This article consists of notes that I took while I learn Airflow for the first time. This includes a brief overview about Airflow, installation on local computer, and running some basic DAGs. If you have any inputs and suggestions related to this learning, just drop me a message and I definitely would look up to it. Enjoy!
Airflow is one of a tool to configure, schedule, and monitor workflows. It is mostly used by data engineers to build data pipelines between various data sources to data warehouses so that we could gather information and insights from it.
The basic form of a workflow is Directed Acyclic Graphs (DAG). Dag is structured as a sequence of tasks and its relationships which are executed by Airflow workers. In Airflow, we could manage our DAGS and its properties through lines of codes flexibly. For example, we could define its start date of process, schedule interval, how many retries it should do if failure occurs, notification by emails, and many more.
Task is the most basic unit of execution in a workflow. A workflow could consist of several tasks which each of it has their own unique functionality. Tasks are generally represented by operators. Operator is a task-template so that we could generate and build our workflows quickly without importing many modules to the script. Some examples of operators are:
- PostgresOperator to interact with PostgreSQL databases
- BashOperator to put up a task related to bash scripting
- PythonOperator to define a task containing python function
- and many more.
I use a computer with a Windows OS. After several trials and errors browsing through Apache-Airflow official documentation and other tutorials, I successfully installed it with this following steps:
- Download and Install Ubuntu from Microsoft Store.
- Setup Windows Subsystem for Linux (WSL) on the computer.
- Install docker desktop.
- Get the docker-compose.yaml configuration file from Apache-Airflow documentation here.
- Create three directories named ./dags , ./logs , ./plugins so that it becomes compatible with the .yaml config file.
- Create a .env file containing AIRFLOW_UID and AIRFLOW_GID environment variables. By default, AIRFLOW_UID should have a value of 50000 and AIRFLOW_GID is 0. Now the directory tree should look something like this:
- Run this command
docker-compose up airflow-initin WSL terminal to initialize Airflow databases.
docker-compose upcommand, it will initialize several components and if it is ready the terminal will look like this:
- View the user interface by opening http://localhost:8080/
After I successfully installed and launched Airflow on my local machine, I tried to implement several examples of DAGs that were given as tutorials in Airflow official documentation.
The first example consists of simple Bash scripts to print current date, sleep, and output params in the terminal. The codes can be accessed here. Here we can see from the picture below that the DAG has dark green color indicating that it has been finished successfully.
We can also see the log of each task while it runs, here is an example of the log of the print_date task. We can see that it successfully outputs today’s date.
The second example which is to create a table called ‘pet’ in Postgre database and populate it with some entries. The code can be accessed here. At first, I encountered an error saying that the airflow workers could not connect to the Postgre server as stated in the log below.
After some time, I figured out that we have to configure the connection ‘postgres_default’ in the Admin -> Connection in Airflow UI to be able to connect to specific database instances. I then used the PostgresOperator and the task was done successfully.
Here we can also inspect it through the Postgre terminal that table ‘pet’ had been created and filled with some data.
Load to PostgreSQL from CSV file
I also tried another challenge. The project is to download a csv file from a website and load it into PostgreSQL database. The code can be accessed here. However, I did some modifications to the code. I added one task to download the csv file first and create an empty table in the database before passing the data.
Here are my DAGs tree view:
I successfully done the task and filled out the “Employee” table in the database. Here is how it looked in the PostgreSQL terminal.
While I was running my Airflow using Docker, my computer temperature often heats up. It turns out that Docker consumes a lot of memory and CPU spaces. I can solve this problem by following this tutorial.