Learning Airflow

Picture from Apache-Airflow

This article consists of notes that I took while I learn Airflow for the first time. This includes a brief overview about Airflow, installation on local computer, and running some basic DAGs. If you have any inputs and suggestions related to this learning, just drop me a message and I definitely would look up to it. Enjoy!

Overview

Airflow is one of a tool to configure, schedule, and monitor workflows. It is mostly used by data engineers to build data pipelines between various data sources to data warehouses so that we could gather information and insights from it.

Illustration of DAG, source from Apache-Airflow

The basic form of a workflow is Directed Acyclic Graphs (DAG). Dag is structured as a sequence of tasks and its relationships which are executed by Airflow workers. In Airflow, we could manage our DAGS and its properties through lines of codes flexibly. For example, we could define its start date of process, schedule interval, how many retries it should do if failure occurs, notification by emails, and many more.

Task is the most basic unit of execution in a workflow. A workflow could consist of several tasks which each of it has their own unique functionality. Tasks are generally represented by operators. Operator is a task-template so that we could generate and build our workflows quickly without importing many modules to the script. Some examples of operators are:

  • PostgresOperator to interact with PostgreSQL databases
  • BashOperator to put up a task related to bash scripting
  • PythonOperator to define a task containing python function
  • and many more.

Installation

I use a computer with a Windows OS. After several trials and errors browsing through Apache-Airflow official documentation and other tutorials, I successfully installed it with this following steps:

  • Download and Install Ubuntu from Microsoft Store.
  • Setup Windows Subsystem for Linux (WSL) on the computer.
  • Install docker desktop.
  • Get the docker-compose.yaml configuration file from Apache-Airflow documentation here.
  • Create three directories named ./dags , ./logs , ./plugins so that it becomes compatible with the .yaml config file.
  • Create a .env file containing AIRFLOW_UID and AIRFLOW_GID environment variables. By default, AIRFLOW_UID should have a value of 50000 and AIRFLOW_GID is 0. Now the directory tree should look something like this:
Snippet of the directory tree
  • Run this command docker-compose up airflow-initin WSL terminal to initialize Airflow databases.
  • Run docker-compose up command, it will initialize several components and if it is ready the terminal will look like this:
Snippet of terminal

Implementing DAG

After I successfully installed and launched Airflow on my local machine, I tried to implement several examples of DAGs that were given as tutorials in Airflow official documentation.

Bash script

The first example consists of simple Bash scripts to print current date, sleep, and output params in the terminal. The codes can be accessed here. Here we can see from the picture below that the DAG has dark green color indicating that it has been finished successfully.

Screenshot of Airflow UI

We can also see the log of each task while it runs, here is an example of the log of the print_date task. We can see that it successfully outputs today’s date.

Task logs, accessed in the Airflow UI

Using PostgresOperator

The second example which is to create a table called ‘pet’ in Postgre database and populate it with some entries. The code can be accessed here. At first, I encountered an error saying that the airflow workers could not connect to the Postgre server as stated in the log below.

Logs from Airflow UI

After some time, I figured out that we have to configure the connection ‘postgres_default’ in the Admin -> Connection in Airflow UI to be able to connect to specific database instances. I then used the PostgresOperator and the task was done successfully.

Tree view of the DAG from Airflow UI

Here we can also inspect it through the Postgre terminal that table ‘pet’ had been created and filled with some data.

Snippet of psql terminal

Load to PostgreSQL from CSV file

I also tried another challenge. The project is to download a csv file from a website and load it into PostgreSQL database. The code can be accessed here. However, I did some modifications to the code. I added one task to download the csv file first and create an empty table in the database before passing the data.

Here are my DAGs tree view:

Snippet of the DAG tree from Airflow UI

I successfully done the task and filled out the “Employee” table in the database. Here is how it looked in the PostgreSQL terminal.

Snippet of psql terminal

Notes

While I was running my Airflow using Docker, my computer temperature often heats up. It turns out that Docker consumes a lot of memory and CPU spaces. I can solve this problem by following this tutorial.

Reference

Apache Airflow Documentation

How to Stop WSL2 from Hogging All Your Ram With Docker

Airflow Tutorial from Youtube by Tuan Vu

--

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Assignment 3: Pattern

Ontarians on the Move, 2021 Edition.

Elliott Wave View: AUDJPY Resumes Higher

Data Story Telling: Bringing Life to Your Data

First Figures with Friends of Tracking

Analyzing Historical Government Spending Using SQL And Excel

Knowledge graph ontology design

Data Science Or Web Development: Which Bootcamp To Take?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Fauzan Ragitya

Fauzan Ragitya

More from Medium

Most important Python skills for Data Engineers

How to Automate ETL Pipelines with Airflow

A crash course on Apache Airflow Conc

Data Engineering Essential Hands-On — Python, SQL, and Spark