Utilizing Airbyte to create ELT pipeline

Photo by Hunter Harritt on Unsplash

This article consists of notes that I took when I explored Airbyte for the first time. This includes a brief overview about Airbyte, local installation, and connection demo.

If you have any inputs and suggestions related to this learning, just drop me a message and I definitely would look up to it. Thank you!

Overview

Airbyte is open-source data integration tools that could help to sync our data from applications, APIs, and databases to data warehouses, data lakes or other destinations. It has a wide variety choice of customizable connectors and we can also build our custom connector with Connector Development Kit (CDK).

Picture from Airbyte Docs

To understand how Airbyte works, we need to know some terms that are commonly used in Airbyte.

Source

Source is defined as data source that we want to ingest from. Airbyte offers many option from RDBMS (Postgres, MySQL, OracleDB, etc), APIs (Instagram, Google Analytics, Facebook Pages), and many more. You can see the full list of sources in the catalog here.

Airbyte also provides CDK as framework to develop our customized connector for HTTP APIs, Singer Taps, or other Python sources. CDK gives basic implementation structure to improve developer experience while designing the connector. Here is the link to the official documentation of Airbyte CDK

Destination

As its name, destination is defined as the end point of our pipeline. We can set our destination to be data warehouses, data lakes, databases, and many more. For the full list of the destination, you can check it in the catalog here.

Connection

In Airbyte, connection (or ELT pipeline), is defined as a synchronization process between source and destination. To set up a connection, we need to set the source and destination connector, fill the sync frequency, and select the stream that we want to sync. Stream in Airbyte is like a table in a database.

Install Airbyte locally

To install Airbyte in your computer, we will need to follow these steps:

  • Install Docker on our computer. Airbyte docs recommend following this guide to install Docker on Windows. We will use docker-compose to build Airbyte from source.
  • Clone and build the repo using docker-compose with this command.
$ git clone https://github.com/airbytehq/airbyte.git
$ cd airbyte
$ docker-compose up
Airbyte UI

Connection demo

After I successfully deployed Airbyte locally, I tried to create a custom Python HTTP API source connector with CDK framework and sync the data into a local database. Following this tutorial, I created a pipeline to sync data from PokeAPI to Postgre database.

Develop the code

The steps that I took to develop the code are:

  • Create the source by using generator with this command. Select the Python HTTP API Source template on the interactive helper and input the name of the connector.
$ cd airbyte-integrations/connector-templates/generator 
$ ./generate.sh
  • Create a virtual environment in the directory of the source atairbyte-integrations/connectors/source-<inputted-name> and install all the dependencies with pip install -r requirements.txt
  • Define inputs needed to read the data by editing the source_<name>/spec.json file. In my case, I add pokemon_name in required field since it is needed by the PokeAPI. This align with Airbyte Protocol’s spec operation.
  • Define check Airbyte operation by implementing the def check_connection in source.py file. This operation is needed to verify whether the input given by the user are sufficient enough to connect to the data source.
  • The next Airbyte operation is discover which returns all streams output from the connector, its corresponding schemas, and the sync mode (full refresh or incremental). We will need to create an extended Python HttpStream class in source.py file and <stream_name>.json file in the /schemas directory.
  • The last Airbyte operation is read which ingests data from the HTTP endpoints. We need to add built in function in our HttpStream class that at least define URL base and path of the endpoint, show the way to parse the response, and help with the pagination.
  • Build the docker image for the container by running docker build . -t airbyte/source-<source-name>:dev
  • The connector is now ready to be used in the UI

Set up in UI

  • Initially, we need to add our custom connector to the source. Go to Settings > New connector. Fill the Connector display name, docker repository name (docker image that was built before), and docker image tag (dev in my case).
  • Then in Sources, choose new source > fill the name > select our connector in Source type. We will need to pass the input parameter that are defined in spec.json file before. In my case, I filled charizard as the pokemon name.
Snippet of Source Settings
  • After that, we need to set the Destinations. I used PostgreSQL with Docker as the destination following this article. I then filled the form as below.
Snippet of Destination Settings
  • The last step is to set up the Connections. Fill the Source connector, Destination connector, and sync frequency. Save the configuration.
  • If done correctly, we can look at the sync history and see the results of the sync process.

Because I input charizard as the pokemon name, I got table describing about charizard properties in my Postgre as below.

Snippet of PgAdmin

--

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

继续坚持发 XTM求关注分享 https://www.torum.com/signup?referral_code=17704383579

Microsoft Has Finally Accepted Open-Source

Essentials of Software Testing Jobs

Sinatra — Movies catalog (Part-1)

The Installation Guide of Selenium IDE

Spree User Generation from Command Line

CSS Grid Series — Full Bleed Blog Layout

Making a Chat App like WhatsApp

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Fauzan Ragitya

Fauzan Ragitya

More from Medium

NoSQL for Beginners

talend as an ETL Tool

Somewhat complex reporting and data analysis in MySql 8 — Part 1

ETL for Best Selling Wrist Watch in Shopee