Utilizing Airbyte to create ELT pipeline
This article consists of notes that I took when I explored Airbyte for the first time. This includes a brief overview about Airbyte, local installation, and connection demo.
If you have any inputs and suggestions related to this learning, just drop me a message and I definitely would look up to it. Thank you!
Airbyte is open-source data integration tools that could help to sync our data from applications, APIs, and databases to data warehouses, data lakes or other destinations. It has a wide variety choice of customizable connectors and we can also build our custom connector with Connector Development Kit (CDK).
To understand how Airbyte works, we need to know some terms that are commonly used in Airbyte.
Source is defined as data source that we want to ingest from. Airbyte offers many option from RDBMS (Postgres, MySQL, OracleDB, etc), APIs (Instagram, Google Analytics, Facebook Pages), and many more. You can see the full list of sources in the catalog here.
Airbyte also provides CDK as framework to develop our customized connector for HTTP APIs, Singer Taps, or other Python sources. CDK gives basic implementation structure to improve developer experience while designing the connector. Here is the link to the official documentation of Airbyte CDK
As its name, destination is defined as the end point of our pipeline. We can set our destination to be data warehouses, data lakes, databases, and many more. For the full list of the destination, you can check it in the catalog here.
In Airbyte, connection (or ELT pipeline), is defined as a synchronization process between source and destination. To set up a connection, we need to set the source and destination connector, fill the sync frequency, and select the stream that we want to sync. Stream in Airbyte is like a table in a database.
Install Airbyte locally
To install Airbyte in your computer, we will need to follow these steps:
- Install Docker on our computer. Airbyte docs recommend following this guide to install Docker on Windows. We will use
docker-composeto build Airbyte from source.
- Clone and build the repo using
docker-composewith this command.
$ git clone https://github.com/airbytehq/airbyte.git
$ cd airbyte
$ docker-compose up
- Visit http://localhost:8000 in the browser to see the UI
After I successfully deployed Airbyte locally, I tried to create a custom Python HTTP API source connector with CDK framework and sync the data into a local database. Following this tutorial, I created a pipeline to sync data from PokeAPI to Postgre database.
Develop the code
The steps that I took to develop the code are:
- Create the source by using generator with this command. Select the
Python HTTP API Sourcetemplate on the interactive helper and input the name of the connector.
$ cd airbyte-integrations/connector-templates/generator
- Create a virtual environment in the directory of the source at
airbyte-integrations/connectors/source-<inputted-name>and install all the dependencies with
pip install -r requirements.txt
- Define inputs needed to read the data by editing the
source_<name>/spec.jsonfile. In my case, I add
requiredfield since it is needed by the PokeAPI. This align with Airbyte Protocol’s
checkAirbyte operation by implementing the def
source.pyfile. This operation is needed to verify whether the input given by the user are sufficient enough to connect to the data source.
- The next Airbyte operation is
discoverwhich returns all streams output from the connector, its corresponding schemas, and the sync mode (full refresh or incremental). We will need to create an extended Python
<stream_name>.jsonfile in the
- The last Airbyte operation is
readwhich ingests data from the HTTP endpoints. We need to add built in function in our
HttpStreamclass that at least define URL base and path of the endpoint, show the way to parse the response, and help with the pagination.
- Build the docker image for the container by running
docker build . -t airbyte/source-<source-name>:dev
- The connector is now ready to be used in the UI
Set up in UI
- Initially, we need to add our custom connector to the source. Go to Settings > New connector. Fill the Connector display name, docker repository name (docker image that was built before), and docker image tag (dev in my case).
- Then in Sources, choose new source > fill the name > select our connector in Source type. We will need to pass the input parameter that are defined in
spec.jsonfile before. In my case, I filled
charizardas the pokemon name.
- After that, we need to set the Destinations. I used PostgreSQL with Docker as the destination following this article. I then filled the form as below.
- The last step is to set up the Connections. Fill the Source connector, Destination connector, and sync frequency. Save the configuration.
- If done correctly, we can look at the sync history and see the results of the sync process.
Because I input charizard as the pokemon name, I got table describing about charizard properties in my Postgre as below.