Ensure data quality by using Great Expectations
Great Expectation is open-source data quality tools that can help data team to conduct data testing, documentation, and automated profiling. The importance of data testing is to make sure data that we load or transform from our databases or vendors are valid to our expectations. Imagine if we do analysis on invalid data, it could lead to wrong interpretation and the worst case is we could end up making wrong decision that generate negative impacts to the company or product we were working on.
As its name, expectations are actually our specification or characteristics that we want to ensure in our data. It is expressed in a quite simple declarative language so that we could understand it easily (even if we are new to programming). Several expectations to a data source are collected into one expectation suite.
Here is an example of expectation of where we are expecting at least half of values in
pageviews column to be between 0 and 20 using expectation type of
The next thing to do after creating an expectation is to validate it. The result of validation tells us whether our data passes our expectations and also report any unexpected values on the data.
Instead of writing your own expectation, Great Expectations provides data profiling. This process inspects the data source to get basic statistics and automatically generates Expectation Suite based on the given data. This method may be suitable if our data is very complex and we are having a hard time understanding it.
Great Expectation also provides Data Docs, which is a report file that contains all of our expectations and its results in a beautiful webpage. It makes us easier to inspect and understand the validation result through this UI.
Great Expectation supports a wide list amount of data sources from SQL databases, cloud storages, Pandas dataframes, or even CSV files. It also has been integrated with various data stacks such as Spark, Apache Airflow, Snowflake, BigQuery, and many more.
When we use it for the first time, we may want to deploy it locally first. The first thing to do is to install
great_expectations library by using pip with this command:
pip install great_expectations
We could make sure if the Great Expectations library by checking its version with running
To initialize the project and create a Data Context, run
great_expectations init on terminal. Data context manages all the configuration needed for the project. The init command will generate a directory named
great_expectations which contain several subdirectories in it.
great_expectations.yml: main configuration file of the project. Stores information from the data sources that are connected, expectations path, checkpoints path, backend connections, and many more.
checkpoints: consists of checkpoint configuration file whose role is to determine which Expectation suites corresponds to a specific data and what to do with the validation results.
expectation: directory that is stored with Expectation Suites as JSON files which holds our criteria for validation process.
plugins: serves if there are any custom plugins used in the project.
uncommitted: all files and directories in it should not be pushed to production environment.
uncommitted/data_docs: contains Data Docs HTML file
uncommitted/validations: contains all of the validation results being executed by Great Expectations
uncommitted/config_variables.yml: files containing secret and sensitive information, e.g database credentials.
Create expectation for sample data
In this part, I am going to try Great Expectations in NYC taxi data on January and February of 2019 by following this tutorial provided in Great Expectation documentation.
The data sources that I used are in the format of CSV file and stored locally. To accommodate this, I have to use
InferredAssetFilesystemDataConnector so that I could batch my data in the local filesystem based on the filename pattern of the data asset. I also specified the path to the directory of the data which is
../data . Here is the snippet of
datasources configuration in
I used automated data profiling to fill the expectations suite. In order to do this, I executed
great_expectations suite new in the CLI and choose option 3 to run the automatic profiling method. The terminal then ask what data should be used to be profiled and tell me to provide the suite name. From that, I was redirected to Jupyter Notebook and follow the instructions there.
As the result, my expectations was full of different basic statistics metrics such as the minimum value, maximum value, mean, median, row count, etc.
To validate our data, I first need to set up a checkpoint. Checkpoint is a configuration file which consists of information such as expectation suites that are being evaluated, data source name with its connector, and also the validation results naming template.
great_expectations checkpoint new <checkpoint_name>
I executed the CLI command above and it will open up a Jupyter Notebook session. Follow through the notebook until the last cell to do the validation. The validation result is also available at the Data Docs as below
That is probably all of my journey learning Great Expectations module as a data quality management tool. If you have any feedbacks and suggestions, please kindly give it through the comment section or you can also drop me a message. Thank you!