Ensure data quality by using Great Expectations

Photo by Campaign Creators on Unsplash

Overview

Great Expectation is open-source data quality tools that can help data team to conduct data testing, documentation, and automated profiling. The importance of data testing is to make sure data that we load or transform from our databases or vendors are valid to our expectations. Imagine if we do analysis on invalid data, it could lead to wrong interpretation and the worst case is we could end up making wrong decision that generate negative impacts to the company or product we were working on.

Expectations

As its name, expectations are actually our specification or characteristics that we want to ensure in our data. It is expressed in a quite simple declarative language so that we could understand it easily (even if we are new to programming). Several expectations to a data source are collected into one expectation suite.

Here is an example of expectation of where we are expecting at least half of values in pageviews column to be between 0 and 20 using expectation type of expect_column_values_to_be_between

{
"expectation_type": "expect_column_values_to_be_between",
"kwargs": {
"column": "pageviews",
"max_value": 20,
"min_value": 0,
"mostly": 0.5
}
}

Validation

The next thing to do after creating an expectation is to validate it. The result of validation tells us whether our data passes our expectations and also report any unexpected values on the data.

Profiling

Instead of writing your own expectation, Great Expectations provides data profiling. This process inspects the data source to get basic statistics and automatically generates Expectation Suite based on the given data. This method may be suitable if our data is very complex and we are having a hard time understanding it.

Documentation

Great Expectation also provides Data Docs, which is a report file that contains all of our expectations and its results in a beautiful webpage. It makes us easier to inspect and understand the validation result through this UI.

Snippet of Data Docs

Integration

Great Expectation supports a wide list amount of data sources from SQL databases, cloud storages, Pandas dataframes, or even CSV files. It also has been integrated with various data stacks such as Spark, Apache Airflow, Snowflake, BigQuery, and many more.

Local deployment

When we use it for the first time, we may want to deploy it locally first. The first thing to do is to install great_expectations library by using pip with this command: pip install great_expectations

We could make sure if the Great Expectations library by checking its version with running great_expectations --version

Initialization

To initialize the project and create a Data Context, run great_expectations init on terminal. Data context manages all the configuration needed for the project. The init command will generate a directory named great_expectations which contain several subdirectories in it.

Directory structure

Snippet of directory structure
  • great_expectations.yml : main configuration file of the project. Stores information from the data sources that are connected, expectations path, checkpoints path, backend connections, and many more.
  • checkpoints : consists of checkpoint configuration file whose role is to determine which Expectation suites corresponds to a specific data and what to do with the validation results.
  • expectation : directory that is stored with Expectation Suites as JSON files which holds our criteria for validation process.
  • plugins : serves if there are any custom plugins used in the project.
  • uncommitted : all files and directories in it should not be pushed to production environment.
  • uncommitted/data_docs : contains Data Docs HTML file
  • uncommitted/validations : contains all of the validation results being executed by Great Expectations
  • uncommitted/config_variables.yml : files containing secret and sensitive information, e.g database credentials.

Create expectation for sample data

In this part, I am going to try Great Expectations in NYC taxi data on January and February of 2019 by following this tutorial provided in Great Expectation documentation.

Data source

Snippet of data source directory

The data sources that I used are in the format of CSV file and stored locally. To accommodate this, I have to use InferredAssetFilesystemDataConnector so that I could batch my data in the local filesystem based on the filename pattern of the data asset. I also specified the path to the directory of the data which is ../data . Here is the snippet of datasources configuration in great_expectations.yml file.

datasources:
getting_started_datasource:
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
class_name: Datasource
data_connectors:
default_inferred_data_connector_name:
default_regex:
group_names:
- data_asset_name
pattern: (.*)
base_directory: ../data
module_name: great_expectations.datasource.data_connector
class_name: InferredAssetFilesystemDataConnector
default_runtime_data_connector_name:
batch_identifiers:
- default_identifier_name
module_name: great_expectations.datasource.data_connector
class_name: RuntimeDataConnector

Expectations

I used automated data profiling to fill the expectations suite. In order to do this, I executed great_expectations suite new in the CLI and choose option 3 to run the automatic profiling method. The terminal then ask what data should be used to be profiled and tell me to provide the suite name. From that, I was redirected to Jupyter Notebook and follow the instructions there.

As the result, my expectations was full of different basic statistics metrics such as the minimum value, maximum value, mean, median, row count, etc.

Snippet of the expectation suite viewed on Data Docs

Validation

To validate our data, I first need to set up a checkpoint. Checkpoint is a configuration file which consists of information such as expectation suites that are being evaluated, data source name with its connector, and also the validation results naming template.

great_expectations checkpoint new <checkpoint_name>

I executed the CLI command above and it will open up a Jupyter Notebook session. Follow through the notebook until the last cell to do the validation. The validation result is also available at the Data Docs as below

Snippet of the validation result in the Data Docs

That is probably all of my journey learning Great Expectations module as a data quality management tool. If you have any feedbacks and suggestions, please kindly give it through the comment section or you can also drop me a message. Thank you!

References

Great Expectations documentation

--

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Recapping a Thread in Synergetics

What is Desktop as a Service?

Handle SQS message failure in batch with partial batch response feature

What’s New in EF Core?

4 things i learned toying around with nested resources in Rails

Middleware in FastAPI— What is it?

CI/CD: Continuous Integration & Delivery Explained

Introduction to Git and Github with basic commands.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Fauzan Ragitya

Fauzan Ragitya

More from Medium

Data Quality indexes in practice

5 Most Common Data Quality Issues

The Future of Hadoop and Big Data Analytics

[BOOK REVIEW] The Data Warehouse Toolkit by Ralph Kimball/Kimball’s DW/BI Architecture