How to Test Data Pipelines: Approaches, Tools, and Tips
.png)
If you're in charge of maintaining a large set of data pipelines, how can you ensure that your data continues to meet expectations after every transformation? That's where data quality testing comes in. Using a set of rules to check if the data conforms to certain requirements, data tests are implemented throughout a data pipeline, from the ingestion point to the destination.
This blog post elaborates on three approaches to testing data pipelines, gives an overview of data testing tools, and makes the case for using a data observability tool to detect the issues your tests didn’t account for.
Three Approaches to Data Testing
1. Validate the data after a pipeline has run
In this approach, tests don't run in the intermediate stages of a data pipeline, rather a test solely checks if the fully processed data matches established business rules. This is the most cost-effective solution for detecting data quality issues, but running tests solely at the data destination has a set of drawbacks that range from tedious to downright disastrous.
First, it's impossible to detect data quality issues early on, so data pipelines can break when one transformation's output doesn't match the next step's input criteria. Take the example of one transformational step that converts a Unix timestamp to a date, while the next step changes notation from dd/MM/yyyy to yyyy-MM-dd. If the first step produces something erroneous, the second step will fail and most likely throw an error.
It's also worth considering that there are no tests to flag the root cause of a data error, as data pipelines are more or less a black box. Consequently, debugging is challenging when something breaks or produces unexpected results.
2. Validate data from data source to destination
In this approach, the solution is to set up tests throughout the pipeline, in multiple steps, often spanning various technologies and stakeholders. Although time-intensive, this approach makes tracking down any data quality issues much easier.
Be aware it can also be costly if your organization uses legacy technology (like traditional ETL tools) which don't scale and require a large team of engineering to maintain your validation rules overtime. If you are moving to a modern data architecture, it is time also update your data testing and validation tools.
3. Validate data as a synthesis of the previous two
If you’re using a modern cloud-based data warehouse like BigQuery, Snowflake, or Redshift, or a data lakehouse like Delta Lake, both raw and production data exist in a single data warehouse. Consequently, the data can also be transformed in that same technology. This new paradigm, known as ELT, has led to organizations embedding tests directly in their data modeling efforts.
This ELT approach offers more benefits. First of all, data tests can be configured with a single tool. Second, it provides you the liberty of embedding data tests in the processed code or configuring them in the orchestration tool. Finally, because of this high degree of centralization of data tests, they can be set up in a declarative manner. When upstream changes occur, you don't need to go through swaths of code to find the right place to implement new tests. On the contrary, it's done by adding a line in a configuration file.
Data Testing Tools
There are many ways to set up data tests. A homebrew solution would be to set up try-catch statements or assertions that check the data for certain properties. However, this isn't standardized or resilient. That's why many vendors have come up with scalable solutions, including dbt, Great Expectations, Soda, and Deequ. A brief overview of data testing tools:
- When you manage a modern data stack, there's a good chance you're also using dbt. This community darling, offered as commercial open source, has a built-in test module.
- A popular tool for implementing tests in Python is Great Expectations. It offers four different ways of implementing out-of-the-box or custom tests. Like dbt, it has an open source and commercial offering.
- Soda, another commercial open source tool, comes with testing capabilities that are in line with Great Expectations' features. The difference is that Soda is a broader data reliability engineering solution that also encompasses data monitoring.
- When working with Spark, all your data is processed as a Spark DataFrame at some point. Deequ offers a simple way to implement tests and metrics on Spark DataFrames. The best thing is that it doesn't have to process a whole data set when a test reruns. It caches the previous results and modifies it.
Data Observability Tools Make it Even Easier to Test Data Pipelines
Implementing data testing in an end-to-end manner can be a daunting task. While you could figure out how to read and interact with each systems’ dialect and implement testing frameworks across each step of your pipeline by using open source technology, instead many organizations today find it easier to embed a data observability tool.
Telmai is a no-code data observability tool that has out-of-the-box integrations with various databases, data lakes, and data warehouses. Plug it in anywhere in your data pipeline to help you test data at ingest, in your data warehouse, and anywhere in between. Simple, fast, and SOC 2 type 2 compliant, Telmai automatically produces alerts when your data drifts and offers the tools to perform root cause analysis quickly. To learn more, sign up for a free starter account.
Data profiling helps organizations understand their data, identify issues and discrepancies, and improve data quality. It is an essential part of any data-related project and without it data quality could impact critical business decisions, customer trust, sales and financial opportunities.
To get started, there are four main steps in building a complete and ongoing data profiling process:
We'll explore each of these steps in detail and discuss how they contribute to the overall goal of ensuring accurate and reliable data. Before we get started, let's remind ourself of what is data profiling.
1. Data Collection
Start with data collection. Gather data from various sources and extract it into a single location for analysis. If you have multiple sources, choose a centralized data profiling tool (see our recommendation in the conclusion) that can easily connect and analyze all your data without having you do any prep work.
2. Discovery & Analysis
Now that you have collected your data for analysis, it's time to investigate it. Depending on your use case, you may need structure discovery, content discovery, relationship discovery, or all three. If data content or structure discovery is important for your use case, make sure that you collect and profile your data in its entirety and do not use samples as it will skew your results.
Use visualizations to make your discovery and analysis more understandable. It is much easier to see outliers and anomalies in your data using graphs than in a table format.
3. Documenting the Findings
Create a report or documentation outlining the results of the data profiling process, including any issues or discrepancies found.
Use this step to establish data quality rules that you may not have been aware of. For example, a United States ZIP code of 94061 could have accidentally been typed in as 94 061 with a space in the middle. Documenting this issue could help you establish new rules for the next time you profile the data.
4. Data Quality Monitoring
Now that you know what you have, the next step is to make sure you correct these issues. This may be something that you can correct or something that you need to flag for upstream data owners to fix.
After your data profiling is done and the system goes live, your data quality assurance work is not done – in fact, it's just getting started.
Data constantly changes. If unchecked, data quality defects will continue to occur, both as a result of system and user behavior changes.
Build a platform that can measure and monitor data quality on an ongoing basis.
Take Advantage of Data Observability Tools
Automated tools can help you save time and resources and ensure accuracy in the process.
Unfortunately, traditional data profiling tools offered by legacy ETL and database vendors are complex and require data engineering and technical skills. They also only handle data that is structured and ready for analysis. Semi-structured data sets, nested data formats, blob storage types, or streaming data do not have a place in those solutions.
Today organizations that deal with complex data types or large amounts of data are looking for a newer, more scalable solution.
That’s where a data observability tool like Telmai comes in. Telmai is built to handle the complexity that data profiling projects are faced with today. Some advantages include centralized profiling for all data types, a low-code no-code interface, ML insights, easy integration, and scale and performance.
Data Observability
Data Quality
Leverages ML and statistical analysis to learn from the data and identify potential issues, and can also validate data against predefined rules
Uses predefined metrics from a known set of policies to understand the health of the data
Detects, investigates the root cause of issues, and helps remediate
Detects and helps remediate.
Examples: continuous monitoring, alerting on anomalies or drifts, and operationalizing the findings into data flows
Examples: data validation, data cleansing, data standardization
Low-code / no-code to accelerate time to value and lower cost
Ongoing maintenance, tweaking, and testing data quality rules adds to its costs
Enables both business and technical teams to participate in data quality and monitoring initiatives
Designed mainly for technical teams who can implement ETL workflows or open source data validation software
Start your data observibility today
Connect your data and start generating a baseline in less than 10 minutes.
No sales call needed
Start your data observability today
Connect your data and start generating a baseline in less than 10 minutes.