Companies are spending a lot of money on data and analytics capabilities, creating more and more data products for people inside and outside the company. These products rely on a tangle of data pipelines, each a choreography of software executions transporting data from one place to another. As these pipelines become more complex, it’s important to have tools and practices to develop and debug changes and mitigate issues before they cause downstream effects. Data observability, monitoring, and testing are all ways to improve pipelines, but they’re not the same.
If you’re confused about how these three concepts relate to each other, keep reading. This article will explain and compare data observability, monitoring, and testing by answering these questions for each of them:
- What is it?
- Why do you need it?
- Which tools offer it?
First, you’ll learn about data observability and why it’s needed.
What Is Data Observability?
Data observability is a more complete, holistic approach to data quality and is often a progression in the maturity of data pipelines.
Data observability goes beyond the traditional monitoring capabilities and strives to reduce the time that data is unreliable by using intelligent tools that monitor various data metrics and help troubleshoot and investigate data quality issues to reduce the mean time to detect (MTTD) and mean time to resolve (MTTR) these issues.
Data observability tools come with specific types of intelligence in the form of ML-driven anomaly detection models that automatically detect issues.
Unlike data testing and data monitoring, which monitor known issues, data observability can observe data patterns and detect issues without any preconceived rules and policies.
Additionally, data observability can track changes in patterns and data values and use that as intelligence to predict future behavior in data. It often serves those predictions in a form of the metric threshold. For example, based on observed values for row count, the tool will predict a potential range, and in cases where data falls outside that range, data observability creates and sends an alert.
Modern data observability tools can be deeply integrated with your data stack to provide a deep understanding of data quality and the reliability of your pipeline at every step of the way and work as a control plane for your data pipelines. This capability is not available in pure data testing or data monitoring.
Why Do You Need Data Observability?
Data products, analytic reports, and ML-based algorithms often rely on input from multiple source systems and data transformation workflows. If one changes or malfunctions, it can break all downstream dependencies.
Changing a data pipeline can feel like dealing with a Jenga tower. Change a single piece, and the whole thing can come crashing down.
Data observability helps data owners understand and resolve any unexpected issues inside the data pipelines that feed downstream data products and applications in both development and production environments. Using data observability prevents unreliable data from flowing through the pipeline.
For example, say interactions with your app are stored as semi-structured logs in a NoSQL database such as MongoDB; data is extracted via Apache Beam and landed in Amazon S3 storage. Next, a stored procedure in Snowflake queries these logs and loads them in a tabular format in a staging schema. Finally, dbt processes the data and adds it to the data model in the production schema. Apache Airflow orchestrates the whole process.
Since six systems handle the data in sequence, data observability can monitor each individually and the flow as a whole. The flow can be programmed to use data quality signals and alerts from the data observability tool to open a ticket, label bad data for future remediation, or stop the pipeline altogether.
Data Observability Tools
It’s possible to build your own data observability platform. However, this means not only implementing data validation tests but also adding trending, continuous monitoring, and analysis of data quality outcomes, creating a visualization layer on top, and implementing ML capabilities for anomaly detection.
If that seems like a lot of work, it is. That’s why vendors provide most of these capabilities out of the box. However, among these tools, there is quite a bit of variance. Some can only observe analytical and SQL-based sources, and others are centralized data observability with the ability to monitor the data across all systems and sources in a data pipeline, regardless of its structure. Additionally, how data quality metrics are calculated can put extra processing costs into your cloud data warehouse and storage systems. Differentiate the platforms that don’t push down computation to your databases from those that do, and think about your TCO.
Often, data monitoring is used in the same sentence as data observability. However, there are differences between the two.
What Is Data Monitoring?
Data monitoring is a step beyond data testing, and it is often implemented when data testing has taken place in building new data pipelines or introducing changes to the pipeline. After data testing is put in place to function test your data at the right points, you would need a monitoring system to keep going.
Data monitoring is a practice in which data is constantly checked for pre-defined data metrics against acceptable thresholds to alert on issues. Proper data monitoring should start from observability, identifying data patterns and anomalies that are not known issues, and from there, defining and setting up what needs to be measured and monitored. Data monitoring without observability only shows surface problems; data observability offers a deeper understanding of ongoing issues.
You could call monitoring holistic because it goes a step beyond data testing, and comparing metrics over time produces patterns and insights that you wouldn’t get from a single data test.
Why Do You Need Data Monitoring?
When it’s obvious what you need to track, data monitoring is the right choice. If you monitor a specific data artifact and know exactly how that data will behave over time, you can set rules to monitor it and set up alerts to get notifications.
Which Tools Offer Data Monitoring?
It is often hard to find this breed of tools on their own, partly because some data monitoring tools have repositioned themselves from data observability platforms without having the complete functionality of data observability and partly because data monitoring is technically a subset of data observability.
For a simple solution, setting up data monitoring can be as quick as feeding a Plotly chart with a metric, with conditional formatting that changes when a threshold is reached. Or, you can use data validation rules on an ongoing basis and gradually build a baseline for automatically detecting outliers and anomalies, which leads to your data observability practice.
While the first two concepts help you measure data quality, this one helps you confirm it.
What Is Data Testing?
Data tests or “data quality tests” validate your knowledge about assumptions that need to hold true for data to be processed as planned. We could break down tests into two categories:
- The appearance of the data: data type, nulls, format, etc.
- Business rules: unique email addresses, customer age, etc.
Erroneous data requires specific actions, including marking it, processing it differently, storing it for later processing, or triggering a notification requesting manual intervention.
There are many dimensions of data quality that you can test for, including the following:
- Data validity: To store dates or times, they need to be in the correct format. A “MM/DD/YY” string could be misinterpreted if “YYYY-MM-DD” is expected. Other common tests check for NULLs and data types.
- Data uniqueness: No two rows in a table should be the same.
- Data completeness: Moving data without filtering or transforming should result in the same number of rows in the destination as in the source.
- Data consistency: If data in multiple places is not identical when it should be, it isn’t consistent. For example, when a customer profile exists in the e-commerce platform and the CRM, the address should be the same in both places.
Why Do You Need Data Testing?
Whether you’re scraping the web, using sensors, or collecting user input from open text fields, there are many ways that data can become corrupted. This could break business-critical models or skew important reports, among other problems. A critical piece of building a data pipeline that feeds business applications, analytics, or even data products, is testing that data for accuracy, validity, and freshness.
Which Tools Offer Data Testing?
First of all, data tests can easily be written with vanilla Python. Conditional statements or assertions could do the trick for simple pipelines. However, for large projects, you need to keep your tests manageable.
That’s why most observability platforms offer some framework to perform data tests.
Data observability, data monitoring, and data testing may be separate concepts, but as you’ve seen in this article, they are intertwined.
Data observability, a relatively new practice within the data sphere, is a set of measures that can help predict andidentify data issues through external symptoms. By processing the output and data artifacts of data pipelines in relation to one another, it can detect anomalies and indicate what’s causing them.
Data monitoring, a subset of observability, is a practice in which data is constantly checked for pre-defined data metrics against acceptable thresholds. It only confirms that there is an anomaly.
Data tests measure either format like null checks or validations like business rules to match your data to a specified list of assumptions. Each test is limited in scope and operates in isolation from the other tests.
In an ideal world, you could develop all three to detect every possible data issue, but your organization’s resources aren’t endless. Using the right tool for exactly what you need will help you maintain high-quality data while keeping your resources and efforts focused.
This article was originally published on Dataversity.
Data profiling helps organizations understand their data, identify issues and discrepancies, and improve data quality. It is an essential part of any data-related project and without it data quality could impact critical business decisions, customer trust, sales and financial opportunities.
To get started, there are four main steps in building a complete and ongoing data profiling process:
We'll explore each of these steps in detail and discuss how they contribute to the overall goal of ensuring accurate and reliable data. Before we get started, let's remind ourself of what is data profiling.
1. Data Collection
Start with data collection. Gather data from various sources and extract it into a single location for analysis. If you have multiple sources, choose a centralized data profiling tool (see our recommendation in the conclusion) that can easily connect and analyze all your data without having you do any prep work.
2. Discovery & Analysis
Now that you have collected your data for analysis, it's time to investigate it. Depending on your use case, you may need structure discovery, content discovery, relationship discovery, or all three. If data content or structure discovery is important for your use case, make sure that you collect and profile your data in its entirety and do not use samples as it will skew your results.
Use visualizations to make your discovery and analysis more understandable. It is much easier to see outliers and anomalies in your data using graphs than in a table format.
3. Documenting the Findings
Create a report or documentation outlining the results of the data profiling process, including any issues or discrepancies found.
Use this step to establish data quality rules that you may not have been aware of. For example, a United States ZIP code of 94061 could have accidentally been typed in as 94 061 with a space in the middle. Documenting this issue could help you establish new rules for the next time you profile the data.
4. Data Quality Monitoring
Now that you know what you have, the next step is to make sure you correct these issues. This may be something that you can correct or something that you need to flag for upstream data owners to fix.
After your data profiling is done and the system goes live, your data quality assurance work is not done – in fact, it's just getting started.
Data constantly changes. If unchecked, data quality defects will continue to occur, both as a result of system and user behavior changes.
Build a platform that can measure and monitor data quality on an ongoing basis.
Take Advantage of Data Observability Tools
Automated tools can help you save time and resources and ensure accuracy in the process.
Unfortunately, traditional data profiling tools offered by legacy ETL and database vendors are complex and require data engineering and technical skills. They also only handle data that is structured and ready for analysis. Semi-structured data sets, nested data formats, blob storage types, or streaming data do not have a place in those solutions.
Today organizations that deal with complex data types or large amounts of data are looking for a newer, more scalable solution.
That’s where a data observability tool like Telmai comes in. Telmai is built to handle the complexity that data profiling projects are faced with today. Some advantages include centralized profiling for all data types, a low-code no-code interface, ML insights, easy integration, and scale and performance.
Start your data observibility today
Connect your data and start generating a baseline in less than 10 minutes.
No sales call needed
Start your data observability today
Connect your data and start generating a baseline in less than 10 minutes.