We’re excited to announce the new eBook, “Data Observability, The Reality” from @Ravit Jain of “The Ravit Show” with Telmai's chapter on "Understanding Data Quality and Data Observability" is now available.
Here is the excerpt from the book, contributed by Mona Rakibe, CEO/Co-Founder of Telmai:
Understanding Data Quality and Data Observability
Quality is a term that often indicates the value that we are willing to associate with, be it the food we eat or the cars we drive. High quality means high trust and, therefore, high value, which is no different for Data.
Today, Data is the backbone of all modern businesses. With the increasing amount of data generated and consumed daily, companies must ensure their data is high quality.
Data quality indicators provide a way for organizations to measure the quality of their data. On the other hand, Observability helps businesses accelerate visibility into their data quality KPIs.
The exponential growth in scale and volume of data has made it extremely hard to assess the health of the data.
Data Observability is a set of techniques that can help continuously monitor data to detect anomalies. This approach goes beyond the traditional capabilities as it leverages statistical analysis and machine learning techniques to monitor various data metrics, time, and sources of the data issues and aggregate those. Such a system aims to reduce the mean time to detect (MTTD) and mean time to resolve (MTTR) data issues.
Data Observability is an approach designed for modern data architecture that scales with the data.
Data Quality Key Performance Indicators(KPI)
Data quality indicators are a set of metrics used to measure the quality of data. These metrics include completeness, accuracy, consistency, uniqueness, timeliness, and validity.
The goal of data quality indicators is to provide an objective and consistent way to measure the quality of data. Organizations use six most common data quality indicators to evaluate their data quality.
Data completeness refers to the degree to which a data set contains the required and relevant information without missing or incomplete data. In other words, it measures whether all critical data points have been captured for a particular dataset.
Complete data is essential for accurate and reliable data analysis and decision-making. Incomplete data may lead to biased or erroneous results, particularly in statistical analysis. Additionally, missing data points can make it difficult to identify patterns or relationships within the data, which can compromise the validity of the analysis.
There are various reasons why data might be incomplete, including data entry errors, partial data load data loss, or data extraction issues.
Using Data Observability, users can automatically get the benefits of continuously monitoring the completeness of data. Unlike traditional DQ tools that measure the completeness of data against pre-defined criteria, data observability leverages historical insights to predict the thresholds of completeness. Completeness can be measured using size of data, row count, or populated attribute values.
Uniqueness is an important data quality indicator that measures whether data contains duplicates. This indicator is particularly relevant when dealing with data that includes customer IDs, product codes, or other unique identifiers.
The uniqueness indicator evaluates whether the same identifier appears more than once in the dataset. For example, if a customer ID is supposed to be unique, but there are duplicate IDs in the dataset, the uniqueness indicator would be low. Sometimes even if the identifiers are unique, the data set might contain multiple records for the same customer.
Non-unique identifiers can lead to a range of issues. For example, if duplicate customer IDs exist in a customer database, it can be challenging to identify which transactions or interactions belong to which customer. This can result in inaccurate reporting, poor customer service, and incorrect billing.
Traditionally, this was solved by applying table constraints like the primary key and manual stewardship to identify and remediate duplicate records. These techniques do not scale for modern data's volume and data formats (semi-structured, streaming, columnar, etc.).
Data Observability tools will not only automatically classify the data into categorical and non-categorical, but they will also automatically detect attributes that hold unique values and flag when there is a drift in the uniqueness of the attribute.
Timeliness and Freshness
The timeliness of data refers to the degree to which data is available within an appropriate timeframe to support timely decision-making or analysis. In other words, it measures whether data is current and up-to-date.
The Freshness of data is closely related to the timeliness of data, but it specifically refers to how recently the data was collected or updated. Fresh data is data that has been collected or updated recently and is, therefore, more likely to be accurate and relevant.
Freshness can be categorized into:
- Table-level Freshness, i.e., how often a table is being updated and whether the time since the last update is anomalous
- Record-level Freshness, i.e., what is the percentage of outdated records (entities) and whether this percentage increases abnormally
- Entity-level Freshness, i.e., similar to record-level, except it takes into account that multiple versions of an entity may exist in the same table
There are different options for implementing table-level Freshness. Some use database query logs to determine the write rate; others look at the last update timestamp of the table in its metadata. In contrast, some others look at the most recent timestamp within the table (in case the record update attribute is part of the schema).
Using Data Observability, users can automatically get insights on the frequency of updates to tables and the data and configure freshness metrics relevant to business needs.
Validity is an indicator that measures whether the data conforms to predefined expectations or rules.
Often data teams ensure validity by checking for data types, formats, and ranges. Data is valid if it conforms to a particular syntax or rules.
This is often the first step in ensuring data quality.
Often data teams will write validation checks for every system in the data pipeline, like Mobile apps, DBT, data-loaders, landing storage, etc.
For example, let's say a company collects customer data through an online form. Data validity means ensuring that the data entered by the customers is accurate and complete and represents the customers' precise information. This may involve validating data against predefined rules and criteria, such as ensuring that email addresses are correctly formatted, that phone numbers are valid and complete, and that age is within an acceptable range.
Open-source tools like Great Expectations have played a key role in accelerating pipeline data validation.
Data Observability takes this approach further by first predicting and recommending validations based on the data stored in the system(statistical analysis) and then enabling users to write scalable validations.
Data Accuracy refers to the extent to which organizational data reflects the actual values or attributes of the objects or events it represents.
In other words, data accuracy is a measure of how closely the data matches reality. Ensuring data accuracy involves verifying data validity and checking for errors, inconsistencies, and, most importantly, anomalies. This can be achieved through various methods, such as manual data entry verification, automated data validation algorithms, and data profiling techniques.
Maintaining data accuracy also requires ongoing efforts to monitor and correct any errors or issues that arise over time.
A classic example of invalid data would be out-of-range values. Such data passes all the tests for validity, completeness, and uniqueness, yet it might not to be accurate.
Accuracy is the most challenging data quality metric as it needs context. Sometimes the context lies with source owners, and other times with data consumers.
Traditionally, this was solved using business rules and manual stewardship, and both these approaches fail to scale for the modern data ecosystem.
Data observability tools provide a sophisticated approach by aggregating historical data and data dependencies and flagging anomalies within data, like outliers and drifts(over time). Data teams can further validate/remediate anomalies using a human-in-loop approach.
Data consistency refers to the degree to which data is uniform, coherent, and accurate across different sources or instances. In other words, data consistency means that data
is the same across different systems, databases, or applications. Consistency can be evaluated by comparing the data to other sources, checking for duplicates, and evaluating naming conventions. It is essential to establish data standards and guidelines, implement data governance controls, and regularly monitor data quality to ensure data consistency. For example, if sales data is inconsistent across different regions, comparing performance accurately or identifying trends can be challenging.
Users who implement data observability across multiple systems can easily compare the data metrics to find drifts (inconsistencies) without writing code/SQLS.
Role of Data Lineage
Data lineage is related to data quality in that it is a crucial aspect of understanding and managing the data.
Data lineage refers to the ability to track the data as it flows through the system, including where it came from, how it was transformed, where it is stored, and how it is used. This information is essential for understanding the data and its quality, as it allows you to trace any issues or errors back to their source and understand how the data has been used.
Data lineage can also be an essential aspect of data governance. It allows organizations to understand the data and its quality and ensure that it is used correctly and effectively.
Data teams use lineage to improve data quality by identifying and correcting errors and ensuring that data is entered and stored in a consistent format. Additionally, understanding data lineage can help organizations identify and remove duplicate or inconsistent data and ensure that data is of sufficient quality.
Data lineage is not a data quality Indicator. Still, it is a crucial aspect of overall data governance. Most observability tools leverage lineage to find the root cause and impact of an issue.
Why do traditional data quality tools fail in the modern data stack?
Traditional data quality tools satisfy the needs of the conventional data stack that is monolithic and is designed to work with structured data sources. These tools focus on data validation, cleansing, and standardization to ensure its accuracy, completeness, and reliability. Writing validation rules in SQL or using open-source tools is one way to implement data quality. Traditional ETL tools often have data quality rules embedded in their user interface to transform the data into higher quality and work best for structured data with limited scale.
However, traditional data quality tools cannot rise to meet the needs of modern data stacks, and here are the main reasons why they fail:
- Data Stewardship and Sampling: In some cases, using a sample of data by data stewards rather than the entire data population for analysis, remediation, or decision-making can be helpful. But it has several drawbacks and limitations, including the risk of bias if the sampling method is not random or if the sample size needs to be more significant to be representative of the population. This can lead to inaccurate or misleading conclusions. Sampling also limits the scope of analysis to the specific sample taken—loss of granularity of data or unusual data points necessary for analysis. And the inability to account for data outliers is another limitation of this method.
- Inability to handle semi-structured and unstructured data: Traditional data quality tools were designed to work with structured data stored in relational data sources and cannot handle semi-structured and unstructured data that's a huge component of modern data stacks that include data from social media posts, audio and video recordings, and free-form text.
- Lack of scalability: Modern data stack is designed to generate and handle massive amounts of data, which can be difficult for traditional data quality tools to manage that may need help to keep up with the data volume leading to slow processing times and potential data quality issues. Most of these tools were not designed with distributed computing and faced performance bottlenecks.
- The complexity of data sources: Modern data stacks often consume data from multiple sources, including internal and external systems, third-party APIs, and cloud-based data stores. Traditional data quality tools may need help to handle the complexity of these data sources, leading to data quality issues and delays in processing.
- Limited data governance capabilities: Traditional data quality tools need more advanced data governance capabilities to manage data in modern data stacks, such as data lineage tracking, data cataloging, and data privacy management.
Traditionally, multiple techniques and tools were available to measure, monitor
and improve data quality. They include data profiling, data cleansing, validation, standardization, and governance. Data quality is an ongoing process that requires continuous monitoring, updating, and maintenance to ensure that the data remains accurate, complete, and consistent over time.
This path can be accelerated using self-learning tools like Data Observability.
Data Observability as the Foundation of Data Quality
Data Observability tools are natively designed to scan the entire dataset to identify anomalies like outliers and drift. This outcome can be easily classified into data quality indicators defined above.
Data observability tools are perfect candidates to deliver superior data quality to the modern data stack as they bring versatile capabilities, as follows, to the table.
Handle Semi-structured Data: Data Observability tools are designed for modern data stacks that require the processing and analyzing of large volumes of data from various sources, including semi-structured data.
Operate in the Cloud: Data Observability is designed to operate and scale efficiently in the cloud and provide capabilities essential for managing the scale of data in modern data stacks.
Work with Streaming Data: Sophisticated data observability tools can handle data stored in a batch or streaming data. They are well-integrated platforms that can span and serve complex data pipelines in ways traditional data quality tools cannot.
Leverage Machine Learning: While data observability tools can validate data against predefined metrics from an available set of policies, they leverage ML and statistical analysis to learn about data and predict future thresholds and drifts.
Provide Interactive Visualization: Given their constant monitoring and self-tuning nature, data observability tools can automatically curate issues and signals in the data into data quality KPIs and dashboards to showcase the ongoing data quality trends visually. These tools are equipped with interactive visualizations to enable investigation and further analysis.
In summary, Data Observability is a set of techniques, and Data Quality(DQ) is the state that's defined using DQ metrics.
Given their constant monitoring and self-tuning nature, data observability tools can automatically curate issues and signals in the data into data quality KPIs and dashboards to showcase the ongoing data quality trends visually. These tools are equipped with interactive visualizations to enable investigation and further analysis.
One of the outcomes of Data Observability is automatic identification and alerting on DQ metrics. Still, there is so much more to it, like business metric drift, which could be a true positive or false positive. Data Quality is one of the many use cases of Data Observability.
For a complete copy of the book, click here.
Data profiling helps organizations understand their data, identify issues and discrepancies, and improve data quality. It is an essential part of any data-related project and without it data quality could impact critical business decisions, customer trust, sales and financial opportunities.
To get started, there are four main steps in building a complete and ongoing data profiling process:
We'll explore each of these steps in detail and discuss how they contribute to the overall goal of ensuring accurate and reliable data. Before we get started, let's remind ourself of what is data profiling.
1. Data Collection
Start with data collection. Gather data from various sources and extract it into a single location for analysis. If you have multiple sources, choose a centralized data profiling tool (see our recommendation in the conclusion) that can easily connect and analyze all your data without having you do any prep work.
2. Discovery & Analysis
Now that you have collected your data for analysis, it's time to investigate it. Depending on your use case, you may need structure discovery, content discovery, relationship discovery, or all three. If data content or structure discovery is important for your use case, make sure that you collect and profile your data in its entirety and do not use samples as it will skew your results.
Use visualizations to make your discovery and analysis more understandable. It is much easier to see outliers and anomalies in your data using graphs than in a table format.
3. Documenting the Findings
Create a report or documentation outlining the results of the data profiling process, including any issues or discrepancies found.
Use this step to establish data quality rules that you may not have been aware of. For example, a United States ZIP code of 94061 could have accidentally been typed in as 94 061 with a space in the middle. Documenting this issue could help you establish new rules for the next time you profile the data.
4. Data Quality Monitoring
Now that you know what you have, the next step is to make sure you correct these issues. This may be something that you can correct or something that you need to flag for upstream data owners to fix.
After your data profiling is done and the system goes live, your data quality assurance work is not done – in fact, it's just getting started.
Data constantly changes. If unchecked, data quality defects will continue to occur, both as a result of system and user behavior changes.
Build a platform that can measure and monitor data quality on an ongoing basis.
Take Advantage of Data Observability Tools
Automated tools can help you save time and resources and ensure accuracy in the process.
Unfortunately, traditional data profiling tools offered by legacy ETL and database vendors are complex and require data engineering and technical skills. They also only handle data that is structured and ready for analysis. Semi-structured data sets, nested data formats, blob storage types, or streaming data do not have a place in those solutions.
Today organizations that deal with complex data types or large amounts of data are looking for a newer, more scalable solution.
That’s where a data observability tool like Telmai comes in. Telmai is built to handle the complexity that data profiling projects are faced with today. Some advantages include centralized profiling for all data types, a low-code no-code interface, ML insights, easy integration, and scale and performance.
Start your data observibility today
Connect your data and start generating a baseline in less than 10 minutes.
No sales call needed
Start your data observability today
Connect your data and start generating a baseline in less than 10 minutes.