Data Observability. What is it?


The data management landscape has dramatically changed over the last decade with the evolution and massive adoption of BigData, ML/AI, ever-increasing number of data sources, and volumes. Ensuring high quality and completeness of data is critical for driving valuable business decisions for enterprises. Yet statistics show that 87% of machine learning projects don’t make it into production and data engineers spend 80% of their time cleaning the data.
Around 5-8 years ago, enterprises had started seeing similar patterns around cloud infrastructure. Organizations were investing heavily in cloud architecture with low returns on their investments. There was a lack of monitoring and minimal predictability about the anomalies and service failures. This need gave birth to the evolution of Cloud Infrastructure Observability.
Running SaaS operations at an ever-growing scale and the need for efficiency and reliability put the focus on observability products, and today, there are several major players in this space like Splunk, DataDog, NewRelic, Dynatrace. I have noticed various interpretations of what Observability is, but recently it has converged into three key concept pillars. In the world of Cloud Infrastructure Observability, these pillars are metrics, traces, and logs and they try to answer the following questions:
- Do I have a problem, how bad is it?
- Where is my problem and what is the impact?
- What went wrong?
Initially, the first pillar was addressed by metrics monitoring tools, the second one by tracing, and the third one by logs. The observability area is very dynamic and experiencing explosive growth, so we see many new tools emerging and addressing the needs of each pillar.
However, data problems are hidden from these tools and even when all the metrics, traces, and logs look normal for a data pipeline, it still can produce garbage data. It is a significant problem for the businesses, which leads not only to constant escalation mode and burnout of data engineers who are troubleshooting discrepancies in the reports but also hurts business big time.
Hence a similar concept of observability has now emerged in data management for data quality use cases. More and more companies realize they need to focus on addressing the data issues or what is sometimes referred to as “data downtime.” It is not surprising to see the growing interest in Data Observability.
Just like Cloud Observability, Data Observability Suites are trying to get answers to the same three questions. However, there is no established consensus on the naming. Let me offer my take on the Data Observability pillars:
- Monitoring of the Data Quality detects a variety of problems or anomalies in the data. There are numerous things which could go wrong, but we can break it down into three high-level categories: missing/incomplete data, incorrect data, and stale data
- Data Lineage helps understand the impact and source of a data anomaly. You need to know how various sources of data relate to each other and how they contribute to downstream systems and reports
- Data Troubleshooting helps to find the root cause of an issue. In the App Observability, it is the Logs. In the case of Data, Observability logs are of minimal help since the pipeline or an application are still operating as expected, but they are processing the wrong data and ultimately causing wrong decision making
Given the complexity of the domain, I anticipate a wide range of tools being introduced for each pillar, far greater than what we saw for Cloud Observability.
So what is Data Observability in the end? In short, it is a new discipline, which is trying to fill the gaps where traditional data management systems like data quality, data profiling, lineage are falling short to help data engineers achieve operational excellence and deliver business results.
#dataquality #dataobservability #dataops
Data profiling helps organizations understand their data, identify issues and discrepancies, and improve data quality. It is an essential part of any data-related project and without it data quality could impact critical business decisions, customer trust, sales and financial opportunities.
To get started, there are four main steps in building a complete and ongoing data profiling process:
We'll explore each of these steps in detail and discuss how they contribute to the overall goal of ensuring accurate and reliable data. Before we get started, let's remind ourself of what is data profiling.
1. Data Collection
Start with data collection. Gather data from various sources and extract it into a single location for analysis. If you have multiple sources, choose a centralized data profiling tool (see our recommendation in the conclusion) that can easily connect and analyze all your data without having you do any prep work.
2. Discovery & Analysis
Now that you have collected your data for analysis, it's time to investigate it. Depending on your use case, you may need structure discovery, content discovery, relationship discovery, or all three. If data content or structure discovery is important for your use case, make sure that you collect and profile your data in its entirety and do not use samples as it will skew your results.
Use visualizations to make your discovery and analysis more understandable. It is much easier to see outliers and anomalies in your data using graphs than in a table format.
3. Documenting the Findings
Create a report or documentation outlining the results of the data profiling process, including any issues or discrepancies found.
Use this step to establish data quality rules that you may not have been aware of. For example, a United States ZIP code of 94061 could have accidentally been typed in as 94 061 with a space in the middle. Documenting this issue could help you establish new rules for the next time you profile the data.
4. Data Quality Monitoring
Now that you know what you have, the next step is to make sure you correct these issues. This may be something that you can correct or something that you need to flag for upstream data owners to fix.
After your data profiling is done and the system goes live, your data quality assurance work is not done – in fact, it's just getting started.
Data constantly changes. If unchecked, data quality defects will continue to occur, both as a result of system and user behavior changes.
Build a platform that can measure and monitor data quality on an ongoing basis.
Take Advantage of Data Observability Tools
Automated tools can help you save time and resources and ensure accuracy in the process.
Unfortunately, traditional data profiling tools offered by legacy ETL and database vendors are complex and require data engineering and technical skills. They also only handle data that is structured and ready for analysis. Semi-structured data sets, nested data formats, blob storage types, or streaming data do not have a place in those solutions.
Today organizations that deal with complex data types or large amounts of data are looking for a newer, more scalable solution.
That’s where a data observability tool like Telmai comes in. Telmai is built to handle the complexity that data profiling projects are faced with today. Some advantages include centralized profiling for all data types, a low-code no-code interface, ML insights, easy integration, and scale and performance.
Data Observability
Data Quality
Leverages ML and statistical analysis to learn from the data and identify potential issues, and can also validate data against predefined rules
Uses predefined metrics from a known set of policies to understand the health of the data
Detects, investigates the root cause of issues, and helps remediate
Detects and helps remediate.
Examples: continuous monitoring, alerting on anomalies or drifts, and operationalizing the findings into data flows
Examples: data validation, data cleansing, data standardization
Low-code / no-code to accelerate time to value and lower cost
Ongoing maintenance, tweaking, and testing data quality rules adds to its costs
Enables both business and technical teams to participate in data quality and monitoring initiatives
Designed mainly for technical teams who can implement ETL workflows or open source data validation software
Start your data observibility today
Connect your data and start generating a baseline in less than 10 minutes.
No sales call needed
Start your data observability today
Connect your data and start generating a baseline in less than 10 minutes.