Data Observability has been the talk of the data community for the last couple of years, but often gets treated as a pure alerting system. In this blog, I want to share why data observability is not the bystander of your data pipeline but the most crucial part of the data pipeline. It's the goddamn data pipeline control-plane!
Let's start with our hero, the Data Metric. A Data metric is an indicator to measure data reliability—for example, completeness, accuracy, uniqueness, frequency, lengths, distribution, etc. At Telmai, we support monitoring 40+ data metrics and business rules, and we will cover detailed data metrics in a separate blog.
So what happens when a data metric is below an agreed threshold?
The obvious answer is that the user gets alerted, but often that's not enough. Users also need to take action on this alert, which means making a change in the pipeline workflow. This is known as the "Pipeline Circuit breaker pattern". To be fair, it has existed for a while in data engineering. But what has changed is how data observability tooling can enable better orchestration for these patterns.
Let's look at an example of a data pipeline:
A pipeline consists of multiple transformation steps, and your data monitoring should be plugged in to each such step.
Each such pipeline step can monitor multiple Data metrics and depending on the outcome of that reliability check it will lead to 2 things, alerting and orchestration.
Almost always when there is data issue it leads to an alert, however depending on the severity of the issue and its impact, alerting can be one of the three types,
Soft alert is recorded in a monitoring system like Telmai for later analysis and investigation.
A user notification if the issue requires attention, so the user can investigate and fix the problem. Notification can be sent via email, slack, or a ticketing system so that the recipients can prioritize their responses.
A pager alert if the issue requires immediate attention and is urgent.
Another outcome of monitoring is a change in pipeline workflow. However, this depends on the type of data metric and the impact of its threshold on the downstream.
Circuit Open or block pipeline :
Let's take an example of a pipeline step where a Data Observability tool identifies incomplete data in an attribute, which is used as a join key.
This should mandate a repair job that needs to be launched automatically and then followed up by reprocessing the step in the pipeline where the problem was detected.
If you are using Airflow + Telmai, the DAG can leverage the response from Telmai API for its flow orchestration.
The Pro of this approach is that users can prevent low-quality data from propagating to downstream steps. The tradeoff is that this will add small latency in the pipeline step.
Circuit Close and pipeline continues:
Now take an example of a pipeline step where a Data Observability tool like Telmai identifies inaccurate job titles in an attribute, which is used for targeted marketing campaigns
This should not become a pipeline blocking step, and the outcome of such a step can be either, Remediate, and a tool like Telmai makes it easy to identify these records to remediate or have the specific dataOps team segment and query records with complete titles.
The biggest Pro of this approach is that the pipeline flow has no latency hence ensuring timely access to the fresh data.
Through these 2 simple examples of its quite clear that your pipeline workflow can be completely different based on the outcome of data observability and the outcome of data observability depends on the data itself. So as a pipeline owner, users need to leverage data observability for adding intelligence to the pipeline flow.
In summary, one should not treat data observability as a boring BI tool where you only check in every once in a while. If you set up the data observability workflow right, it becomes the center of all your data pipeline workflows.
If you want to learn more about customer case studies or want a demo from an expert, sign-up here.
Data profiling helps organizations understand their data, identify issues and discrepancies, and improve data quality. It is an essential part of any data-related project and without it data quality could impact critical business decisions, customer trust, sales and financial opportunities.
To get started, there are four main steps in building a complete and ongoing data profiling process:
We'll explore each of these steps in detail and discuss how they contribute to the overall goal of ensuring accurate and reliable data. Before we get started, let's remind ourself of what is data profiling.
1. Data Collection
Start with data collection. Gather data from various sources and extract it into a single location for analysis. If you have multiple sources, choose a centralized data profiling tool (see our recommendation in the conclusion) that can easily connect and analyze all your data without having you do any prep work.
2. Discovery & Analysis
Now that you have collected your data for analysis, it's time to investigate it. Depending on your use case, you may need structure discovery, content discovery, relationship discovery, or all three. If data content or structure discovery is important for your use case, make sure that you collect and profile your data in its entirety and do not use samples as it will skew your results.
Use visualizations to make your discovery and analysis more understandable. It is much easier to see outliers and anomalies in your data using graphs than in a table format.
3. Documenting the Findings
Create a report or documentation outlining the results of the data profiling process, including any issues or discrepancies found.
Use this step to establish data quality rules that you may not have been aware of. For example, a United States ZIP code of 94061 could have accidentally been typed in as 94 061 with a space in the middle. Documenting this issue could help you establish new rules for the next time you profile the data.
4. Data Quality Monitoring
Now that you know what you have, the next step is to make sure you correct these issues. This may be something that you can correct or something that you need to flag for upstream data owners to fix.
After your data profiling is done and the system goes live, your data quality assurance work is not done – in fact, it's just getting started.
Data constantly changes. If unchecked, data quality defects will continue to occur, both as a result of system and user behavior changes.
Build a platform that can measure and monitor data quality on an ongoing basis.
Take Advantage of Data Observability Tools
Automated tools can help you save time and resources and ensure accuracy in the process.
Unfortunately, traditional data profiling tools offered by legacy ETL and database vendors are complex and require data engineering and technical skills. They also only handle data that is structured and ready for analysis. Semi-structured data sets, nested data formats, blob storage types, or streaming data do not have a place in those solutions.
Today organizations that deal with complex data types or large amounts of data are looking for a newer, more scalable solution.
That’s where a data observability tool like Telmai comes in. Telmai is built to handle the complexity that data profiling projects are faced with today. Some advantages include centralized profiling for all data types, a low-code no-code interface, ML insights, easy integration, and scale and performance.
Start your data observibility today
Connect your data and start generating a baseline in less than 10 minutes.
No sales call needed
Start your data observability today
Connect your data and start generating a baseline in less than 10 minutes.