5 Reasons to Consider Centralized Data Observability for Your Modern Data Stack

5 Reasons to Consider Centralized Data Observability for Your Modern Data Stack
Farnaz Erfan

With the rush towards a modern data stack, organizations are increasing their ability to execute faster at a reduced engineering cost.

As data teams lay out the foundation for their new modern stack they are also looking at Data Observability as a critical component to monitor their stack and ensure data reliability.

In this article we will discuss five reasons why those in search of a modern data stack should implement a centralized data observability. First let’s define what defines centralized data observability.

What is centralized data observability?

A centralized data observability monitors data across the entire data pipeline, from ingestion to consumption. Such a platform not only supports structured and semi-structured data across myriads of systems such as data warehouses, data lakes, message queues, and streaming sources, it also is capable of supporting all the common data formats like JSON, CSV, and parquet.

A centralized data observability platform is used to define Data Quality KPIs like Completeness, Uniqueness, Accuracy, Validity, and Freshness across the entire DataOps systems and become the single,  common platform to view, manage, and monitor these KPIs.

Why do modern data stacks need centralized data observability?

While there are many approaches to data observability and the vendor landscape in this space has become quite popular, there are core reasons that a centralized Data Observability platform fits a modern, and evolving data stack, best. Here are five reasons:

1. Replacing redundant data quality efforts with automation and ML

It's not uncommon for different teams to independently write rules/queries to understand the health of data. This not only causes inefficiencies in process/coding but also leads to infrastructure cost overhead as thousands of queries pile up over the years to investigate the data.

There is a reason for moving to a modern data stack. As data engineering resources have spread thinner and thinner across various tools to piecemeal legacy platforms together and maintain what has been built over the years, one thing is clear: There is no room for mundane work in the new stack. 

Automation and self-maintained platforms are replacing legacy. A centralized data observability platform is able to keep track and oversee data as it moves through the new stack. You no longer need to write code to create checks and balances across every stage of the data transformation. Centralized data observability built with machine learning and automation runs in parallel, in the background, to give you a piece of mind when you are not looking and notify you if something is up.

2. Supporting all new data types - structured or semi-structured 

As we have seen with the emerging technologies such as data streaming and reverse ETL, data is constantly sourced and activated across various shapes and formats. While many data observability platforms can monitor structured data such as data warehouses or databases, they are not capable of monitoring data that doesn’t have well-defined metadata. 

A centralized data observability tool is able to monitor and detect issues and anomalies in all data types, including structured and semi-structured sources. Because this platform relies on data patterns, and not just metadata, centralized observability is flexible to observe data across various systems, without forcing the data to be shaped into a structured format before it can be observed. 

3. Running data observability at every step of the pipeline 

Data pipelines used to be simple: ETL processes cleaned, shaped, and transformed data from legacy databases into normalized formats and data warehouses for BI reporting. Today, data pipelines are ingesting mixed-type data into data lakes and using modern transformation and in-database processes to shape the data; delta lakes and cloud data warehouses have become the centralized source of information, while feature engineering and feature stores are adapted by modern data science projects. 

Point data observability tools are often built for data warehouse monitoring or data science and AIOps. They are good systems for monitoring the landing zone or the last mile of a data pipeline, but for the data that moves through numerous hops and stops, a centralized data observability platform is crucial to monitor the data at every step of the pipeline:  at ingest to detect source system issues, at the transformation point to ensure ETL jobs are performed correctly, at the data warehouse or consumption layer to detect any anomalies or drift in business KPIs. 

4. Creating a central understanding of data metrics across data teams

With metric formulas scattered across BI tools and buried in dashboards, the industry decided that there is a need for a separate metrics layer. One that eliminates recreating and rewriting KPIs in each dashboard, and instead provides a centralized location where KPIs and their definitions are shared, reused, and collaborated on. This metrics layer centralizes key business definitions and metrics to improve the efficiency of data teams. However, it does not ensure the accuracy of those metrics.

A centralized data observability platform deployed on this metrics layer will ensure that the metrics commonly used by downstream systems are tracked to meet quality standards. After all, how could you create reusability without having reusable pieces that pass basic quality controls. Centralizing the quality of the metrics is just as important as centralizing the metrics themselves. 

5. Ability to change the underlying data stack without impacting observability 

Lastly, let’s just look at the amount of innovation we have seen in the data space in the last 2 years. Teams are moving to a new modern stack, and long are the days of SQL interfaces on Hadoop. We are going to see even more data, analytics, and ML platforms coming up at a rapid pace in the years to come, each solving a particular problem.

Attaching a Data Observability platform that is source-specific (meaning dependent on the metadata and logs of a specific system) doesn’t port to another system easily. As you decide to onboard or add new systems to your data pipeline, or migrate from one to the other, you don’t want to change your data quality definitions again and again. Your Data Observability should be able to easily move with your stack. A centralized Data Observability platform can do that. It is agnostic to the systems it monitors and uses its own computation engine to calculate the metrics, without relying on each underlying data store’s metadata or SQL dialect to examine the data at hand.

Closing thoughts

As modern data stacks have become more and more popular, data observability has also gained momentum. 

In the past, simple checks and balances, pre-defined rules, and metadata monitoring solutions were sufficient. Today, data pipelines are more complex, and many more systems and platforms have been added to the stack to either capture more data, or to make it more consumable and actionable. 

A centralized data observability platform is capable of running in parallel to this modern data stack, ensuring trust in data at every step, and across a variety of sources and transformations. This data observability platform is capable of monitoring the modern data stack as it is today, and is also architecturally designed in a way that future-proofs it as new systems and sources get added to the stack as the industry evolves.

Data profiling helps organizations understand their data, identify issues and discrepancies, and improve data quality. It is an essential part of any data-related project and without it data quality could impact critical business decisions, customer trust, sales and financial opportunities. 

To get started, there are four main steps in building a complete and ongoing data profiling process:

  1. Data Collection
  2. Discovery & Analysis
  3. Documenting the Findings
  4. Data Quality Monitoring

We'll explore each of these steps in detail and discuss how they contribute to the overall goal of ensuring accurate and reliable data. Before we get started, let's remind ourself of what is data profiling.

What are the different kinds of data profiling?

Data profiling falls into three major categories: structure discovery, content discovery, and relationship discovery. While they all help in gaining more understanding of the data, the type of insights they provide are different:


Structure discovery analyzes that data is consistent, formatted correctly, and well structured. For example, if you have a ‘Date’ field, structure discovery helps you see the various patterns of dates (e.g., YYYY-MM-DD or YYYY/DD/MM) so you can standardize your data into one format.


Structure discovery also examines simple and basic statistics in the data, for example, minimum and maximum values, means, medians, and standard deviations.


Content discovery looks more closely into the individual attributes and data values to check for data quality issues. This can help you find null values, empty fields, duplicates, incomplete values, outliers, and anomalies.


For example, if you are profiling address information, content discovery helps you see whether your ‘State’ field contains the two-letter abbreviation or the fully spelled out city names, both, or potentially some typos.


Content discovery can also be a way to validate databases with predefined rules. This process helps find ways to improve data quality by identifying instances where the data does not conform to predefined rules. For example, a transaction amount should never be less than $0.


Relationship discovery discovers how different datasets are related to each other. For example, key relationships between database tables, or lookup cells in a spreadsheet. Understanding relationships is most critical in designing a new database schema, a data warehouse, or an ETL flow that requires joining tables and data sets based on those key relationships.

Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Data Observability
Data Quality

Leverages ML and statistical analysis to learn from the data and identify potential issues, and can also validate data against predefined rules

Uses predefined metrics from a known set of policies to understand the health of the data

Detects, investigates the root cause of issues, and helps remediate

Detects and helps remediate.

Examples: continuous monitoring, alerting on anomalies or drifts, and operationalizing the findings into data flows

Examples: data validation, data cleansing, data standardization

Low-code / no-code to accelerate time to value and lower cost

Ongoing maintenance, tweaking, and testing data quality rules adds to its costs

Enables both business and technical teams to participate in data quality and monitoring initiatives

Designed mainly for technical teams who can implement ETL workflows or open source data validation software

Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Start your data observibility today

Connect your data and start generating a baseline in less than 10 minutes. 

No sales call needed

Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Start your data observability today

Connect your data and start generating a baseline in less than 10 minutes. 

Telmai is a platform for the Data Teams to proactively detect and investigate anomalies in real-time.
© 2023 Telm.ai All right reserved.