How to Build a Data Monitoring System

How to Build a Data Monitoring System
Farnaz Erfan

As organizations become more data-driven, decision-makers increasingly use analytical and predictive systems to aid them in understanding their data. However, to ensure trust in these systems, a data quality strategy and set of standards should be in place. Many organizations use data monitoring systems to ensure their data meets a set of predefined standards and also to detect new data anomalies and outliers.

Building a data monitoring system from the ground up can be very challenging, and significant effort and resources are needed to design, develop, and maintain such a system. In this article, you'll learn about the implications of creating a data monitoring system and the various aspects that should be considered in this process.

Before You Start

Before building a system to monitor data, you need to make sure that the underlying data pipeline is observable, meaning that the state of your system can be deduced from its outputs. Although this is a step that is automated in data observability tools like Telmai, if you are building your own system, you'll need to implement changes to make that possible.

For example, if you often deal with data in a semi-structured format, you would not be able to monitor the various attributes in data, as it does not follow a particular schema or known column definitions. In these cases, you would need to transform your data into an observable format before you can proceed.

1. Use the Right Data Metrics

A data metric is a standard of measurement to assess the data that's being monitored. Metrics are typically defined relative to a standard value, which could be the best possible scenario (or the worst case). These metrics must be selected carefully and be appropriate for your particular use case. For data quality, they include completeness, uniqueness, freshness, validity, accuracy, and consistency.

The common metric known as the anomaly score is a good example of how you can set your metrics. The anomaly score can be in a range from 0 to 100, where the closer to 100 the value is, the more anomalous the data being monitored is, and the more likely that an error is occurring. 

2. Set Appropriate Thresholds for Data Metrics

After selecting the right metrics for your use case, your data managers should receive updates about the state of the system to help monitor it effectively and efficiently. One way to go about this is through thresholds and setting up notifications that let data managers know when data falls out of expected ranges. 

For instance, if a metric is set to validate a discount code but the transactions fall outside the discounted range, the data managers would then be notified to take appropriate steps to handle the situation. In addition to predetermined and expected behavior in data, thresholds can also be created in response to certain trends and seasonal variations. For instance, if your system's data volume is known to peak during festive periods, you can build a predictable model to determine future trends and measure your data quality against a predictable threshold 

To detect anomalies, you would establish thresholds at which you want alerts to trigger. These thresholds represent the point where the problem is severe enough to require an alert. Multiple thresholds can be set to communicate different degrees of severity and when these thresholds are reached, the appropriate data owner is immediately notified.

Designing thresholds requires proper knowledge of the data, possibly from business stakeholders, profiling the data, or deploying machine learning techniques that learn from the data and its trends to impose the right thresholds at the right time.

Improperly designed thresholds can lead to an overwhelming number of alerts for errors that may not be serious or even need any attention. Inevitably, this would lead to your alerts becoming useless.

3. Integrate with Every Step of Your Data Pipeline

As you design your data monitoring system make sure you you can integrates it within every step of your data pipeline. 

This integration should tie into your data ingest and onboarding, and into every step of your data transformation, data warehouse or data lake systems, as well as any application APIs along the way, or even data streaming queues depending on the nature of your data stack. 

You data monitoring logic should be able to measure the quality of the data from complex and semi-structured sources as easily as it is able to monitor data in warehouses and analytical databases. The sooner and the more upstream your data quality issues are identified, the lower the cost of its implications are downstream. Given that the structure and format of the data varies from system to system, all of this comes at a high cost of engineering work. 

Additionally, consider hosting these services on cloud servers or even local servers and implications of building backup, recovery, security, and other infrastructure considerations. 

4. Detect Issues Before They Become Real Problems

After your monitoring system integrates with your data pipeline, computes the data quality metrics, and defines alerting thresholds, your data owners should be able to see alerts within your system or get notified in their preferred channels.

To inform and communicate data quality issues with the broader team of data owners or data consumers you should set up notifications and integrate alerts into daily communication and productivity tools such as Slack, email applications, and so on. This helps you reach desired recipients early and before data quality issues cause severe downstream implications. 

5. Create an Interactive Dashboard

Having a central dashboard for the data monitoring and alerting system is crucial for ease of use and daily tracking, analyzing, investigating, and remediating data quality issues. Without it, you'd have to code to retrieve raw data points that measure the various metrics within your system. 

For example, you would have to query a zip code field to see the number of rows that match a specific pattern vs. another. With hundreds of tables and thousands of fields in question this can easily become unmanageable, and without visualizations and interactive investigations, this data is not easily understandable, particularly when the system has been running for a long time.

Interactive visualizations help communicate important points succinctly and in an intuative way. For example, drilling into an anomaly in the data can narrow it down to the specific data points that are contributing to this particular anomaly. Additionally, with dashboards all recorded metrics can be out into time series to provide a one-stop overview of the state of a system through time, with the ability to drill down into the details and route causes behind the scene.

6. Make it Scalable

As should be the case with every system, data monitoring systems need to be scalable. With the growth of any organization, its volumes of data typically increase. To monitor this increasing volume over time and to analyze its historical trends, the data monitoring system should be equipped to scale without an impact on performance. 

If a monitoring system is not scalable, it could easily run out of compute power with an unprecedented increase or a complex calculation. This halt in the system could be costly for the organization.

As you design, develop, and deploy your monitoring system, scalability must always be in the foreground of your mind to ensure continuous and consistent operation when the system is live.

Don’t Forget Maintenance

Once you've built your system, it also needs continuous maintenance and management to ensure that it's operating as desired. For example, every time a schema is changed or an outlier occurred that you didn’t account for before, you need to consider how the new data is observed and ensure that the changed schema isn't breaking your data monitoring logic.

Additionally, as you add more people to your team and new projects get created, you may end up recreating the same data quality checks and balances over and over. It's not uncommon for different teams to independently write rules or queries to understand the health of data. This not only causes inefficiencies and duplicated work but also leads to infrastructure cost overhead as thousands of queries pile up over the years.

Are You Sure You Want to Build a Data Monitoring System from Scratch?

The build-out of your data monitoring system is the first cost to consider. Even greater, though, is the cost of more engineers for maintenance and management. 

Let Telmai do the heavy lifting involved in both, with a no-code interface you can use to set up data monitoring in your pipeline and integrate with your data regardless of their structure or format. 

  • Telmai comes out of the box with 40+ different metrics, along with machine learning and intelligent algorithms to detect previously unknown data quality issues. 
  • Telmai provides an intuitive and interactive visual interface for monitoring your data, and it also integrates with common communication channels such as Slack or email. 
  • Telmai's architecture is Spark-based, so it scales as your data grows.

Try out Telmai today.

Telmai in action

Data profiling helps organizations understand their data, identify issues and discrepancies, and improve data quality. It is an essential part of any data-related project and without it data quality could impact critical business decisions, customer trust, sales and financial opportunities. 

To get started, there are four main steps in building a complete and ongoing data profiling process:

  1. Data Collection
  2. Discovery & Analysis
  3. Documenting the Findings
  4. Data Quality Monitoring

We'll explore each of these steps in detail and discuss how they contribute to the overall goal of ensuring accurate and reliable data. Before we get started, let's remind ourself of what is data profiling.

What are the different kinds of data profiling?

Data profiling falls into three major categories: structure discovery, content discovery, and relationship discovery. While they all help in gaining more understanding of the data, the type of insights they provide are different:

 

Structure discovery analyzes that data is consistent, formatted correctly, and well structured. For example, if you have a ‘Date’ field, structure discovery helps you see the various patterns of dates (e.g., YYYY-MM-DD or YYYY/DD/MM) so you can standardize your data into one format.

 

Structure discovery also examines simple and basic statistics in the data, for example, minimum and maximum values, means, medians, and standard deviations.

 

Content discovery looks more closely into the individual attributes and data values to check for data quality issues. This can help you find null values, empty fields, duplicates, incomplete values, outliers, and anomalies.

 

For example, if you are profiling address information, content discovery helps you see whether your ‘State’ field contains the two-letter abbreviation or the fully spelled out city names, both, or potentially some typos.

 

Content discovery can also be a way to validate databases with predefined rules. This process helps find ways to improve data quality by identifying instances where the data does not conform to predefined rules. For example, a transaction amount should never be less than $0.

 

Relationship discovery discovers how different datasets are related to each other. For example, key relationships between database tables, or lookup cells in a spreadsheet. Understanding relationships is most critical in designing a new database schema, a data warehouse, or an ETL flow that requires joining tables and data sets based on those key relationships.

Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Data Observability
Data Quality

Leverages ML and statistical analysis to learn from the data and identify potential issues, and can also validate data against predefined rules

Uses predefined metrics from a known set of policies to understand the health of the data

Detects, investigates the root cause of issues, and helps remediate

Detects and helps remediate.

Examples: continuous monitoring, alerting on anomalies or drifts, and operationalizing the findings into data flows

Examples: data validation, data cleansing, data standardization

Low-code / no-code to accelerate time to value and lower cost

Ongoing maintenance, tweaking, and testing data quality rules adds to its costs

Enables both business and technical teams to participate in data quality and monitoring initiatives

Designed mainly for technical teams who can implement ETL workflows or open source data validation software

Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Start your data observibility today

Connect your data and start generating a baseline in less than 10 minutes. 

No sales call needed

Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Start your data observability today

Connect your data and start generating a baseline in less than 10 minutes. 

Telmai is a platform for the Data Teams to proactively detect and investigate anomalies in real-time.
© 2023 Telm.ai All right reserved.