Telmai monitors any source in data pipelines, but how do we do it?‍

Telmai monitors any source in data pipelines, but how do we do it?‍
Max Lukichev

Abstract

Telmai is designed to monitor data at any pipeline step, be it a source or a destination. And it does it at scale and with low latency. Our system is built upon architectural blocks which significantly differ from a typical homegrown system designed to monitor SQL databases/engines. Telmai, on the other hand, uses distributed compute via Spark and is highly optimized for monitoring data at a large scale.

This blog highlights some architectural decisions that helped us achieve scale, versatility, and efficiency.

Data monitoring requirements

Over the last decades, companies have embraced many new technologies in the data ecosystem for two reasons: to uncover newer use cases (revenue) or lower cost.

The former heavily depends on the reliability of the data, and the latter is always a consideration for everyone evaluating the tools that support the use case.

So when we were designing Telmai, the critical consideration was to build a system that is a step ahead in speed, scale, and costs required for these new data use cases. In this blog, we’d like to share some of the architectural decisions we went with.

Our requirements for building a robust monitoring system came down to:

  • Shifting data monitoring to the left, in other words finding issues closest to their source in order to reduce  MTTD and MTTR (mean time to detect and resolve)
  • Scale and workload isolation to be able to support vastly increased number and size of data sources to be analyzed
  • Performance - finding problems timely to fit in the strict time windows for remediation activities
  • Deployment flexibility to ensure versatility of data sources and the fact data is not suppose to leave the clients security perimeter
  • Cost efficiency to avoid incurring high infrastructure costs for monitoring, be it ingress/egress or compute.

So with all these considerations we outlined two options: building Telmai by connecting directly to a Data warehouse/SQL engine and running SQL queries to calculate the metrics, or using an external processing engine, like Spark, to perform all the calculations to generate the metrics.

There were pros and cons to both. While the SQL one works well for monitoring Data warehouse or BI tools, for our requirements the Spark option felt more compelling and below you can learn why so.

SQL versus Spark based monitoring system

Shifting data monitoring to the left

This means quickly plugging into any source that can contribute to data issues. In other words, read anything - various file formats (parquet, CSV, JSON, archives, etc.), multiple databases like HBase, Cassandra, SSTables, subscription topics, ex. Kafka or even Spark dataframes while data is still being processed with Spark-based ETL.

When data comes from such a diverse group of sources, it's a problem by itself to first extract and then load data into a system. The data warehouse was restrictive as it often fails the load jobs during schema validation. It requires additional efforts and transformation jobs to run before loading the data, introduces additional delays and failure points, and defeats the purpose of monitoring as early (i.e., leftmost) as possible. So if we wanted our clients to shift their monitoring to the left, we had to choose an architecture that would support this easily.

Scale

When monitoring is shifted to the left, it means more sources to be analyzed and much more data in general. This data is often raw that hasn't had any cleanup/aggregations. Doing this without creating a bottleneck in the pipeline was a critical consideration for us. Spark offers a great help here, as it allows the launch of many auto-scaling clusters to work in parallel, ensuring the system's excellent overall throughput. Such flexibility was hard to achieve even with managed data warehouses.

Pipeline performance

Calculating metrics is compute-intensive. And monitoring requires a lot of them. Each attribute requires running numerous aggregations to understand various value distributions, completeness, mean frequency, values ranges, standard deviations, etc. This gets even more complicated when multivalued attributes are present, and the size gets to a terabyte and petabyte scale.

Having hundreds of attributes and calculating dozens of metrics for every attribute will require thousands of queries(or jobs). Each analytical SQL engine has a significant overhead for every query due to planning, which can quickly get out of hand in real-life situations.

By having a deeper control over the execution flow with Spark, we can utilize a highly-optimized logic for very wide (a large number of attributes and metrics) datasets and avoid bottlenecks with the general-purpose SQL approach. Transforming data into a very long skinny (key-value) representation allows us to run with a constant number of analytics jobs and not be affected by many metrics, which comes very handy with an exploding number of custom metrics and attributes to monitor.

Deployment flexibility

We knew our clients needed deployment flexibility for data security.

The use of Spark as a processing engine allowed for decoupling data extraction and processing, bringing both closer to the monitored data. That means running processing anywhere in the network topology, for example, via EMR clusters in the same or peered VPC to where monitored HBase or Cassandra clusters are. It eliminates unnecessary security vulnerabilities and challenging questions from infosec teams.

In short, Spark helps us to enable SaaS via public or private cloud options and extend coverage for on-prem Spark/Hadoop setups.

Cost considerations and budgeting

Being a decoupled processing infrastructure, Spark was our choice to help us cleanly separate workloads, avoid unnecessary load on DL and DW engines, and quickly control and isolate monitoring costs from other business-driven initiatives.

Summary

To securely monitor any step in the data pipeline at scale without compromising product features, we needed to future-proof our architecture. We relied heavily on distributed compute architecture via Spark to achieve our architectural goal.

Data profiling helps organizations understand their data, identify issues and discrepancies, and improve data quality. It is an essential part of any data-related project and without it data quality could impact critical business decisions, customer trust, sales and financial opportunities. 

To get started, there are four main steps in building a complete and ongoing data profiling process:

  1. Data Collection
  2. Discovery & Analysis
  3. Documenting the Findings
  4. Data Quality Monitoring

We'll explore each of these steps in detail and discuss how they contribute to the overall goal of ensuring accurate and reliable data. Before we get started, let's remind ourself of what is data profiling.

What are the different kinds of data profiling?

Data profiling falls into three major categories: structure discovery, content discovery, and relationship discovery. While they all help in gaining more understanding of the data, the type of insights they provide are different:

 

Structure discovery analyzes that data is consistent, formatted correctly, and well structured. For example, if you have a ‘Date’ field, structure discovery helps you see the various patterns of dates (e.g., YYYY-MM-DD or YYYY/DD/MM) so you can standardize your data into one format.

 

Structure discovery also examines simple and basic statistics in the data, for example, minimum and maximum values, means, medians, and standard deviations.

 

Content discovery looks more closely into the individual attributes and data values to check for data quality issues. This can help you find null values, empty fields, duplicates, incomplete values, outliers, and anomalies.

 

For example, if you are profiling address information, content discovery helps you see whether your ‘State’ field contains the two-letter abbreviation or the fully spelled out city names, both, or potentially some typos.

 

Content discovery can also be a way to validate databases with predefined rules. This process helps find ways to improve data quality by identifying instances where the data does not conform to predefined rules. For example, a transaction amount should never be less than $0.

 

Relationship discovery discovers how different datasets are related to each other. For example, key relationships between database tables, or lookup cells in a spreadsheet. Understanding relationships is most critical in designing a new database schema, a data warehouse, or an ETL flow that requires joining tables and data sets based on those key relationships.

Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Data Observability
Data Quality

Leverages ML and statistical analysis to learn from the data and identify potential issues, and can also validate data against predefined rules

Uses predefined metrics from a known set of policies to understand the health of the data

Detects, investigates the root cause of issues, and helps remediate

Detects and helps remediate.

Examples: continuous monitoring, alerting on anomalies or drifts, and operationalizing the findings into data flows

Examples: data validation, data cleansing, data standardization

Low-code / no-code to accelerate time to value and lower cost

Ongoing maintenance, tweaking, and testing data quality rules adds to its costs

Enables both business and technical teams to participate in data quality and monitoring initiatives

Designed mainly for technical teams who can implement ETL workflows or open source data validation software

Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Start your data observibility today

Connect your data and start generating a baseline in less than 10 minutes. 

No sales call needed

Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Start your data observability today

Connect your data and start generating a baseline in less than 10 minutes. 

Telmai is a platform for the Data Teams to proactively detect and investigate anomalies in real-time.
© 2023 Telm.ai All right reserved.