Announcing Telmai's Data Observability for your Delta Lake
.jpg)
Overview
We are super excited to announce Telmai's native support for Delta Lake. With this new integration, Telmai users have end-to-end data observability across the entire data pipeline, i.e., Data Lake and Lakehouse environments, Data Warehouses, Delta Lake, and even streaming sources.
What is Delta Lake?
Open-sourced in April 2019, Delta Lake is a Databricks project that brings reliability, performance, and lifecycle management to data lakes.
Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
Designed to solve the Data Reliability gaps in the Data Lake architecture, Delta Lake has gained rapid adoption since its launch in 2019. At Telmai, this integration was designed for an existing Delta and Unity catalog customer looking to enhance their data reliability further.

So how is Telmai enabling data reliability and data quality for Delta Lake?
As a central data observability tool for your entire data pipeline, Telmai can now easily integrate with Delta Lake to analyze data inside Delta Lake for anomalies like outliers and drifts.
Our no-code integration will enable Delta users to automatically monitor close to 40 data metrics for your Delta tables/views within hours of getting started.
Some of these metrics include,
- Schema drifts: Schema changes like new attributes added or removed
- Record count: The volume of received data, calculated as row counts
- Completeness: Incomplete data received like null values, empty strings, NA, etc.
- Uniqueness: Count of unique values to track duplicates
- Distribution: Distribution drift for categorical data
- Pattern drifts: Unexpected syntax patterns, useful for well-formatted attributes like codes, phone numbers, SSN, Zip Code, etc.
- Controlled lists of values: Controlled list of values (LOV) like ISO codes, ICD codes, Gender, Address_Type, etc
- Accuracy: Data accuracy is calculated based on multiple metrics like numeric values, is_email, is_URL, length of strings, tokens, etc. Telmai can flag outliers based on these metrics.
- Business metrics: Track specific metrics derived from data. For example, taking an average of all values from an attribute like a credit_score and tracking sudden changes in the aggregated value over time.
With Telmai's notifications, your team will get alerted on unexpected drifts in these metrics. Additionally, users can set expectations/rules using our UI to fine-tune these metrics and thresholds for specific business needs.
Telmai will also automatically classify these metrics into Data Quality KPIs like freshness, completeness, accuracy, validity, uniqueness, etc.
Our Delta Lake integration is designed to natively process and monitor the changed records and analyze only those. Delta integration differs from other integrations like BigQuery and Snowflake, which don't natively track changed data. Telmai will leverage a timestamp-based column in those sources to identify and track the changed records.
Moreover, we have made all this super easy, so the Delta Lake users can focus on building great data products and not burn out by taking care of pipeline health issues.
How does our Delta integration work?
.jpg)
It is a simple 3 step process that's documented here:
- Collect JDBC connection information from your Databricks cluster.
- Create an API token that would allow Telmai to connect to your cluster.
- Create a source in Telmai to connect to your Delta table.
- Enable **delta flag on the Telmai source/connection to allow monitoring of changed data.
- Enable the schedule on the Telmai source to run jobs on a scheduled period.
** Telmai's delta flag works across all sources (not just Delta Lakes). The naming is coincidental, enabling a change in data capture mode to monitor and observe only the changed records.
Additionally Telmai's REST based integrations will enable Databricks users to enrich their Unity Catalog functionality by providing Data reliability insights like - Open alerts on tables, Data quality scores on Freshness, completeness, accuracy etc. Giving a full 360 degrees view on overall data health.
We are excited about this new feature as it has been a highly requested one, enabling our customers to accelerate their data reliability on their Delta Lakes. And we hope that you find it exciting as well!
Reach out to us if you want to read the use case study or have any questions or schedule a demo here
Data profiling helps organizations understand their data, identify issues and discrepancies, and improve data quality. It is an essential part of any data-related project and without it data quality could impact critical business decisions, customer trust, sales and financial opportunities.
To get started, there are four main steps in building a complete and ongoing data profiling process:
We'll explore each of these steps in detail and discuss how they contribute to the overall goal of ensuring accurate and reliable data. Before we get started, let's remind ourself of what is data profiling.
1. Data Collection
Start with data collection. Gather data from various sources and extract it into a single location for analysis. If you have multiple sources, choose a centralized data profiling tool (see our recommendation in the conclusion) that can easily connect and analyze all your data without having you do any prep work.
2. Discovery & Analysis
Now that you have collected your data for analysis, it's time to investigate it. Depending on your use case, you may need structure discovery, content discovery, relationship discovery, or all three. If data content or structure discovery is important for your use case, make sure that you collect and profile your data in its entirety and do not use samples as it will skew your results.
Use visualizations to make your discovery and analysis more understandable. It is much easier to see outliers and anomalies in your data using graphs than in a table format.
3. Documenting the Findings
Create a report or documentation outlining the results of the data profiling process, including any issues or discrepancies found.
Use this step to establish data quality rules that you may not have been aware of. For example, a United States ZIP code of 94061 could have accidentally been typed in as 94 061 with a space in the middle. Documenting this issue could help you establish new rules for the next time you profile the data.
4. Data Quality Monitoring
Now that you know what you have, the next step is to make sure you correct these issues. This may be something that you can correct or something that you need to flag for upstream data owners to fix.
After your data profiling is done and the system goes live, your data quality assurance work is not done – in fact, it's just getting started.
Data constantly changes. If unchecked, data quality defects will continue to occur, both as a result of system and user behavior changes.
Build a platform that can measure and monitor data quality on an ongoing basis.
Take Advantage of Data Observability Tools
Automated tools can help you save time and resources and ensure accuracy in the process.
Unfortunately, traditional data profiling tools offered by legacy ETL and database vendors are complex and require data engineering and technical skills. They also only handle data that is structured and ready for analysis. Semi-structured data sets, nested data formats, blob storage types, or streaming data do not have a place in those solutions.
Today organizations that deal with complex data types or large amounts of data are looking for a newer, more scalable solution.
That’s where a data observability tool like Telmai comes in. Telmai is built to handle the complexity that data profiling projects are faced with today. Some advantages include centralized profiling for all data types, a low-code no-code interface, ML insights, easy integration, and scale and performance.
Data Observability
Data Quality
Leverages ML and statistical analysis to learn from the data and identify potential issues, and can also validate data against predefined rules
Uses predefined metrics from a known set of policies to understand the health of the data
Detects, investigates the root cause of issues, and helps remediate
Detects and helps remediate.
Examples: continuous monitoring, alerting on anomalies or drifts, and operationalizing the findings into data flows
Examples: data validation, data cleansing, data standardization
Low-code / no-code to accelerate time to value and lower cost
Ongoing maintenance, tweaking, and testing data quality rules adds to its costs
Enables both business and technical teams to participate in data quality and monitoring initiatives
Designed mainly for technical teams who can implement ETL workflows or open source data validation software
Start your data observibility today
Connect your data and start generating a baseline in less than 10 minutes.
No sales call needed
Start your data observability today
Connect your data and start generating a baseline in less than 10 minutes.