7 Critical Capabilities Telmai Adds to the Databricks Lakehouse Platform

7 Critical Capabilities Telmai Adds to the Databricks Lakehouse Platform
Farnaz Erfan

This week, we announced our Databricks partnership. A few months in making, this partnership has been the result of our shared customers and the capabilities that they have asked for. Here are 7 capabilities that Telmai brings to the Databricks platform.

1. Monitoring batch and streaming pipelines

Enterprises rely on the Databricks Lakehouse platform to bring analytics, advanced ML models, and data products into production. This, in turn, drives the importance of data quality as adopting these applications relies on good data.

Telmai’s Data Observability provides the tooling for continuously monitoring the health of data lake houses and delta live tables (DLT). Telmai automatically and proactively detects, investigates, and monitors the quality of batch workloads and streaming data pipelines.

2. Investigating semi-structured and structured data alike

As issues arise, data teams need tooling to investigate and find the root cause of their data quality. With the flexibility of Databricks lake house architecture to house both structured and semi-structured data, any monitoring and observability layer needs to be able to detect and investigate data quality issues regardless of the type and format of the data. However, semi-structured data has nested attributes and multi-value fields, aka arrays, and requires flattening and calculating aggregations one attribute at a time. The accumulated overhead on each query with such an approach quickly becomes unmanageable.

Telmai, on the other hand, is designed from the ground up to support very large semi-structured schemas such as JSON (ndjson), parquet, and data warehouse tables with JSON files. Data teams can Telmai to investigate the root cause of all their data.

3. Anomaly detection and predictions

Machine learning and time series analysis of observed data can help data teams gain a predictable outlook on the data quality issues that may arise and empower them to set notifications and alerts where future data would drift from historical ranges or expected values. For example, if historically the % completeness in zip code is above 95%, any future incomplete data that brings this KPI down could signal further attention to the root cause analysis of the issue.

Telmai uses time series analysis to learn about values in your data and create a baseline for normal behavior. With that baseline established, Telmai establishes thresholds for every data metric monitored. This automates anomaly detection. When the data crosses a certain threshold or falls outside historical patterns, alerts, and notifications uncover changes in the data and key KPIs.

4. Pipeline control using a circuit breaker 

In many cases, identifying data quality issues is the first step in the process. Fully operationalizing data quality monitoring into the pipeline is often the ultimate goal. To fix data quality issues, data teams implement remediation pipelines that need to run and process at the right time and for the right data validation rule.

Telmai can programmatically be integrated into Databricks data pipelines. Using Telmai’s API, data teams can invoke remediation workflows based on the output of data quality monitoring. Repair actions can be taken based on the alert type and details provided through Telmai’s APIs. For example, incomplete data alerts can be flagged for upstream teams to fix, while duplicate data can go through a de-duplication workflow.

5. Cataloging high-quality data assets

After automating checks and balances on the data housed inside the lake house, data health metrics are published into the Unity Catalog so catalog users get additional insights about their data sets like Data Quality KPIs, open issues, certified data labels, and more. 

Telmai provides continuous data quality checks and creates a series of data quality KPIs such as accuracy, completeness, and freshness. These KPIs can label data sets in the Unity Catalog for data consumers to easily discover and utilize high-quality data in their analytics, ML/AI, and data-sharing initiatives.

6. No-code, low-code UI for collaboration

While Databrick’s primary users remain highly technical, Databricks Unity Catalog has opened access, discovery, and usage of data quality assets to a broader set of users.

With Telmai, data teams can take this further and invite their business counterparts and SMEs to collaborate on setting data quality rules and expectations and finding the root cause of issues before downstream business impact. Telmai provides over 40 out-of-the-box data quality metrics and has an intuitive, visual UI to make data quality accessible to those who actually have the business context for it.

7. Data profiling and analysis for migrating to Databricks 

Perhaps this capability is the first of many for those migrating to Databricks data lake architecture. These migrations need easy and automated profiling and auditing tools to understand data structures and content pre-migration and avoid duplications, data loss, or inconsistencies post-migration.

Telmai’s low-code no-code Data Observability product enables profiling and understanding of data quality issues before migration and testing and validation after migration to ensure a well-designed and high-quality data lake environment.

To learn more and see Telmai in action, request a demo with us here.

Data profiling helps organizations understand their data, identify issues and discrepancies, and improve data quality. It is an essential part of any data-related project and without it data quality could impact critical business decisions, customer trust, sales and financial opportunities. 

To get started, there are four main steps in building a complete and ongoing data profiling process:

  1. Data Collection
  2. Discovery & Analysis
  3. Documenting the Findings
  4. Data Quality Monitoring

We'll explore each of these steps in detail and discuss how they contribute to the overall goal of ensuring accurate and reliable data. Before we get started, let's remind ourself of what is data profiling.

What are the different kinds of data profiling?

Data profiling falls into three major categories: structure discovery, content discovery, and relationship discovery. While they all help in gaining more understanding of the data, the type of insights they provide are different:

 

Structure discovery analyzes that data is consistent, formatted correctly, and well structured. For example, if you have a ‘Date’ field, structure discovery helps you see the various patterns of dates (e.g., YYYY-MM-DD or YYYY/DD/MM) so you can standardize your data into one format.

 

Structure discovery also examines simple and basic statistics in the data, for example, minimum and maximum values, means, medians, and standard deviations.

 

Content discovery looks more closely into the individual attributes and data values to check for data quality issues. This can help you find null values, empty fields, duplicates, incomplete values, outliers, and anomalies.

 

For example, if you are profiling address information, content discovery helps you see whether your ‘State’ field contains the two-letter abbreviation or the fully spelled out city names, both, or potentially some typos.

 

Content discovery can also be a way to validate databases with predefined rules. This process helps find ways to improve data quality by identifying instances where the data does not conform to predefined rules. For example, a transaction amount should never be less than $0.

 

Relationship discovery discovers how different datasets are related to each other. For example, key relationships between database tables, or lookup cells in a spreadsheet. Understanding relationships is most critical in designing a new database schema, a data warehouse, or an ETL flow that requires joining tables and data sets based on those key relationships.

Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Data Observability
Data Quality

Leverages ML and statistical analysis to learn from the data and identify potential issues, and can also validate data against predefined rules

Uses predefined metrics from a known set of policies to understand the health of the data

Detects, investigates the root cause of issues, and helps remediate

Detects and helps remediate.

Examples: continuous monitoring, alerting on anomalies or drifts, and operationalizing the findings into data flows

Examples: data validation, data cleansing, data standardization

Low-code / no-code to accelerate time to value and lower cost

Ongoing maintenance, tweaking, and testing data quality rules adds to its costs

Enables both business and technical teams to participate in data quality and monitoring initiatives

Designed mainly for technical teams who can implement ETL workflows or open source data validation software

Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Start your data observibility today

Connect your data and start generating a baseline in less than 10 minutes. 

No sales call needed

Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Start your data observability today

Connect your data and start generating a baseline in less than 10 minutes. 

Telmai is a platform for the Data Teams to proactively detect and investigate anomalies in real-time.
© 2023 Telm.ai All right reserved.