This week, we announced our Databricks partnership. A few months in making, this partnership has been the result of our shared customers and the capabilities that they have asked for. Here are 7 capabilities that Telmai brings to the Databricks platform.
1. Monitoring batch and streaming pipelines
Enterprises rely on the Databricks Lakehouse platform to bring analytics, advanced ML models, and data products into production. This, in turn, drives the importance of data quality as adopting these applications relies on good data.
Telmai’s Data Observability provides the tooling for continuously monitoring the health of data lake houses and delta live tables (DLT). Telmai automatically and proactively detects, investigates, and monitors the quality of batch workloads and streaming data pipelines.
2. Investigating semi-structured and structured data alike
As issues arise, data teams need tooling to investigate and find the root cause of their data quality. With the flexibility of Databricks lake house architecture to house both structured and semi-structured data, any monitoring and observability layer needs to be able to detect and investigate data quality issues regardless of the type and format of the data. However, semi-structured data has nested attributes and multi-value fields, aka arrays, and requires flattening and calculating aggregations one attribute at a time. The accumulated overhead on each query with such an approach quickly becomes unmanageable.
Telmai, on the other hand, is designed from the ground up to support very large semi-structured schemas such as JSON (ndjson), parquet, and data warehouse tables with JSON files. Data teams can Telmai to investigate the root cause of all their data.
3. Anomaly detection and predictions
Machine learning and time series analysis of observed data can help data teams gain a predictable outlook on the data quality issues that may arise and empower them to set notifications and alerts where future data would drift from historical ranges or expected values. For example, if historically the % completeness in zip code is above 95%, any future incomplete data that brings this KPI down could signal further attention to the root cause analysis of the issue.
Telmai uses time series analysis to learn about values in your data and create a baseline for normal behavior. With that baseline established, Telmai establishes thresholds for every data metric monitored. This automates anomaly detection. When the data crosses a certain threshold or falls outside historical patterns, alerts, and notifications uncover changes in the data and key KPIs.
4. Pipeline control using a circuit breaker
In many cases, identifying data quality issues is the first step in the process. Fully operationalizing data quality monitoring into the pipeline is often the ultimate goal. To fix data quality issues, data teams implement remediation pipelines that need to run and process at the right time and for the right data validation rule.
Telmai can programmatically be integrated into Databricks data pipelines. Using Telmai’s API, data teams can invoke remediation workflows based on the output of data quality monitoring. Repair actions can be taken based on the alert type and details provided through Telmai’s APIs. For example, incomplete data alerts can be flagged for upstream teams to fix, while duplicate data can go through a de-duplication workflow.
5. Cataloging high-quality data assets
After automating checks and balances on the data housed inside the lake house, data health metrics are published into the Unity Catalog so catalog users get additional insights about their data sets like Data Quality KPIs, open issues, certified data labels, and more.
Telmai provides continuous data quality checks and creates a series of data quality KPIs such as accuracy, completeness, and freshness. These KPIs can label data sets in the Unity Catalog for data consumers to easily discover and utilize high-quality data in their analytics, ML/AI, and data-sharing initiatives.
6. No-code, low-code UI for collaboration
While Databrick’s primary users remain highly technical, Databricks Unity Catalog has opened access, discovery, and usage of data quality assets to a broader set of users.
With Telmai, data teams can take this further and invite their business counterparts and SMEs to collaborate on setting data quality rules and expectations and finding the root cause of issues before downstream business impact. Telmai provides over 40 out-of-the-box data quality metrics and has an intuitive, visual UI to make data quality accessible to those who actually have the business context for it.
7. Data profiling and analysis for migrating to Databricks
Perhaps this capability is the first of many for those migrating to Databricks data lake architecture. These migrations need easy and automated profiling and auditing tools to understand data structures and content pre-migration and avoid duplications, data loss, or inconsistencies post-migration.
Telmai’s low-code no-code Data Observability product enables profiling and understanding of data quality issues before migration and testing and validation after migration to ensure a well-designed and high-quality data lake environment.
To learn more and see Telmai in action, request a demo with us here.
Data profiling helps organizations understand their data, identify issues and discrepancies, and improve data quality. It is an essential part of any data-related project and without it data quality could impact critical business decisions, customer trust, sales and financial opportunities.
To get started, there are four main steps in building a complete and ongoing data profiling process:
We'll explore each of these steps in detail and discuss how they contribute to the overall goal of ensuring accurate and reliable data. Before we get started, let's remind ourself of what is data profiling.
1. Data Collection
Start with data collection. Gather data from various sources and extract it into a single location for analysis. If you have multiple sources, choose a centralized data profiling tool (see our recommendation in the conclusion) that can easily connect and analyze all your data without having you do any prep work.
2. Discovery & Analysis
Now that you have collected your data for analysis, it's time to investigate it. Depending on your use case, you may need structure discovery, content discovery, relationship discovery, or all three. If data content or structure discovery is important for your use case, make sure that you collect and profile your data in its entirety and do not use samples as it will skew your results.
Use visualizations to make your discovery and analysis more understandable. It is much easier to see outliers and anomalies in your data using graphs than in a table format.
3. Documenting the Findings
Create a report or documentation outlining the results of the data profiling process, including any issues or discrepancies found.
Use this step to establish data quality rules that you may not have been aware of. For example, a United States ZIP code of 94061 could have accidentally been typed in as 94 061 with a space in the middle. Documenting this issue could help you establish new rules for the next time you profile the data.
4. Data Quality Monitoring
Now that you know what you have, the next step is to make sure you correct these issues. This may be something that you can correct or something that you need to flag for upstream data owners to fix.
After your data profiling is done and the system goes live, your data quality assurance work is not done – in fact, it's just getting started.
Data constantly changes. If unchecked, data quality defects will continue to occur, both as a result of system and user behavior changes.
Build a platform that can measure and monitor data quality on an ongoing basis.
Take Advantage of Data Observability Tools
Automated tools can help you save time and resources and ensure accuracy in the process.
Unfortunately, traditional data profiling tools offered by legacy ETL and database vendors are complex and require data engineering and technical skills. They also only handle data that is structured and ready for analysis. Semi-structured data sets, nested data formats, blob storage types, or streaming data do not have a place in those solutions.
Today organizations that deal with complex data types or large amounts of data are looking for a newer, more scalable solution.
That’s where a data observability tool like Telmai comes in. Telmai is built to handle the complexity that data profiling projects are faced with today. Some advantages include centralized profiling for all data types, a low-code no-code interface, ML insights, easy integration, and scale and performance.
Leverages ML and statistical analysis to learn from the data and identify potential issues, and can also validate data against predefined rules
Uses predefined metrics from a known set of policies to understand the health of the data
Detects, investigates the root cause of issues, and helps remediate
Detects and helps remediate.
Examples: continuous monitoring, alerting on anomalies or drifts, and operationalizing the findings into data flows
Examples: data validation, data cleansing, data standardization
Low-code / no-code to accelerate time to value and lower cost
Ongoing maintenance, tweaking, and testing data quality rules adds to its costs
Enables both business and technical teams to participate in data quality and monitoring initiatives
Designed mainly for technical teams who can implement ETL workflows or open source data validation software
Start your data observibility today
Connect your data and start generating a baseline in less than 10 minutes.
No sales call needed
Start your data observability today
Connect your data and start generating a baseline in less than 10 minutes.