I’m excited to share some of the latest features we have been working on at Telmai
We have added some fascinating functionality to our product this release like,
- Automatic ML-based thresholds
- Support of Semi-Structured Data (JSON)
- New Integrations: Snowflake, Firebolt
- Change Data Control for SQL sources and Cloud data storage
- Data metric segmentation
- New Data Metrics: Table level metrics + distribution drifts
Support of Semi-Structured Data
Most modern cloud data warehouses and data lakes now support semi-structured schemas. Data architects are leveraging this structure to design the most efficient data model for storage and querying. Providing quality metrics and KPIs that are aware of such systems is crucial for establishing accurate observability outcomes.
Hence, now Telmai can monitor not only flat data but also files and Data Warehouse tables with semi-structured schema (i.e., nested and multi-valued attributes.). Telmai is designed to support complete analysis on complex data with thousands of attributes without any impact on performance.
New Integrations: Snowflake, Firebolt
We added support for Azure Blob, Snowflake, and Firebolt in addition to BigQuery, CloudStorage, S3 and local files.
All SQL sources now support both flat and semi-structured schemas and can be configured for Change Data Capture(CDC).
Stay tuned for a separate blog on these integartions.
Change Data Capture(CDC) support
Telmai provides a way to schedule runs with specific periodicity, i.e., hourly, daily, or weekly for any source. Additionally, you can configure Telmai to process and monitor only the portion of the data (delta) which changed between the runs. For Cloud storage, the delta is determined via file metadata. For SQL sources, users can specify an attribute holding records creation/update timestamps, which is then used to read freshly changed records.
CDC support is additional to the full database and table analysis i.e we can provide metrics and alerts on total data as well changed data.
This powerful functionality will enable users to review trends and drifts holistically for the entire data set and only for changed data.
Data Metric and threshold segmentation
Data in the same table often needs to be analyzed and monitored separately as the trends may vary based on specific dimensions, like different customers or geographic regions.
From this release, Telmai allows users to specify this dimension in the data source, enabling both holistic metric and segmented analysis.
Automatic ML-based Thresholds
With our low-code no-code approach, Telmai calculates thresholds for each data metric on your dataset. These ML-based thresholds are now enhanced to evolve with your data without any configurations.
Telmai will automatically establish and predict trends over key metrics, like % of non-null/empty values, number of records, % of unique values and many more. When an observed value of a metric is outside of prediction boundaries Telmai will issue an alert and send a notification to subscribers.
New Metrics: Distribution Drifts
In addition to various data metric drifts like record count, completeness etc., Telmai can automatically detect unexpected changes in distributions of categorical data or changes in the distributions of value patterns:
Often categorical data is used to understand the segmentation of business, and sudden drift in such distributions could have a direct potential impact. With this automatic alert, data teams can be proactively aware of such drifts to investigate before any business impact.
All of the above and much more functionality has been added to our product. If you would like to learn more about these features and how they apply in your usecase, feel free to schedule a demo using this link
Data profiling helps organizations understand their data, identify issues and discrepancies, and improve data quality. It is an essential part of any data-related project and without it data quality could impact critical business decisions, customer trust, sales and financial opportunities.
To get started, there are four main steps in building a complete and ongoing data profiling process:
We'll explore each of these steps in detail and discuss how they contribute to the overall goal of ensuring accurate and reliable data. Before we get started, let's remind ourself of what is data profiling.
1. Data Collection
Start with data collection. Gather data from various sources and extract it into a single location for analysis. If you have multiple sources, choose a centralized data profiling tool (see our recommendation in the conclusion) that can easily connect and analyze all your data without having you do any prep work.
2. Discovery & Analysis
Now that you have collected your data for analysis, it's time to investigate it. Depending on your use case, you may need structure discovery, content discovery, relationship discovery, or all three. If data content or structure discovery is important for your use case, make sure that you collect and profile your data in its entirety and do not use samples as it will skew your results.
Use visualizations to make your discovery and analysis more understandable. It is much easier to see outliers and anomalies in your data using graphs than in a table format.
3. Documenting the Findings
Create a report or documentation outlining the results of the data profiling process, including any issues or discrepancies found.
Use this step to establish data quality rules that you may not have been aware of. For example, a United States ZIP code of 94061 could have accidentally been typed in as 94 061 with a space in the middle. Documenting this issue could help you establish new rules for the next time you profile the data.
4. Data Quality Monitoring
Now that you know what you have, the next step is to make sure you correct these issues. This may be something that you can correct or something that you need to flag for upstream data owners to fix.
After your data profiling is done and the system goes live, your data quality assurance work is not done – in fact, it's just getting started.
Data constantly changes. If unchecked, data quality defects will continue to occur, both as a result of system and user behavior changes.
Build a platform that can measure and monitor data quality on an ongoing basis.
Take Advantage of Data Observability Tools
Automated tools can help you save time and resources and ensure accuracy in the process.
Unfortunately, traditional data profiling tools offered by legacy ETL and database vendors are complex and require data engineering and technical skills. They also only handle data that is structured and ready for analysis. Semi-structured data sets, nested data formats, blob storage types, or streaming data do not have a place in those solutions.
Today organizations that deal with complex data types or large amounts of data are looking for a newer, more scalable solution.
That’s where a data observability tool like Telmai comes in. Telmai is built to handle the complexity that data profiling projects are faced with today. Some advantages include centralized profiling for all data types, a low-code no-code interface, ML insights, easy integration, and scale and performance.
Leverages ML and statistical analysis to learn from the data and identify potential issues, and can also validate data against predefined rules
Uses predefined metrics from a known set of policies to understand the health of the data
Detects, investigates the root cause of issues, and helps remediate
Detects and helps remediate.
Examples: continuous monitoring, alerting on anomalies or drifts, and operationalizing the findings into data flows
Examples: data validation, data cleansing, data standardization
Low-code / no-code to accelerate time to value and lower cost
Ongoing maintenance, tweaking, and testing data quality rules adds to its costs
Enables both business and technical teams to participate in data quality and monitoring initiatives
Designed mainly for technical teams who can implement ETL workflows or open source data validation software
Start your data observibility today
Connect your data and start generating a baseline in less than 10 minutes.
No sales call needed
Start your data observability today
Connect your data and start generating a baseline in less than 10 minutes.