Data Freshness in the context of observability takes many forms, which are vastly different depending on the intended use. Outdated data is just as bad as incorrect data. But in addition, understanding how often tables are updated plays a critical role in building self-service around data. It provides important context to the users on what to expect and how to integrate with the data optimally.
Freshness can be categorized into:
- Table-level Freshness, i.e., how often a table is being updated and whether the time since the last update is anomalous
- Record-level Freshness, i.e., what is the percentage of outdated records (entities) and whether this percentage increases abnormally
- Entity-level Freshness, i.e. similar to record-level, except it takes into account that multiple versions of an entity may exist in the same table
Understanding how often a table is being updated or used serves a dual purpose
- Augmenting data catalogs with data quality insights
- Proactive monitoring and alerting
Data Catalog Augmentation: Understanding when the data is being updated and read is very valuable when deciding how the data can be used and integrated. For example: If data is constantly updated, it makes sense to invest in streaming integration to save on operational costs, but if the data is updated infrequently, then batch is the way to go; or even if the data is relevant at all, the table which is rarely used for querying may not be the right one even if it has the correct schema.
All these considerations are critical for empowering self-service, and it's also not uncommon to have unused legacy tables that need constant purging.
These use cases collectively serve data discoverability use cases via robust data catalogs.
Proactive Monitoring and Alerting: By analyzing historical trends on how often the data was delivered to a table, a system can predict the update rate and if for whatever reason such update doesn’t happen on time, it can indicate a serious issue. By proactively notifying the data pipeline owners, failures like that can be mitigated and repaired before the damage is done and hence this is a very valuable insight in increasing data reliability.
How to implement Table level freshness ?
There are different options for implementing table-level freshness. Some use database query logs to determine the write rate; others look at the last update timestamp of the table in its metadata. In contrast, some others look at the most recent timestamp within the table (in case the record update attribute is part of the schema). Each has its pros and cons, depending on the available APIs of the underlying systems.
Because query logs are different in each system and may not even be available in some, they are not ideal for defining data freshness. Therefore, a simpler option like last updated timestamp metadata would make more sense, even though it may not be as accurate.
Unlike table-level freshness, record-level freshness is related to the quality of the data itself rather than the quality of the pipeline. Table-level freshness shows us whether the pipeline delivers data but won’t tell you how much outdated data is within the table. For example, a customer may consider a record fresh and valuable if it was updated within the last month, and anything else is considered outdated. These insights can not be gathered by looking at table level freshness.
Tracking the percent of outdated records is very important in the context of the data catalog, too; a table with 50% of records out of date may not be suitable for an accurate BI report. However, it is ok to develop and test ML models. It is also a valuable metric to track and be alerted when the percentage unexpectedly grows.
Record-level freshness is a common request in data warehousing scenarios. Most modern data warehouses are designed for the high write throughput; therefore, it is a common practice to create multiple instances of the same record when updating some fields, i.e. instead of updating a record a new one is created with a newer timestamp. Deduplication, in this case, happens at a query time.
Creating additional records for every update also has the benefit of being able to look at the historical data state. That means it’s expected to have outdated records in the table. However, since multiple records belong to the same entity (ex., a customer with the same unique id has a history for every time it was updated), only the latest version of the entity needs to be analyzed for freshness. In this context, it’s better to talk of entity-level freshness rather than record-level freshness.
Supporting entity-level freshness requires implicit deduplication of the records and hence needs to be considered when building the solution.
Another consideration is that unlike table freshness, which is quickly established automatically via historical trends, entity-level freshness is tied to the business context and often requires users to provide definitions for each use case. For example, in some instances, <1 month is excellent; in others, data can’t be older than a couple of days.
In practice, record-level freshness can be calculated only if the table has an attribute - record update time:
In the example above, the Freshness Expectation can be set via tools like Telmai to be no more than one month old. Telmai uses this to estimate the expected percentage of outdated records and alerts when that ratio drastically changes.
Telmai also visualizes this in a historical perspective:
As we demonstrated in this article Freshness is a very broad and a very important area of Data Observability and also a key metric for Data Quality. Tracking different kinds of freshness addresses different use cases. If you’d like to learn how we approach it at Telmai, please reach out to us: https://www.telm.ai/contact-us.
Data profiling helps organizations understand their data, identify issues and discrepancies, and improve data quality. It is an essential part of any data-related project and without it data quality could impact critical business decisions, customer trust, sales and financial opportunities.
To get started, there are four main steps in building a complete and ongoing data profiling process:
We'll explore each of these steps in detail and discuss how they contribute to the overall goal of ensuring accurate and reliable data. Before we get started, let's remind ourself of what is data profiling.
1. Data Collection
Start with data collection. Gather data from various sources and extract it into a single location for analysis. If you have multiple sources, choose a centralized data profiling tool (see our recommendation in the conclusion) that can easily connect and analyze all your data without having you do any prep work.
2. Discovery & Analysis
Now that you have collected your data for analysis, it's time to investigate it. Depending on your use case, you may need structure discovery, content discovery, relationship discovery, or all three. If data content or structure discovery is important for your use case, make sure that you collect and profile your data in its entirety and do not use samples as it will skew your results.
Use visualizations to make your discovery and analysis more understandable. It is much easier to see outliers and anomalies in your data using graphs than in a table format.
3. Documenting the Findings
Create a report or documentation outlining the results of the data profiling process, including any issues or discrepancies found.
Use this step to establish data quality rules that you may not have been aware of. For example, a United States ZIP code of 94061 could have accidentally been typed in as 94 061 with a space in the middle. Documenting this issue could help you establish new rules for the next time you profile the data.
4. Data Quality Monitoring
Now that you know what you have, the next step is to make sure you correct these issues. This may be something that you can correct or something that you need to flag for upstream data owners to fix.
After your data profiling is done and the system goes live, your data quality assurance work is not done – in fact, it's just getting started.
Data constantly changes. If unchecked, data quality defects will continue to occur, both as a result of system and user behavior changes.
Build a platform that can measure and monitor data quality on an ongoing basis.
Take Advantage of Data Observability Tools
Automated tools can help you save time and resources and ensure accuracy in the process.
Unfortunately, traditional data profiling tools offered by legacy ETL and database vendors are complex and require data engineering and technical skills. They also only handle data that is structured and ready for analysis. Semi-structured data sets, nested data formats, blob storage types, or streaming data do not have a place in those solutions.
Today organizations that deal with complex data types or large amounts of data are looking for a newer, more scalable solution.
That’s where a data observability tool like Telmai comes in. Telmai is built to handle the complexity that data profiling projects are faced with today. Some advantages include centralized profiling for all data types, a low-code no-code interface, ML insights, easy integration, and scale and performance.
Leverages ML and statistical analysis to learn from the data and identify potential issues, and can also validate data against predefined rules
Uses predefined metrics from a known set of policies to understand the health of the data
Detects, investigates the root cause of issues, and helps remediate
Detects and helps remediate.
Examples: continuous monitoring, alerting on anomalies or drifts, and operationalizing the findings into data flows
Examples: data validation, data cleansing, data standardization
Low-code / no-code to accelerate time to value and lower cost
Ongoing maintenance, tweaking, and testing data quality rules adds to its costs
Enables both business and technical teams to participate in data quality and monitoring initiatives
Designed mainly for technical teams who can implement ETL workflows or open source data validation software
Start your data observibility today
Connect your data and start generating a baseline in less than 10 minutes.
No sales call needed
Start your data observability today
Connect your data and start generating a baseline in less than 10 minutes.