Data Freshness: 3 Levels of Freshness and Which One is Right for You
October 21, 2022
Max Lukichev, Co-founder and CTO
Data Freshness in the context of observability takes many forms, which are vastly different depending on the intended use. Outdated data is just as bad as incorrect data. But in addition, understanding how often tables are updated plays a critical role in building self-service around data. It provides important context to the users on what to expect and how to integrate with the data optimally.
Freshness can be categorized into:
Table-level Freshness, i.e., how often a table is being updated and whether the time since the last update is anomalous
Record-level Freshness, i.e., what is the percentage of outdated records (entities) and whether this percentage increases abnormally
Entity-level Freshness, i.e. similar to record-level, except it takes into account that multiple versions of an entity may exist in the same table
Understanding how often a table is being updated or used serves a dual purpose
Augmenting data catalogs with data quality insights
Proactive monitoring and alerting
Data Catalog Augmentation: Understanding when the data is being updated and read is very valuable when deciding how the data can be used and integrated. For example: If data is constantly updated, it makes sense to invest in streaming integration to save on operational costs, but if the data is updated infrequently, then batch is the way to go; or even if the data is relevant at all, the table which is rarely used for querying may not be the right one even if it has the correct schema.
All these considerations are critical for empowering self-service, and it's also not uncommon to have unused legacy tables that need constant purging.
These use cases collectively serve data discoverability use cases via robust data catalogs.
Proactive Monitoring and Alerting: By analyzing historical trends on how often the data was delivered to a table, a system can predict the update rate and if for whatever reason such update doesn’t happen on time, it can indicate a serious issue. By proactively notifying the data pipeline owners, failures like that can be mitigated and repaired before the damage is done and hence this is a very valuable insight in increasing data reliability.
How to implement Table level freshness ?
There are different options for implementing table-level freshness. Some use database query logs to determine the write rate; others look at the last update timestamp of the table in its metadata. In contrast, some others look at the most recent timestamp within the table (in case the record update attribute is part of the schema). Each has its pros and cons, depending on the available APIs of the underlying systems.
Because query logs are different in each system and may not even be available in some, they are not ideal for defining data freshness. Therefore, a simpler option like last updated timestamp metadata would make more sense, even though it may not be as accurate.
Unlike table-level freshness, record-level freshness is related to the quality of the data itself rather than the quality of the pipeline. Table-level freshness shows us whether the pipeline delivers data but won’t tell you how much outdated data is within the table. For example, a customer may consider a record fresh and valuable if it was updated within the last month, and anything else is considered outdated. These insights can not be gathered by looking at table level freshness.
Tracking the percent of outdated records is very important in the context of the data catalog, too; a table with 50% of records out of date may not be suitable for an accurate BI report. However, it is ok to develop and test ML models. It is also a valuable metric to track and be alerted when the percentage unexpectedly grows.
Record-level freshness is a common request in data warehousing scenarios. Most modern data warehouses are designed for the high write throughput; therefore, it is a common practice to create multiple instances of the same record when updating some fields, i.e. instead of updating a record a new one is created with a newer timestamp. Deduplication, in this case, happens at a query time.
Creating additional records for every update also has the benefit of being able to look at the historical data state. That means it’s expected to have outdated records in the table. However, since multiple records belong to the same entity (ex., a customer with the same unique id has a history for every time it was updated), only the latest version of the entity needs to be analyzed for freshness. In this context, it’s better to talk of entity-level freshness rather than record-level freshness.
Supporting entity-level freshness requires implicit deduplication of the records and hence needs to be considered when building the solution.
Another consideration is that unlike table freshness, which is quickly established automatically via historical trends, entity-level freshness is tied to the business context and often requires users to provide definitions for each use case. For example, in some instances, <1 month is excellent; in others, data can’t be older than a couple of days.
In practice, record-level freshness can be calculated only if the table has an attribute - record update time:
In the example above, the Freshness Expectation can be set via tools like Telmai to be no more than one month old. Telmai uses this to estimate the expected percentage of outdated records and alerts when that ratio drastically changes.
Telmai also visualizes this in a historical perspective:
As we demonstrated in this article Freshness is a very broad and a very important area of Data Observability and also a key metric for Data Quality. Tracking different kinds of freshness addresses different use cases. If you’d like to learn how we approach it at Telmai, please reach out to us: https://www.telm.ai/contact-us.
On this page
Start your data observibility today
Connect your data and start generating a baseline in less than 10 minutes.