Data Freshness: 3 Levels of Freshness and Which One is Right for You

Data Freshness: 3 Levels of Freshness and Which One is Right for You
Max Lukichev, Co-founder and CTO

Data Freshness in the context of observability takes many forms, which are vastly different depending on the intended use. Outdated data is just as bad as incorrect data. But in addition, understanding how often tables are updated plays a critical role in building self-service around data. It provides important context to the users on what to expect and how to integrate with the data optimally.

Freshness can be categorized into:

  • Table-level Freshness, i.e., how often a table is being updated and whether the time since the last update is anomalous
  • Record-level Freshness, i.e., what is the percentage of outdated records (entities) and whether this percentage increases abnormally
  • Entity-level Freshness, i.e. similar to record-level, except it takes into account that multiple versions of an entity may exist in the same table

Table-level Freshness

Understanding how often a table is being updated or used serves a dual purpose

  • Augmenting data catalogs with data quality insights
  • Proactive monitoring and alerting

Data Catalog Augmentation: Understanding when the data is being updated and read is very valuable when deciding how the data can be used and integrated. For example: If data is constantly updated, it makes sense to invest in streaming integration to save on operational costs, but if the data is updated infrequently, then batch is the way to go; or even if the data is relevant at all, the table which is rarely used for querying may not be the right one even if it has the correct schema. 

All these considerations are critical for empowering self-service, and it's also not uncommon to have unused legacy tables that need constant purging.  

These use cases collectively serve data discoverability use cases via robust data catalogs.

Proactive Monitoring and Alerting: By analyzing historical trends on how often the data was delivered to a table, a system can predict the update rate and if for whatever reason such update doesn’t happen on time, it can indicate a serious issue. By proactively notifying the data pipeline owners, failures like that can be mitigated and repaired before the damage is done and hence this is a very valuable insight in increasing data reliability.

How to implement Table level freshness ?

There are different options for implementing table-level freshness. Some use database query logs to determine the write rate; others look at the last update timestamp of the table in its metadata. In contrast, some others look at the most recent timestamp within the table (in case the record update attribute is part of the schema). Each has its pros and cons, depending on the available APIs of the underlying systems. 

Because query logs are different in each system and may not even be available in some, they are not ideal for defining data freshness. Therefore, a simpler option like last updated timestamp metadata would make more sense, even though it may not be as accurate.

Record-Level Freshness

Unlike table-level freshness, record-level freshness is related to the quality of the data itself rather than the quality of the pipeline. Table-level freshness shows us whether the pipeline delivers data but won’t tell you how much outdated data is within the table. For example, a customer may consider a record fresh and valuable if it was updated within the last month, and anything else is considered outdated. These insights can not be gathered by looking at table level freshness.

Tracking the percent of outdated records is very important in the context of the data catalog, too; a table with 50% of records out of date may not be suitable for an accurate BI report. However, it is ok to develop and test ML models. It is also a valuable metric to track and be alerted when the percentage unexpectedly grows.

Entity-Level Freshness

Record-level freshness is a common request in data warehousing scenarios. Most modern data warehouses are designed for the high write throughput; therefore, it is a common practice to create multiple instances of the same record when updating some fields, i.e. instead of updating a record a new one is created with a newer timestamp. Deduplication, in this case, happens at a query time. 

Creating additional records for every update also has the benefit of being able to look at the historical data state. That means it’s expected to have outdated records in the table. However, since multiple records belong to the same entity (ex., a customer with the same unique id has a history for every time it was updated), only the latest version of the entity needs to be analyzed for freshness. In this context, it’s better to talk of entity-level freshness rather than record-level freshness. 

Supporting entity-level freshness requires implicit deduplication of the records and hence needs to be considered when building the solution.

Another consideration is that unlike table freshness, which is quickly established automatically via historical trends, entity-level freshness is tied to the business context and often requires users to provide definitions for each use case. For example, in some instances, <1 month is excellent; in others, data can’t be older than a couple of days.

In practice, record-level freshness can be calculated only if the table has an attribute - record update time:

In the example above, the Freshness Expectation can be  set via tools like Telmai to be no more than one month old. Telmai uses this to estimate the expected percentage of outdated records and alerts when that ratio drastically changes.

Telmai also visualizes this in a historical perspective:

Conclusion

As we demonstrated in this article Freshness is a very broad and a very important area of Data Observability and also a key metric for Data Quality. Tracking different kinds of freshness addresses different use cases. If you’d like to learn how we approach it at Telmai, please reach out to us: https://www.telm.ai/contact-us.

Data profiling helps organizations understand their data, identify issues and discrepancies, and improve data quality. It is an essential part of any data-related project and without it data quality could impact critical business decisions, customer trust, sales and financial opportunities. 

To get started, there are four main steps in building a complete and ongoing data profiling process:

  1. Data Collection
  2. Discovery & Analysis
  3. Documenting the Findings
  4. Data Quality Monitoring

We'll explore each of these steps in detail and discuss how they contribute to the overall goal of ensuring accurate and reliable data. Before we get started, let's remind ourself of what is data profiling.

What are the different kinds of data profiling?

Data profiling falls into three major categories: structure discovery, content discovery, and relationship discovery. While they all help in gaining more understanding of the data, the type of insights they provide are different:

 

Structure discovery analyzes that data is consistent, formatted correctly, and well structured. For example, if you have a ‘Date’ field, structure discovery helps you see the various patterns of dates (e.g., YYYY-MM-DD or YYYY/DD/MM) so you can standardize your data into one format.

 

Structure discovery also examines simple and basic statistics in the data, for example, minimum and maximum values, means, medians, and standard deviations.

 

Content discovery looks more closely into the individual attributes and data values to check for data quality issues. This can help you find null values, empty fields, duplicates, incomplete values, outliers, and anomalies.

 

For example, if you are profiling address information, content discovery helps you see whether your ‘State’ field contains the two-letter abbreviation or the fully spelled out city names, both, or potentially some typos.

 

Content discovery can also be a way to validate databases with predefined rules. This process helps find ways to improve data quality by identifying instances where the data does not conform to predefined rules. For example, a transaction amount should never be less than $0.

 

Relationship discovery discovers how different datasets are related to each other. For example, key relationships between database tables, or lookup cells in a spreadsheet. Understanding relationships is most critical in designing a new database schema, a data warehouse, or an ETL flow that requires joining tables and data sets based on those key relationships.

Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Data Observability
Data Quality

Leverages ML and statistical analysis to learn from the data and identify potential issues, and can also validate data against predefined rules

Uses predefined metrics from a known set of policies to understand the health of the data

Detects, investigates the root cause of issues, and helps remediate

Detects and helps remediate.

Examples: continuous monitoring, alerting on anomalies or drifts, and operationalizing the findings into data flows

Examples: data validation, data cleansing, data standardization

Low-code / no-code to accelerate time to value and lower cost

Ongoing maintenance, tweaking, and testing data quality rules adds to its costs

Enables both business and technical teams to participate in data quality and monitoring initiatives

Designed mainly for technical teams who can implement ETL workflows or open source data validation software

Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Start your data observibility today

Connect your data and start generating a baseline in less than 10 minutes. 

No sales call needed

Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Start your data observability today

Connect your data and start generating a baseline in less than 10 minutes. 

Telmai is a platform for the Data Teams to proactively detect and investigate anomalies in real-time.
© 2023 Telm.ai All right reserved.