A chef preparing a gourmet meal might use a thermometer to check the temperature of the meat. A mechanic working on a car might use a dipstick to check the engine's oil level. For data teams, how do we specify data quality, and how do we measure it?
Data quality is more important than ever, but as technology evolves, the way we talk about data quality is struggling to keep up.
Historically many data quality dimensions have been adopted, like Accuracy, Validity, Completeness, Consistency, Reliability, Timeliness, Uniqueness, Accessibility, Confidentiality, Relevance, Integrity, … etc.
However, there is no standardization of their names or descriptions.
A comprehensive survey of over 60 dimensions was conducted by DAMA NL Foundations and published in DDQ-Research-2020 as an attempt to move towards more standardization. Among the many dimensions of data quality, customers typically choose a small subset of the most critical ones for their needs.
These are referred to as the primary or critical dimensions.
Although the primary dimensions are not universally agreed on, even amongst data quality professionals, six dimensions are widely accepted as primary dimensions:
This blog post focuses on these six widely used dimensions and their measurements, which we refer to as data quality indicators (DQI).
Table of Contents
One of the most basic but crucial indicators of data quality is completeness, often defined as not-null values.
So, complete compared to what? Since data is populated at different levels, there’s different ways to calculate this.
- Dataset/Database: Total data present across all the tables in the entire dataset or database. Expected values are usually calculated against historical data or against other data sources in the pipeline.
- Table/schema: Total data populated in specific tables like Customers, Orders, or Transactions. Expected values are usually calculated against historical data or against other data sources in the pipeline.
- Metadata: The degree to which both technical and business metadata information is populated. This is usually measured as not-null values for metadata like descriptions, security tags, creation date, etc. in your data catalog.
- Records: Total number records in the dataset. Expected values can be measured against historic trends or lineage across various data sources in the pipeline. For example, based on historical trends, users expect 1M records but receive only 500k, causing incomplete records. Or BigQuery has 1M records whereas Cassandra has only 800k records, indicating the records are incomplete.
- Attributes: Total number of attributes in a schema definition. Expected values can be specified by business requirements. This is a useful indicator to ensure schema is complete across multiple data sources in your data pipeline and can be tracked through lineage of schema. For Example, the Customers table in BigQuery has 18 attributes, whereas Cassandra has only 16 attributes; often classified as schema mismatch too.
- Value: Total numbers of populated attribute values across all rows. For example, if a table has 10 rows with 5 attributes, this will indicate how many values are populated across these 50 values.
- An attribute value: Total numbers of populated attribute values for a specific attribute.
Completeness can be calculated using simple null checks at your datastore level. Data is populated with values that exclude the template/null proxies. Examples of null proxies would be 000-000-000, N/A, none, not-defined, Not-applicable , and NA.
The standard unit of measure is percentage(%).
For example, the Customers table of 3M customers has 2.94M populated emails.
Then the completeness of emails is 2.94/3 x 100 = 98%.
The percentage of records that can be identified uniquely based on a predefined key is its uniqueness.
Assuming your table has a primary key defined, the formula for measuring uniqueness would be:
1 – primary_key_count / total_row_count
⚠️ Good to know: Uniqueness is the inverse of duplicates
Here are a few examples of how uniqueness can affect data quality:
Duplicate with identical key values
One record with one key value occurs more than once in a dataset. The two records are not unique.
Oftentimes datastore constraints can easily help avoid this issue.
Duplicate with different key values
Multiple records with the same values occur more than once in a dataset. Object John is not unique in the dataset.
A record has the same key as another record, and both occur in a dataset. Key 22 is not unique.
Most often users will need to use sophisticated master data management and Identify resolution systems for resolving duplicates like these.
Freshness is the delay after which data is considered complete. Data is fresh when it is up-to-date and available on time.
Conversely, data is not fresh when it is stale and unreliable.
Freshness can be associated with various levels including metadata. Is the metadata available and up to date at the time someone is attempting to use it to understand the data that he or she is accessing?
You measure freshness by the length of time between data availability and time of the event or phenomenon they describe.
While many think of freshness as a simple metric, it has several layers, namely table-level freshness, record-level freshness, and finally entity-level freshness.
For data to be valid, it should be collected according to predefined rules and parameters, and should conform to the right format and be within the right range.
For example, if the data is dates and needs to be in DD-MM-YYYY format, but the data is in MM-DD-YY format, then it’s invalid.
Data could also be invalid if it doesn’t conform to defined business rules. For example, if cancelation date > travel date
All data can typically be measured for validity. Validity applies at the data item level, record level (for combinations of valid values), as well as metadata.
It’s typically measured as the percentage of data items valid.
Note that data may also be valid only for a specific length of time. For example, data that is generated from RFID or scientific datasets.
As one of the most complex yet critical data quality indicators, accuracy is the degree to which the data item correctly describes the “real world” object or event being described.
Specifications of the “real world” are usually based on business context.
Accuracy can be associated with various levels including metadata. Is the metadata from a verifiable source, and does it describe the data it is attached to correctly? For example, if the metadata is a definition, does the definition correctly define the data element in question?
You measure accuracy by the degree to which the data mirrors the characteristics of the real world object or objects it represents. The higher the percentage of data entries that pass the data accuracy rules, the more accurate your data is.
Here’s an example of data accuracy: A European school is receiving applications for its annual September intake and requires students to be aged 5 before the 31st August of the intake year.
In this scenario, the parent, a US citizen, applying to a European school completes the Date of Birth (D.O.B) on the application form in the US date format, MM/DD/YYYY rather than the European DD/MM/YYYY format, causing the representation of days and months to be reversed.
As a result, 09/08/YYYY really meant 08/09/YYYY causing the student to be accepted at the age of 5 on the 31st August in YYYY.
The representation of the student’s D.O.B.–whilst valid in its US context–means that in Europe the age was not derived correctly and the value recorded was consequently not accurate.
Consistency can refer to ensuring that data values are consistently represented within a dataset and between datasets, and consistently associated across datasets. Or the degree to which data values of two sets of attributes within a record comply with a rule. Or the degree to which the data values of a set of attributes of a dataset at different points in time comply with a rule.
In consistency two or more data values are compared.
There are various types of consistency:
- Consistency within a record: The degree to which data values of two sets of attributes within a record comply with a rule.
- Consistency between records within a dataset: The degree to which data values of two sets of attributes between records comply with a rule.
- Consistency across datasets: The degree to which data values of two sets of attributes between datasets comply with a rule.
- Consistency over time (temporal consistency): The degree to which the data values of a set of attributes of a dataset at different points in time comply with a rule.
- Consistency across metadata: Is all the metadata consistent, or does it conflict with other metadata? For example, does a usage business rule demand that data be used when the creation business rules state it hasn’t been created yet? Plausibility of data values can be seen of a type of consistency.
A rule can be an equation with a =, > or < symbol. For example, profit = return – costs.
A rule can also be a business rule. For example, one might expect that the number of transactions each day does not exceed 105% of the running average number of transactions for the previous 30 days.
A data item could be measured against itself or its counterpart in another dataset or database.
An example of data consistency would be whether a student’s date of birth has the same value and format in the school register as that stored within the Student database.
Consistency is often measured as a percentage.
Do not confuse consistency with accuracy or correctness.
Data profiling helps organizations understand their data, identify issues and discrepancies, and improve data quality. It is an essential part of any data-related project and without it data quality could impact critical business decisions, customer trust, sales and financial opportunities.
To get started, there are four main steps in building a complete and ongoing data profiling process:
We'll explore each of these steps in detail and discuss how they contribute to the overall goal of ensuring accurate and reliable data. Before we get started, let's remind ourself of what is data profiling.
1. Data Collection
Start with data collection. Gather data from various sources and extract it into a single location for analysis. If you have multiple sources, choose a centralized data profiling tool (see our recommendation in the conclusion) that can easily connect and analyze all your data without having you do any prep work.
2. Discovery & Analysis
Now that you have collected your data for analysis, it's time to investigate it. Depending on your use case, you may need structure discovery, content discovery, relationship discovery, or all three. If data content or structure discovery is important for your use case, make sure that you collect and profile your data in its entirety and do not use samples as it will skew your results.
Use visualizations to make your discovery and analysis more understandable. It is much easier to see outliers and anomalies in your data using graphs than in a table format.
3. Documenting the Findings
Create a report or documentation outlining the results of the data profiling process, including any issues or discrepancies found.
Use this step to establish data quality rules that you may not have been aware of. For example, a United States ZIP code of 94061 could have accidentally been typed in as 94 061 with a space in the middle. Documenting this issue could help you establish new rules for the next time you profile the data.
4. Data Quality Monitoring
Now that you know what you have, the next step is to make sure you correct these issues. This may be something that you can correct or something that you need to flag for upstream data owners to fix.
After your data profiling is done and the system goes live, your data quality assurance work is not done – in fact, it's just getting started.
Data constantly changes. If unchecked, data quality defects will continue to occur, both as a result of system and user behavior changes.
Build a platform that can measure and monitor data quality on an ongoing basis.
Take Advantage of Data Observability Tools
Automated tools can help you save time and resources and ensure accuracy in the process.
Unfortunately, traditional data profiling tools offered by legacy ETL and database vendors are complex and require data engineering and technical skills. They also only handle data that is structured and ready for analysis. Semi-structured data sets, nested data formats, blob storage types, or streaming data do not have a place in those solutions.
Today organizations that deal with complex data types or large amounts of data are looking for a newer, more scalable solution.
That’s where a data observability tool like Telmai comes in. Telmai is built to handle the complexity that data profiling projects are faced with today. Some advantages include centralized profiling for all data types, a low-code no-code interface, ML insights, easy integration, and scale and performance.
Start your data observibility today
Connect your data and start generating a baseline in less than 10 minutes.
No sales call needed
Start your data observability today
Connect your data and start generating a baseline in less than 10 minutes.