Did you know that you can do more with SQL than just retrieve and manipulate data?
By periodically querying your database to track critical metrics - such as row count, percentage of unique values in a column, and other basic data quality KPIs - you can check and maintain the health of your tables.
This post will cover 9 data quality checks you can perform with SQL. You will also learn about Telmai, a high-performance alternative to SQL-based tools that can automate data quality checks with a low-code, no-code interface, and without slowing down your databases.
Table of Contents
- Duplicate Records
- NULL Value
- Data Type
- Why SQL-based Data Quality Checks Aren’t Enough
- An Alternative Approach
First we’ll identify duplicate records in a table.
A duplicate records check in SQL can be performed by using the GROUP BY clause along with the HAVING clause. Here's an example in SQL Server:
In this code, the GROUP BY clause groups together records with identical values in the specified columns, and the HAVING clause filters out groups with a count of one, leaving only the duplicates.
Next, let’s ensure that mandatory fields have values rather than NULLs.
A null value check in SQL can be performed by using the IS NULL operator in a WHERE clause. Here's an example:
This will give you a result set of all the records in the table where the value in column X is NULL.
Now let’s verify that data in each field is of the correct data type.
A data type check in SQL can be performed by using the CASE statement and the IS OF operator. Here's an example:
This will return all the records in the table, along with a new column, DataType, indicating the data type of the value in columnX. The CASE statement checks the data type of columnX and returns a string value indicating the data type. In this example, DATATYPE1 and DATATYPE2 are placeholders for the actual data types you want to check for, such as INT, VARCHAR, DATE, etc. If the data type of columnX is neither DATATYPE1 nor DATATYPE2, it will be categorized as "Other."
Here we validate that values in a field fall within a specified range.
A range check in SQL can be performed by using the BETWEEN operator in a WHERE clause. Here's how:
This will give you all the records in the table where the value in columnX is between lower_bound and upper_bound. The BETWEEN operator is inclusive, meaning that the result set will include records where the value in columnX is equal to lower_bound or upper_bound.
A domain check in SQL can be performed by using a subquery in a WHERE clause to check if the value in a column is in a list of valid values, known as a domain.
Here's the example query:
This gives you all the records in the table where the value in columnX is in the list of valid values stored in the domain_table.
This helps to prevent duplicates from being inserted into the database and ensures that each value in the column is unique.
The most common way to perform a uniqueness check in SQL is to use a unique constraint. This is a restriction placed on a column or set of columns that ensures that the values in those columns are unique across all records in the table.
Here's an example of how to create one:
This will create a unique constraint uc_column_name on the column_name in the table_name. If you try to insert a record into the table with a value that already exists in the column_name, the database engine will raise an error and reject the insertion.
A format check is a way to enforce a particular pattern or format for the values in a particular column, helping to ensure that the data in the column is consistent.
Here’s an example:
This will create a check constraint ck_column_name on the column_name in the table_name. The constraint ensures that the value in column_name starts with an uppercase letter and is followed by two digits. Knowing regex will help you form the constraint.
A length check is a way to enforce a specific length or range of lengths for the values in a particular column.
Here’s an example using a check constraint:
This will create a check constraint ck_column_name on the column_name in the table_name. The constraint ensures that the length of the value in column_name is between 5 and 10 characters.
Last but not least, a completeness check helps to identify missing or incomplete data and prevent it from affecting your analysis or reports.
There are a few ways to perform a completeness check in SQL:
Use the NULL value
You can check for missing data by looking for NULL values in a particular column.
This query returns all the rows where the column_name is NULL.
Use the COUNT function
You can also use the COUNT function to check the completeness of your data.
This query returns the number of rows in the table_name. You can compare this count to the expected number of rows to determine if there is any missing data.
Use the GROUP BY clause
You can also use the GROUP BY clause to check for missing data.
This query returns the number of rows for each unique value in the group_by_column. You can use this information to determine if there is any missing data for a particular group.
Developing a tool in-house sounds appealing since you know your domain and can easily define all the SQL queries you want to monitor your data. In addition, it seems relatively simple to implement an alerting application that runs queries against your databases and notifies you based on your defined rules.
However, an in-house tool comes with its own difficulties. The following are the five main drawbacks to this approach:
- They only work with data sources that can be queried in SQL. If you have to monitor streaming data, data coming from cloud file storage, flat files, APIs, or NoSQL databases, you’re out of luck. Limiting monitoring to only SQL sources often means that you are not able to monitor the data upstream of that SQL source.
- They will slow down your databases. Building a scalable SQL-based monitoring application is difficult because such applications work by executing queries at constant time intervals. As your database gets bigger, the entire monitoring process slows down due to the extra time needed to query hundreds or thousands of attributes and millions of records. You might instead choose to run queries more sparingly, but this can prevent you from finding out about problems until they’ve grown large enough to threaten your workflow.
- They add processing costs to your infrastructure. Your costs will go up if you need to execute a lot of queries on your database. To avoid performance problems, as discussed in the previous point, you need to increase the capacity of your infrastructure and dedicate more resources to it, which in turn increases your costs.
- They are hard to maintain over time. Data is inherently subject to change. As your data changes, you would need to change your data quality checks. This means having an inventory of all your rules along with proper documentation and downstream impacts they may have to be able to tweak and adjust them without implications.
- They miss the unknown unknowns. You can only validate or check what you know. That means anomalies, outliers, and drifts in your data may get passed the rules you set up without you knowing.
The real key to improving your data monitoring process is to begin monitoring before the data lands in your data warehouse. As an example of an alternative option, let's take a look at what Telmai offers:
Telmai gives you the ability to monitor data across your entire pipeline regardless of whether your data structures are in SQL or nonSQL, so you can track and analyze your data upstream or as it’s being transferred to your database or data warehouse. This means you can monitor incoming data without impacting the performance of your data warehouse. You’ll also learn about potential problems with the data sooner in the process.
Telmai’s data quality analysis and observability platform is designed to help you detect and solve issues in real time, in every stage of your data transformation. Unlike using SQL to assess the health of data when data is at rest in the database or data warehouse, with Telmai, you won’t need to write any code or rules to set it up. Its user-friendly UI allows you to define what correct data should look like, and its machine learning algorithms proactively monitor the data and alert you to deviations that you may not have accounted in the past.
To see how Telmai can speed up and improve your data monitoring, request a demo.
Data profiling helps organizations understand their data, identify issues and discrepancies, and improve data quality. It is an essential part of any data-related project and without it data quality could impact critical business decisions, customer trust, sales and financial opportunities.
To get started, there are four main steps in building a complete and ongoing data profiling process:
We'll explore each of these steps in detail and discuss how they contribute to the overall goal of ensuring accurate and reliable data. Before we get started, let's remind ourself of what is data profiling.
1. Data Collection
Start with data collection. Gather data from various sources and extract it into a single location for analysis. If you have multiple sources, choose a centralized data profiling tool (see our recommendation in the conclusion) that can easily connect and analyze all your data without having you do any prep work.
2. Discovery & Analysis
Now that you have collected your data for analysis, it's time to investigate it. Depending on your use case, you may need structure discovery, content discovery, relationship discovery, or all three. If data content or structure discovery is important for your use case, make sure that you collect and profile your data in its entirety and do not use samples as it will skew your results.
Use visualizations to make your discovery and analysis more understandable. It is much easier to see outliers and anomalies in your data using graphs than in a table format.
3. Documenting the Findings
Create a report or documentation outlining the results of the data profiling process, including any issues or discrepancies found.
Use this step to establish data quality rules that you may not have been aware of. For example, a United States ZIP code of 94061 could have accidentally been typed in as 94 061 with a space in the middle. Documenting this issue could help you establish new rules for the next time you profile the data.
4. Data Quality Monitoring
Now that you know what you have, the next step is to make sure you correct these issues. This may be something that you can correct or something that you need to flag for upstream data owners to fix.
After your data profiling is done and the system goes live, your data quality assurance work is not done – in fact, it's just getting started.
Data constantly changes. If unchecked, data quality defects will continue to occur, both as a result of system and user behavior changes.
Build a platform that can measure and monitor data quality on an ongoing basis.
Take Advantage of Data Observability Tools
Automated tools can help you save time and resources and ensure accuracy in the process.
Unfortunately, traditional data profiling tools offered by legacy ETL and database vendors are complex and require data engineering and technical skills. They also only handle data that is structured and ready for analysis. Semi-structured data sets, nested data formats, blob storage types, or streaming data do not have a place in those solutions.
Today organizations that deal with complex data types or large amounts of data are looking for a newer, more scalable solution.
That’s where a data observability tool like Telmai comes in. Telmai is built to handle the complexity that data profiling projects are faced with today. Some advantages include centralized profiling for all data types, a low-code no-code interface, ML insights, easy integration, and scale and performance.
Leverages ML and statistical analysis to learn from the data and identify potential issues, and can also validate data against predefined rules
Uses predefined metrics from a known set of policies to understand the health of the data
Detects, investigates the root cause of issues, and helps remediate
Detects and helps remediate.
Examples: continuous monitoring, alerting on anomalies or drifts, and operationalizing the findings into data flows
Examples: data validation, data cleansing, data standardization
Low-code / no-code to accelerate time to value and lower cost
Ongoing maintenance, tweaking, and testing data quality rules adds to its costs
Enables both business and technical teams to participate in data quality and monitoring initiatives
Designed mainly for technical teams who can implement ETL workflows or open source data validation software
Start your data observibility today
Connect your data and start generating a baseline in less than 10 minutes.
No sales call needed
Start your data observability today
Connect your data and start generating a baseline in less than 10 minutes.