Telmai named one of the most promising data startups by Business Insider!  Read More

How to Build Data Validation

How to Build Data Validation
Farnaz Erfan

Imagine you're at a gas station filling up your car and you notice that the numbers on the pump aren't matching with what you're paying at the payment terminal. This is an example of poor data validation - the data being collected (the amount of gas being dispensed) isn't matching up with the data being recorded (the cost of the gas).

Beyond the gas pump, data validation is a critical step in machine learning, migrating data from one system to another, and really any data product. Since information is constantly being updated, deleted, entered manually, or moved around, having valid data is a must.

In this blog post, we'll dive into how to do data validation, the pros and cons of different data validation methods, and how a data observability platform can help automate and streamline the data validation process.

Four ways to do data validation 

There are many different ways to approach data validation, and the best method will depend on the specific needs and resources of your project. 

The four most common are writing your own scripts, using Excel’s built-in data validation tool, using an ETL tool, or leveraging a data observability tool. Here’s what you need to know about each:

1. Coding data validation

Depending on your ability to code, validating data by writing a script may be an option for you. You can write Spark jobs, Python scripts, DBT assertions, or Great Expectations to compare data values and metadata against your defined rules and schemas to verify that all the necessary information meets your required data quality parameters. 

The most popular of these kinds of data validation is using SQL. SQL can easily validate data records in a database based on a specific condition. Let’s say you want to exclude the customers whose email address includes “gmail” to exclude them from your marketing campaigns. You need to use the following SQL statement: Select * from Customer where Email like ‘%gmail%’ and subsequently, write the results into a table that you can use to upload to your marketing platforms. 

Although this type of data validation is pretty popular among database administrators and programmers, depending on the complexity of your data and the size of the data set you are validating, this method of data validation can be quite time-consuming. It also can send unnecessary queries into your database, and as a result, slow it down. 

2. Using Excel

Among the most basic ways of data validation is using Microsoft Excel or Google Sheets. Of course, this method of data validation requires you to export your data into Excel or Sheets before you can start, which granted is neither scalable nor easily maintainable. In both Excel and Sheets, the data validation process is straightforward. Excel and Sheets both have a menu item listed as Data > Data Validation. By selecting the Data Validation menu, a user can choose the specific data type or constraint validation required for a given file or data range.

However, this type of data validation is limited to the type of data or the values. For example, whether data is in a drop-down list, or has a particular type such as date, time, decimal, and text values. And let’s not forget, you can only fit a limited number of records to either of these tools before your data validation process becomes unwieldy.

3. Using ETL, ELT, or data integration tools

ETL (Extract, Transform and Load,) ELT (Extract, Load, Transform,) and data integration tools typically integrate data validation policies as part of their workflow. These validation rules are executed as data is extracted from one source and loaded into another, or, in the case of ELT, after extract and load are completed and transformation happens in the database. Popular tools include dbt or FiveTran.

While these tools are popular for integrating data from multiple sources together, often for integrating into a data warehouse, they can’t handle incoming real-time data streams, they have higher data infrastructure costs and more engineering time effort to build and maintain data validation rules that are subject to change as data architectures, systems, schemas, and processes change. 

4. Leveraging Data Observability tools

Data observability tools like Telmai enable you to customize data validation workflows precisely for your needs. You can use the ML-based learning from these solutions to detect data consistency issues off the bat, and you can create workflows that are specific to your needs. Additionally, you can automatically run any data validation workflow on a schedule (or on-demand) and get alerted if your data falls outside historical ranges or your predefined conditions.

Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Data Observability
Data Quality

Leverages ML and statistical analysis to learn from the data and identify potential issues, and can also validate data against predefined rules

Partial
Freshness volume, and schema monitoring is visual, but Field Health Monitors require SQL coding.

Detects, investigates the root cause of issues, and helps remediate

Detects and helps remediate.

Examples: continuous monitoring, alerting on anomalies or drifts, and operationalizing the findings into data flows

Examples: data validation, data cleansing, data standardization

Low-code / no-code to accelerate time to value and lower cost

Ongoing maintenance, tweaking, and testing data quality rules adds to its costs

Enables both business and technical teams to participate in data quality and monitoring initiatives

Designed mainly for technical teams who can implement ETL workflows or open source data validation software

Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Start your data observibility today

Connect your data and start generating a baseline in less than 10 minutes. 

No sales call needed

Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

More like this

How to Test Data Pipelines: Approaches, Tools, and Tips

How to Build a Data Monitoring System

Data Observability vs. Data Quality

Telmai is a platform for the Data Teams to proactively detect and investigate anomalies in real-time.
© 2022 Telm.ai All right reserved.