Imagine you're at a gas station filling up your car and you notice that the numbers on the pump aren't matching with what you're paying at the payment terminal. This is an example of poor data validation - the data being collected (the amount of gas being dispensed) isn't matching with the data being recorded (the cost of the gas).
Beyond the gas pump, data validation is a critical step in machine learning, system migrations, and really any data product. Since information is constantly being updated, deleted, or moved around, having valid data is a must.
While there are many different ways to approach data validation, the four most common are writing your own scripts, using Excel’s built-in data validation tool, using an ETL tool, or leveraging a data observability tool.
In this blog post we'll dive into the pros and cons of each method and how the data observability platform Telmai can help automate and streamline the data validation process.
1. Coding data validation
Depending on your ability to code, validating data by writing a script may be an option for you. You can write Spark jobs, Python scripts, DBT assertions, or Great Expectations to compare data values and metadata against your defined rules and schemas to verify that all the necessary information meets your required data quality parameters.
The most popular of these kinds of data validation is using SQL. SQL can easily validate data records in a database based on a specific condition. Let’s say you want to exclude the customers whose email address includes “gmail” to exclude them from your marketing campaigns. You need to use the following SQL statement: Select * from Customer where Email like ‘%gmail%’ and subsequently, write the results into a table that you can use to upload to your marketing platforms.
Although this type of data validation is pretty popular among database administrators and programmers, depending on the complexity of your data and the size of the data set you are validating, this method of data validation can be quite time-consuming. It also can send unnecessary queries into your database, and as a result, slow it down.
2. Using Excel
Among the most basic ways of data validation is using Microsoft Excel or Google Sheets. Of course, this method of data validation requires you to export your data into Excel or Sheets before you can start, which granted is neither scalable nor easily maintainable. In both Excel and Sheets, the data validation process is straightforward. Excel and Sheets both have a menu item listed as Data > Data Validation. By selecting the Data Validation menu, a user can choose the specific data type or constraint validation required for a given file or data range.
However, this type of data validation is limited to the type of data or the values. For example, whether data is in a drop-down list, or has a particular type such as date, time, decimal, and text values. And let’s not forget, you can only fit a limited number of records to either of these tools before your data validation process becomes unwieldy.
3. Using ETL, ELT, or data integration tools
ETL (Extract, Transform and Load), ELT (Extract, Load, Transform), and data integration tools typically integrate data validation policies as part of their workflow. These validation rules are executed as data is extracted from one source and loaded into another, or, in the case of ELT, after extract and load are completed and transformation happens in the database. Popular tools include dbt or FiveTran.
While these tools are popular for integrating data from multiple sources together, often for integrating into a data warehouse, they can’t handle incoming real-time data streams, they have higher data infrastructure costs and more engineering time effort to build and maintain data validation rules that are subject to change as data architectures, systems, schemas, and processes change.
4. Leveraging Data Observability tools
Data observability tools like Telmai enable you to customize data validation workflows precisely for your needs. You can automatically run any data validation workflow on a schedule (or on-demand) and get alerted if your data falls outside historical ranges or your predefined conditions. Plus Telmai's ML-based learning enables you to detect data quality issues you couldn't have even predicted would be a problem!
Data profiling helps organizations understand their data, identify issues and discrepancies, and improve data quality. It is an essential part of any data-related project and without it data quality could impact critical business decisions, customer trust, sales and financial opportunities.
To get started, there are four main steps in building a complete and ongoing data profiling process:
We'll explore each of these steps in detail and discuss how they contribute to the overall goal of ensuring accurate and reliable data. Before we get started, let's remind ourself of what is data profiling.
1. Data Collection
Start with data collection. Gather data from various sources and extract it into a single location for analysis. If you have multiple sources, choose a centralized data profiling tool (see our recommendation in the conclusion) that can easily connect and analyze all your data without having you do any prep work.
2. Discovery & Analysis
Now that you have collected your data for analysis, it's time to investigate it. Depending on your use case, you may need structure discovery, content discovery, relationship discovery, or all three. If data content or structure discovery is important for your use case, make sure that you collect and profile your data in its entirety and do not use samples as it will skew your results.
Use visualizations to make your discovery and analysis more understandable. It is much easier to see outliers and anomalies in your data using graphs than in a table format.
3. Documenting the Findings
Create a report or documentation outlining the results of the data profiling process, including any issues or discrepancies found.
Use this step to establish data quality rules that you may not have been aware of. For example, a United States ZIP code of 94061 could have accidentally been typed in as 94 061 with a space in the middle. Documenting this issue could help you establish new rules for the next time you profile the data.
4. Data Quality Monitoring
Now that you know what you have, the next step is to make sure you correct these issues. This may be something that you can correct or something that you need to flag for upstream data owners to fix.
After your data profiling is done and the system goes live, your data quality assurance work is not done – in fact, it's just getting started.
Data constantly changes. If unchecked, data quality defects will continue to occur, both as a result of system and user behavior changes.
Build a platform that can measure and monitor data quality on an ongoing basis.
Take Advantage of Data Observability Tools
Automated tools can help you save time and resources and ensure accuracy in the process.
Unfortunately, traditional data profiling tools offered by legacy ETL and database vendors are complex and require data engineering and technical skills. They also only handle data that is structured and ready for analysis. Semi-structured data sets, nested data formats, blob storage types, or streaming data do not have a place in those solutions.
Today organizations that deal with complex data types or large amounts of data are looking for a newer, more scalable solution.
That’s where a data observability tool like Telmai comes in. Telmai is built to handle the complexity that data profiling projects are faced with today. Some advantages include centralized profiling for all data types, a low-code no-code interface, ML insights, easy integration, and scale and performance.
Start your data observibility today
Connect your data and start generating a baseline in less than 10 minutes.
No sales call needed
Start your data observability today
Connect your data and start generating a baseline in less than 10 minutes.