
On this page
Data profiling helps organizations understand their data, identify issues and discrepancies, and improve data quality. It is an essential part of any data-related project and without it data quality could impact critical business decisions, customer trust, sales and financial opportunities.
To get started, there are four main steps in building a complete and ongoing data profiling process:
What are the different kinds of data profiling?
Data profiling falls into three major categories: structure discovery, content discovery, and relationship discovery. While they all help in gaining more understanding of the data, the type of insights they provide are different:
Structure discovery analyzes that data is consistent, formatted correctly, and well structured. For example, if you have a ‘Date’ field, structure discovery helps you see the various patterns of dates (e.g., YYYY-MM-DD or YYYY/DD/MM) so you can standardize your data into one format.
Structure discovery also examines simple and basic statistics in the data, for example, minimum and maximum values, means, medians, and standard deviations.
Content discovery looks more closely into the individual attributes and data values to check for data quality issues. This can help you find null values, empty fields, duplicates, incomplete values, outliers, and anomalies.
For example, if you are profiling address information, content discovery helps you see whether your ‘State’ field contains the two-letter abbreviation or the fully spelled out city names, both, or potentially some typos.
Content discovery can also be a way to validate databases with predefined rules. This process helps find ways to improve data quality by identifying instances where the data does not conform to predefined rules. For example, a transaction amount should never be less than $0.
Relationship discovery discovers how different datasets are related to each other. For example, key relationships between database tables, or lookup cells in a spreadsheet. Understanding relationships is most critical in designing a new database schema, a data warehouse, or an ETL flow that requires joining tables and data sets based on those key relationships.
Start your data observibility today
Connect your data and start generating a baseline in less than 10 minutes.
No sales call needed
Start your data observability today
Connect your data and start generating a baseline in less than 10 minutes.