9 Data Quality Checks You Can Do with Pandas
Duplicates, null values, data types, and more. Learn all the critical data quality checks you can do with Pandas and discover Telmai – a revolutionary tool that can automate these checks for you.
Did you know that Pandas offers more than simple data manipulation and transformation? You can leverage its robust capabilities to maintain the integrity and health of your datasets by performing various quality checks.
In this guide, we will walk you through nine essential data quality checks that you can conduct with Pandas. You’ll also discover Telmai, an advanced platform that allows you to automate these essential checks, enhancing your data quality control.
1. Duplicate Records Check
Identifying duplicate records is essential to prevent redundant information. This check finds rows where specified columns have the same values.
duplicates = df.duplicated(subset=['column_1', 'column_2'])
print("Duplicate Rows:")
print(df[duplicates])
2. NULL Value Check
NULL values can indicate missing or undefined data. This check identifies how many NULL or missing values exist in a specific column.
missing_values = df['column_name'].isnull().sum()
print(f"Number of missing values in column_name: {missing_values}")
3. Data Type Check
Data types define the nature of the data. Ensuring the correct data types ensures consistency and prevents errors in analysis.
data_types = df.dtypes
print("Data Types:")
print(data_types)
4. Range Check
A range check ensures that values fall within a specific interval. It helps in identifying outlier values that might be errors.
out_of_range = df[(df['column_name'] < lower_bound) | (df['column_name'] > upper_bound)]
print("Out of Range Values:")
print(out_of_range)
5. Domain Check
A domain check verifies that values adhere to a predefined set of valid values, ensuring consistency in categorization.
invalid_domain = df[~df['column_name'].isin(valid_domain)]
print("Invalid Domain Values:")
print(invalid_domain)
6. Uniqueness Check
Uniqueness checks ensure that values in a column are unique, particularly in columns that should contain exclusive data, like IDs.
non_unique = df['column_name'].duplicated()
print("Non-Unique Values:")
print(df[non_unique])
7. Format Check
Format checks validate the structure or pattern of values. They are especially useful for emails, phone numbers,
invalid_format = df[~df['email'].str.match(r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}')]
print("Invalid Formats:")
print(invalid_format)
8. Length Check
Length checks ensure that the length of string values meets specific requirements. This is useful for constraints like passwords or usernames.
invalid_length = df[df['column_name'].str.len() != required_length]
print("Invalid Lengths:")
print(invalid_length)
9. Completeness Check
Completeness checks confirm that all required fields are present and non-null, ensuring that essential information is not missing.
incomplete_rows = df[df[['column_1', 'column_2']].isnull().any(axis=1)]
print("Incomplete Rows:")
print(incomplete_rows)
Why Pandas-based Data Quality Checks Aren’t Enough
While Pandas offers flexibility and robust functions to perform these quality checks, it’s not without its limitations. Handling large datasets can be memory-intensive, code-based checks require constant maintenance, and the lack of integration with other data sources may pose challenges. Additionally, there’s no direct way to automate and schedule these checks without manual intervention.
Telmai as An Alternative Approach
Telmai presents a streamlined alternative to traditional Pandas-based data quality checks. With its high-performance, low-code/no-code interface, Telmai not only automates the checks but also offers easy integrations with various data sources. Without slowing down your databases, it ensures consistent, timely, and scalable data quality control. Explore Telmai’s platform to elevate your data quality management and focus on drawing insights from your data.
- On this page
See what’s possible with Telmai
Request a demo to see the full power of Telmai’s data observability tool for yourself.