9 Data Quality Checks You Can Do with Pandas

Duplicates, null values, data types, and more. Learn all the critical data quality checks you can do with Pandas and discover Telmai – a revolutionary tool that can automate these checks for you.

9 Data Quality Checks You Can Do with Pandas

Anoop Gopalam

August 22, 2023

Did you know that Pandas offers more than simple data manipulation and transformation? You can leverage its robust capabilities to maintain the integrity and health of your datasets by performing various quality checks.

In this guide, we will walk you through nine essential data quality checks that you can conduct with Pandas. You’ll also discover Telmai, an advanced platform that allows you to automate these essential checks, enhancing your data quality control.

1. Duplicate Records Check

Identifying duplicate records is essential to prevent redundant information. This check finds rows where specified columns have the same values.

duplicates = df.duplicated(subset=['column_1', 'column_2'])
print("Duplicate Rows:")

2. NULL Value Check

NULL values can indicate missing or undefined data. This check identifies how many NULL or missing values exist in a specific column.

missing_values = df['column_name'].isnull().sum()
print(f"Number of missing values in column_name: {missing_values}")

3. Data Type Check

Data types define the nature of the data. Ensuring the correct data types ensures consistency and prevents errors in analysis.

data_types = df.dtypes
print("Data Types:")

4. Range Check

A range check ensures that values fall within a specific interval. It helps in identifying outlier values that might be errors.

out_of_range = df[(df['column_name'] < lower_bound) | (df['column_name'] > upper_bound)]
print("Out of Range Values:")

5. Domain Check

A domain check verifies that values adhere to a predefined set of valid values, ensuring consistency in categorization.

invalid_domain = df[~df['column_name'].isin(valid_domain)]
print("Invalid Domain Values:")

6. Uniqueness Check

Uniqueness checks ensure that values in a column are unique, particularly in columns that should contain exclusive data, like IDs.

non_unique = df['column_name'].duplicated()
print("Non-Unique Values:")

7. Format Check

Format checks validate the structure or pattern of values. They are especially useful for emails, phone numbers, 

invalid_format = df[~df['email'].str.match(r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}')]
print("Invalid Formats:")

8. Length Check

Length checks ensure that the length of string values meets specific requirements. This is useful for constraints like passwords or usernames.

invalid_length = df[df['column_name'].str.len() != required_length]
print("Invalid Lengths:")

9. Completeness Check

Completeness checks confirm that all required fields are present and non-null, ensuring that essential information is not missing.

incomplete_rows = df[df[['column_1', 'column_2']].isnull().any(axis=1)]
print("Incomplete Rows:")

Why Pandas-based Data Quality Checks Aren’t Enough

While Pandas offers flexibility and robust functions to perform these quality checks, it’s not without its limitations. Handling large datasets can be memory-intensive, code-based checks require constant maintenance, and the lack of integration with other data sources may pose challenges. Additionally, there’s no direct way to automate and schedule these checks without manual intervention.

Telmai as An Alternative Approach

Telmai presents a streamlined alternative to traditional Pandas-based data quality checks. With its high-performance, low-code/no-code interface, Telmai not only automates the checks but also offers easy integrations with various data sources. Without slowing down your databases, it ensures consistent, timely, and scalable data quality control. Explore Telmai’s platform to elevate your data quality management and focus on drawing insights from your data.

  • On this page

See what’s possible with Telmai

Request a demo to see the full power of Telmai’s data observability tool for yourself.