After personnel, Data is the most valuable asset for any business.
Industries depend on data to make significant decisions, leaving no room for untrustworthy data.
In my career as a Data Scientist, I have experienced first hand that data is only valuable if it is reliable.
The first step towards building trust in the quality of data is understanding it i.e Monitoring, the very first pillar in the Data Observability architecture at Telm.ai. At Telm.ai, we believe that profiling datasets is not enough, true data quality monitoring can be achieved when you observe trends of anomalies at both syntactic and semantic levels.
There are too many things which could go wrong with enterprise data especially when the volume and velocity of data is high, we will categorize the quality of data into -
- Freshness or Timeliness
Completeness is all about detecting missing data. It can be on the source(table) level, i.e. you received only a fraction of the expected data, or on attribute level - when some attributes are missing values. However, missing data is not equivalent to no-data, in this case not having data is a valid situation. Being able to distinguish between these scenarios will save a lot of time/effort and money for the data owners.
An example of such a scenario can be 2 days worth of missing flight data for a certain geographical area. Instead of a pipeline failure, this could be due to a valid reason like airport closures due to severe weather conditions. It is important to detect and isolate such scenarios to avoid expensive and unnecessary troubleshooting efforts.
Another example could be that only partial records made it in either due to entry error from the sales representatives or perhaps various failures in the data pipeline despite all data being entered. Some amount of broken/partial records will almost certainly exist in any large scale datasets, but it is critical to be alerted when the trend changes so mitigation actions can be taken.
As you can see, there are many different use cases that apply just to the realm of Completeness.
Correctness: Meaningful data largely depends on how correct the data is, in terms of its accuracy, consistency, uniqueness and validity.
A very wide variety of approaches can be used to detect problems depending on the complexity of the domain. For example, if the goal is to detect non-unique SSN numbers in the data set, then some cardinality-based statistics can be applied to detect outliers. However, if it’s required to evaluate a lot of additional evidence to determine duplicates then it might require very sophisticated matching systems, often found in the heart of good Master Data Management systems like Reltio.
Below are a few examples of incorrect data.
- Invalid data : Many times validity of data is defined by business teams like marketing Ops, analytics or security teams. Example: As a part of GDPR compliance, security teams request anonymization of PII data by masking, so SSN is now updated to XXX-XX-XXXX. However, during on-boarding data from a new source, if the masking rule is not properly implemented, it would lead to not only incorrect or unexpected data but also out of compliance data. Only automatic monitoring at semantic level can proactively alert on such anomalies.
- Dummy data: sometimes when data is entered into the system, some dummy or template values are used. It could lead to all kinds of problems for the analytics. Imagine a thousand records with the same phone number: (800)-111-1111. Phone number is often a key field used in records matching along with other evidence, so an error like that can lead to many incorrectly merged records or even inefficient sales that will be very expensive to fix once it happens.
- Schema mismatch: either due to entry mistakes or due to pipeline errors, the data may end up in an attribute it wasn’t meant to be, like first name instead of last name, SSN instead of phone number, state instead of country and many more such anomalies.
- Non-standard formats: sometimes data does not follow expected format. In this case, even if it is correct it may result in significant problems for downstream analytics. For example, I have observed that a full state name may be used instead of the expected 2-symbol state name or unexpected format of phone number or SSN.
There are so many different and unseen possibilities that make the data incorrect and untrustworthy, that catching all the anomalies with a rule based static system can only be limiting, requiring constant on-going user intervention after the fact, which is already too late for critical industries that rely on real-time data.
Timeliness of the data is as important as it’s correctness. A report using old data is just as bad as the one using incorrect data. Below are some examples I had to resolve that affected data quality, causing an additional manual overhead and delays.
- Events data get updated: cancellation, new location, updated time, added celebrity Events that changed location/time, tickets availability or canceled
- Airlines wants to know booking data on third party booking agencies as they happen for timely flights-related adjustments
- Ride sharing (Lyft, Uber) want to know the latest on expected attendees at a major venue to adjust pricing and direct drivers accordingly
- Timeliness is particularly critical in the financial sector, and others, for borrowers credit checking. A loan will have to be approved/rejected on the spot
Monitoring, analyzing, and reporting on data in real time will go a long way in reducing operation risk due to low quality data and more data driven companies are realizing this.
We at Telm.ai are striving hard to build your trust in data that will not fail you.
About the Author:
Lina Khatib is a Lead Data Scientist and founding team member at Telm.ai.
With a PhD in Computer Science, Lina brings with her over 20 years of experience in the fields of AI and Data Science.
Lina has worked for 10 years as a Research Scientist in the Intelligent Autonomy & Robotics area at NASA and as a Data Scientist at VW/Audi, Reltio and PredictHQ.
She is excited to be a part of Telm.ai and believes this is the only AI based attribute non-discriminating solution for tackling major issues in Data Observability.
#dataobservability #dataquality #dataengineering #dataobservabilityplatform #machinelearning
Data profiling helps organizations understand their data, identify issues and discrepancies, and improve data quality. It is an essential part of any data-related project and without it data quality could impact critical business decisions, customer trust, sales and financial opportunities.
To get started, there are four main steps in building a complete and ongoing data profiling process:
We'll explore each of these steps in detail and discuss how they contribute to the overall goal of ensuring accurate and reliable data. Before we get started, let's remind ourself of what is data profiling.
1. Data Collection
Start with data collection. Gather data from various sources and extract it into a single location for analysis. If you have multiple sources, choose a centralized data profiling tool (see our recommendation in the conclusion) that can easily connect and analyze all your data without having you do any prep work.
2. Discovery & Analysis
Now that you have collected your data for analysis, it's time to investigate it. Depending on your use case, you may need structure discovery, content discovery, relationship discovery, or all three. If data content or structure discovery is important for your use case, make sure that you collect and profile your data in its entirety and do not use samples as it will skew your results.
Use visualizations to make your discovery and analysis more understandable. It is much easier to see outliers and anomalies in your data using graphs than in a table format.
3. Documenting the Findings
Create a report or documentation outlining the results of the data profiling process, including any issues or discrepancies found.
Use this step to establish data quality rules that you may not have been aware of. For example, a United States ZIP code of 94061 could have accidentally been typed in as 94 061 with a space in the middle. Documenting this issue could help you establish new rules for the next time you profile the data.
4. Data Quality Monitoring
Now that you know what you have, the next step is to make sure you correct these issues. This may be something that you can correct or something that you need to flag for upstream data owners to fix.
After your data profiling is done and the system goes live, your data quality assurance work is not done – in fact, it's just getting started.
Data constantly changes. If unchecked, data quality defects will continue to occur, both as a result of system and user behavior changes.
Build a platform that can measure and monitor data quality on an ongoing basis.
Take Advantage of Data Observability Tools
Automated tools can help you save time and resources and ensure accuracy in the process.
Unfortunately, traditional data profiling tools offered by legacy ETL and database vendors are complex and require data engineering and technical skills. They also only handle data that is structured and ready for analysis. Semi-structured data sets, nested data formats, blob storage types, or streaming data do not have a place in those solutions.
Today organizations that deal with complex data types or large amounts of data are looking for a newer, more scalable solution.
That’s where a data observability tool like Telmai comes in. Telmai is built to handle the complexity that data profiling projects are faced with today. Some advantages include centralized profiling for all data types, a low-code no-code interface, ML insights, easy integration, and scale and performance.
Leverages ML and statistical analysis to learn from the data and identify potential issues, and can also validate data against predefined rules
Uses predefined metrics from a known set of policies to understand the health of the data
Detects, investigates the root cause of issues, and helps remediate
Detects and helps remediate.
Examples: continuous monitoring, alerting on anomalies or drifts, and operationalizing the findings into data flows
Examples: data validation, data cleansing, data standardization
Low-code / no-code to accelerate time to value and lower cost
Ongoing maintenance, tweaking, and testing data quality rules adds to its costs
Enables both business and technical teams to participate in data quality and monitoring initiatives
Designed mainly for technical teams who can implement ETL workflows or open source data validation software
Start your data observibility today
Connect your data and start generating a baseline in less than 10 minutes.
No sales call needed
Start your data observability today
Connect your data and start generating a baseline in less than 10 minutes.