After personnel, Data is the most valuable asset for any business.
Industries depend on data to make significant decisions, leaving no room for untrustworthy data.
In my career as a Data Scientist, I have experienced first hand that data is only valuable if it is reliable.
The first step towards building trust in the quality of data is understanding it i.e Monitoring, the very first pillar in the Data Observability architecture at Telm.ai. At Telm.ai, we believe that profiling datasets is not enough, true data quality monitoring can be achieved when you observe trends of anomalies at both syntactic and semantic levels.
There are too many things which could go wrong with enterprise data especially when the volume and velocity of data is high, we will categorize the quality of data into -
Freshness or Timeliness
Completeness is all about detecting missing data. It can be on the source(table) level, i.e. you received only a fraction of the expected data, or on attribute level - when some attributes are missing values. However, missing data is not equivalent to no-data, in this case not having data is a valid situation. Being able to distinguish between these scenarios will save a lot of time/effort and money for the data owners.
An example of such a scenario can be 2 days worth of missing flight data for a certain geographical area. Instead of a pipeline failure, this could be due to a valid reason like airport closures due to severe weather conditions. It is important to detect and isolate such scenarios to avoid expensive and unnecessary troubleshooting efforts.
Another example could be that only partial records made it in either due to entry error from the sales representatives or perhaps various failures in the data pipeline despite all data being entered. Some amount of broken/partial records will almost certainly exist in any large scale datasets, but it is critical to be alerted when the trend changes so mitigation actions can be taken.
As you can see, there are many different use cases that apply just to the realm of Completeness.
Correctness: Meaningful data largely depends on how correct the data is, in terms of its accuracy, consistency, uniqueness and validity.
A very wide variety of approaches can be used to detect problems depending on the complexity of the domain. For example, if the goal is to detect non-unique SSN numbers in the data set, then some cardinality-based statistics can be applied to detect outliers. However, if it’s required to evaluate a lot of additional evidence to determine duplicates then it might require very sophisticated matching systems, often found in the heart of good Master Data Management systems like Reltio.
Below are a few examples of incorrect data.
Invalid data : Many times validity of data is defined by business teams like marketing Ops, analytics or security teams. Example: As a part of GDPR compliance, security teams request anonymization of PII data by masking, so SSN is now updated to XXX-XX-XXXX. However, during on-boarding data from a new source, if the masking rule is not properly implemented, it would lead to not only incorrect or unexpected data but also out of compliance data. Only automatic monitoring at semantic level can proactively alert on such anomalies.
Dummy data: sometimes when data is entered into the system, some dummy or template values are used. It could lead to all kinds of problems for the analytics. Imagine a thousand records with the same phone number: (800)-111-1111. Phone number is often a key field used in records matching along with other evidence, so an error like that can lead to many incorrectly merged records or even inefficient sales that will be very expensive to fix once it happens.
Schema mismatch: either due to entry mistakes or due to pipeline errors, the data may end up in an attribute it wasn’t meant to be, like first name instead of last name, SSN instead of phone number, state instead of country and many more such anomalies.
Non-standard formats: sometimes data does not follow expected format. In this case, even if it is correct it may result in significant problems for downstream analytics. For example, I have observed that a full state name may be used instead of the expected 2-symbol state name or unexpected format of phone number or SSN.
There are so many different and unseen possibilities that make the data incorrect and untrustworthy, that catching all the anomalies with a rule based static system can only be limiting, requiring constant on-going user intervention after the fact, which is already too late for critical industries that rely on real-time data.
Timeliness of the data is as important as it’s correctness. A report using old data is just as bad as the one using incorrect data. Below are some examples I had to resolve that affected data quality, causing an additional manual overhead and delays.
Events data get updated: cancellation, new location, updated time, added celebrity Events that changed location/time, tickets availability or canceled
Airlines wants to know booking data on third party booking agencies as they happen for timely flights-related adjustments
Ride sharing (Lyft, Uber) want to know the latest on expected attendees at a major venue to adjust pricing and direct drivers accordingly
Timeliness is particularly critical in the financial sector, and others, for borrowers credit checking. A loan will have to be approved/rejected on the spot
Monitoring, analyzing, and reporting on data in real time will go a long way in reducing operation risk due to low quality data and more data driven companies are realizing this.
We at Telm.ai are striving hard to build your trust in data that will not fail you.
About the Author:
Lina Khatib is a Lead Data Scientist and founding team member at Telm.ai.
With a PhD in Computer Science, Lina brings with her over 20 years of experience in the fields of AI and Data Science.
Lina has worked for 10 years as a Research Scientist in the Intelligent Autonomy & Robotics area at NASA and as a Data Scientist at VW/Audi, Reltio and PredictHQ.
She is excited to be a part of Telm.ai and believes this is the only AI based attribute non-discriminating solution for tackling major issues in Data Observability.