After personnel, Data is the most valuable asset for any business.
Industries depend on data to make significant decisions, leaving no room for untrustworthy data.
In my career as a Data Scientist, I have experienced first hand that data is only valuable if it is reliable.
The first step towards building trust in the quality of data is understanding it i.e Monitoring, the very first pillar in the Data Observability architecture at Telm.ai. At Telm.ai, we believe that profiling datasets is not enough, true data quality monitoring can be achieved when you observe trends of anomalies at both syntactic and semantic levels.
There are too many things which could go wrong with enterprise data especially when the volume and velocity of data is high, we will categorize the quality of data into -
Completeness is all about detecting missing data. It can be on the source(table) level, i.e. you received only a fraction of the expected data, or on attribute level - when some attributes are missing values. However, missing data is not equivalent to no-data, in this case not having data is a valid situation. Being able to distinguish between these scenarios will save a lot of time/effort and money for the data owners.
An example of such a scenario can be 2 days worth of missing flight data for a certain geographical area. Instead of a pipeline failure, this could be due to a valid reason like airport closures due to severe weather conditions. It is important to detect and isolate such scenarios to avoid expensive and unnecessary troubleshooting efforts.
Another example could be that only partial records made it in either due to entry error from the sales representatives or perhaps various failures in the data pipeline despite all data being entered. Some amount of broken/partial records will almost certainly exist in any large scale datasets, but it is critical to be alerted when the trend changes so mitigation actions can be taken.
As you can see, there are many different use cases that apply just to the realm of Completeness.
Correctness: Meaningful data largely depends on how correct the data is, in terms of its accuracy, consistency, uniqueness and validity.
A very wide variety of approaches can be used to detect problems depending on the complexity of the domain. For example, if the goal is to detect non-unique SSN numbers in the data set, then some cardinality-based statistics can be applied to detect outliers. However, if it’s required to evaluate a lot of additional evidence to determine duplicates then it might require very sophisticated matching systems, often found in the heart of good Master Data Management systems like Reltio.
Below are a few examples of incorrect data.
There are so many different and unseen possibilities that make the data incorrect and untrustworthy, that catching all the anomalies with a rule based static system can only be limiting, requiring constant on-going user intervention after the fact, which is already too late for critical industries that rely on real-time data.
Timeliness of the data is as important as it’s correctness. A report using old data is just as bad as the one using incorrect data. Below are some examples I had to resolve that affected data quality, causing an additional manual overhead and delays.
Monitoring, analyzing, and reporting on data in real time will go a long way in reducing operation risk due to low quality data and more data driven companies are realizing this.
We at Telm.ai are striving hard to build your trust in data that will not fail you.
About the Author:
Lina Khatib is a Lead Data Scientist and founding team member at Telm.ai.
With a PhD in Computer Science, Lina brings with her over 20 years of experience in the fields of AI and Data Science.
Lina has worked for 10 years as a Research Scientist in the Intelligent Autonomy & Robotics area at NASA and as a Data Scientist at VW/Audi, Reltio and PredictHQ.
She is excited to be a part of Telm.ai and believes this is the only AI based attribute non-discriminating solution for tackling major issues in Data Observability.
#dataobservability #dataquality #dataengineering #dataobservabilityplatform #machinelearning
Cloud migration and integration projects rely on good quality data to meet their objectives. However, traditional technologies have...
Maxim Lukichev, Telmai CTO shares his experience on architecting data systems, in session with DataTalksClub Data architectures for...