Whether you onboard 3rd party data or curate data for your customers, internal or external, Data Quality is one of the most important considerations. Selling data of poor quality will inevitably result in reputational losses, negatively impacting your business. Whereas onboarding bad data could ruin your other data initiatives, corrupt downstream systems and lead to inaccurate analytics, which in turn leads to business losses.
Ensuring good quality of data requires implementing robust practices in the following areas:
We’ve spoken about the importance of data monitoring extensively, let’s now take a look at the first step - understanding the data.
For years, one of the most handy tools for data experts was profiling. There are a number of free, open source and commercial products for data profiling available in the market.
Data Profiling is a process of analyzing data and summarizing this information in the hope of assessing quality of the data. For example, finding what are the attributes in the dataset, what is the distribution of values, top or least frequent values, percentage of populated and unique values etc.
The consumption of profiling reports is manual, time consuming and quite honestly, boring. It becomes even more challenging when reports provide too much detail so it quickly becomes overwhelming. On the other hand, if it’s very high level, it would be hardly useful as it will uncover only a few of the most visible problems (Figure 1).
Now, if you multiply it by drastically increased volume and velocity of data in recent years, it becomes clear that profiling in its classic interpretation is no longer up to the task. So are we doomed? Fortunately not.
It’s clear that just throwing more information at people in the hope that it will solve data quality concerns is not going to work. Fully relying on ML to detect and act on data issues also doesn’t seem very feasible as there is a tremendous amount of knowledge and context about data in the heads of data experts accumulated for years working in specific data domains.
So we’ve combined ML and the intuitive and seamless experience of data experts together to bring to you Profiler++. ML does what it does best - crunching through tons of statistical data and bringing up the most valuable information. Whereas the experts can shine at what they do best - applying all their knowledge in making decisions based on that information.
Telmai’s engine processes data, collects tons of statistical information, significantly more information than a typical profiling tool would, which is then fed to the ML engine, where all of this information is analyzed and then brought to the user via fast, interactive and engaging user experience. So instead of reading a thick folder of statistical reports, it’s like watching a movie which tells you a story about your data.
The Profiler++ provides an easy way to explore data by combining multiple perspectives: patterns/masks, various distributions and data anomaly scores all in one interactive user experience. It connects all the statistics and scores together so when you drill down into a particular aspect, like pattern, or certain anomaly score, you can see all contributing factors. This in turn helps to investigate the root cause of the problematic values, share findings with other teams and most importantly translate it into data quality expectations.
This profiler on steroids is now available for trial and can help you eliminate the blind spots in your data.
The best part - once you understand your data, our platform is designed to take it further by capturing user feedback to fine tune the model training, allowing you to put it on autopilot and let it analyze and monitor your data on an ongoing basis.
At Telmai, we believe that your monitoring and data observability journey must start with understanding, analyzing and investigating your data and specifically data quality anomalies. We have made this first step absolutely easy, engaging and fun!
Just bring in your data and within minutes, with zero code start this journey towards monitoring…. Start your free account today.
About the Author
Max Lukichev, a co-founder and CTO of Telmai, holds a Ph.D. in Computer Science with a focus on Database theory. Having spent more than 10 years in the research and development of ML/AI-based products, which address various aspects of detecting and resolving data anomalies, Max has a deep background in observability platforms. In his spare time, Max loves building cars from ground up!
#dataprofiling #dataobservability #datamonitoring #dataquality
Cloud migration and integration projects rely on good quality data to meet their objectives. However, traditional technologies have...
Maxim Lukichev, Telmai CTO shares his experience on architecting data systems, in session with DataTalksClub Data architectures for...