Analyzing & Monitoring Data quality in Google BigQuery
BigQuery is Google’s fully managed, petabyte scale, low cost enterprise data warehouse for analytics and most often used for interactive ad-hoc queries of read-only datasets. This makes it crucial that the data in BigQuery is always ready for consumption. So it’s important to ensure and monitor the quality of the data in your BigQuery Monitoring […]
BigQuery is Google’s fully managed, petabyte scale, low cost enterprise data warehouse for analytics and most often used for interactive ad-hoc queries of read-only datasets.
This makes it crucial that the data in BigQuery is always ready for consumption. So it’s important to ensure and monitor the quality of the data in your BigQuery
Monitoring the usage and performance of BigQuery
This does not fall into the category of pure data quality monitoring but nonetheless an important part of your overall monitoring strategy. You could leverage Google’s native monitoring capabilities or augment it with your central cloud monitoring like Datadog.
Syntactical Validations within BigQuery
Google BigQuery supports a lot of schema level validations, this definitely works well for data type and schema issues.
Open Source Tools
Data Validation tool is a good open source library that might be helpful if you are migrating data from an existing system to BigQuery.
Airflow (Cloud Composer) BigQuery operators provides another good open-source solution to validating data in BigQuery
Additionally there are open source tools like Great Expectations, Tensorflow Data Validation, Deequ, Apache Griffin, etc
These tools work best when you already know what validations have to be programmatically enforced. Most of these tools need some development work.
ML and statical monitoring tools
These tools are more suited for automatic anomaly detection for the unknown issues in the data. These types of issues can fall into trend anomalies and value anomalies. There are quite a few observability companies evolving in the market that monitor trends. Telmai is one such platform and it is able to automatically detect anomalies at row-value level.
In-order to truly build a high quality data source you would start with analysis and then translate these learnings into monitoring metrics and SLI.
Example : Once you standardize the acceptable patterns for phone numbers, you want to be alerted when there is a violation on this.
Hope this was helpful, would love to hear from you how you are monitoring your BigQuery data today..
We have just added our BigQuery integration for Profiler++ and we will be rolling out our beta-version BigQuery monitoring soon.
Please leave your email if you are interested in being a beta customer for our BigQuery data monitoring.
ArticlesSee all articles
See what’s possible with Telmai
Request a demo to see the full power of Telmai’s data observability tool for yourself.