Introducing the First Semi-Structured Data Observability

Introducing the First Semi-Structured Data Observability
Max Lukichev

Why Semi-Structured

As databases were evolving and migrating to the cloud, the data models were changing too. Most modern cloud data warehouses or data lakes now support semi-structured schemas. Today pretty much any big data platform supports working with NDJSON, i.e., New Line Delimited File, where each row is a proper JSON representing a record. And even native support of JSON schema type: [https://cloud.google.com/bigquery/docs/reference/standard-sql/json-data]

This brought a great opportunity for data architects to design the most efficient data model for storage and querying. However, at the same time, it created a challenge for Data Observability. And the reason is multi-value fields, aka arrays. Such data requires special logic to properly calculate basic data quality KPIs like completeness, uniqueness etc. as there can be some records with no values vs multiple values per record.

Example of a single record of such data:

{
  "first_name": "Jon",
  "last_name": "Doe",
  "address": [{
    "street":"123 Main St",
    "city": "Neverland",
    "zip": "94404"
  }],
  "phone":["123-124-23-45","124-234-3467"]
}

Normally the data model is designed to make the most common queries to be efficient. Typically for analytical/reporting purposes a flatten operation (sometimes called unnest or explode) is used and is sufficient because such queries only require a single or a couple attributes to be flattened at a time. Effectively, it means that if a record has an attribute with two values, it will get converted into two records with the corresponding value for the attribute.

In the example above that would be:

{"first_name" : "Jon", … , "phone" : "123-124-23-45"}
{"first_name" : "Jon", … , "phone" : "124-234-3467"}

However when monitoring data metrics and their drifts, such an approach becomes very inefficient, as it would require flattening and calculating aggregations one attribute at a time. It is not uncommon to see hundreds and even thousand attributes in real-life cases. The accumulated overhead on each query with such an approach quickly becomes unacceptable.

Solution

Telmai is designed from the ground up to support very large semi-structured schemas. It can run with a constant number of aggregation operations (or queries) not depending on number of attributes, avoiding the overheads, and can process virtually unlimited complexity of the schema. 

In the example above Telmai extracts following attributes and automatically predicts trends for dozens of metrics per each of them to detects unexpected drifts:

These metrics include:

  • Total record count
  • % of unique values
  • % of incomplete/null values
  • Distribution of categorical values and patterns
  • Mean value frequency, length, …
  • % of date, number, … 
  • And many more

Telmai can process semi-structured data in any form: JSON (ndjson), parquet, data warehouse tables with JSON files. For example, one of the customers expressed a need to monitor product JSON data stored in Snowflake. The challenge was that this data was of a very complex schema - a hierarchy of over 8000 attributes. Such schemas are not unique for enterprises and specifically for data describing products, as each type of a product can bring a set of attributes specific to its type. Yet, the Telmai engine was able to handle such data efficiently and give peace of mind about the quality of the data.

In summary, Telmai architecture allows the analysis of data of any volume, located anywhere (files, streams, SQL engines), of flat and semi-structured schemas of any complexity. On top of that, it can scale horizontally well, so aggressive sampling is not required either.

Data profiling helps organizations understand their data, identify issues and discrepancies, and improve data quality. It is an essential part of any data-related project and without it data quality could impact critical business decisions, customer trust, sales and financial opportunities. 

To get started, there are four main steps in building a complete and ongoing data profiling process:

  1. Data Collection
  2. Discovery & Analysis
  3. Documenting the Findings
  4. Data Quality Monitoring

We'll explore each of these steps in detail and discuss how they contribute to the overall goal of ensuring accurate and reliable data. Before we get started, let's remind ourself of what is data profiling.

What are the different kinds of data profiling?

Data profiling falls into three major categories: structure discovery, content discovery, and relationship discovery. While they all help in gaining more understanding of the data, the type of insights they provide are different:

 

Structure discovery analyzes that data is consistent, formatted correctly, and well structured. For example, if you have a ‘Date’ field, structure discovery helps you see the various patterns of dates (e.g., YYYY-MM-DD or YYYY/DD/MM) so you can standardize your data into one format.

 

Structure discovery also examines simple and basic statistics in the data, for example, minimum and maximum values, means, medians, and standard deviations.

 

Content discovery looks more closely into the individual attributes and data values to check for data quality issues. This can help you find null values, empty fields, duplicates, incomplete values, outliers, and anomalies.

 

For example, if you are profiling address information, content discovery helps you see whether your ‘State’ field contains the two-letter abbreviation or the fully spelled out city names, both, or potentially some typos.

 

Content discovery can also be a way to validate databases with predefined rules. This process helps find ways to improve data quality by identifying instances where the data does not conform to predefined rules. For example, a transaction amount should never be less than $0.

 

Relationship discovery discovers how different datasets are related to each other. For example, key relationships between database tables, or lookup cells in a spreadsheet. Understanding relationships is most critical in designing a new database schema, a data warehouse, or an ETL flow that requires joining tables and data sets based on those key relationships.

Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Data Observability
Data Quality

Leverages ML and statistical analysis to learn from the data and identify potential issues, and can also validate data against predefined rules

Uses predefined metrics from a known set of policies to understand the health of the data

Detects, investigates the root cause of issues, and helps remediate

Detects and helps remediate.

Examples: continuous monitoring, alerting on anomalies or drifts, and operationalizing the findings into data flows

Examples: data validation, data cleansing, data standardization

Low-code / no-code to accelerate time to value and lower cost

Ongoing maintenance, tweaking, and testing data quality rules adds to its costs

Enables both business and technical teams to participate in data quality and monitoring initiatives

Designed mainly for technical teams who can implement ETL workflows or open source data validation software

Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Start your data observibility today

Connect your data and start generating a baseline in less than 10 minutes. 

No sales call needed

Stay in touch

Stay updated with our progress. Sign up now

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Start your data observability today

Connect your data and start generating a baseline in less than 10 minutes. 

Telmai is a platform for the Data Teams to proactively detect and investigate anomalies in real-time.
© 2023 Telm.ai All right reserved.