Data Wiki

Welcome to Telmai’s Data wiki, commonly used terms in the data world, all in one place

What is observability?

Observability is a measure of how well internal states of a system can be inferred from its external outputs. In software systems, observability involves making a system’s internal state visible through metrics, logging, and tracing. This helps developers and operators understand system behavior, diagnose issues, and improve performance.

What is data observability?

Data Observability is the practice of understanding and monitoring the behavior, quality, and performance of data as it flows through a system. It involves real-time tracking and analysis to ensure data reliability, accuracy, and compliance.

Key Aspects:

Data Quality: Accuracy, completeness, and consistency.
Data Flow: Movement through systems and identification of bottlenecks.
Data Dependencies: Relationships and impacts of changes.
Data Anomalies: Detection of outliers and errors.
Data Compliance: Adherence to regulations and policies.

Organizations achieve data observability using monitoring tools, data pipelines, quality checks, and governance practices. This helps in identifying issues, making informed decisions, and maintaining data integrity.

What is data monitoring?

Data Monitoring involves continuously checking data against predefined or machine learning-calculated metrics to ensure it meets acceptable thresholds. When an issue is detected, it typically triggers an alert or notification, and sometimes initiates automated actions.

Example: Notify a user when a specific policy fails.

Why is data monitoring important?

Data Monitoring is a process that maintains a high, consistent standard of data quality. By routinely monitoring data at the source or ingestion, allows organizations to avoid the resource-intensive pre-processing of data before it is moved.

How is data observability different from data quality?

Data Observability refers to the ability to understand and monitor the behavior of data within a system. It involves tracking data flow, detecting errors, and identifying discrepancies in real-time, enabling early problem detection and system performance assessment.

With the integration of machine learning, data observability tools can intelligently monitor anomalies and business KPI drifts, offering deeper insights with lower maintenance and total cost of ownership (TCO).

Data quality, on the other hand, focuses on the accuracy, completeness, and consistency of data. It involves identifying and correcting errors, removing duplicates, and ensuring data is entered and stored consistently.

While both concepts provide visibility into data health and can detect quality issues against predefined metrics, data observability goes further by offering real-time monitoring and intelligent insights, enhancing overall system understanding and performance.

What is difference between data outliers and drifts?

Data outliers and data drifts both describe unusual or abnormal data points but in different contexts.

Data outliers are individual data points significantly different from others in a dataset, often due to measurement or data entry errors. They can skew statistics and patterns and are typically identified using statistical methods like mean, standard deviation, and quantiles.

Data drifts refer to changes in the statistical properties of a dataset over time, caused by shifts in the underlying processes, such as data collection methods or system changes. Drifts indicate evolving trends or patterns within the data.

How to find anomalies in data?

There are several techniques for identifying anomalies in a dataset:

Statistical methods: Use properties like mean, standard deviation, and quantiles to identify outliers. For example, points outside a certain number of standard deviations from the mean are flagged as outliers.
‍Clustering:Techniques like k-means, density-based clustering, or hierarchical clustering group data points and identify those that don’t belong to any cluster.
Classification:Train a classifier to learn data patterns and identify points that don’t conform.
Rule-based methods:Use predefined rules to flag anomalies, such as points outside a certain range.
‍Machine Learning: Algorithms like Autoencoder, Isolation Forest, and Local Outlier Factor are designed to detect anomalies.

Finding anomalies is an ongoing process requiring continuous monitoring and maintenance. Data observability tools like Telmai are designed to automatically detect anomalies in data.

What is data anomaly monitoring?

Data anomaly monitoring is the continuous process of detecting unusual or abnormal data signals, known as anomalies. These anomalies can indicate issues in the data pipeline or collection process and can also provide valuable insights into changes in underlying business processes.

What is data quality?

Data quality refers to the accuracy, completeness, consistency, timeliness, uniqueness, and reliability of data. It measures how well data meets the requirements for its intended use. High data quality enables organizations to make informed decisions based on reliable information. Techniques to measure, monitor, and improve data quality include data profiling, cleansing, validation, standardization, and governance. Ensuring data quality is an ongoing process that requires continuous monitoring and maintenance. Data observability serves as a foundational layer for maintaining high data quality.

What are the characteristics of data quality?

Data quality refers to metrics that measure the degree to which data meets the requirements for its intended use. Key characteristics include:

Accuracy: Data is correct and free of errors.
Completeness: Data has no missing values.
Consistency: Data is uniform across sources or systems.
Timeliness: Data is up-to-date and relevant.
Validity: Data conforms to business rules or constraints.
Uniqueness: Data is unique and non-redundant.

Organizations prioritize these characteristics based on their specific needs, usage, and data lifecycle.

Is data lineage part of data quality?

Data lineage is closely related to data quality. It tracks data’s origins, transformations, storage, and usage, helping to understand and manage data quality. By tracing issues back to their source, data lineage aids in correcting errors, ensuring consistency, and enhancing overall data governance. While not a direct part of data quality, it is essential for maintaining and improving it.

What are data quality checks?

Data quality checks are procedures and methods used to evaluate and improve data quality, ensuring it is accurate, complete, consistent, and reliable. Key types of checks include:

Data Validation: Ensures data conforms to rules and constraints like data types, formats, and ranges.
Data Profiling: Generates statistics and summaries about the data, such as types, missing values, and outliers.
Data Standardization: Ensures consistent data representation and storage, such as standardizing dates and addresses.
Data Cleansing: Corrects errors and inconsistencies, removes duplicates, and fills in missing values.
Data Governance: Establishes policies and controls for managing data.
Data Monitoring: Continuously detects errors and identifies patterns and trends.

These checks can be automated or manual, applied in batch mode or real-time, and tailored to various data types. Data observability tools like Telmai use machine learning and statistical analysis to automate and enhance data quality checks.

What causes poor data quality?

Poor data quality can result from various factors, including:

Data Entry Errors: Human errors during data input, like typos and omissions.
Data Collection Errors: Faulty equipment or inaccurate measurements during data collection.
Data Pipeline Issues: Technical problems or bugs in data transfer, storage, or processing.
Lack of Data Governance: Absence of policies and controls for managing data.
Lack of Data Monitoring: Failure to regularly check for errors and inconsistencies.
Lack of Standardization: Inconsistent data formats leading to confusion and errors.
Data Integration Issues: Inaccuracies from merging data from multiple sources without proper integration and cleaning.

What impact does poor data quality have?

Poor data quality can significantly impact organizations in various ways:

Inaccurate Decision Making: Leads to incorrect or incomplete information, resulting in poor outcomes.
Reduced Productivity: Wastes time and resources as employees correct errors or search for missing information.
Increased Costs: Raises expenses due to error correction and redoing work.
Loss of Trust: Damages reputation and erodes client and stakeholder trust.
Compliance Issues: Risks non-compliance with regulations like GDPR, HIPAA, and SOX.
Hindered Analytics: Makes it difficult to extract insights and use data effectively.
Poor ML Models: Affects the accuracy and maintenance of machine learning models.
Integration Challenges: Complicates merging data from different systems, limiting insights.

In summary, poor data quality wastes resources, leads to bad decisions, decreases productivity, damages reputation, and risks regulatory non-compliance.

What is the difference between Data Governance and Data quality?

Data governance and data quality are closely related but distinct concepts in data management.

Data Governance: This is the overall management of data within an organization. It involves establishing policies, procedures, and controls for data throughout its lifecycle—from creation to disposal. It encompasses data management, security, privacy, and compliance.

Data Quality: This refers to the degree to which data meets the requirements for its intended use. It measures how well data is fit for purpose, focusing on accuracy, completeness, consistency, and reliability.

In summary, data governance is about managing and protecting data, while data quality is about ensuring data is accurate and useful.

What is data integrity and why is it important?

Data integrity refers to the accuracy, completeness, consistency, and reliability of data throughout its lifecycle. It ensures that data is protected against unauthorized access or alterations.

Importance of Data Integrity:

Informed Decision Making: Ensures accurate and consistent data for reliable decisions.
Data Security: Protects against breaches and security threats.
Compliance: Meets regulatory requirements like GDPR, HIPAA, and SOX.
Business Continuity: Ensures systems run efficiently and protects against data loss or corruption.
Reputation: Maintains trust with clients and stakeholders.

In summary, data integrity is crucial for accurate decision-making, security, regulatory compliance, operational efficiency, and maintaining organizational reputation.

What is data quality binning?

Data quality binning involves categorizing data into bins or segments to evaluate its quality. This method helps identify patterns, detect anomalies, and ensure data accuracy and consistency. By grouping data into specific ranges or categories, organizations can effectively monitor and manage data quality, leading to better insights and more informed decision-making.

What is data diff?

Data diff involves comparing two datasets to identify differences and changes. This process highlights discrepancies, updates, and deletions between data versions, ensuring data integrity and accuracy. Data diff is useful for tracking changes over time, debugging data issues, and validating data transformations, helping organizations maintain consistent and reliable data.

What is a data catalog?

A Data Catalog is an organized inventory of data assets, created through the discovery, description, and classification of datasets. It provides context and metadata, enabling data analysts, data scientists, data stewards, and other data consumers to easily find and understand relevant datasets. This facilitates the extraction of business value from the data.

What is a data dictionary?

A Data Dictionary is a collection of names, definitions, and attributes of data elements used in a database, information system, or research project. It describes the meanings and purposes of data elements, providing guidance on their interpretation and representation. Additionally, it includes metadata about these data elements, ensuring clarity and consistency in their use within the context of a project.

What is metadata management?

Metadata Management is an organization-wide agreement on how to describe information assets. With a radical shift in the amounts of data metadata management is critical to help derive the right data as quickly as possible, to increase the ROI on data.

What is data lineage?

Data lineage traces the transformations that a dataset has gone through since the time of origination. It describes a certain dataset’s origin, movement, characteristics, and quality.Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process.

What is ETL?

ETL (Extract, Transform, Load) is a data integration process that involves extracting data from various sources, transforming it into a suitable format, and loading it into a target database or data warehouse. This process ensures that data is consolidated, cleaned, and optimized for analysis and reporting.

What is data governance?

Data governance is a set of principles and practices that ensure high quality through the complete lifecycle of your data. According to the Data Governance Institute (DGI), it is a practical and actionable framework to help a variety of data stakeholders across any organization identify and meet their information needs.

What is data architecture?

Data architecture describes the structure of an organization’s logical and physical data assets and data management resources, according to The Open Group Architecture Framework (TOGAF).

What is a data warehouse?

A data warehouse is a centralized repository for an organization’s collected data. It involves the processes of extracting, transforming, and loading (ETL) data, or simply extracting and loading data. This curated data repository supports reporting, analytics, and data mining, enabling comprehensive data analysis and decision-making.

What is a data lake?

A Data Lake is a storage repository that holds vast amounts of raw data in its native format until needed. Unlike data warehouses, data lakes support structured, semi-structured, and unstructured data. Additionally, the data schema is applied upon reading rather than during loading or writing, allowing for greater flexibility and agility.

What is a data lakehouse?

A Data Lakehouse combines the best elements of data warehouses and data lakes. It integrates the data structures and management features of data warehouses with the cost-effective storage of data lakes. This hybrid approach supports structured, semi-structured, and unstructured data, enabling machine learning and business intelligence. Data lakehouses offer flexibility, scalability, and cost efficiency, making them valuable for data scientists and analysts.

What is data mesh?

A data mesh is an architectural paradigm that connects data from distributed sources, locations, and organizations, making data from multiple data silos highly available, secure and interoperable by abstracting away the complexities of connecting, managing and supporting access to data.

What is Lambda Architecture?

Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data.

What is Kappa Architecture?

Kappa architecture is a simplification of Lambda architecture. It is a data processing architecture designed to handle stream-based processing methods where incoming data is streamed through a real-time layer and the results of which are placed in the serving layer for queries.

What are data formats?

Data formats refer to the structure and organization of data for storage, processing, and retrieval. Common data formats include structured formats like CSV and SQL databases, semi-structured formats like JSON and XML, and unstructured formats like text and multimedia files.

These formats determine how data is encoded and accessed within data lakes, warehouses, and lakehouses.

What are open table formats?

Open table formats are standardized data storage formats that allow data to be easily accessed, shared, and used across different systems and platforms. Examples include Apache Iceberg, Delta Lake, and Apache Hudi. These formats support features like ACID transactions, schema evolution, and time travel, making them ideal for large-scale data processing in data lakes and lakehouses. Open table formats ensure interoperability and flexibility, enhancing data management and analytics capabilities.

What is data science?

Data science is a study of data that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.

Why is data science important?

Many small as well as large enterprises are striving to be data-driven organizations. Data is an unmistakable asset that key strategic and business decisions can be based on. This is where data science can provide many answers. With a radical increase in the amount of data, using machine learning algorithms and AI, data science can predict, recommend, and provide functional insights to deliver high ROI on data initiatives.

Where is data science used?

Some of the ways Data Science can be used for:

Prediction
Suggestions
Forecast
Recognition
Insights
Anomaly detection
Pattern detection
Decision making