Data Wiki

Welcome to Telmai’s Data wiki, commonly used terms in the Data world, all in one place

What is Observability?

In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

Observability is the ability to infer the internal state of a system from its external outputs. In the context of software systems, observability refers to the practice of making a system’s internal state visible through metrics, logging, and tracing. This allows developers and operators to understand the behavior of a system, diagnose problems, and make informed decisions about how to improve it.

What is data Observability?

Data observability refers to the ability to gain insights and understand the behavior, quality, and performance of data as it flows through a system or process. It encompasses the monitoring, tracking, and analysis of data in real-time to ensure its reliability, accuracy, and compliance with desired standards.

Data observability involves capturing and analyzing different aspects of data, including its structure, content, lineage, transformation, and dependencies. It aims to answer questions such as:Data Quality: Is the data accurate, complete, and consistent? Does it adhere to predefined quality standards and business rules?Data Flow: How does data move through different systems, processes, and transformations? Are there any bottlenecks or issues affecting the data flow?Data Dependencies: What are the relationships and dependencies between different data elements or entities? How do changes in one data source or system impact downstream processes?Data Anomalies: Are there any abnormalities, outliers, or unexpected patterns in the data that need attention? Are there any data-related issues or errors affecting the overall data integrity?Data Compliance: Does the data comply with relevant regulations, policies, and privacy requirements? Are there any potential data breaches or security vulnerabilities?To achieve data observability, organizations utilize a combination of monitoring tools, data pipelines, data quality checks, and data governance practices. They may employ techniques such as data profiling, data lineage tracking, data monitoring, and data validation to gain insights into the behavior and quality of their data.Data observability helps organizations identify and address data issues in real-time, enabling them to make informed decisions, troubleshoot problems, and maintain the reliability and integrity of their data assets. It plays a vital role in ensuring that data is trustworthy, actionable, and supports effective data-driven decision-making processes.

What is data monitoring?

Data monitoring is a practice in which data is constantly checked for pre-defined or ML calculated data metrics against acceptable threshold to alert on issues. Output of monitoring is usually an alert or a notifications but could be an automated action.
For example : Notify user when the below policy fails.

Why is data monitoring important?​

Data monitoring is a process that maintains a high, consistent standard of data quality. By routinely monitoring data at the source or ingestion, allows organizations to avoid the resource-intensive pre-processing of data before it is moved.

How is data observability different from data quality?​

Data observability refers to the ability to understand and monitor the behavior of data within a system, while data quality refers to the accuracy, completeness, consistency and reliability of the data.
Data observability allows to see how data flows through the system and identify any errors or discrepancies in the data, it enables to detect problems early, understand the overall health and performance of the system and take actions accordingly.
Data quality, on the other hand, is more focused on ensuring that the data is accurate, complete, and consistent. This includes identifying and correcting errors in the data, removing duplicate or inconsistent information, and ensuring that data is entered and stored in a consistent format.They both provide visibility into the health of the data and can detect data quality issues against predefined metrics and known policies.
Data observability takes data quality further by monitoring anomalies and business KPI drifts. Employing ML has made data observability tools a more intelligent system with lower maintenance and TCO as compared to what traditional data quality was capable of doing.

What is difference between data outliers and drifts?​

Data outliers and data drifts are both terms used to describe unusual or abnormal data points, but they refer to different types of anomalies.
Data outliers are data points that are significantly different from the other data points in the dataset. These data points can be caused by measurement errors, data entry errors, or other issues, and they can skew the overall statistics and patterns of the data.
Outliers are often identified by statistical methods such as mean, standard deviation, and quantiles, and can be removed or handled in different ways.

Data drifts, on the other hand, refer to changes in the statistical properties of a dataset over time. These changes can be caused by changes in the underlying process, such as changes in the data collection methods or changes in the system.

How to find Anomalies in Data?​​

There are several techniques for finding anomalies or unusual data points within a dataset, some of which include:
Statistical methods: use statistical properties of the data such as mean, standard deviation, and quantiles to identify unusual data points. For example, data points that fall outside of a certain number of standard deviations from the mean are considered outliers.
Clustering: use techniques such as k-means, density-based clustering, or hierarchical clustering to group data points together and identify data points that do not belong to any cluster.
Classification: train a classifier to learn patterns and behaviors in the data, and use it to identify data points that do not conform to those patterns.
Rule-based methods: use a set of predefined rules to identify anomalies, such as looking for data points that fall outside of a certain range or that do not conform to certain constraints.
Machine Learning: use machine learning algorithms such as Autoencoder, Isolation Forest, Local Outlier Factor, which are designed to find anomalies in a dataset.
Finding anomalies in data is not a one-time process, it’s an ongoing process that requires continuous monitoring, updating, and maintenance to ensure that the data remains accurate, complete, and consistent over time.
Data Observability tools like Telmai are natively designed to automatically find anomalies in data.

What is Data Anomaly Monitoring?​​

Data anomaly monitoring is the process of continuously monitoring data to detect unusual or abnormal data signals, also known as anomalies. Anomalies could indicate problems with the data pipeline or data collection process, or they can also reveal valuable insights about change in underlying business or process.

What is Data Quality

Data quality refers to the accuracy, completeness, consistency, timelines, uniqueness and reliability of the data.It is the degree to which data meets the requirements for its intended use.Data quality is important for every organizations as it enables them to make informed decisions based on reliable data.There are multiple techniques and tools available to measure, monitor and improve data quality, they include data profiling, data cleansing, data validation, data standardization, and data governance.Data quality is an ongoing process that requires continuous monitoring, updating, and maintenance to ensure that the data remains accurate, complete, and consistent over time.Data Observability has become the foundational layer for Data Quality.

What are the characteristics of data quality

Data quality characteristics refers to the metrics that help measure data quality like accuracy, completeness, consistency, timelines, uniqueness and reliability of the data.
It is the degree to which data meets the requirements for its intended use.
Data quality is important for every organizations as it enables them to make informed decisions based on reliable data.There are several dimensions of data quality that can be considered, such as:
Accuracy: the degree to which the data is correct and free of errors.
Completeness: the degree to which the data is complete and does not have missing values.
Consistency: the degree to which the data is consistent within or across different sources or systems.
Timeliness: the degree to which the data is up-to-date and relevant.
Validity: the degree to which the data conforms to business rules or constraints.
Uniqueness: the degree to which the data is unique and non-redundant.
Different organizations prioritize the requirements that define data quality based on the need, usage, and the life cycle of the processes that use it.
More on topic: https://docs.telm.ai/academy/

Is Data Lineage part of data quality?

Data lineage is related to data quality in that it is a key aspect of understanding and managing the data.
Data lineage refers to the ability to track the data as it flows through the system, including where it came from, how it was transformed, where it is stored, and how it is used.This information is important for understanding the data and its quality, as it allows you to trace any issues or errors back to their source and understand how the data has been used.
Data lineage can also be a key aspect in data governance, it allows organizations to understand the data and its quality, and to ensure that it is being used correctly and effectively.
Data lineage information can also be used to improve data quality by identifying and correcting errors, and by ensuring that data is entered and stored in a consistent format.
Additionally, understanding data lineage can help organizations to identify and remove duplicate or inconsistent data and to ensure that data is of sufficient quality and granularity to meet the needs of the system or application.
In summary, data lineage is not exactly part of data quality but is a key aspect of understanding and managing data, and it is closely related to data quality in that it can help organizations to improve the accuracy, completeness, consistency, and reliability of their data.

What are data quality checks?

Data quality checks are a set of procedures and methods used to evaluate and improve the quality of data.
These checks are used to ensure that data is accurate, complete, consistent, and reliable.There are several types of data quality checks that can be performed, some of which include:
Data validation: checks that data conforms to predefined rules and constraints, such as data types, formats, and ranges.
Data profiling: examines the data and generates statistics and summaries about the data, such as data types, missing values, and outliers.
Data standardization: ensures that data is represented and stored in a consistent format, such as standardizing dates, addresses, and names.
Data cleansing: removes or corrects errors and inconsistencies in the data, such as correcting misspellings, removing duplicate data, and filling in missing values.Data governance: ensures that data is being used correctly and effectively by establishing policies, procedures, and controls for managing data.
Data monitoring: continuous monitoring of data to detect errors, inconsistencies, and issues, and to identify patterns and trends in the data.These checks can be automated or manual, they can be performed in batch mode or in real-time, and they can be applied to different types of data, such as numerical data, categorical data, and time-series data.
Data Observability enabled use of ML and statistical analysis to automate the process of defining and maintaining DQ checks. A lot of the DQ checks can now be automatically addressed by Data Observability tools like Telmai

What causes poor data quality?

Poor data quality can be caused by a variety of factors, some of which include:
Data entry errors: human errors during data entry, such as typos, omissions, and incorrect data entry, can lead to inaccuracies in the data.
Data collection errors: errors or inaccuracies in the data collection process, such as faulty equipment or inaccurate measurement, can lead to poor quality data.
Data Pipeline issues: technical problems or bugs in the Data pipeline used to transfer, store or process data can lead to errors or inconsistencies in the data.
Lack of data governance: a lack of policies, procedures, and controls for managing data can lead to poor data quality, as there may be no clear guidelines for how data should be entered, stored, and used.
Lack of data monitoring: failing to monitor data for errors, inconsistencies, and issues can lead to poor data quality over time.
Lack of standardization: data that is not represented and stored in a consistent format can lead to confusion, errors, and inconsistencies.
Data integration issues: merging data from multiple sources can lead to inconsistencies and inaccuracies if the data is not properly integrated and cleaned.

What impact does poor data quality have?

Bad data quality can have a significant impact on organizations in multiple ways, some of which include:
Inaccurate or unreliable decision making: poor data quality can lead to incorrect or incomplete information being used to make decisions, resulting in poor outcomes.
Reduced productivity and efficiency: bad data quality can lead to wasted time and resources, as employees may have to spend time correcting errors or searching for missing information.
Increased costs: poor data quality can lead to increased costs, such as the cost of correcting errors or re-doing work that was based on incorrect information.
Loss of trust and credibility: bad data quality can damage an organization’s reputation and lead to loss of trust from clients, customers, and other stakeholders.
Compliance issues: bad data quality can lead to non-compliance with regulations and laws, such as GDPR, HIPAA, and SOX.
Inability to effectively use data analytics and business intelligence: poor data quality can make it difficult to extract insights from data and make it difficult to use data to improve decision making.Whats the impact of bad data quality
Poor ML models : bad data quality can have a significant impact on machine learning models, reducing their accuracy, making them more complex and harder to maintain, and making it difficult to train, evaluate, and interpret their results. It is important to ensure that data is of good quality, accurate, and unbiased, before using it in any ML model.
Difficulty in integrating data from different systems:
 poor data quality can make it difficult to merge data from multiple sources, which can limit the insights that can be gained from the data.

In summary, bad data quality can have a wide-reaching and negative impact on an organization and can lead to wasted resources, poor decision making, decreased productivity and profitability, damage to reputation, and non-compliance with regulations and laws.

What is difference between Data Governance and Data quality?

Data quality is closely related to data governance, as both are concerned with managing and ensuring the integrity of data.

Data governance is the overall management of data within an organization, including policies, procedures, and controls for managing data. It covers the entire data lifecycle, from data creation, to data storage, and data disposal. Data governance includes data management, data quality, data security, data privacy and data compliance.
Data quality, on the other hand, is the degree to which data meets the requirements for its intended use. It is the measure of how well data is fit for the purpose it was collected for. Data quality includes aspects such as accuracy, completeness, consistency, and reliability of data.

what is data integrity and why is it important?

Data integrity refers to the completeness, accuracy, consistency and reliability of data over its entire lifecycle. It ensures that data is accurate, consistent, and complete and that it is protected against unauthorized access or alteration.Data integrity is important for organizations because it enables them to make informed decisions, to ensure that their systems are running efficiently and effectively, and to protect against data breaches and other security threats.
Here are some reasons why data integrity is important:
Informed decision making: accurate and consistent data is necessary to make informed decisions, and poor data integrity can lead to incorrect or incomplete information being used to make decisions.
Data security: data integrity is necessary to protect against data breaches and other security threats, as well as to ensure compliance with regulations and laws.
Compliance: 
Data integrity is important for compliance with regulations such as GDPR, HIPAA, and SOX, which require organizations to protect personal data and ensure its accuracy.
Business continuity: accurate and consistent data is necessary to ensure that systems are running efficiently and effectively, and to protect against data loss or corruption.
Reputation: poor data integrity can damage an organization’s reputation and lead to loss of trust from clients, customers, and other stakeholders.In summary, data integrity is important for organizations as it enables them to make informed decisions, to ensure that their systems are running efficiently and effectively, and to protect against data breaches and other security threats.
It also allows organizations to comply with regulations and laws, and to maintain their reputation and trust from clients and customers.

What is a Data warehouse?

A data warehouse is a central repository for all data that is collected in an organization’s business systems. Data can be extracted, transformed, and loaded (ETL) or extracted and loaded into a warehouse which then supports reporting, analytics and mining on this extracted and curated data.

What is a Data Lake?

A data lake is a storage repository that holds large amounts of raw data in native format until it is needed. Data lakes address the shortcomings of data warehouses in two ways. First, the data can be stored in structured, semi-structured, or unstructured format. Second, the data schema is decided upon reading, rather than loading or writing data and it can be changed for more agility.

What is a Data Lake House?

A data lake house is a data solution concept that combines the best elements of the data warehouse with those of the data lake. Data lake-houses implement data warehouses’ data structures and management features of data lakes, which are typically more cost-effective for data storage. A Data lakehouse has proven to be quite useful to data scientists as they enable machine learning and business intelligence.

What is a data catalog?

Gartner definition: A data catalog maintains an inventory of data assets through the discovery, description, and organization of datasets. The catalog provides context to enable data analysts, data scientists, data stewards, and other data consumers to find and understand a relevant dataset for the purpose of extracting business value.

What is a Data dictionary?

A Data Dictionary is a collection of names, definitions, and attributes about data elements that are used or captured in a database, information system, or part of a research project. It describes the meanings and purposes of data elements within the context of a project and provides guidance on interpretation, accepted meanings, and representation. A Data Dictionary also provides metadata about data elements.

What is metadata management?

Metadata Management is an organization-wide agreement on how to describe information assets. With a radical shift in the amounts of data metadata management is critical to help derive the right data as quickly as possible, to increase the ROI on data.

What is data lineage?

Data lineage traces the transformations that a dataset has gone through since the time of origination.  It describes a certain dataset’s origin, movement, characteristics, and quality.Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process.

What is ETL?

Metadata Management is an organization-wide agreement on how to describe information assets. With a radical shift in the amounts of data metadata management is critical to help derive the right data as quickly as possible, to increase the ROI on data.

What is data governance?

Data governance is a set of principles and practices that ensure high quality through the complete lifecycle of your data. According to the Data Governance Institute (DGI), it is a practical and actionable framework to help a variety of data stakeholders across any organization identify and meet their information needs.

What is Data Architecture?

Data architecture describes the structure of an organization’s logical and physical data assets and data management resources, according to The Open Group Architecture Framework (TOGAF).

What is Data Mesh?

A data mesh is an architectural paradigm that connects data from distributed sources, locations, and organizations, making data from multiple data silos highly available, secure and interoperable by abstracting away the complexities of connecting, managing and supporting access to data.

What is Lambda Architecture?

Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data.

What is Kappa Architecture?

Kappa architecture is a simplification of Lambda architecture. It is a data processing architecture designed to handle stream-based processing methods where incoming data is streamed through a real-time layer and the results of which are placed in the serving layer for queries.

What is data science?

Data science is a study of data that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.

Why is Data Science important?

Many small as well as large enterprises are striving to be data-driven organizations. Data is an unmistakable asset that key strategic and business decisions can be based on. This is where data science can provide many answers. With a radical increase in the amount of data, using machine learning algorithms and AI, data science can predict, recommend, and provide functional insights to deliver high ROI on data initiatives.

Where is Data Science used?

Some of the ways Data Science can be used for:​

  • Prediction
  • Suggestions
  • Forecast
  • Recognition
  • Insights
  • Anomaly detection
  • Pattern detection
  • Decision making