Data Wiki

Welcome to Telmai's Data wiki, commonly used terms in the Data world, all in one place

Data Observability

01 | What is Observability?

In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

02 | What is data Observability?

Data Observability is a set of measures that can help predict and identify data issues through external symptoms. This approach goes beyond the traditional monitoring capabilities and strives to reduce the time that data is unreliable by using statistical and Machine learning tools that monitor various data metrics, time, and sources of the data issues and help troubleshoot and investigate issues.

The goal of such a system is to reduce the meantime to detect (MTTD) and mean time to resolve (MTTR) data issues.

03 | What is data monitoring?

Data monitoring is a practice in which data is constantly checked for pre-defined data metrics against acceptable threshold to alert on issues.For example : Notify user when the below policy fails

Data-monitoring
04 | Why is data monitoring important?​

Data monitoring is a process that maintains a high, consistent standard of data quality. By routinely monitoring data at the source or ingestion, allows organizations to avoid the resource-intensive pre-processing of data before it is moved.

05 | How is data observability different from data quality?​

Data quality is one aspect of monitoring data whereas data observability is an umbrella term that monitors the quality of data, traces the issues in data discrepancies, and provides a platform to troubleshoot data issues.  

06 | What is Anomaly Detection?​

Anomaly detection is a part of data observability that identifies outliers that deviate from a dataset’s normal behavior. Also known as outlier detection, it is the identification of certain data points which raise suspicions by differing significantly from the majority of the data.

Data Quality

01 | What are the characteristics of data quality

At a very high level, the following characteristics define the quality of data:

  • How complete the data is
  • Is it accurate and reliable
  • Is it available when needed, and is it up-to-date, aka timeliness

Different organizations prioritize the requirements that define data quality based on the need, usage, and the life cycle of the processes that use it.

02 | What downstream impact does poor data quality have?

Data is an important asset used to make crucial decisions. If important business decision-making processes use data that is inherently poor in quality, it will create a ripple effect on all processes that consume it.  The time, effort, and cost of triaging and cleaning data at this point have proven to be frustrating and a low ROI.

03 | What are some of the reasons for poor data quality?

There could be many different factors that contribute to poor data quality:

  • Human Error during data entry
  • Data collated from different data sources attributing to anomalies
  • Missing values
  • Erroneous data
04 | What are some ways data quality can be improved?​

Data can get corrupted due to many different factors. With the necessary tools and processes in place, these can be pre-empted at the beginning of the lifecycle rather than late troubleshooting, which adds to the time and cost. Some of the ways data quality can be improved are:

  • Implementing a data anomaly detection tool catch issues that could break the system
  • Unit testing processes
  • Business rules
  • Profiling

Data Management

01 | What is a Data warehouse?

A data warehouse is a central repository for all data that is collected in an organization's business systems. Data can be extracted, transformed, and loaded (ETL) or extracted and loaded into a warehouse which then supports reporting, analytics and mining on this extracted and curated data.

02 | What is a Data Lake?

A data lake is a storage repository that holds large amounts of raw data in native format until it is needed. Data lakes address the shortcomings of data warehouses in two ways. First, the data can be stored in structured, semi-structured, or unstructured format. Second, the data schema is decided upon reading, rather than loading or writing data and it can be changed for more agility.

03 | What is a Data Lake House?

A data lake house is a data solution concept that combines the best elements of the data warehouse with those of the data lake. Data lake-houses implement data warehouses’ data structures and management features of data lakes, which are typically more cost-effective for data storage. A Data lakehouse has proven to be quite useful to data scientists as they enable machine learning and business intelligence.

04 | What is a data catalog?

Gartner definition: A data catalog maintains an inventory of data assets through the discovery, description, and organization of datasets. The catalog provides context to enable data analysts, data scientists, data stewards, and other data consumers to find and understand a relevant dataset for the purpose of extracting business value.

05 | What is a Data dictionary?

A Data Dictionary is a collection of names, definitions, and attributes about data elements that are used or captured in a database, information system, or part of a research project. It describes the meanings and purposes of data elements within the context of a project and provides guidance on interpretation, accepted meanings, and representation. A Data Dictionary also provides metadata about data elements.

06 | What is metadata management?

Metadata Management is an organization-wide agreement on how to describe information assets. With a radical shift in the amounts of data metadata management is critical to help derive the right data as quickly as possible, to increase the ROI on data.

07 | What is data lineage?

Data lineage traces the transformations that a dataset has gone through since the time of origination.  It describes a certain dataset’s origin, movement, characteristics, and quality.Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process.

08 | What is ETL?

Metadata Management is an organization-wide agreement on how to describe information assets. With a radical shift in the amounts of data metadata management is critical to help derive the right data as quickly as possible, to increase the ROI on data.

09 | What is ELT?

ELT is a modern alternative to ETL for massive amounts of data, where data is extracted from multiple sources, loaded once into a central data repository like a data lake, and transformed as needed by BI tools, allowing for timely access, scalability and flexibility.

10 | What is data governance?

Data governance is a set of principles and practices that ensure high quality through the complete lifecycle of your data. According to the Data Governance Institute (DGI), it is a practical and actionable framework to help a variety of data stakeholders across any organization identify and meet their information needs.

Data Architecture

01 | What is Data Architecture?

Data architecture describes the structure of an organization's logical and physical data assets and data management resources, according to The Open Group Architecture Framework (TOGAF).

02 | What is Data Mesh?

A data mesh is an architectural paradigm that connects data from distributed sources, locations, and organizations, making data from multiple data silos highly available, secure and interoperable by abstracting away the complexities of connecting, managing and supporting access to data.

03 | What  is Lambda Architecture?

Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data.

04 | What is Kappa Architecture?

Kappa architecture is a simplification of Lambda architecture. It is a data processing architecture designed to handle stream-based processing methods where incoming data is streamed through a real-time layer and the results of which are placed in the serving layer for queries.

Data Science

01 | What is data science?

Data science is a study of data that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.

02 | Why is Data Science important?

Many small as well as large enterprises are striving to be data-driven organizations. Data is an unmistakable asset that key strategic and business decisions can be based on. This is where data science can provide many answers. With a radical increase in the amount of data, using machine learning algorithms and AI, data science can predict, recommend, and provide functional insights to deliver high ROI on data initiatives.

03 | Where is Data Science used?

Some of the ways Data Science can be used for:​

  • Prediction
  • Suggestions
  • Forecast
  • Recognition
  • Insights
  • Anomaly detection
  • Pattern detection
  • Decision making
Telmai is a platform for the Data Teams to proactively
detect and investigate anomalies in real-time.
© 2022 Telm.ai All right reserved.