n pursuit of identifying the team that is accountable for data quality, we interviewed a multitude of people across multiple companies, who answered varied questions about how organizations are structured, who has the most knowledge of the data, who is the most impacted and how can the value of data be derived to support the business. Summarizing in one sentence, our study has found that data quality ownership cannot be siloed, it needs to be democratized with a collaborative data culture. Adopting a culture around data to create a solid foundation has added a new dimension to enforce a shift in paradigm so that the core business decisions are led by data.
With data culture - people, tools and processes are structured around sourcing, access, quality and consumption needs of business and operations, bringing together a combination of business owners and data engineers. Extending the core data ownership roles to business units allows for more domain-informed decision making with the engineering team to support the larger vision.
Which segues naturally into how such teams can co-exist to form a solid data team.
The last few years have seen an explosion in frameworks and architectural suggestions to create a structure around ingestion, consumption, storage, analytics, all striving towards making informed and quick decisions on the data to generate business value. Depending on the needs, the size and the use case of the organizations, a couple of different structures of team and layouts can be implemented. An interesting, in-depth read on this topic is a blog written by Zhamak Dehghani, where she breaks down models that data-driven organizations could consider to create a solid foundation for all data management.
Domain oriented Data Ownership model
More traditional data architectures follow a linear, central, domain oriented data ownership approach. A centralized data ownership team that manages the ingestion of data from various sources, processes the data and then provides access to different consumers, scientists, analytics, and BI teams.
This may prove to be a good solution for smaller organizations that have a simpler domain and consumption use cases, where a central team is able to serve the needs of a handful of distinct domains using data warehouse and data lake architectures. The key here is centralized data management, which may prove to be a hurdle for a growth and expansive data vision, where the data teams may be overwhelmed serving the various different needs, fighting fires to produce correct, timely data to make business critical decisions.
Domain Agnostic Data Ownership model
For a more complex infrastructure with many different sources of data, equally diverse sets of consumers with different needs and use cases, the centralized data ownership model will fail to keep up. Data Mesh is one such strong contender creating a design shift in how data can be managed organizationally.
While data pipeline, ingestion, storage can be maintained by centralized self-serve data infrastructure platform team, allowing data to be locally owned by product domains for their specific use case, can allow for a more independent and targeted lifecycle of data, dictating quality measures, metadata structures, consumer needs, defining business KPIs for their product data thereby reducing the turnaround time in response to their customers and the overall load on the otherwise centralized data team.
Such teams would continue to include product data owners as well as data engineers to support the needs of the product so as to bring together a team of unique skills fulfilling the wider data-driven vision of the organization.
Data Quality Owners
Business and data analysts are more accustomed to identifying issues with data in the context of its use case than the engineering teams, however, as described above, both are an essential part of data ownership. While the operations teams have varied consumption and domain centered quality needs, the domain data teams can define their correctness Service Level Indicators(SLIs) on data. Here is a high level breakdown on the needs of various operational units in a typical organization:
It's very evident from our conversations that data quality can not be achieved by technology alone. It needs the right combination of process, culture and tools, especially tools that will empower both technical and non-technical teams.
Data Quality needs to be democratized across business and IT and democratization of data quality can be achieved when organizations focus on building collaborative data culture.
More and more technologies need to focus on empowering entire data teams (both technical and non-technical) to improve data quality.
#dataObservabiliity #dataquality #demcratizeDataQuality #dataanalytics #dataengineer
Data profiling helps organizations understand their data, identify issues and discrepancies, and improve data quality. It is an essential part of any data-related project and without it data quality could impact critical business decisions, customer trust, sales and financial opportunities.
To get started, there are four main steps in building a complete and ongoing data profiling process:
We'll explore each of these steps in detail and discuss how they contribute to the overall goal of ensuring accurate and reliable data. Before we get started, let's remind ourself of what is data profiling.
1. Data Collection
Start with data collection. Gather data from various sources and extract it into a single location for analysis. If you have multiple sources, choose a centralized data profiling tool (see our recommendation in the conclusion) that can easily connect and analyze all your data without having you do any prep work.
2. Discovery & Analysis
Now that you have collected your data for analysis, it's time to investigate it. Depending on your use case, you may need structure discovery, content discovery, relationship discovery, or all three. If data content or structure discovery is important for your use case, make sure that you collect and profile your data in its entirety and do not use samples as it will skew your results.
Use visualizations to make your discovery and analysis more understandable. It is much easier to see outliers and anomalies in your data using graphs than in a table format.
3. Documenting the Findings
Create a report or documentation outlining the results of the data profiling process, including any issues or discrepancies found.
Use this step to establish data quality rules that you may not have been aware of. For example, a United States ZIP code of 94061 could have accidentally been typed in as 94 061 with a space in the middle. Documenting this issue could help you establish new rules for the next time you profile the data.
4. Data Quality Monitoring
Now that you know what you have, the next step is to make sure you correct these issues. This may be something that you can correct or something that you need to flag for upstream data owners to fix.
After your data profiling is done and the system goes live, your data quality assurance work is not done – in fact, it's just getting started.
Data constantly changes. If unchecked, data quality defects will continue to occur, both as a result of system and user behavior changes.
Build a platform that can measure and monitor data quality on an ongoing basis.
Take Advantage of Data Observability Tools
Automated tools can help you save time and resources and ensure accuracy in the process.
Unfortunately, traditional data profiling tools offered by legacy ETL and database vendors are complex and require data engineering and technical skills. They also only handle data that is structured and ready for analysis. Semi-structured data sets, nested data formats, blob storage types, or streaming data do not have a place in those solutions.
Today organizations that deal with complex data types or large amounts of data are looking for a newer, more scalable solution.
That’s where a data observability tool like Telmai comes in. Telmai is built to handle the complexity that data profiling projects are faced with today. Some advantages include centralized profiling for all data types, a low-code no-code interface, ML insights, easy integration, and scale and performance.
Leverages ML and statistical analysis to learn from the data and identify potential issues, and can also validate data against predefined rules
Uses predefined metrics from a known set of policies to understand the health of the data
Detects, investigates the root cause of issues, and helps remediate
Detects and helps remediate.
Examples: continuous monitoring, alerting on anomalies or drifts, and operationalizing the findings into data flows
Examples: data validation, data cleansing, data standardization
Low-code / no-code to accelerate time to value and lower cost
Ongoing maintenance, tweaking, and testing data quality rules adds to its costs
Enables both business and technical teams to participate in data quality and monitoring initiatives
Designed mainly for technical teams who can implement ETL workflows or open source data validation software
Start your data observibility today
Connect your data and start generating a baseline in less than 10 minutes.
No sales call needed
Start your data observability today
Connect your data and start generating a baseline in less than 10 minutes.