Cloud migration and integration projects rely on good quality data to meet their objectives. However, traditional technologies have struggled to manage the volume and complexity of modern cloud computing and storage.
This article explores a use case from one organisation that embraced the next era of cloud data quality technology by incorporating it directly into its delivery approach.
The outcome has delivered better quality projects at a fraction of the time than conventional approaches whilst opening up exciting avenues for future project innovation.
The demands of cloud-based data profiling
Myers-Holum, Inc. (MHI) operate at the cutting-edge of data engineering and data integration, having helped over 1000 enterprises to streamline their operations, financials and business processes.
Driven by a desire to reduce maintenance costs, increase computing power, and exploit the massive rise in data volumes, clients reach out to MHI for help in either transitioning legacy analytical systems into Google Cloud or building out entirely new Google Cloud analytical solutions.
MHI, therefore, has a compelling need to understand the data structures, content, and relationships within client system data to prevent data anomalies or defects from going undetected when migrating to Google Cloud as well as shaping the design of the target platform.
This discovery exercise is what is commonly referred to as 'data profiling'.
Data profiling forms a core component of data management and is crucial to the many complex data engineering initiatives that specialist data firms like MHI undertake.
The typical client of MHI is looking to analyse billions of data elements from sources such as:
- Google, Facebook ad networks
- In-app event streams
- Finance market event data (such as Bloomberg)
- Corporate sales and finance data
Data profiling speeds up the design and development of the analytical cloud platform solutions that will leverage this source data whilst identifying all of the transformations and data cleansing activities required to transition the data safely.
But despite a busy data profiling vendor marketplace, MHI had struggled to find a traditional profiling solution that could cope with the evolution of data volumes and processing performance typified by cloud computing (as well as work well with the new cloud technology stack).
The tools they initially tested originated from the previous ‘on-prem’ data management and engineering era, where volumes and compute speed were far lower. These legacy tools left MHI with a gap when attempting to scale data profiling economically and reliably for a cloud-based technical stack.
Darius Kemeklis, EVP of the MHI Google Cloud Practice, explains:
"We had grappled with the existing data profiling technologies for some time. Many of the tools were limited in their outputs and analytics, making it impossible to share insights with clients.
But the biggest challenge was scale.
Either the legacy profiling architectures meant we were forced to rely on sampling (which didn't address our needs), or they were too cumbersome to support a consulting workflow that requires the analysis of thousands of attributes and billions of data points".
Given these limitations, Telmai began working with Darius' team at MHI to leverage the Telmai cloud-based data profiling and quality solution to improve their cloud migration workflow and reduce the cost/timescales of hand-cranked data profiling activities.
Incorporating data profiling into the data migration consulting workflow
Companies such as MHI and others realise that when building integration and migration processes, you can't move to the next phase of designing and engineering data pipelines until the data profiling and quality assessment work has been completed.
For example, you can't build mapping and transformation rules between source and target data stores without understanding:
- Detailed structural schema analysis
- Data quality analysis to identify data content problem and risk areas
- Distribution of data values and patterns to identify the different standards and rules inherent to the data
- Redundant attributes that are either empty/incomplete, or have not been maintained
Skipping these phases dramatically increases the likelihood of failure during a data migration and integration project.
During a recent Telmai interview with Dylan Jones (editor of Data Quality Pro and Data Migration Pro), Dylan explained the research findings that linked data profiling to successful data migration outcomes:
Telmai: How has data profiling improved data migration outcomes?
Dylan Jones: It's hard to understate how big a shift we've seen due to improved technologies and practices in data profiling / data quality management.
Historically, data migration projects were high-risk ventures.
Back in 2007, only 16% of data migration projects came in on time and under budget. Given the high failure rate, data migration was thought of as a 'poisoned chalice' if you were in charge of delivery.
But when we researched the industry in 2017, we found that 60% of projects were considered successful.
The adoption of data profiling heavily influenced that success.
For example, the 2007 research showed that only 10% of projects used data profiling tools. But by 2017, the adoption of data profiling tools had risen to 70% (in the US) and is even higher today. There's a clear link between data profiling, data quality and project outcomes."
The link between data profiling and accurate project forecasting
One of the biggest challenges that data profiling addresses is helping to scope and assess the risks associated with a data migration or integration project.
It's impossible to determine how complex or costly the project will be for integration partners and customers without accurately assessing the legacy data sources and the migration path they need to take before reaching the target system.
During the same interview, Dylan Jones expanded on this scoping challenge and its dependency on data profiling:
Telmai: How can data profiling influence the project scoping and forecasting analysis of a complex migration?
Dylan Jones: "One of the reasons many projects still come in over budget or blow their delivery timescales is they lack an effective forecasting strategy that is driven by reality as opposed to guesswork and misplaced assumptions.
For example, our research observed that 50% of projects lacked an effective forecasting and scoping strategy.
Today, the challenge is compounded because so many migration projects are cloud-related, which means the volumes and complexity of data sources is significantly higher than ever before – greatly increasing the risk of budget and timescale increases if the forecasting is flawed.
The key is to undertake a Pre-Migration Impact Assessment, which is a fancy way of saying - profile your data extensively!
You must understand the data structure, content, and quality before committing investment and planning for your migration.
By profiling your data in advance, you'll have a clearer understanding of the pitfalls that await you, the skills and resources you're going to need and the likely duration and complexity of the project."
How does data profiling feature throughout the rest of the data migration?
We've highlighted the importance of data profiling at the outset of an integration/migration project – but how can it be applied after this?
Drawing again on the MHI use case, we can see how data profiling can support the design of the target solution by sanity checking the quality and content of event data.
Darius Kemeklis explains:
"For many of our clients, we need to process and analyse large amounts of event data. An example could be a client with money transfer systems that record vast quantities of interaction data with its customers.
By understanding this data, we can model the totality of user engagement. This knowledge helps us build the right data analytics and warehousing solution to optimise each client's marketing campaign and transaction performance.
Data profiling plays a critical role because it helps us quickly cut through the noise to identify the most vital information and assess its fitness against the intended purpose of the target system."
So, this kind of use case demonstrates how cloud-based data profiling is accelerating and shaping target system design due to its ability to assess and report on event data volumes that would have been inconceivable with legacy profiling technologies.
Beyond the data migration – how can cloud-based data profiling play a role after go-live?
At Telmai, we're excited about the implications of cloud computing on the type of data services that solution providers and their clients can support after the migration.
When the migration or integration is finished and the target system goes live, the data quality assurance work is not complete – in fact, it's just getting started.
Systems and applications constantly adapt to the subtle shifts in business models, consumer needs, and competitive pressures shaping application and data design.
If unchecked, data quality defects will continue to occur, both as a result of system and user behaviour changes, as well as the inherent failure rate that massive volumes of data and user interaction inevitably creates.
The ability to measure and monitor data quality remotely was not lost on MHI following their recent shift to cloud-based data profiling with Telmai.
As Darius Kemeklis explains:
"Historically, we would deliver the final migration and hand everything over to the customer. But the challenge is what happens if something like an industry coding standard changes or a particular user starts entering information in a different format?
By applying these new approaches to cloud-based data profiling and data quality monitoring, we'll be able to build alerts that instantly target and report defective data before it leads to problems.
This allows our customers to proactively monitor data quality instead of relying on the business users to notify the technology teams when the data goes bad."
There are many ways that management reports, analytics and operational processes can become defective after the migration if the underlying data quality isn't continuously assessed and monitored.
By building up an earlier profile of the target data that you know to be correct, you can build a 24/7 operational data quality reporting platform by leveraging the data profiling and data quality rules, technologies, and processes delivered during the migration.
With the flexible and remote capabilities offered by cloud computing, this means it's easier than ever to deliver these types of value-added data quality services.
Summary and next steps
Cloud migration and integration projects benefit significantly from data profiling and data quality interventions, but for many years the sheer scale and volume of cloud computing created a barrier to traditional data profiling technologies.
The case story we present in this article introduces the next generation of cloud-based profiling solutions and their potential, now and in the future.
If you would like to personally experience a demonstration of how Telmai can help improve the quality and outcome of your cloud migration or integration project, then reserve your demonstration below:
Book a demo now and see for yourself.
Data profiling helps organizations understand their data, identify issues and discrepancies, and improve data quality. It is an essential part of any data-related project and without it data quality could impact critical business decisions, customer trust, sales and financial opportunities.
To get started, there are four main steps in building a complete and ongoing data profiling process:
We'll explore each of these steps in detail and discuss how they contribute to the overall goal of ensuring accurate and reliable data. Before we get started, let's remind ourself of what is data profiling.
1. Data Collection
Start with data collection. Gather data from various sources and extract it into a single location for analysis. If you have multiple sources, choose a centralized data profiling tool (see our recommendation in the conclusion) that can easily connect and analyze all your data without having you do any prep work.
2. Discovery & Analysis
Now that you have collected your data for analysis, it's time to investigate it. Depending on your use case, you may need structure discovery, content discovery, relationship discovery, or all three. If data content or structure discovery is important for your use case, make sure that you collect and profile your data in its entirety and do not use samples as it will skew your results.
Use visualizations to make your discovery and analysis more understandable. It is much easier to see outliers and anomalies in your data using graphs than in a table format.
3. Documenting the Findings
Create a report or documentation outlining the results of the data profiling process, including any issues or discrepancies found.
Use this step to establish data quality rules that you may not have been aware of. For example, a United States ZIP code of 94061 could have accidentally been typed in as 94 061 with a space in the middle. Documenting this issue could help you establish new rules for the next time you profile the data.
4. Data Quality Monitoring
Now that you know what you have, the next step is to make sure you correct these issues. This may be something that you can correct or something that you need to flag for upstream data owners to fix.
After your data profiling is done and the system goes live, your data quality assurance work is not done – in fact, it's just getting started.
Data constantly changes. If unchecked, data quality defects will continue to occur, both as a result of system and user behavior changes.
Build a platform that can measure and monitor data quality on an ongoing basis.
Take Advantage of Data Observability Tools
Automated tools can help you save time and resources and ensure accuracy in the process.
Unfortunately, traditional data profiling tools offered by legacy ETL and database vendors are complex and require data engineering and technical skills. They also only handle data that is structured and ready for analysis. Semi-structured data sets, nested data formats, blob storage types, or streaming data do not have a place in those solutions.
Today organizations that deal with complex data types or large amounts of data are looking for a newer, more scalable solution.
That’s where a data observability tool like Telmai comes in. Telmai is built to handle the complexity that data profiling projects are faced with today. Some advantages include centralized profiling for all data types, a low-code no-code interface, ML insights, easy integration, and scale and performance.
Leverages ML and statistical analysis to learn from the data and identify potential issues, and can also validate data against predefined rules
Uses predefined metrics from a known set of policies to understand the health of the data
Detects, investigates the root cause of issues, and helps remediate
Detects and helps remediate.
Examples: continuous monitoring, alerting on anomalies or drifts, and operationalizing the findings into data flows
Examples: data validation, data cleansing, data standardization
Low-code / no-code to accelerate time to value and lower cost
Ongoing maintenance, tweaking, and testing data quality rules adds to its costs
Enables both business and technical teams to participate in data quality and monitoring initiatives
Designed mainly for technical teams who can implement ETL workflows or open source data validation software
Start your data observibility today
Connect your data and start generating a baseline in less than 10 minutes.
No sales call needed
Start your data observability today
Connect your data and start generating a baseline in less than 10 minutes.