Enhancing Query Efficiency in Lakehouses

A critical challenge for organizations today is managing the surging costs associated with data infrastructure. With the introduction of modern lakehouse architecture, this challenge is magnified due to its ability to facilitate access to extensive data volumes. Without proper management, this can result in inefficiencies and suboptimal data usage. A significant portion of these costs […]

Max Lukichev

April 16, 2024

A critical challenge for organizations today is managing the surging costs associated with data infrastructure. With the introduction of modern lakehouse architecture, this challenge is magnified due to its ability to facilitate access to extensive data volumes. Without proper management, this can result in inefficiencies and suboptimal data usage.

A significant portion of these costs arises from unnecessary expenditures on the movement and storage of low-quality data. Yet, the expenses incurred from processing and querying this data far outweigh these costs.

It’s not a secret that data is never perfect, and Data Quality requires significant investment. Yet, understanding how data is queried offers an effective strategy for cutting infrastructure expenses and enhancing the value of data assets.

Analysing Queries To Optimise Performance

Asking these specific questions about the queries being run can help identify and better understand the root causes of financial inefficiencies or system bottlenecks.

  1. Identify the queries that take a long time to execute or access large volumes of data. It’s crucial to measure both the execution time and the amount of data these queries scan, as these factors typically drive up costs.
  2. Determine the common partitioning patterns for long-running queries involving large datasets. This involves understanding the frequent conditions applied in these queries. Optimizing partition settings based on these patterns can lead to significant improvements.
  3. It’s important to recognize the most frequently accessed tables and those rarely used, often referred to as “dark data” that is collected and stored but rarely, if ever, utilized, leading to unnecessary expenditure.
  4. Analyzing the most common aggregation queries performed can reveal opportunities to pre-aggregate data according to usage patterns. This strategy can reduce the need to query the raw data layer, which is often costly.
  5. Assess the data quality that is most frequently queried, right down to the attribute level. In many cases, only a small subset of attributes is regularly accessed out of a much larger set. Querying low-quality data is wasteful and highlights areas where investments in Data Quality can yield significant returns.

To address points 1-4, some Data Warehouse (DW) systems provide data that is readily accessible and queryable for analysis. The following example code retrieves the top 10 longest-running queries

You can adjust the DATEADD function to focus on a shorter or longer period of time. Replace my_warehouse with the name of a warehouse.

SELECT query_id,
  ROW_NUMBER() OVER(ORDER BY partitions_scanned DESC) AS query_id_int,
  query_text,
  total_elapsed_time/1000 AS query_execution_time_seconds,
  partitions_scanned,
  partitions_total,
FROM snowflake.account_usage.query_history Q
WHERE warehouse_name = 'my_warehouse' AND TO_DATE(Q.start_time) > DATEADD(day,-1,TO_DATE(CURRENT_TIMESTAMP()))
  AND total_elapsed_time > 0 --only get queries that actually used compute
  AND error_code IS NULL
  AND partitions_scanned IS NOT NULL
ORDER BY total_elapsed_time desc
LIMIT 10;

Refer to the following documentation for detailed insights into historical query and task performance.

This process can be enhanced by integrating monitoring and alerting capabilities using platforms like DataDog or Splunk. By setting up SQL connectors, these systems can trigger alerts when there’s a noticeable increase in p90 response times. Learn more about this setup here.

Conclusion

Although points 1 to 4 can be resolved with relative ease and minimal engineering effort, the last point demands a comprehensive approach, viewing it as an ongoing program rather than a one-time project, and should be appropriately budgeted. You need a system that does two things well to build a solution that helps save on Data Warehouse costs and get the most from data projects. It should look at how queries run and check the quality of the data, then put this information together in a simple way so you can make smart choices:

  • Minimize querying of raw data by guiding users towards pre-aggregated, enhanced, and cleaned data.
  • Optimize data storage according to query patterns.
  • Focus Data Quality initiatives and budgeting on the highest data usage and impact areas.
  • On this page

See what’s possible with Telmai

Request a demo to see the full power of Telmai’s data observability tool for yourself.