Apache Iceberg Archives

Driving Reliable Graph Analytics on your Open Lakehouse Data with Telmai and PuppyGraph

Posted on August 11, 2025August 11, 2025 by Anoop Gopalam

As enterprise data ecosystems expand, data pipelines have become increasingly distributed and heterogeneous. Critical business information streams in from numerous sources, landing in diverse cloud systems such as data warehouses and lakehouses. With the rapid adoption of open table formats like Apache Iceberg and Delta Lake, extracting meaningful context from this complex data landscape has grown more challenging than ever.

Graph databases are emerging as a vital tool for enterprises aiming to surface nuanced insights hidden within vast, interconnected datasets. Yet, the reliability of any graph model hinges on the quality of its underlying data. Without clean, trusted data, even the most advanced graph engines can produce misleading insights.

This article examines why graph modeling on open lakehouses demands a heightened focus on data quality and how the combined strengths of Telmai and PuppyGraph deliver a robust, transparent, and scalable solution to ensure your knowledge graphs stand on a foundation of clean, reliable data.

What is a Graph Database?

A graph database models data as nodes and edges rather than rows and tables. This structure makes it ideal for querying complex relationships that include customer behavior, fraud detection, supply chain optimization, or even social behavior graphs.

Graph databases use nodes (representing entities) and edges (representing their relationships) to reflect real-world connections naturally. This model is especially powerful for answering questions like:

How are customers, products, and transactions interrelated?
Which suppliers, shipments, or touchpoints form a risk-prone chain?
What are the shortest paths or networks among organizational entities?

There are two main types of graph databases:

RDF (Resource Description Framework): Schema-driven, commonly used for semantic web applications.
Property graphs: More flexible, allowing arbitrary attributes on both nodes and edges, making them intuitive for a wide range of use cases.

PuppyGraph is a high-performance property graph engine that lets you query structured data as a graph, making it easy to uncover relationships and patterns without moving your data into a separate graph database. It supports Gremlin and Cypher query languages, integrates directly with tabular data sources like Iceberg, and avoids the heavyweight infrastructure typical of traditional graph databases.

Graph Power Without Data Migration

Historically, running advanced analytics meant extracting data from storage and loading it into tightly coupled, often proprietary platforms. This process was slow, risky, and led to fragmentation and vendor lock-in.

Modern compute engines like PuppyGraph break this pattern by enabling direct, in-place graph querying over data in object storage. This creates a centralized source of truth while maintaining architectural flexibility, reducing complexity, and preserving data integrity, future-proofing your analytics stack.

Why Data Quality Must Be Built Into Your Graph Pipeline

Modern lakehouses built on open table formats like Apache Iceberg or Delta Lake promise agility, scale, and interoperability for enterprise data. However, their very openness can mask a new breed of data quality issues that quietly erode the value of downstream analytics, especially in graph modeling.

Key data quality issues common in open table formats and distributed pipelines that could affect graph modeling include:

Schema drift and type inconsistencies: Data evolving over time may introduce mixed data types or missing columns, breaking parsing logic and causing graph construction failures or unexpected node/edge omissions.

Null or missing foreign keys: Missing references between tables can create orphaned nodes or broken edges, fragmenting the graph and skewing relationship metrics.

Inconsistent or mixed timestamp formats: Time-based event relationships rely on accurate event sequencing. Mixed formats disrupt these sequences, making time-based graph queries unreliable.
Out-of-range or anomalous values: Erroneous measurements or outliers can bias graph algorithms, for example by inflating edge weights or misrepresenting geospatial relationships.

Duplicate or partial records: These create redundancy and fragmentation, inflating graph size and complicating pattern detection.

Referential mismatches across distributed datasets: Inaccurate joins lead to false or missing relationships, diluting the reliability of graph analytics.

The distributed and heterogeneous nature of lakehouse pipelines amplifies these challenges, as data flows through multiple ingestion points and transformations before reaching the graph layer. Without systematic, automated data quality validation before graph modeling, these hidden errors remain undetected—leading to delayed insights, costly rework, and even production outages.

Embedding rigorous data quality checks early in the pipeline ensures that your graph analytics start from a clean, consistent, and trusted foundation. This is where the combined strengths of Telmai and PuppyGraph offer a breakthrough.

How Telmai and PuppyGraph Transform Raw Data into Trusted Graph Analytics

Enterprise analytics delivers true value only when the data relationships it relies on are accurate and transparent. Telmai and PuppyGraph offer an integrated solution that validates and models data in real time ensuring every data node, edge, and relationship is trustworthy. This unified approach enables teams to interpret complex datasets with clarity and agility.

Figure: Telmai & PuppyGraph Architecture

To bring the joint value of Telmai and PuppyGraph into sharp focus, let’s walk through a practical example using the Olist dataset — a publicly available e-commerce dataset rich with customer, order, product, and seller information.

In this dataset, we injected common data quality challenges such as:

Null foreign keys (e.g., missing customer_id or seller_id), which break the critical links between customers, orders, and products
Inconsistent timestamp formats : Mixing formats such as MM/DD/YYYY with ISO 8601 timestamps leads to unreliable temporal relationships. For graph analytics that rely on event sequencing, like tracking purchase funnels or supply chain timelines, this inconsistency results in erroneous ordering of events, skewed path analyses, and misleading temporal insights.
Out-of-range values, like unrealistic product weights that skew relationship weighting and analytics
Data type mismatches that lead to processing errors or dropped nodes during graph construction

If these issues remain undetected and uncorrected, the resulting graph will have broken edges, orphan nodes, and inaccurate relationship metrics,ultimately producing misleading insights and undermining trust in your analytics.

This is where Telmai plays a pivotal role. Before the data ever reaches the graph engine, Telmai performs comprehensive, full-fidelity data profiling and validation directly on the raw Iceberg tables in their native cloud storage location.

It automatically detects null keys, inconsistent formats, schema drift, and anomalous values, without resorting to sampling that might miss critical errors. Telmai surfaces these issues early, enabling data teams to correct or flag problematic data before graph modeling begins.

With this validated, clean data in place, PuppyGraph ingests the Iceberg datasets natively—eliminating the need for costly data migrations or fragile ETL processes. PuppyGraph then constructs accurate, high-performance property graphs that faithfully represent the true entity relationships and temporal sequences within your data.

Graph algorithms depend heavily on the correctness of edges and nodes to surface meaningful relationships, identify patterns, and detect anomalies. By integrating Telmai’s rigorous data quality validation with PuppyGraph’s flexible, in-place graph computation, organizations gain confidence that their knowledge graphs are built on solid ground. This ensures faster onboarding, fewer silent errors, and graph analytics that reliably power critical business applications—from customer journey analysis to fraud detection and supply chain optimization.

The old adage “garbage in, garbage out” holds especially true here: graphs built on noisy or inconsistent data risk misleading conclusions, operational disruptions, and lost business opportunities.

Conclusion

Together, Telmai and PuppyGraph offer a seamless, scalable solution that enables enterprises to build trustworthy knowledge graphs on top of open lakehouses. By integrating rigorous data validation with high-performance graph modeling, here are the key benefits that this joint solution can offer:

Faster onboarding: Validated data minimizes back-and-forth between data engineers and graph modelers, speeding up time to value.
Fewer silent errors: Early detection prevents costly rework and avoids customer-facing problems caused by inaccurate graph outputs.
Smarter data products: Reliable, high-quality graphs enable more precise personalization, recommendations, and fraud detection—driving better business outcomes.

Ready to build trusted, scalable graphs? Click here to talk to our team and learn how to turn your lakehouse into a source of clean, reliable insights.

Want to stay ahead on best practices and product insights? Click here to subscribe to our newsletter for expert guidance on building reliable, AI-ready data pipelines.

Embedding AI-Ready Observability in the Lakehouse: Lessons from Bill and ZoomInfo

Posted on July 31, 2025July 31, 2025 by Anoop Gopalam

As enterprises modernize toward AI-first architectures, trustworthy data pipelines have become a foundational requirement. At enterprise scale, the sheer velocity, variety, and complexity of evolving data ecosystems make it essential not just to deliver clean data, but to embed data observability deeply within lakehouse architectures. Without it, even the most sophisticated analytics or AI initiatives risk breaking under the weight of unreliable inputs.

At this year’s CDOIQ Symposium, Hasmik Sarkezians, SVP of Data Engineering at ZoomInfo, and Aindra Misra, Director of Product at Bill, joined Mona Rakibe, CEO of Telmai, for a candid panel discussion. Together, they shared hard-won insights on what it takes to operationalize real-time, proactive data observability in modern lakehouse environments—and why traditional, reactive approaches no longer meet the needs of today’s AI-driven enterprise.

Why Observability Can’t Be an Afterthought

Both Bill and ZoomInfo operate in high-velocity, high-stakes data environments. Bill powers mission-critical financial workflows for over 500,000 small businesses and 9,000+ accounting firms, with products spanning AP, AR, and spend management. ZoomInfo manages a complex pipeline of over 450 million contacts and 250 million companies, delivering enriched, AI-powered go-to-market intelligence to thousands of customers in real time.

In both cases, small data errors often snowball into systemic risks. For instance, at ZoomInfo, A misclassified company description that is used to infer industry, headcount, or revenue, if left unchecked, can ripple through downstream processes and undermine the accuracy of critical data products if left unchecked.As Hasmik Sarkezians, SVP of Data Engineering at ZoomInfo, put it:

A minor data issue can become a massive customer-facing problem if it slips through the cracks. Catching it at the source is 10x cheaper and 100x less painful.

Catching issues at the root, she emphasized, is far less costly than retroactively fixing the consequences after they’ve been exposed to customers.

Moving from Monitoring to Intelligent Action: Making Observability Actionable

Observability is often synonymous with an after-the-fact reporting function. But both Bill and ZoomInfo have pushed well beyond that model toward embedded, actionable observability that actively shapes how data flows through their systems.

At ZoomInfo, this shift has been architectural. Rather than automatically pushing updates from the source of truth to their customer-facing search platform, the data team now holds that data until it passes a battery of automated quality checks powered by Telmai. If anomalies are detected, a failure alert is sent via Slack, and the data is held back from publication until the issue is resolved.

“We prevent the bad data from being exposed to the customer,” explained Hasmik Sarkezians, SVP of Data Engineering at ZoomInfo, “we catch that before it’s even published.” Updated records now undergo anomaly detection and policy checks via DAGs, and only data that passes validation is published. If an issue is found, a failure alert is pushed into Slack, and the data is held for manual review or correction.

In one instance, a faulty proxy once caused a data source to generate null revenue values for a large portion of companies. “We already caught multiple issues,” said Hasmik, referencing one such case involving SEC data, “proxy had an issue [that] generated null values, and we didn’t consume it because we had this alert in place.”

The pipeline, equipped with Telmai rules and micro-batch DAGs, caught the anomaly before it could propagate to customers.

Meanwhile, at Bill, the platform team faced a familiar challenge: lean data engineering resources spread thin managing Great Expectations and ad hoc rule logic. With a growing number of internal and external data consumers—including AI agents, forecasting engines, and fraud models—the cost of manual triage became unsustainable.

At Bill, the driver was slightly different. Their lean engineering team had previously relied on open-source frameworks like Great Expectations, but the overhead of managing rule-based tests across dynamic datasets was increasingly unsustainable.

Our hope with Telmai is that we’ll improve operational efficiency for our teams… and scale data quality to analytics users as well, not just engineering. – Aindra Misra, Director of Product at Bill

By introducing anomaly detection, no-code interfaces, and out-of-the-box integrations, Bill aims to empower not just data engineers but business analysts to assess trustworthiness—without relying on custom rules or engineering intervention.

For both companies, this marks a step toward making observability not just visible, but actionable—and enabling faster, safer data product delivery as a result.

The Role of Open Architectures

Both Bill and ZoomInfo emphasized the centrality of open architectures, anchoring their platforms on Apache Iceberg to support scalable, AI-ready analytics across heterogeneous, rapidly evolving ecosystems.

ZoomInfo, in particular, has leaned into architectural openness to simplify access across its vast and distributed data estate that includes cloud platforms and legacy systems. “We’ve been at GCP, we have presence in AWS. We have data all over,” said Hasmik Sarkezians. To unify this complexity, ZoomInfo adopted Starburst on top of Iceberg. “It kind of democratized how we access the data and made our integration much easier.”

Bill echoed a similar philosophy. “For us, open architecture is a combination of three different components,” explained Aindra Misra. “The first one is… open data format. Second is industry standard protocols. And the third… is modular integration.” He highlighted Bill’s use of Iceberg and adherence to standardized protocols for syncing with external accounting systems—ensuring flexibility both within their stack and across third-party integrations.

This architectural philosophy carries important implications for observability. Rather than relying on closed systems or platform-specific solutions, both teams prioritized composability—selecting tools that integrate natively into their pipelines, query layers, and governance stacks. As Mona pointed out, interoperability was “literally table stakes” in ZoomInfo’s evaluation process: “Would we integrate with their today’s data architecture, future’s data architecture, past data systems?”

Observability, in these environments, must adapt—not disrupt. That means understanding Iceberg metadata natively, connecting easily to orchestration frameworks, and enabling cross-system validation without manual validation. In short, open data architectures demand open observability systems—ones built to meet organizations where their data lives.

This design philosophy lets teams keep pace with changing business and technical needs. In Hasmik’s words: “…for me it’s just democratization of… the quality process, the data itself, the data governance, all of that has to come together to tell a cohesive story.”

By rooting their approaches in open, flexible architectures, both companies have positioned themselves to scale trust and agility—making meaningful, system-wide observability possible as they pursue ever more advanced data and AI outcomes.

Organizational Lessons: Who Owns Data Quality?

Despite making significant technical strides, both panelists acknowledged that data quality ownership and building a culture around it remain a persistent challenge.

ZoomInfo tackled this by forming a dedicated Data Reliability Engineering (DRE) team was initially created to manage observability infrastructure and to onboard new data sets. However, as Hasmik Sarkezians explained, this model soon ran up against bottlenecks and scalability concerns:

“Currently, we have a very small team. We created a team around [Telmai], which is called the DRE, the data reliability engineers… It’s a semi-automatic way of onboarding new datasets… but it’s not really automatic and it’s not really easy to get the direct cause, so there’s a lot of efforts being done to automate all of that process.”

Recognizing these limitations, ZoomInfo is actively working to decentralize data quality responsibilities. The vision is to empower product and domain teams—not only centralized data reliability engineers—to set their own Telmai policies, receive alerts directly, and react quickly via Slack integrations or future natural language interfaces:

“For me, I think we need to make sure that the owner, the data set owner, can set up the Telmai alerts, would be reactive to those alerts, and will take action.”

At Bill, Aindra Misra described a similar challenge. Leaning too heavily on a small, expert engineering team created not just operational drag, but also strained handoffs and trust with analytics and business teams: “With the lean team… things get escalated and the overall trust between the handshake between internal teams like the platform engineering and analytics team—that trust loses.”

Their north star is to build an ecosystem where business analysts, Ops, GTM teams, and other data consumers have the direct context to check, understand, and act on data quality issues—without always waiting for engineering intervention.

In both organizations, it’s clear that tools alone aren’t enough. Ownership must be embedded into culture, process, and structure—with clearly defined SLAs, better cross-team handoffs, and systems that empower the people closest to the data to take accountability for its quality.

Toward AI-Ready Data Products

Both organizations are also preparing for a shift from analytics-driven to autonomous systems.

At Bill, internal applications are increasingly powered by insights and forecasts that must be accurate, explainable, and timely. Use cases like spend policy enforcement, invoice financing, and fraud detection rely on real-time decisions driven by data flowing through modern platforms like Iceberg. As Aindra Misra noted, delivering trust in this context is critical: “Trust is our mission—whether it’s external customers or internal teams, data SLAs need to be predictable and transparent.”

ZoomInfo, meanwhile, is layering AI copilots and signal-driven workflows on top of an extensive enrichment pipeline. As Hasmik Sarkezians explained earlier, a single issue in a base data set can cascade through derived fields—corrupting entity resolution, contact mapping, and ultimately customer-facing outputs.

In both environments, the stakes are rising. Poor data quality no longer just breaks dashboards—it can undermine automation, introduce risk, and erode customer trust. As Aindra put it:

Once the data goes into an AI… if the output of that AI application is not what you expect it to be, it’s very hard to trace it back to the exact data issue at the source… unless you observed it before it broke something.”

That’s why both organizations see observability not as a reporting tool, but as a foundational enabler of AI—instrumenting every stage of the pipeline to catch issues before they scale into system-wide consequences.

Final Thoughts: Trust Is Your Data Moat

As AI models and agentic workflows become commoditized, the true differentiator isn’t your algorithm. It’s the reliability of the proprietary data you feed into it.

For both Bill and ZoomInfo, embedding observability wasn’t just about operational hygiene. It was a strategic move to scale trust, protect business outcomes, and prepare their architectures for the demands of autonomous systems.

Here are a few key points and takeaways from this panel discussion –

Start Early. Shift Left: Observability works best when embedded at the data ingestion and pipeline layer, not added post-facto once problems reach dashboards or AI models.
Automate the Feedback Loop: Use tools that not only detect issues but can orchestrate action—blocking bad data, triggering alerts, and assigning ownership.
Democratize, Don’t Centralize: Give business and analytics teams accessible controls and visibility into data health, instead of relying solely on specialized teams.
Build for Change: Choose data observability platforms that support open standards, multi-cloud, and mixed data ecosystems—future-proofing your investments

Want to learn how Telmai can accelerate your AI initiatives with reliable and trusted data? Click here to connect with our team for a personalized demo.

Want to stay ahead on best practices and product insights? Click here to subscribe to our newsletter for expert guidance on building reliable, AI-ready data pipelines.

What’s new at Telmai in 2025: Key product feature updates so far

Posted on June 27, 2025July 2, 2025 by Hashem Raslan

Reliable data is the foundation for every modern enterprise,from powering AI models to ensuring trusted reporting and customer experiences. But as architectures evolve toward distributed open lakehouse and multi-cloud hybrid data environments, ensuring data quality at the source becomes more critical than ever. That’s why, in the first half of 2025, Telmai continued to double down on enabling observability where the data lives—in the lake itself. From native Iceberg support that eliminates warehouse dependencies to enhanced rule logic and smarter alerting, our latest updates are built to help data teams monitor, validate, and remediate data issues earlier in the pipeline.

The result: greater trust in data-driven initiatives through automated resolution, allowing data reliability to scale alongside your growing ecosystem without adding operational overhead.

Ensuring data quality is made simple for Business and Engineering teams

Enhanced DQ Rule Engine to scale and unify validation across your stack

As data pipelines become increasingly distributed and complex, centrally managing the various data quality rules and validation workflows is no longer optional—it’s essential. Telmai’s enhanced rule engine offers a unified interface for creating, editing, and deploying validation rules across all systems, eliminating fragmented and siloed operations. With reusable templates, and JSON-based rule definitions, teams can standardize validations, ensure version control and further integrate into CI/CD workflows. Decoupled from warehouse compute, Telmai executes validations in its own engine, delivering performance and scalability without additional cost or operational overhead.

Custom free-form SQL metrics and advanced rule logic

Telmai now enables users to define custom metrics using SQL and implement complex validation logic that reflects their unique data domain, all through an intuitive interface. Whether it’s tracking nuanced business KPIs or applying layered rule conditions, teams can build data quality rules that align with real-world expectations, without engineering overhead. This empowers users across technical levels to define what “data quality” means for their organization and catch edge-case anomalies that generic rules often miss.

Monitor data quality trends with custom metric dashboards

image

Telmai’s Metric Inspector equips teams with interactive dashboards to visualize and investigate data quality trends over time with interactive, time-series dashboards that reveal how key data quality metrics behave over time. Users can drill into data quality KPIs such as null rates, freshness lag, or custom-defined KPIs and correlate them with anomaly triggers. This level of transparency helps data teams identify patterns, validate rule thresholds, and continuously refine their data quality strategy.

Observability at the Source for Open Table Formats

As AI applications and advanced analytics become table stakes for modern enterprises, organizations are increasingly adopting open lakehouse architectures powered by table formats like Apache Iceberg, Delta Lake, and Hudi. While these formats offer the flexibility of object storage with the governance and performance of traditional warehouses, they lack native mechanisms for ensuring data quality and reliability.

Without observability at the source, organizations are forced to validate data downstream, driving up cloud costs, delaying issue detection, and undermining the reliability of critical analytics and AI initiatives. Telmai solves this by delivering native, source-level observability for Apache Iceberg on GCP and GCS, with Delta and Hudi support on the roadmap.

Through partition-level profiling and metadata pushdown, Telmai enables full-fidelity validation without scanning entire datasets or triggering warehouse compute. This empowers teams to detect schema drift, freshness issues, and value anomalies early in the pipeline, ensuring AI models and analytical workloads operate on trusted, timely data, at scale and without architectural compromise.

Operational Efficiency Through Smarter Workflows and UX

Streamlined interface for defining and managing DQ policies

Telmai’s updated policy management UI makes it easier for users to define, configure, and review data quality policies at scale. The refreshed layout improves visibility into rule thresholds, affected datasets, and policy logic—reducing the time it takes to author or adjust checks. Designed to support both technical and operational users, this interface lowers the barrier to entry and enables more teams to take ownership of data quality without relying on engineering.

Smarter alerting and workflows for faster resolution

As data ecosystems scale, managing noise becomes just as important as detecting issues. Telmai now includes enhanced alert routing logic that ensures notifications are delivered based on policy type, team ownership, and relevance—so the right people are alerted at the right time. By aligning alerts with organizational context, teams can reduce alert fatigue, prioritize what matters, and accelerate resolution. This makes it easier to operationalize data quality across distributed teams and complex pipelines.

Simplified connection and asset management for faster onboarding

To streamline onboarding and reduce repetitive setup, Telmai has introduced a modular approach to managing connections and assets. Previously, connections were tied directly to asset creation, requiring full configuration for each new asset.

Now, source connections are created once through a centralized interface and can be reused across multiple assets. This separation simplifies setup, reduces duplication, and gives teams better control over how data sources are organized and maintained—leading to faster deployment and easier scaling across environments.

Closing Thoughts

As modern data ecosystems scale to support AI, open table formats, and real-time analytics, ensuring data quality at the source has become a strategic priority. Our latest updates reflect Telmai’s continued commitment to enabling proactive, high-fidelity AI augmented data quality monitoring, tailored to the needs of both business and data teams.We’re excited for you to explore these new capabilities and welcome your feedback as we continue to innovate.

Want to learn how Telmai can accelerate your AI initiatives with reliable and trusted data? Click here to connect with our team for a personalized demo.

Want to stay ahead on best practices and product insights? Click here to subscribe to our newsletter for expert guidance on building reliable, AI-ready data pipelines.

Snowflake Summit 2025 Recap: What It Means for Data Reliability and AI Readiness

Posted on June 5, 2025June 5, 2025 by Anoop Gopalam

Introduction: Data quality is no longer an afterthought

“There is no AI strategy without a data strategy.” This statement from Snowflake CEO Sridhar Ramaswamy wasn’t just a soundbite, it was the central theme surrounding Snowflake Summit 2025. From Cortex AI SQL to Openflow and the Postgres acquisition, one principle became clear: the future of AI and enterprise applications is grounded in the quality, reliability, and observability of data.

In this article, we look at some key product announcements that stood out from the Snowflake Summit 2025.

1. Easy, Connected, Trusted: Snowflake’s AI Data Cloud in three words

Snowflake Co-founder and Head of Product Benoit Dageville opened the Summit by outlining how AI is becoming embedded across all domains and functions within the enterprise. He emphasized Snowflake’s transformation into a unified platform for intelligent data operations and distilled the company’s AI vision into three foundational principles: easy, connected, and trusted.

Easy: AI development should be frictionless. A unified data platform must reduce complexity so teams can build and deploy faster.

Connected: AI systems can’t operate in silos. Data and applications must move freely across organizational boundaries.

Trusted: Governance isn’t an afterthought. Trust must be built into the platform through end-to-end visibility, control, and accountability.

This framework wasn’t just theoretical but laid the groundwork for many of the following product announcements.

2. Open table formats are now first-class citizens

One of the clearest trends from the Summit was Snowflake’s deeper alignment with the open data ecosystem, especially Apache Iceberg. With support for native Iceberg tables and federated catalog access, Snowflake positioned itself as a format-agnostic, interoperable layer, regardless of whether your architecture follows a lakehouse, data mesh, traditional warehouse, or hybrid model.

This move underscores the growing need to unlock data access and analysis across open and managed environments, enabling teams to build, scale, and share advanced insights and AI-powered applications faster. Snowflake’s commitment to open interoperability was also reflected in its expanded contributions to the open-source ecosystem, including Apache Iceberg, Apache NiFi, Modin, Streamlit, and the incubation of Apache Polaris.

“We want to enable you to choose a data architecture that evolves with your business needs,” said Christian Kleinerman, EVP of Product at Snowflake. “

To support this vision in practice, Snowflake also announced deeper interoperability with external Iceberg catalogs such as AWS Glue and Hive Metastore, allowing teams to query data where it lives without moving it.

Enhanced compatibility with Unity Catalog further reflects a broader trend: governance and lineage must now extend across formats, clouds, and tooling ecosystems, not just within a single vendor stack. These updates position Snowflake not only as a data platform but as a flexible control plane for AI-ready architectures—one where open data, external catalogs, and trusted analytics can operate in sync.

3. Eliminating silos with Openflow’s unified and autonomous ingestion

Snowflake OpenFlow marks a significant step in simplifying data ingestion across structured and unstructured sources. Built on Apache NiFi, it offers an open, extensible interface that supports batch, stream, and multimodal pipelines within a unified framework.

Users can choose between a Snowflake-managed deployment or a bring-your-own-cloud setup, offering flexibility for hybrid and decentralized teams. Crucially, OpenFlow applies the same engineering and governance standards to unstructured data pipelines as it does to structured ones, enabling teams to build reliable data products regardless of source format.

During the keynote, EVP of Product Christian Kleinerman also previewed Snowpipe Streaming, a high-throughput ingestion engine (up to 10GB/s) with multi-client support and immediate data query ability.

Together, these advancements aim to eliminate siloed ingestion workflows and reduce operational friction without compromising reliability at the point.

4. Metadata governance for the AI era: Horizon Catalog and Copilots

Snowflake unveiled Horizon Catalog, a federated catalog designed to unify metadata across diverse sources, including Iceberg tables, dbt models, and BI tools like Power BI. This consolidated view provides both lineage and context across structured and semi-structured datasets, which is critical for organizations embracing decentralized data ownership models or a data mesh architecture.

In addition, the new Horizon Copilot brings natural language search, usage analysis, and lineage insights to the forefront, making it easier for teams to discover, understand, and validate data across their stack.

As enterprises shift to more decentralized models of data ownership, this level of federated visibility and governance becomes essential to ensuring reliability at scale, mainly when data flows across pipelines, clouds, and tools.

5. Semantic views and context-aware AI Signals

Snowflake’s introduction of Semantic Views and Cortex Knowledge Extensions marks a strategic shift toward embedding domain logic directly into the data platform. Semantic Views provide a standardized layer for business logic, enabling consistent metrics, definitions, and calculations across tools. This is especially critical when powering AI models that rely on aligned semantics for trustworthy insights.

Cortex Knowledge Extensions allow teams to inject metadata, rules, and domain-specific guidance into their LLMs and copilots, improving accuracy and reducing hallucinations. For data teams building AI-native pipelines, this means more context-aware signal processing, less noise in anomaly detection, and alerts that reflect business impact.

6. Accelerating AI and DataOps without compromising trust

Snowflake doubled down on operationalizing AI across the enterprise with product updates aimed at trust, speed, and precision. Cortex AI SQL brings LLM capabilities to familiar SQL workflows, allowing users to build natural language-driven queries while maintaining governance. Paired with Snowflake Intelligence and Document AI, these tools reflect a growing push toward embedded agents and copilots that enhance productivity without compromising oversight.

These updates underscore a broader trend: enabling faster AI development cycles while preserving the reliability, auditability, and explainability of decisions made downstream. For data teams, this means aligning DataOps with MLOps and building safeguards that scale with velocity.

Final Thoughts: What This Means for Data Teams

The Snowflake Summit 2025 goes far beyond feature releases and reflects a more profound shift in the design of enterprise data architectures and their governance strategies:

Open formats like Apache Iceberg, Delta Lake, etc, are not just supported, they’re foundational to modern, flexible architectures.
Ingestion at scale is now coupled with expectations of real-time validation and trust at the entry point.
Governance is moving from static policies to intelligent automation and embedded lineage.
AI precision demands semantic alignment and metadata context from the start.

From Horizon Catalog to Cortex AI SQL and OpenFlow, Snowflake is designing for a world where AI-powered insights must be fast and dependable. For data teams, this means architecting systems where reliability, explainability, and agility are not trade-offs but baseline requirements.

As Snowflake doubles down on support for open formats and distributed pipelines, AI-powered data observability tools like Telmai ensure that your data quality scales with your architecture. Whether you’re onboarding Iceberg tables, streaming data through OpenFlow, or aligning KPIs via semantic layers, Telmai integrates natively into your existing data architecture to proactively monitor your data for inconsistencies and validate every record before it impacts AI and analytics outcomes.

Are you looking to make your Snowflake pipelines AI-ready? Click here to talk to our team of experts to learn how Telmai can accelerate access to trusted and reliable data.

Passionate about data quality? Get expert insights and guides delivered straight to your inbox – click here to subscribe to our newsletter now.