What Is Write-Audit-Publish in Apache Iceberg and Why It Matters for Data Quality

The Write–Audit–Publish (WAP) pattern in Apache Iceberg offers a practical, production-grade approach to catching and fixing bad data before it reaches production. By isolating changes in an audit branch, running validation checks, and publishing only trusted data, WAP creates a repeatable and auditable quality gate for your open lakehouse. This walkthrough explores how to implement WAP with Iceberg and Spark

Data Observability

Anoop Gopalam

August 10, 2025

Enterprise data pipelines handle massive volumes of information daily. However, without proper guardrails or controls, poor-quality data can easily make its way into production, leading to skewed analytics and impacting business decisions.

Consider a financial services firm running daily ETL jobs to update a transactions table in its lakehouse. If that incoming data contains duplicates or inconsistencies and they’re not caught in time, those errors get published to production, feeding faulty metrics into BI dashboards and driving misinformed decisions.

Although data quality means different things to various stakeholders, the shared objective is consistent: delivering accurate, reliable data. For data engineering teams, achieving this at scale across diverse sources and complex pipelines is a significant challenge. For data engineers, achieving that at scale across diverse sources and evolving pipelines is a complex, high-stakes challenge.

In this blog, we’ll explore the Write-Audit-Publish (WAP) pattern in Apache Iceberg and show how it helps enterprise data teams enforce data quality, isolate changes, and safely publish updates in modern lakehouse environments.

What is Write-Audit-Publish (WAP)?

Write-Audit-Publish (WAP) is a proven approach in data engineering for ensuring data quality before it reaches production. Instead of sending new data straight into live tables, WAP routes it through a controlled staging process where it can be inspected, validated, and corrected—reducing the risk of bad data polluting critical systems. The process unfolds in three key stages:

Write: New data is first written to a temporary, non-production branch or table. This keeps it isolated from active workloads, so changes won’t disrupt ongoing operations. In most implementations, this staging or audit table serves as a holding area between the data source and the final production destination.
Audit: In this stage, the staged data undergoes a comprehensive quality review. Teams can run checks for null values, duplicates, referential integrity violations, or other business-specific rules. Any issues found can be fixed here, ensuring only trusted data moves forward.
Publish: Once the data passes all validations, it’s promoted to production. This step should be atomic, meaning that downstream consumers see either all of the new changes or none, avoiding partial or inconsistent updates.

Advantages of the WAP Pattern in Iceberg

WAP in Iceberg delivers tangible, operational benefits that directly strengthen data quality, reliability, and developer agility:

Branch isolation without duplication – Maintain multiple states of the same table (e.g., audit, main) in one logical table for parallel development and testing.
Atomic publishing – Atomic merge operations ensure that either all updates make it into production or none do, eliminating partial updates that can confuse downstream consumers.
Effortless rollback – If a branch doesn’t pass review, you can simply drop it—no cleanup or complex undo required.
Concurrency-safe – Iceberg’s ACID guarantees mean multiple jobs can read and write simultaneously without corrupting data.
Schema evolution – With Spark’s dynamic schema merge, Iceberg tables can accept new columns or adjust schema mismatches without breaking existing data.

Now that we have a general idea of the write-audit-publish pattern, let us try to understand how we can implement such a data quality framework using the Apache Iceberg table format.

Implementing Write-Audit-Publish in Apache Iceberg

Apache Iceberg, as a modern table format for data lakes, provides the APIs and table semantics required to implement the Write-Audit-Publish (WAP) workflow. Iceberg doesn’t enforce the pattern itself. It’s the compute engine’s job to orchestrate and execute it. Apache Spark currently offers the most mature and well-documented WAP capabilities. That said, other engines, such as Apache Flink, also integrate tightly with Iceberg and can support equivalent WAP-style workflows, particularly in streaming or real-time scenarios. In this walkthrough, we’ll focus on implementing WAP using Iceberg with Spark.

The diagram above illustrates the overall flow. Using Iceberg’s branching capabilities is considered a best practice when adopting WAP, as it allows you to isolate, validate, and merge changes with minimal overhead while supporting multiple parallel changes safely.

For this example, we built the workflow on Spark with Iceberg, backed by Google Cloud Storage (GCS). Our setup included:

Apache Spark with Iceberg runtime and extensions enabled
Google Cloud Storage (GCS) as the data lake storage layer
Hive Metastore or similar, for table and branch metadata
Python/PySpark scripts using Spark’s Iceberg integration

Step 1 – Create Table and Branch

We start by creating a table in Apache Iceberg with Write-Audit-Publish (WAP) enabled and branching configured. This sets up two environments:

main – the production branch your workloads will read from
audit_branch – an isolated staging branch for new data

This separation ensures no unvalidated data ever reaches production.

We then seed main with some initial sample data so it behaves like a real production table:

Finally, we create the audit branch. All new writes and validation will happen here until the data passes quality checks:

Step 2 – Write New Data to the Audit Branch

When new data arrives, whether from daily ETL jobs or ad-hoc batch loads, we don’t write it directly to production. Instead, we target the audit branch, ensuring production (main) remains untouched until validation is complete.

Next, we simulate 50 new orders—with the option to inject invalid records for testing data quality checks—and append them to the audit branch. After this step:

Main branch: 100 rows (unchanged)
Audit branch: 150 rows (100 original + 50 new)

Here’s a glimpse of the generated order schema so you know what we’re working with:

Each record includes an OrderId, OrderDate, CustomerId, product details, quantity, unit price, total amount, order status, and shipping address, giving us a realistic dataset to validate in the next stage.

Step 3 – Validate the Audit Branch

Before anything gets published, the pipeline runs deep audits—automated checks for key metrics:

We run checks for:

Date format validation:
Ensures all OrderDate values follow the yyyy-mm-dd pattern, catching malformed or missing entries before they cause downstream issues.

Date value validation
Goes beyond formatting, here we verify that months are within 1–12, days are within 1–31, and all dates are logically valid.

Business rule checks
We are checking for critical domain rules here:
- No negative TotalAmount values
- Quantity between 1 and 100
- No nulls in key identifiers (OrderId, CustomerId, ProductId)
- No duplicate OrderIds

Branch vs. Main comparison:
Using the code below, we confirm exactly what will be added to production, including the number of new orders, total and average order value, and date range.

If all checks pass, the branch is marked as safe to publish. If not, the pipeline reports each failing rule with actionable guidance, blocking the merge until issues are resolved.

Step 4 – Publish to Production

Once the audit branch passes all validations, we promote it to the main branch using Apache Iceberg’s atomic branch merge.Iceberg provides several safe publish operations, such as snapshot cherry-pick or fast-forward, ensuring all-or-nothing updates:

If validations fail and you try to publish, the script blocks you unless you explicitly override (with a visible warning).

After publishing, the audit branch is typically removed to keep the workspace tidy and prevent accidental re-use.

Step 5 – Verify Main Branch

As a final safeguard, we re-read the main branch immediately after publishing to validate the post-merge state. This step ensures that the published data matches expectations and that no corruption occurred during the merge.

The script confirms:

Total record count
Total revenue
Date range of all orders
Integrity check results (e.g., no missing keys, no duplicates)

Why This Matters for Data Quality

Whether you’re managing millions of financial transactions, high-volume clickstreams, or sensitive healthcare records, adopting the Write–Audit–Publish (WAP) pattern with Apache Iceberg establishes repeatable, auditable, and scalable quality gates for your data.

The WAP workflow isn’t just about “moving data around.” It’s a production-grade data quality firewall where:

Bad data never hits production – All incoming changes live in the audit branch until they’ve passed quality checks, keeping live tables clean.
Controlled schema evolution – Even when upstream data changes unexpectedly, Iceberg can adjust without violating data contracts or dropping critical fields.
Faster validation cycles – Teams can ingest, validate each update with customizable, business-driven logic, and fix issues in parallel to production workloads, speeding up the path to trusted data.

With a modern data quality platform like Telmai, the Write–Audit–Publish process ensures data reliability becomes fully automated and deeply integrated into your open lakehouse. Telmai natively supports Iceberg, Delta Lake, and Apache Hudi, allowing you to plug directly into your existing data stack—no code, no re-engineering.

Once connected, Telmai continuously monitors your audit branch with ML-driven and rule-based checks, automatically detecting anomalies, schema changes, and data drift before they impact production. By catching issues at the source, you can resolve them faster, maintain clean, trusted datasets, and publish to production with confidence, ensuring downstream analytics, AI models, and business reports always run on reliable data.

Turn your Iceberg lakehouse into a trusted data source. Click here to talk to our team to learn how Telmai keeps your Iceberg tables production-ready.

Want to stay ahead on best practices and product insights? Click here to subscribe to our newsletter for expert guidance on building reliable, AI-ready data pipelines.

On this page

See what’s possible with Telmai

Request a demo to see the full power of Telmai’s data observability tool for yourself.

Book a demo Contact Us

Articles

See all articles

What Is Write-Audit-Publish in Apache Iceberg and Why It Matters for Data Quality

What is Write-Audit-Publish (WAP)?

Advantages of the WAP Pattern in Iceberg

Implementing Write-Audit-Publish in Apache Iceberg

Step 1 – Create Table and Branch

Step 2 – Write New Data to the Audit Branch

Step 3 – Validate the Audit Branch

Step 4 – Publish to Production

Step 5 – Verify Main Branch

Why This Matters for Data Quality

See what’s possible with Telmai

Articles

Driving Reliable Graph Analytics on your Open Lakehouse Data with Telmai and PuppyGraph

How to Supercharge Google Dataplex to Ensure Data Reliability in Google Cloud Lakehouses

What Is Write-Audit-Publish in Apache Iceberg and Why It Matters for Data Quality

What Separates a Data Quality Issue From a Data Quality Incident