At 10 TB, updating 200k rows can mean rewriting thousands of files. Here’s why, and what to do about it.


The Assumption

Most teams never look inside MERGE. Match rows, update, insert or delete, done — the syntax practically invites you to stop thinking. At small scale, that bargain holds. At 10+ TB with high-frequency ingestion and fragmented file layouts, MERGE becomes the bottleneck. Not because of a bug — because of what it actually does under the hood.

Where It Breaks

At scale, MERGE is dominated by two amplification mechanisms:

A — Wide scan radius. Without partition pruning, Spark cannot narrow the candidate set by partition. Data skipping on file-level statistics can still help, but if the layout is not selective — file-level value ranges too wide to exclude most files — the scan approaches the full table. The benchmark below measures this on a 44 GB table; at production scale the same mechanism means reading terabytes to update megabytes.

B — Small file accumulation. Frequent ingestion (every 5–15 minutes) fragments the target into thousands of small files, multiplying scan cost and scheduling overhead per MERGE operation.

To understand the mechanics behind them, you need to look at what MERGE actually does.

Conceptually, MERGE boils down to five logical phases — candidate selection, scan, row classification, file rewrite, and atomic commit:1 2

Source           Target table
(354 MB)         (44 GB, 58,400 files)
   │                  │
   ▼                  ▼
┌──────────────────────────────────┐
│ 1. Candidate selection           │  partition pruning, data skipping
│    (which files might match?)    │  → reduces scan scope
├──────────────────────────────────┤
│ 2. Scan + join                   │  source ⋈ target on merge key
│    (correlate source ↔ target)   │  → affected data located
├──────────────────────────────────┤
│ 3. Row classification            │  each pair → update / delete / insert
│    (evaluate WHEN clauses)       │
├──────────────────────────────────┤
│ 4. File rewrite                  │  every file with ≥1 match → rewritten
│    (copy-on-write)               │  1 row changed = entire file rewritten
├──────────────────────────────────┤
│ 5. Atomic commit                 │  _delta_log: old files removed,
│    (transaction log)             │  new files added, single atomic update
└──────────────────────────────────┘

Of these five phases, two drive the bulk of the cost: the scan (phase 2) and the file rewrite (phase 4). Both are forms of amplification — the system does more total work than the update strictly requires — and they are largely independent: one is how much you read, the other is how much you write.3

The scan reads every candidate file to find matching rows. Without partition pruning or effective data skipping, the candidate set can approach the full table — reading terabytes to locate megabytes of changes.

The rewrite is a consequence of copy-on-write: Delta does not update rows in place. It rewrites every Parquet file that contains at least one matched row. One modified row in 10,000 files means 10,000 files rewritten.

The practical corollary: if your source is small but your MERGE is slow, the problem is not the update — it is the table layout.

Everything described above assumes the default copy-on-write path. Deletion vectors offer an alternative write model that trades immediate rewrite cost for deferred read and maintenance cost.4

The Numbers

To make the amplification concrete, the benchmark below runs the same MERGE against the same table with progressively more optimizations — partition pruning, then Z-ordering, then compaction — so the effect of each on scan and rewrite is isolated. The last two rows show chunked MERGE,5 an operational pattern that does not reduce per-file amplification but can improve stability when the source is too large for a single MERGE.

The two cost drivers are directly observable as file counts: files scanned and files rewritten. File-count ratios are scale-invariant — they depend on layout geometry (how rows are distributed across files and partitions), not on absolute table size. The same 12x scan reduction and 7x rewrite reduction apply whether the table is 44 GB or 10 TB. Wall-clock times are not portable: this benchmark runs on NVMe (3+ GB/s sequential), and on cloud storage the same file-count reductions translate to much larger time savings, where per-file open latency (50–200 ms) dominates. Read the file-count columns first; the durations are secondary.

Measured on a single-node Spark 4.1 standalone cluster with Delta Lake 4.1 OSS. The setup:

  • Hardware: AMD Ryzen 9 7950X3D (16 cores / 32 threads), 125 GB RAM, 1.8 TB NVMe
  • Runtime: Spark 4.1.1 local[*], 90 GB driver memory, Delta Lake 4.1.0 OSS, Java 17
  • Target: 44 GB, 58,400 Parquet files (~750 KB avg, fragmented from 5 batch writes simulating repeated ingestion), partitioned by date (365 partitions, ~160 files per partition)
  • Source: 354 MB, 200k rows — 120k updates sampled from hot partitions + 80k inserts. 92% of rows concentrated on 30 dates (80% of inserts target the same 30 dates)
  • Merge key: id (UUID, high-cardinality, uniformly distributed within each partition) — does not coincide with partition key date
  • Match rate: 60% updates (120k matches), 40% inserts (80k new rows)

Full setup, scripts, and raw results: delta-merge-bench.

Repro notes:

  • Data generation is seeded (SEED=42) for deterministic row distribution across partitions
  • maxRecordsPerFile=1000 controls the initial file fragmentation (~58,400 files)
  • In this benchmark, file-level stats were present in _delta_log (add.stats JSON) for the relevant columns; since stats are optional in the protocol, data skipping only helps when those stats are actually recorded

Read:Write is the ratio of files scanned to files rewritten within each scenario. Rewrite fraction is the share of baseline rewrite remaining after optimization (lower is better).6

Scenario                        Files scanned    Files rewritten    Read:Write      Rewrite fraction    Scan time    Rewrite time    Total
──────────────────────────────  ───────────────  ─────────────────  ──────────────  ──────────────────  ──────────   ─────────────   ──────
Baseline (no optimization)      58,400           3,092              19×             1× (ref)            16.0 s       8.7 s           25.0 s
+ Partition pruning             ~4,800           3,092              1.5×            1×                  13.0 s       7.8 s           20.9 s
+ Z-ordering on merge key       ~5,300           441                1.7×            0.14×               5.4 s        11.2 s*         16.6 s
+ Pre-compaction (OPTIMIZE)     ~5,200           427                1.7×            0.14×               5.4 s        11.7 s          17.1 s
Chunk 1 (hot dates)             ~4,300           427                1.4×            0.14×               2.8 s        10.9 s          13.8 s
Chunks 2–7 (cold, insert-only)  ~150 / chunk     0 / chunk          —               —                   2.9 s avg    0.3 s avg       ~3.4 s each

* Rewrite time increases despite fewer files — see the Z-ordering discussion below.

The scenarios are cumulative. Files scanned rises slightly from ~4,800 to ~5,300 after Z-ordering. The partition predicate still selects the same 30 partitions, but Z-ordering rewrites files with different boundaries and slightly different counts per partition, so the same predicate maps to a modestly larger file set. The increase is not operationally significant.

Mitigation Strategies

Using the benchmarks above, the two sections below group strategies by the amplification axis they target — reducing the scan and reducing the rewrite.7

Reducing the Scan: Pruning and Data Skipping

Three mechanisms reduce the number of files MERGE reads. They differ in reliability and in what they require.

Static partition pruning is deterministic. If the merge condition constrains the partition column — whether via a literal predicate like t.date IN ('2026-01-01', '2026-01-02') or a join condition like t.date = s.date from which the optimizer can extract values at plan time — Spark resolves the matching partitions before execution and never reads the others. The key distinction is plan-time resolution: if the optimizer cannot derive the partition values statically, pruning defers to DPP at runtime. In the benchmark, the baseline scans all 58,400 files because t.id = s.id does not constrain the partition column. Adding a partition constraint reduced the scan to ~4,800 files — a 12x improvement — without changing the rewrite count.

Dynamic partition pruning (DPP) is opportunistic. Spark derives runtime filters from the build side of the join and pushes them into the target scan. Whether this actually prunes partitions depends on the plan shape, source size, and the optimizer’s cost model. DPP helps when it fires; you cannot rely on it firing — in practice, many MERGE plans do not activate DPP at all. DPP is most reliable when the source side can be broadcast, although Spark can sometimes derive runtime filters from other join plans. In the benchmark, DPP is enabled by default but its effect is entirely subsumed by the static predicate on date — the explicit constraint already selects exactly the relevant partitions, leaving nothing for DPP to prune further.

Data skipping is a fallback when the merge key lacks locality — though on columns with natural ordering (timestamps, monotonic IDs), it can be the primary pruning mechanism. Even when partition pruning is unavailable, Delta can skip individual files using min/max statistics stored in the transaction log — minimum, maximum, and null count for up to N columns (controlled by the table property delta.dataSkippingNumIndexedCols). If the merge key is among the columns with recorded statistics, Delta eliminates files whose value range does not overlap with the source. If statistics are not collected for the merge key, data skipping is blind.

In the benchmark, stats are present for id, but because id is a high-cardinality UUID uniformly distributed within each partition, the min/max ranges overlap across nearly all files — a worst case for data skipping. On keys with natural locality (timestamps, monotonic IDs), the effect can be much stronger, especially after Z-ordering — not measured here. The confirmation is in the numbers: the source touches 30 distinct dates, so the partition predicate selects 30 of 365 partitions. ~4,800 files after pruning ≈ 30 partitions × 160 files per partition — exactly what static pruning alone would produce. If data skipping were effective, the count would be lower.

These three mechanisms compound but are not interchangeable. Partition pruning operates on the directory structure and is always reliable. DPP depends on runtime conditions. Data skipping helps on top of pruning but cannot replace it. The practical consequence: always include the partition column in the merge condition, even if it is redundant with the merge key.

Reducing the Rewrite: Z-Ordering and Compaction

Two mechanisms reduce the rewrite: Delta’s OPTIMIZE command and Z-ordering.

Delta’s OPTIMIZE command merges small files into larger ones. On the benchmark table, compaction with an 8 MB target file size reduced the total file count from 58,400 to ~5,200. (The MERGE’s partition predicate selects a similar count because the 30 hot partitions contain most of the table’s files after compaction.) The config is spark.databricks.delta.optimize.maxFileSize — the databricks prefix is historical and applies to OSS Delta as well. The 8 MB target was chosen for this benchmark specifically: a 1 GB target would drastically reduce within-partition file granularity in this setup — fewer, larger files means each file spans a wider value range, making min/max statistics less selective and limiting the benefit Z-ordering can provide. In production you typically target larger files; partitioning and clustering are how you regain selectivity.

Z-ordering attacks the problem from the opposite direction. Instead of reading fewer files, it arranges the data so that matching rows concentrate into fewer files in the first place — using a space-filling curve to co-locate rows with similar values. When applied to the merge key, matching rows that were previously scattered across thousands of files concentrate into a fraction of the original set:

Before Z-ordering (id scattered across files):

File 1:  id=A id=M id=Z id=B id=Q        ← match on id=A → rewrite File 1
File 2:  id=C id=A id=R id=F id=X        ← match on id=A → rewrite File 2
File 3:  id=A id=D id=K id=S id=N        ← match on id=A → rewrite File 3
File 4:  id=G id=H id=P id=T id=V        ← no match     → untouched
...
3,092 files contain at least one matching row

After Z-ordering on id:

File 1:  id=A id=A id=A id=B id=B        ← match on id=A → rewrite File 1
File 2:  id=C id=D id=F id=G id=H        ← no match     → untouched
File 3:  id=K id=M id=N id=P id=Q        ← no match     → untouched
File 4:  id=R id=S id=T id=V id=X        ← no match     → untouched
...
441 files contain at least one matching row (7× fewer)

In the benchmark, Z-ordering reduced files rewritten from 3,092 to 441 — a 7x improvement. The Z-ordering pass cost 135 seconds. Combined with OPTIMIZE, the total investment was 258 seconds (4.3 minutes) and reduced subsequent MERGE rewrite from 3,092 to 427 files. A counterintuitive detail: rewrite time increases from 7.8s to 11.2s despite rewriting 7x fewer files. After Z-ordering, each rewritten file is much larger, so each rewrite task processes more bytes; with fewer tasks, parallelism also drops. On cloud storage, where per-file open latency dominates over per-task byte cost, the 7x file reduction translates directly to less total latency.

The investment is not permanent. With ingestion every 10 minutes, small files accumulate within hours and file fragmentation returns. A single OPTIMIZE must be repeated — either as a scheduled trailing job after the last ingestion window, or addressed through liquid clustering,8 which replaces partitioning and Z-ordering for clustered tables. Liquid clustering reorganizes data incrementally when you run OPTIMIZE — only newly written files are re-clustered, not the entire table. On Databricks Runtime with auto-optimize enabled, clustering can also trigger on write above a size threshold.

Decision Framework

The two amplification axes — scan and rewrite — respond to different interventions. Partition pruning reduces how much you read; Z-ordering and compaction reduce how much you write. Neither substitutes for the other, and combining them is where the largest gains come from. The table below orders the strategies by priority: start with table design and baseline hygiene before investing in layout optimization, because each layer assumes the previous ones are already in place.

                          Solution              When to use                    Cost
                          ────────────────────  ─────────────────────────────  ────────────────────
Table design              Align partition key   New table or schema change      Schema migration
                          with merge key        possible

Baseline hygiene          Partition pruning     Always — prerequisite          Zero (one-line change)
                          Source dedup          Dirty data, duplicate keys     Pre-processing step
                          Data skipping tuning  Key lacks file-level stats     Config change

Layout optimization       Z-order on merge key  Join key ≠ partition key        135 s / 44 GB (measured)
                          Pre-MERGE OPTIMIZE    Frequent ingestion             258 s / 44 GB (measured)

Operational alternatives  Chunked MERGE         Source too large for single     N commits, more conflict
                                                stable MERGE                   windows, idempotency req.
                          Deletion vectors      Rewrite cost dominates,        Deferred read overhead,
                                                periodic OPTIMIZE acceptable   periodic compaction

Source dedup,9 chunked MERGE,5 and deletion vectors4 are detailed in the notes below.

If you control the table schema, aligning the partition key with the merge key is the strongest lever — but it is only viable when the merge key has low-to-moderate cardinality and is meaningful for query patterns beyond MERGE. When that alignment is not possible, the strategies above are how you close the gap between the logical operation (“update these rows”) and the physical one (“rewrite every file that contains at least one of them”). If you cannot prune, MERGE stays scan-heavy. If you cannot cluster, it stays rewrite-heavy. Without either, you are absorbing the full cost of a bad layout, one MERGE at a time.

Notes


  1. The actual implementation compresses some of these into fewer physical steps — candidate selection, scan, and row classification happen within a single join operation. The five-phase model is a logical decomposition that helps reason about where cost accumulates. ↩︎

  2. Concurrent MERGE and conflict resolution. The five phases above describe a single MERGE in isolation. In production, multiple operations often run against the same table simultaneously — two pipelines writing to the same table, or a MERGE running alongside a scheduled OPTIMIZE. Delta Lake handles this through optimistic concurrency control: each operation reads a snapshot of the transaction log, does its work, and attempts to commit. At commit time, Delta checks whether any other operation committed first and modified overlapping files. If the file sets are disjoint — for example, two MERGEs that touch different partitions — both commits succeed. If they overlap, the later commit fails with a ConcurrentModificationException and must be retried from scratch. This has direct implications for the strategies discussed above: partition pruning narrows the file set each MERGE touches, reducing the conflict surface between concurrent writers. Conversely, a full-table MERGE with no pruning conflicts with almost any concurrent write. Chunked MERGE amplifies this risk — N smaller commits means N windows where a concurrent writer can create a conflict. ↩︎

  3. UPDATE and DELETE on Delta suffer from the same scan and rewrite amplification. MERGE makes them more visible because it adds a join with its shuffle cost. ↩︎

  4. Deletion Vectors. When enabled (write support in OSS Delta Lake since 2.4, with MERGE support since 3.1), Delta can avoid immediate copy-on-write rewrites by recording row-level deletions in sidecar metadata and writing only new or updated rows as separate files. This reduces immediate rewrite amplification, but scan amplification is unchanged — the touched files must still be scanned to identify matches and generate the deletion vectors. Subsequent reads must reconstruct the current state by applying deletion vectors, and the layout becomes less compact over time. Periodic OPTIMIZE is needed to materialize a cleaner layout. The amplification does not disappear — it shifts from eager file rewrite to deferred cleanup and read-time resolution. Note that row-level concurrency — where two MERGEs can modify different rows in the same file without conflict — requires row tracking and is a Databricks Runtime feature (DBR 14.2+), not available in OSS Delta. ↩︎ ↩︎

  5. Chunked MERGE. Break the source into temporal or key-based chunks and execute N smaller MERGEs. Each chunk means a smaller working set, fine-grained retry, and less shuffle. The trade-off is transactional: N commits instead of one, more conflict opportunities with concurrent writers, and intermediate table states between chunks. Chunk boundaries should split the heavy partitions — not follow a fixed calendar grid. Follow with OPTIMIZE to compact the small files produced by each commit. ↩︎ ↩︎

  6. Read:Write = files scanned / files rewritten — a ratio of read work to write work within each scenario. Rewrite fraction = files rewritten / baseline rewritten — the remaining share of rewrite work after optimization (lower is better). File counts are used as a portable proxy; in bytes the amplification depends on file-size distribution, which changes after Z-ordering and compaction. ↩︎

  7. This benchmark runs on NVMe with 3+ GB/s sequential throughput. On cloud storage, where file-open latency dominates, the 12x scan reduction and 7x rewrite reduction translate to proportionally larger wall-clock improvements. The file counts — not the durations — are the portable metric. ↩︎

  8. Liquid clustering. Available in OSS Delta Lake 3.1+ (preview, GA in 3.2) and Databricks Runtime 13.3+. Enabling liquid clustering upgrades the table protocol version and cannot be rolled back. ↩︎

  9. Source deduplication. In Delta Lake OSS, MERGE raises an error when multiple source rows match the same target row if any MATCHED clause is present. Databricks Runtime has relaxed this constraint for specific clause combinations, but the safe practice remains: deduplicate the source on the merge key before executing. ↩︎