At 10 TB, updating 200k rows can mean rewriting thousands of files. Here’s why, and what to do about it.
The Assumption
Most teams never look inside MERGE. Match rows, update, insert or delete, done — the syntax practically invites you to stop thinking. At small scale, that bargain holds. At 10+ TB with high-frequency ingestion and fragmented file layouts, MERGE becomes the bottleneck. Not because of a bug — because of what it actually does under the hood.
Where It Breaks
At scale, MERGE is dominated by two amplification mechanisms:
A — Wide scan radius. Without partition pruning, Spark cannot narrow the candidate set by partition. Data skipping on file-level statistics can still help, but if the layout is not selective, the scan approaches the full table. The benchmark below measures this on a 44 GB table; at production scale the same mechanism means reading terabytes to update megabytes.
B — Small file accumulation. Frequent ingestion (every 5–15 minutes) fragments the target into thousands of small files, multiplying scan cost and scheduling overhead per MERGE operation.
To understand the mechanics behind them, you need to look at what MERGE actually does.
Conceptually, MERGE boils down to five logical phases.1
1 — Candidate selection. Spark uses the merge condition and table metadata (partitioning and file statistics) to build a candidate target file set — possibly the full table.
2 — Scan. Spark scans those candidate files and correlates source and target rows. Delta can reduce scope via partition pruning (requires predicates on the partition column) and data skipping (file-level min/max statistics, effective when relevant column statistics exist and the layout is selective). Without either, the scan can degenerate toward the full table.
3 — Row classification. Each row pair is labeled matched-update, matched-delete, or not-matched-insert based on the MERGE clauses.
4 — File rewrite. In the default copy-on-write path, Delta does not update rows in place. It rewrites every Parquet file that contains at least one matched row. One modified row in 10,000 files means 10,000 files rewritten. (Deletion vectors, covered below, change this trade-off.)
5 — Commit. An atomic commit to the transaction log (_delta_log): the new files are registered, the old files are marked as removed, and the entire operation becomes visible to readers in a single update. Until the commit succeeds, no reader sees partial results.
Source Target table
(354 MB) (44 GB, 58,400 files)
│ │
▼ ▼
┌──────────────────────────────────┐
│ 1. Candidate selection │ partition pruning, data skipping
│ (which files might match?) │ → reduces scan scope
├──────────────────────────────────┤
│ 2. Scan + join │ source ⋈ target on merge key
│ (correlate source ↔ target) │ → affected data located
├──────────────────────────────────┤
│ 3. Row classification │ each pair → update / delete / insert
│ (evaluate WHEN clauses) │
├──────────────────────────────────┤
│ 4. File rewrite │ every file with ≥1 match → rewritten
│ (copy-on-write) │ 1 row changed = entire file rewritten
├──────────────────────────────────┤
│ 5. Atomic commit │ _delta_log: old files removed,
│ (transaction log) │ new files added, single atomic update
└──────────────────────────────────┘
Of these five phases, two drive the bulk of the cost: the scan (phase 2) and the rewrite (phase 4). Both are forms of amplification — the system does more total work than the update strictly requires — and they are independent: one is how much you read, the other is how much you write.2
The practical corollary: if your source is small but your MERGE is slow, the update is not your problem. The table is.
Everything described above assumes the default copy-on-write path. Deletion vectors offer an alternative write model that trades immediate rewrite cost for deferred read and maintenance cost.3
Concurrent MERGE and conflict resolution. The five phases above describe a single MERGE in isolation. In production, multiple operations often run against the same table simultaneously — two pipelines writing to the same table, or a MERGE running alongside a scheduled OPTIMIZE. Delta Lake handles this through optimistic concurrency control: each operation reads a snapshot of the transaction log, does its work, and attempts to commit. At commit time, Delta checks whether any other operation committed first and modified overlapping files. If the file sets are disjoint — for example, two MERGEs that touch different partitions — both commits succeed. If they overlap, the later commit fails with a ConcurrentModificationException. The failed operation must be retried from scratch; Delta does not automatically retry the MERGE at the transaction layer. This has direct implications for the strategies below: partition pruning does not just reduce scan cost, it also narrows the file set each MERGE touches, reducing the conflict surface between concurrent writers. Conversely, a full-table MERGE with no pruning conflicts with almost any concurrent write. Chunked MERGE amplifies this risk — N smaller commits means N windows where a concurrent writer can create a conflict.
The Numbers
To make the amplification concrete, the benchmark below runs the same MERGE against the same table with progressively more optimizations — partition pruning, then Z-ordering, then compaction — so the effect of each on scan and rewrite is isolated. The last two rows show chunked MERGE, an operational pattern that does not reduce amplification but is included for completeness.4
The two cost drivers are directly observable as file counts: files scanned and files rewritten. Since the benchmark runs on NVMe (3+ GB/s sequential), the wall-clock times are compressed — the same file-count reductions would translate to much larger time savings on cloud storage, where per-file latency dominates. Read the file-count columns first; the durations are secondary.
Measured on a single-node Spark 4.1 standalone cluster with Delta Lake 4.1 OSS. The setup:
- Hardware: AMD Ryzen 9 7950X3D (16 cores / 32 threads), 125 GB RAM, 1.8 TB NVMe
- Runtime: Spark 4.1.1
local[*], 90 GB driver memory, Delta Lake 4.1.0 OSS, Java 17 - Target: 44 GB, 58,400 Parquet files (~750 KB avg, fragmented from 5 batch writes simulating repeated ingestion), partitioned by
date(365 partitions, ~160 files per partition) - Source: 354 MB, 200k rows — 92% concentrated on 30 dates (120k updates sampled from hot partitions + 80k inserts, 80% on the same 30 dates)
- Merge key:
id(UUID, high-cardinality, uniformly distributed within each partition) — does not coincide with partition keydate - Match rate: 60% updates (120k matches), 40% inserts (80k new rows)
Full setup, scripts, and raw results: delta-merge-bench.
Repro notes:
- Data generation is seeded (
SEED=42) for deterministic row distribution across partitions maxRecordsPerFile=1000controls the initial file fragmentation (~58,400 files)- In this benchmark, file-level stats were present in
_delta_log(add.statsJSON) for the relevant columns; since stats are optional in the protocol, data skipping only helps when those stats are actually recorded
Scenario Files scanned Files rewritten Scan/Rewrite Rewrite fraction Scan time Rewrite time Total
────────────────────────────── ─────────────── ───────────────── ────────────── ────────────────── ────────── ───────────── ──────
Baseline (no optimization) 58,400 3,092 19× 1× (ref) 16.0 s 8.7 s 25.0 s
+ Partition pruning ~4,800 3,092 1.5× 1× 13.0 s 7.8 s 20.9 s
+ Z-ordering on merge key ~5,300 441 1.7× 0.14× 5.4 s 11.2 s 16.6 s
+ Pre-compaction (OPTIMIZE) ~5,200 427 1.7× 0.14× 5.4 s 11.7 s 17.1 s
Chunk 1 (hot dates) ~4,300 427 1.4× 0.14× 2.8 s 10.9 s 13.8 s
Chunks 2–7 (cold, insert-only) ~150 / chunk 0 / chunk — — 2.9 s avg 0.3 s avg ~3.4 s each
The scenarios are cumulative. Files scanned rises slightly from ~4,800 to ~5,300 after Z-ordering — the scan is already constrained by the partition predicate, so the slight increase is just file-boundary noise after clustering.5
Scan amplification Rewrite amplification
(how much you read) (how much you write)
┌─────────────────────────┐ ┌─────────────────────────┐
│ Candidate files │ scanned │ File A: 1 row changed │ → rewrite entire file
│ ░░░░░░░░░░░░░░░░░░░░░░░ │ │ File B: 3 rows changed │ → rewrite entire file
│ ░░░░░░░░░░░░░░░░░░░░░░░ │ │ File C: 0 rows changed │ → untouched
│ ████ files with matches │ │ File D: 1 row changed │ → rewrite entire file
│ ░░░░░░░░░░░░░░░░░░░░░░░ │ │ ... │
│ ░░░░░░░░░░░░░░░░░░░░░░░ │ │ every matched file │
└─────────────────────────┘ │ rewritten in full │
you read far more └─────────────────────────┘
than you modify 1 changed row = 1 file rewritten
Mitigation Strategies
Using the benchmarks above, the two sections below group strategies by the amplification axis they target — reducing the scan and reducing the rewrite.6
Reducing the Scan: Pruning and Data Skipping
Three mechanisms reduce the number of files MERGE reads. They differ in reliability and in what they require.
Static partition pruning is deterministic. If the merge condition constrains the partition column — whether via a literal predicate like t.date IN ('2026-01-01', '2026-01-02') or a join condition like t.date = s.date from which the optimizer can extract values at plan time — Spark resolves the matching partitions before execution and never reads the others. The key distinction is plan-time resolution: if the optimizer cannot derive the partition values statically, pruning defers to DPP at runtime. In the benchmark, the baseline scans all 58,400 files because t.id = s.id does not constrain the partition column. Adding a partition constraint reduced the scan to ~4,800 files — a 12x improvement — without changing the rewrite count.
Dynamic partition pruning (DPP) is opportunistic. Spark derives runtime filters from the build side of the join and pushes them into the target scan. Whether this actually prunes partitions depends on the plan shape, source size, and the optimizer’s cost model. DPP helps when it fires; you cannot rely on it firing — in practice, many MERGE plans do not activate DPP at all. DPP is most reliable when the source side can be broadcast, although Spark can sometimes derive runtime filters from other join plans. In the benchmark, DPP is enabled by default but its effect is entirely subsumed by the static predicate on date — the explicit constraint already selects exactly the relevant partitions, leaving nothing for DPP to prune further.
Data skipping is a fallback, not a strategy. Even when partition pruning is unavailable, Delta can skip individual files using min/max statistics stored in the transaction log — minimum, maximum, and null count for up to N columns (controlled by the table property delta.dataSkippingNumIndexedCols). If the merge key is among the columns with recorded statistics, Delta eliminates files whose value range does not overlap with the source. If statistics are not collected for the merge key, data skipping is blind. In the benchmark, stats are present for id, but because id is a high-cardinality UUID uniformly distributed within each partition, the min/max ranges overlap across nearly all files — a worst case for data skipping. On keys with natural locality (timestamps, monotonic IDs), the effect can be much stronger, especially after Z-ordering — not measured here. The confirmation is in the numbers: ~4,800 files after pruning ≈ 30 partitions × 160 files per partition — exactly what static pruning alone would produce. If data skipping were effective, the count would be lower.
These three mechanisms compound but are not interchangeable. Partition pruning operates on the directory structure and is always reliable. DPP depends on runtime conditions. Data skipping helps on top of pruning but cannot replace it. The practical consequence: always include the partition column in the merge condition, even if it is redundant with the merge key.
Reducing the Rewrite: Z-Ordering and Compaction
Two mechanisms reduce the rewrite: Delta’s OPTIMIZE command and Z-ordering.
Delta’s OPTIMIZE command merges small files into larger ones. On the benchmark table, compaction with an 8 MB target file size reduced the file count from 58,400 to ~5,200. (The config is spark.databricks.delta.optimize.maxFileSize — the databricks prefix is historical and applies to OSS Delta as well.) The 8 MB target was chosen for this benchmark specifically: a 1 GB target would drastically reduce within-partition file granularity in this setup, limiting the selectivity that Z-ordering needs to be effective. In production you typically target larger files; partitioning and clustering are how you regain selectivity.
Z-ordering attacks the problem from the opposite direction. Instead of reading fewer files, it arranges the data so that matching rows concentrate into fewer files in the first place — using a space-filling curve to co-locate rows with similar values. When applied to the merge key, matching rows that were previously scattered across thousands of files concentrate into a fraction of the original set:
Before Z-ordering (id scattered across files):
File 1: id=A id=M id=Z id=B id=Q ← match on id=A → rewrite File 1
File 2: id=C id=A id=R id=F id=X ← match on id=A → rewrite File 2
File 3: id=A id=D id=K id=S id=N ← match on id=A → rewrite File 3
File 4: id=G id=H id=P id=T id=V ← no match → untouched
...
3,092 files contain at least one matching row
After Z-ordering on id:
File 1: id=A id=A id=A id=B id=B ← match on id=A → rewrite File 1
File 2: id=C id=D id=F id=G id=H ← no match → untouched
File 3: id=K id=M id=N id=P id=Q ← no match → untouched
File 4: id=R id=S id=T id=V id=X ← no match → untouched
...
441 files contain at least one matching row (7× fewer)
In the benchmark, Z-ordering reduced files rewritten from 3,092 to 441 — a 7x improvement. The Z-ordering pass cost 135 seconds. Combined with OPTIMIZE, the total investment was 258 seconds (4.3 minutes) and reduced subsequent MERGE rewrite from 3,092 to 427 files. A counterintuitive detail: rewrite time increases from 7.8s to 11.2s despite rewriting 7x fewer files. After Z-ordering, each rewritten file is much larger, so each rewrite task processes more bytes; with fewer tasks, parallelism also drops. On cloud storage, where per-file open latency dominates over per-task byte cost, the 7x file reduction translates directly to less total latency. The key result from the benchmark: partition pruning and Z-ordering are orthogonal — the first reduces reads, the second reduces writes.
The investment is not permanent. With ingestion every 10 minutes, small files accumulate within hours and file fragmentation returns. A single OPTIMIZE must be repeated — either as a scheduled trailing job after the last ingestion window, or addressed through liquid clustering (OSS Delta Lake 3.1+, Databricks Runtime 13.3+), which, for clustered tables, replaces partitioning and Z-ordering. Liquid clustering reorganizes data incrementally when you run OPTIMIZE — only newly written files are re-clustered, not the entire table. On Databricks Runtime with auto-optimize enabled, clustering can also trigger on write above a size threshold.
Decision Framework
The strategies above target different amplification axes. In practice, apply them in order — each layer assumes the previous ones are in place.
Solution When to use Cost
──────────────────── ───────────────────────────── ────────────────────
Table design Align partition key New table or schema change Schema migration
with merge key possible
Baseline hygiene Partition pruning Always — prerequisite Zero (one-line change)
Source dedup[^dedup] Dirty data, duplicate keys Pre-processing step
Data skipping tuning Key lacks file-level stats Config change
Layout optimization Z-order on merge key Join key ≠ partition key 135 s / 44 GB (measured)
Pre-MERGE OPTIMIZE Frequent ingestion 258 s / 44 GB (measured)
If you control the table schema, aligning the partition key with the merge key can drastically reduce scan scope and improve rewrite locality — but it is only viable when the merge key is a sensible partitioning dimension (low-to-moderate cardinality, meaningful for query patterns). The strategies below assume this alignment is not possible.
When the source is too large for a single MERGE to be stable, an operational alternative is to chunk it into smaller batches.4 When rewrite cost dominates and the runtime supports it, deletion vectors offer a different trade-off.3
If you cannot prune, MERGE stays scan-heavy. If you cannot cluster, it stays rewrite-heavy. Without either, you are not tuning a pipeline — you are absorbing the full cost of a bad layout, one MERGE at a time.
-
The actual implementation compresses some of these into fewer physical steps — candidate selection, scan, and row classification happen within a single join operation. The five-phase model is a logical decomposition that helps reason about where cost accumulates. ↩︎
-
UPDATE and DELETE on Delta suffer from the same scan and rewrite amplification. MERGE makes them more visible because it adds a join with its shuffle cost. ↩︎
-
Deletion Vectors. When enabled (write support in OSS Delta Lake since 2.4, with MERGE support since 3.1), Delta can avoid immediate copy-on-write rewrites by recording row-level deletions in sidecar metadata and writing only new or updated rows as separate files. This reduces immediate rewrite amplification, but scan amplification is unchanged — the touched files are still read in full to identify matches and generate the deletion vectors. Subsequent reads must reconstruct the current state by applying deletion vectors, and the layout becomes less compact over time. Periodic
OPTIMIZEis needed to materialize a cleaner layout. The amplification does not disappear — it shifts from eager file rewrite to deferred cleanup and read-time resolution. Note that row-level concurrency — where two MERGEs can modify different rows in the same file without conflict — requires row tracking and is a Databricks Runtime feature (DBR 14.2+), not available in OSS Delta. ↩︎ ↩︎ -
Chunked MERGE. Break the source into temporal or key-based chunks and execute N smaller MERGEs. Each chunk means a smaller working set, fine-grained retry, and less shuffle. The trade-off is transactional: N commits instead of one, more conflict opportunities with concurrent writers, and intermediate table states between chunks. Chunk boundaries should split the heavy partitions — not follow a fixed calendar grid. Follow with
OPTIMIZEto compact the small files produced by each commit. ↩︎ ↩︎ -
Scan/Rewrite = files scanned / files rewritten — a ratio of read work to write work within each scenario, not an amplification factor in the strict sense. Rewrite fraction = files rewritten / baseline rewritten — the remaining share of rewrite work after optimization (lower is better). File counts are used as a portable proxy; in bytes the amplification depends on file-size distribution, which changes after Z-ordering and compaction. ↩︎
-
This benchmark runs on NVMe with 3+ GB/s sequential throughput. On cloud storage, where file-open latency dominates, the 12x scan reduction and 7x rewrite reduction translate to proportionally larger wall-clock improvements. The file counts — not the durations — are the portable metric. ↩︎