delta-explain: Making Delta Lake Pruning Visible

Updated July 4, 2026

The problem

For scan-heavy analytical workloads, the dominant cost lever is whether files can be eliminated from the scan set before they are opened. This depends on how the layout interacts with the predicate. Partition pruning and data skipping can eliminate unnecessary file reads before any engine touches the data, using metadata from the transaction log. In library terms: partition pruning tells you which branch of the library to visit; data skipping tells you which shelves cannot possibly contain the book.

Three things break this: partitions that do not match the predicate, statistics that are absent, and data mixed enough that min/max ranges cannot rule anything out. None of them throws an error. There is no warning. The query runs, returns the right answer, and you quietly scan more and pay more.

With small datasets the problem is easy to ignore. Volumes grow, and the same misalignment that costs nothing on 20 files compounds at 20,000. At that point you need to understand what’s happening, and the options are limited. You can investigate by reading the execution plan or the engine’s profile, but that path is both engine-specific and execution-bound: it requires running the query against a particular engine. There is no widely available, engine-independent tool that attributes pruning from Delta log metadata alone.

I wrote delta-explain to fill this gap.

What delta-explain is for

delta-explain is a metadata-only diagnostic for Delta Lake file pruning. Given a table snapshot and a WHERE predicate, it shows which files can be eliminated by partition pruning and data skipping before any query engine reads the data.

Use it when you need to make pruning behavior visible, measurable, or enforceable.

Before a layout change. If you are about to repartition a large table, change its clustering, or rewrite it into a staging copy, run delta-explain with the production predicates against both layouts. The report shows, in file counts, whether the new layout actually improves pruning before the migration touches production.

After a rewrite or suspected regression. If a dashboard got slower after last week’s table rewrite, --at-version lets you run the same analysis against an earlier version of the table:

delta-explain s3://warehouse/events -w "date = '2026-07-01'" --at-version 41   # before the rewrite
delta-explain s3://warehouse/events -w "date = '2026-07-01'"                   # after

Two reports, same predicate, one variable: the layout. A suspicion becomes a measured difference.

When a predicate does not prune. delta-explain splits the predicate into partition-safe, stats-safe, and unsplittable fragments, then labels the run as exact, conservative, or incomplete. Supported fragments are lowered into the pruning analysis; unsupported constructs degrade with diagnostics instead of failing the run. With --verbose, the report lists every kept and dropped file, including partition values and min/max statistics, so you can see which files survived and why.

When writer behavior changes. A pipeline upgrade or configuration change can silently stop emitting useful per-file statistics. Nothing fails, but future data skipping degrades. A post-ETL --assert-stats run turns that silent regression into an exit code the same day it happens.

When pruning is part of the contract. For known tables and known predicates, --min-pruning turns an expectation like “this query should never scan more than a tenth of the table” into a CI gate: an exit code for shell pipelines, or a machine-readable JSON report, formally specified and versioned, whose assertions carry stable names a calling system can branch on.

When you inherit a table you do not fully know. Features like deletion vectors, column mapping, and liquid clustering change what the numbers mean. delta-explain detects them and declares them in a table_features block with warnings, rather than reporting figures that assume every table has the same semantics.

Getting started

There are three ways to use delta-explain: from the shell, from Python, and as a gate in CI. All three run the same binary and produce the same report.

From the shell

delta-explain is a single binary. Rust users get it straight from crates.io with cargo install delta-explain; everyone else picks a channel: Homebrew, Scoop, a Debian package, prebuilt binaries on GitHub Releases, or the Docker image on GHCR. The README collects the install commands.

One required argument (the table, a local path or an object-store URI) and a predicate:

delta-explain ./users -w "country = 'DE' AND age > 40"

The report it prints is the subject of the worked examples below.

From Python

pip install delta-explain ships the same CLI binary inside the wheel and adds a thin wrapper that runs it and returns the JSON report as a Python object:

import delta_explain

report = delta_explain.explain("./users", where="country = 'DE' AND age > 40",
                               min_pruning=60, assert_stats=True)
report.total_files                # 6
report.final_files                # 1
report.total_pruning_pct          # 83.33...
report.passed                     # True: both gates passed
report["analysis"]["confidence"]  # "conservative"

The table and predicate are the same as in the worked example below, which also explains where that 83% comes from. The wrapper is deliberately not a second API: it shells out to the same binary and hands back the same versioned JSON document, so everything the next subsection says about assertions applies unchanged. The gates travel as keyword arguments, report.passed collapses them into one line of a pytest suite or an Airflow task, and the object is also dict-like: every field of the document (assertions, analysis, statistics coverage, table features) is one subscript away.

In CI

The same analysis can run as a contract, and nothing has to live in your repository: delta-explain is on the GitHub Marketplace as a composite action that downloads a pinned release binary, so the gate is one step, with pruning-pct, final-files, and result exposed as outputs for later steps:

- uses: cdelmonte-zg/[email protected]
  with:
    table: s3://warehouse/events
    where: "date = '2026-07-01'"
    min-pruning: "90"
    assert-stats: "true"
    env-creds: "true"

On any other CI system, install the binary in the job (pip install delta-explain is the quickest path, a prebuilt wheel) or run the Docker image, and use the flags directly.

--min-pruning enforces a floor on the total pruning percentage:

delta-explain s3://warehouse/events \
    --env-creds \
    -w "date = '${PARTITION_DATE}'" \
    --min-pruning 90

--env-creds reads cloud credentials from the standard environment variables (AWS_*, AZURE_*, GOOGLE_*). The command exits 1 if total pruning is below 90%, with stderr:

ASSERTION FAILED: total pruning 33.3% is below threshold 90.0%

--assert-stats is the second gate. It fails (exit code 1) if any file in the snapshot is missing per-file statistics in the transaction log. The intended use is a “writer broken” alarm: if your ETL stops emitting stats, future pruning will degrade, and you want to catch this before it reaches production. Long-lived tables do not produce false alarms: statistics are resolved through the kernel’s checkpoint-aware log replay, so a file is flagged only when its add action genuinely carries none (the companion piece explains the mechanism).

--format json produces structured output for downstream parsing. Flags can be combined:

delta-explain s3://warehouse/events \
    --env-creds \
    -w "date = '${PARTITION_DATE}'" \
    --min-pruning 90 \
    --assert-stats \
    --format json

Choosing the threshold, and what the JSON document guarantees to whatever consumes it, have their own section after the worked examples.

delta-explain in practice

An example

Consider a Delta table partitioned by country, with 6 files distributed across 3 partitions (DE, IT, US). Each file has min/max statistics on the age column. The table is intentionally tiny, so every surviving file can be inspected on screen. The percentages that follow are not a benchmark; they expose the mechanism that decides whether a production scan touches tens of files or thousands.

We want to understand what would happen if an engine ran a query with:

WHERE country = 'DE' AND age > 40

Specifically: how many files remain in the candidate scan set, and how much pruning occurred. Without running a query through an engine, delta-explain can answer this:

delta-explain ./users -w "country = 'DE' AND age > 40"

The output:

Delta table: ./users
Version:     5
Predicate:   country = 'DE' AND age > 40

Predicate Analysis:
  partition-safe: country = 'DE'
  stats-safe:     age > 40
  unsplittable:   -
  confidence:     conservative

Files in snapshot: 6

Phase 1: Partition pruning [exact]
  predicate:       country = 'DE'
  files remaining: 2  (-4, 67% pruned)

Phase 2: Data skipping (min/max statistics) [conservative]
  predicate:       age > 40
  files remaining: 1  (-1, 50% pruned)

Total reduction: 6 -> 1 files (83% pruned)

The Predicate Analysis block names the split: country = 'DE' is assigned to partition pruning, while age > 40 is assigned to data skipping. The run is reported as conservative overall because Phase 2 uses file-level statistics; partition pruning alone would have been exact.

The phases are attribution measurements, not a query plan. delta-explain is not saying that Spark, Trino, or any other engine must execute the scan in this order. It is asking a narrower metadata question: over this Delta snapshot, how many files can be eliminated by partition values, and how many more can be eliminated by per-file statistics?

Partition pruning is exact in this model: a file’s partition values either satisfy country = 'DE' or they do not. Data skipping is conservative: min/max statistics can prove that a file cannot match, but they cannot prove that every surviving file actually contains a matching row. A file is kept whenever its range still overlaps the predicate.

That is the main safety rule behind the report: delta-explain may keep too many files, but it must not drop a file that could contain a matching row.

Six files reduce to one, 83% pruned before any Parquet data file is opened.

With that model in mind, --verbose shows the same attribution file by file:

delta-explain ./users -w "country = 'DE' AND age > 40" --verbose

Phase 1: Partition pruning [exact]
  predicate:       country = 'DE'
  files remaining: 2  (-4, 67% pruned)

  [DROPPED] part-00000-48368dae.parquet  (1.1 KB  3 records)  partition(country=IT)  stats(age: 41..65)
  [DROPPED] part-00000-fcf95aac.parquet  (1.1 KB  5 records)  partition(country=IT)  stats(age: 22..38)
  [DROPPED] part-00000-eee5a3ec.parquet  (1.1 KB  3 records)  partition(country=US)  stats(age: 31..55)
  [DROPPED] part-00000-de2ffaef.parquet  (1.1 KB  4 records)  partition(country=US)  stats(age: 18..29)
  [KEPT   ] part-00000-a35083c1.parquet  (1.1 KB  4 records)  partition(country=DE)  stats(age: 40..60)
  [KEPT   ] part-00000-c34f1417.parquet  (1.1 KB  5 records)  partition(country=DE)  stats(age: 20..35)

Phase 2: Data skipping (min/max statistics) [conservative]
  predicate:       age > 40
  files remaining: 1  (-1, 50% pruned)

  [KEPT   ] part-00000-a35083c1.parquet  (1.1 KB  4 records)  partition(country=DE)  stats(age: 40..60)
  [DROPPED] part-00000-c34f1417.parquet  (1.1 KB  5 records)  partition(country=DE)  stats(age: 20..35)

(In the verbose excerpts throughout this article, file names and per-file statistics are abbreviated for legibility; the real output shows full file names and the statistics of every indexed column.)

The verbose output makes the conservative property concrete. In Phase 2, the file with age: 20..35 is eliminated because its maximum (35) doesn’t satisfy age > 40. No row in that file can match. The remaining file has age: 40..60: its maximum (60) exceeds 40, so the file cannot be ruled out. The lower bound plays no role in the exclusion test here; the maximum alone is enough to show the file may contain values greater than 40. Data skipping is conservative: it keeps any file whose range overlaps the predicate, even if some or all rows ultimately fall below the threshold.

The interactive diagram below steps through the same attribution model.

Comparative scans on ./users

Full scan · done

Files (6 total, partitioned by country)

DE file 1

age: 40..60

DE file 2

age: 20..35

IT file 3

age: 41..65

IT file 4

age: 22..38

US file 5

age: 31..55

US file 6

age: 18..29

Predicate Analysis

partition-safecountry = 'DE'

stats-safeage > 40

unsplittable-

confidenceconservative

Predicate applied

country = 'DE' AND age > 40

Files in snapshot6

Phase 1 (partition pruning)6 → 2 (-4, 67% pruned)

Phase 2 (data skipping)2 → 1 (-1, 50% pruned)

Surviving1 / 6

Step 3 / 3

A counter-example

The same data, the same predicate. Only the layout changes: files now contain records from multiple countries mixed together, with no partitioning at all.

delta-explain ./users-flat -w "country = 'DE' AND age > 40"

Delta table: ./users-flat
Version:     0
Predicate:   country = 'DE' AND age > 40

Predicate Analysis:
  partition-safe: -
  stats-safe:     country = 'DE' AND age > 40
  unsplittable:   -
  confidence:     conservative

Files in snapshot: 6

Phase 1: Data skipping (min/max statistics) [conservative]
  predicate:       country = 'DE' AND age > 40
  files remaining: 4  (-2, 33% pruned)

There is no partition pruning: country is not a partition column. The entire predicate falls into stats-safe; where the relevant statistics are present, as they are here, it is evaluated by data skipping alone, which decides through file-level statistics. For equality and range predicates like these, that mainly means min/max.

On string columns, min/max is still well-defined: lexicographic order is compatible with equality and range predicates. The issue is physical interleaving: when values are mixed across files, every file’s country range spans most of the alphabet, so country = 'DE' cannot be excluded from any of them. Per-file min/max collapses the mixed values into a wide range, and that range is too broad to exclude the file. In library terms: when every shelf label spans most of the alphabet, no shelf can be ruled out.

The detail with --verbose:

delta-explain ./users-flat -w "country = 'DE' AND age > 40" --verbose

Phase 1: Data skipping (min/max statistics) [conservative]
  predicate:       country = 'DE' AND age > 40
  files remaining: 4  (-2, 33% pruned)

  [DROPPED] part-00001.snappy.parquet  (1.1 KB  4 records)  stats(age: 18..29, country: AL..US)
  [DROPPED] part-00002.snappy.parquet  (1.1 KB  5 records)  stats(age: 20..35, country: AT..US)
  [KEPT   ] part-00003.snappy.parquet  (1.1 KB  4 records)  stats(age: 30..55, country: DE..US)
  [KEPT   ] part-00004.snappy.parquet  (1.1 KB  3 records)  stats(age: 22..50, country: AT..IT)
  [KEPT   ] part-00005.snappy.parquet  (1.1 KB  5 records)  stats(age: 40..65, country: CH..US)
  [KEPT   ] part-00006.snappy.parquet  (1.1 KB  3 records)  stats(age: 25..45, country: AT..DE)

In the verbose output, country prunes nothing: every displayed country range still includes DE under lexicographic min/max reasoning. File 6 (country: AT..DE, age: 25..45) is the cleanest illustration: even though DE happens to coincide with the upper bound of its country range, the age range satisfies age > 40 and the file stays.

The overall result: pruning drops from 83% to 33%. On 6 files the difference is negligible, but at production scale the gap between 17% and 67% retention (four times as many candidate files) becomes a metadata and scheduling problem in its own right: each surviving file may imply object-store requests, footer reads, task scheduling, and reader setup, even before the cost of decoding the data itself.

When part of the predicate cannot help

The problem section said that pruning fails silently. Here is what that failure looks like when it finally has a voice. LIKE sits outside the pruning language delta-explain vouches for, so put it in the predicate and watch what the tool does with it:

delta-explain ./users -w "country LIKE 'D%' AND age > 40"

Delta table: ./users
Version:     5
Predicate:   country LIKE 'D%' AND age > 40

Predicate Analysis:
  partition-safe: -
  stats-safe:     age > 40
  unsplittable:   country LIKE 'D%'
  confidence:     incomplete

Files in snapshot: 6

Phase 1: Data skipping (min/max statistics) [incomplete]
  predicate:       age > 40 AND country LIKE 'D%'
  files remaining: 3  (-3, 50% pruned)

Warnings!
[UNSUPPORTED_EXPRESSION]: Unsupported expression: country LIKE 'D%'; the fragment 'country LIKE 'D%'' cannot contribute to pruning and is applied conservatively (keeps all files)

The run does not fail. The supported conjunct still prunes half the table, the warning names exactly the fragment that could not contribute, and the confidence drops to incomplete so nothing downstream mistakes this for a fully attributed result. Note the honesty cuts both ways: some engines can use a prefix pattern like this one for their own pruning, so the real scan may touch even fewer files; delta-explain keeps the files and reports only what it can vouch for.

A prefix pattern is also the one unsupported construct with an obvious way in: for simple string domains, LIKE 'D%' can be rewritten as the lexicographic range country >= 'D' AND country < 'E', which sits squarely inside the pruning language. That rewrite is on the roadmap (#72); until it ships, the report stays on the conservative side you see above.

Thresholds and the JSON contract

The gates from the getting-started section leave two practical questions open: what threshold to set, and what the JSON output guarantees to whatever consumes it.

The recommended gate is not a fixed cutoff at all. Real ETL pipelines have non-uniform file counts across partitions, so a fixed threshold produces false positives when the selected partition is unusually large or when late-arriving backfills distort the baseline. Measure drift against a recent baseline instead, and when the assertion fires, rerun with --verbose --format json to identify which files lacked stats or which fragments landed in unsplittable, and --at-version to compare against the layout that used to pass.

The Python wrapper turns that advice into a few lines. Instead of hard-coding a percentage, compare today’s report against a version that was known to be healthy:

import delta_explain

TABLE = "s3://warehouse/events"
PREDICATE = "date = '2026-07-01'"

baseline = delta_explain.explain(TABLE, where=PREDICATE, at_version=41, env_creds=True)
current = delta_explain.explain(TABLE, where=PREDICATE, env_creds=True)

drift = baseline.total_pruning_pct - current.total_pruning_pct
assert drift < 10, f"pruning dropped {drift:.1f} points against version 41"

The same report supports guards that a percentage cannot express. analysis.confidence degrades to incomplete when part of the predicate could no longer be attributed (an unsupported construct, a mixed OR), which is often the first sign that a query changed shape upstream; and stats.mode moving from exact to partial catches a writer that started skipping statistics on some files before the pruning numbers visibly suffer:

if current["analysis"]["confidence"] == "incomplete":
    raise RuntimeError("part of the predicate no longer contributes to pruning")
if current["stats"]["mode"] != "exact":
    raise RuntimeError("some files are missing statistics")

All of the above assumes a baseline exists. For a brand-new gate, the layout gives you a first floor: with N equal-sized partitions and an equality predicate that selects one of them, partition pruning eliminates (N-1)/N of the files, so 365 daily partitions give 364/365 ≈ 99.7% and four region partitions give 75%. With uniformly distributed file counts this is the exact figure; real workloads deviate in both directions, a sparse selected partition pruning more, a hot one less. Set the initial threshold safely below that floor, then switch to drift once a few healthy runs have accumulated.

That leaves the second question, and it holds for any consumer of the report, not just CI: what does the JSON document guarantee? Three things.

First, the document identifies itself. It declares a schema_version (currently "0.2.0"), changes to it follow SemVer, and every report validates against a JSON Schema shipped in the repository and enforced in CI.

Second, the verdict is machine-stable. Every entry in assertions[] carries a stable name (min_pruning, stats_complete) and a result: pass | fail, mirrored by a top-level result; the stable name is what a calling system branches on.

Third, stdout is all-or-nothing. An assertion failure prints the complete report (with result: "fail") and exits 1; a runtime error prints nothing to stdout, puts the message on stderr, and exits 1; a usage error exits 2. A failed run never emits a partial document, so delta-explain ... | jq cannot parse garbage.

Closing

delta-explain reasons at the protocol layer: the Delta transaction log, its actions, partition values, and file-level statistics. That is not the whole performance story. Large files with many row groups, Parquet reader pushdown, dynamic partition pruning, runtime filters, caching, and engine-specific planning all sit beyond this layer.

But the protocol layer answers an earlier question: before any engine opens a Parquet file, how many files can the metadata already rule out?

That question is often enough to expose the main problem. If a predicate still leaves thousands of files alive, the issue is usually visible in the log: wrong partitioning, missing statistics, or a layout that does not match the query pattern. If the same predicate leaves one file out of a hundred, later row-group pruning may still help, but the largest reduction has already happened upstream.

The report is therefore an opportunity, not a physical plan. delta-explain shows the file-level pruning opportunity available from the table metadata and the supplied predicate. It does not reproduce Spark’s, Trino’s, or Databricks’ execution plan, and it does not claim to explain all of performance.

Its purpose is narrower: to make an invisible physical assumption visible, measurable, and assertable. Query engines explain what they plan to execute. delta-explain asks what the table metadata already proves before execution starts.

The exact guarantees behind that claim are defined in docs/semantics.md. The companion piece, How delta-explain Measures Pruning, explains how those numbers are produced.

The problem#

What delta-explain is for#

Getting started#

From the shell#

From Python#

In CI#

delta-explain in practice#

An example#

A counter-example#

When part of the predicate cannot help#

Thresholds and the JSON contract#

Closing#