delta-explain: Making Delta Lake Pruning Visible

Updated May 8, 2026

The problem

For scan-heavy analytical workloads, the dominant cost lever is whether files can be eliminated from the scan set before they are opened. This depends on how the layout interacts with the predicate. Partition pruning and data skipping eliminate unnecessary file reads before any engine touches them, using metadata from the transaction log. If partitioning is wrong or relevant statistics are missing, neither mechanism can do its job.

With small datasets the problem is easy to ignore. Volumes grow, and the same misalignment that costs nothing on 20 files compounds at 20,000. At that point you need to understand what’s happening, and the options are limited. You can investigate by reading the execution plan or the engine’s profile, but that path is both engine-specific and execution-bound: it requires running the query against a particular engine. There is no widely available, engine-independent tool that attributes pruning from Delta log metadata alone.

I wrote delta-explain to fill this gap. It is published on crates.io and distributed as pre-built binaries through Homebrew, Scoop, Debian .deb, GitHub Releases, and a Docker image on GHCR; the README collects the install commands for each channel.

A pruning attribution model

delta-explain does not simulate the execution of a query. It performs a different and narrower exercise: given a WHERE predicate, it asks how much of the work of eliminating files can be attributed to each of the two mechanisms Delta Lake offers. The output reports the attribution alongside the resulting file count.

The two mechanisms are partition pruning, which eliminates whole directories on the basis of partition values declared in the transaction log, and data skipping, which eliminates individual files on the basis of per-file min/max statistics carried in the same log.

Two orthogonal properties govern the analysis: pruning soundness (no file containing a matching row is ever dropped) and attribution completeness (every pruned file can be credited to a specific mechanism). The first is a property of the elimination itself; the second, of the per-mechanism accounting.

The predicate is therefore split. Each top-level conjunct (an AND fragment) is examined and assigned to one of three buckets, according to which columns it references:

partition-safe: the fragment depends only on partition columns, so it can be evaluated by partition pruning.
stats-safe: the fragment depends only on non-partition columns; coverage by file-level statistics is checked separately later.
unsplittable: a fragment whose column set, after AND-flattening, spans both axes. The canonical case is an OR joining a partition predicate with a non-partition one; a NOT over a mixed-axis AND produces the same effect.

The labels Phase 1 and Phase 2 that appear in the output are then attribution measurements, not stages of a query plan. Phase 1 reports the share of file elimination that can be credited to partition pruning; Phase 2 reports the share that can be credited to data skipping.

At this level, both mechanisms are just filters over the same file set. Applying the partition filter before the statistics filter, or the other way around, leads to the same survivor set: the phases are therefore an analytic decomposition, not an execution order.

Each phase carries a confidence label that records whether the corresponding file elimination is exact, conservative, or only partially attributable. Partition pruning is exact (sound and complete): a directory either matches a partition value or it does not, and no file is kept on uncertain grounds. Data skipping is conservative (sound but incomplete): per-file min/max statistics describe a range, and a range may overlap the predicate even when no row in the file actually satisfies it, so files survive that contribute no rows.

The whole run reports the least informative label involved: exact when only exact pruning is used, conservative when data skipping contributes, and incomplete when unsplittable fragments prevent clean attribution (the pruning itself remains sound).

delta-explain in practice

An example

Consider a Delta table partitioned by country, with 6 files distributed across 3 partitions (DE, IT, US). Each file has min/max statistics on the age column.

We want to understand what would happen if an engine ran a query with:

WHERE country = 'DE' AND age > 40

Specifically: how many files would be read, and how much pruning would occur. Without running a query through an engine, delta-explain can answer this:

delta-explain ./users -w "country = 'DE' AND age > 40"

The output:

Delta table: ./users
Version:     5
Predicate:   country = 'DE' AND age > 40

Predicate Analysis:
  partition-safe: country = 'DE'
  stats-safe:     age > 40
  unsplittable:   -
  confidence:     conservative

Files in snapshot: 6

Phase 1: Partition pruning [exact]
  predicate:       country = 'DE'
  files remaining: 2  (-4, 67% pruned)

Phase 2: Data skipping (min/max statistics) [conservative]
  predicate:       age > 40
  files remaining: 1  (-1, 50% pruned)

Total reduction: 6 -> 1 files (83% pruned)

The Predicate Analysis block names the split: country = 'DE' to Phase 1 (partition pruning), age > 40 to Phase 2 (data skipping). The run is reported as conservative overall because Phase 2 uses data skipping; partition pruning alone would have been exact.

Six files reduce to one, 83% pruned before any Parquet data file is opened.

If we want file-by-file detail, we add --verbose:

delta-explain ./users -w "country = 'DE' AND age > 40" --verbose

Phase 1: Partition pruning [exact]
  predicate:       country = 'DE'
  files remaining: 2  (-4, 67% pruned)

  [DROPPED] part-00000-48368dae.parquet  (1.1 KB  3 records)  partition(country=IT)  stats(age: 41..65)
  [DROPPED] part-00000-fcf95aac.parquet  (1.1 KB  5 records)  partition(country=IT)  stats(age: 22..38)
  [DROPPED] part-00000-eee5a3ec.parquet  (1.1 KB  3 records)  partition(country=US)  stats(age: 31..55)
  [DROPPED] part-00000-de2ffaef.parquet  (1.1 KB  4 records)  partition(country=US)  stats(age: 18..29)
  [KEPT   ] part-00000-a35083c1.parquet  (1.1 KB  4 records)  partition(country=DE)  stats(age: 40..60)
  [KEPT   ] part-00000-c34f1417.parquet  (1.1 KB  5 records)  partition(country=DE)  stats(age: 20..35)

Phase 2: Data skipping (min/max statistics) [conservative]
  predicate:       age > 40
  files remaining: 1  (-1, 50% pruned)

  [KEPT   ] part-00000-a35083c1.parquet  (1.1 KB  4 records)  partition(country=DE)  stats(age: 40..60)
  [DROPPED] part-00000-c34f1417.parquet  (1.1 KB  5 records)  partition(country=DE)  stats(age: 20..35)

The verbose output makes the conservative property concrete. In Phase 2, the file with age: 20..35 is eliminated because its maximum (35) doesn’t satisfy age > 40. No row in that file can match. The remaining file has age: 40..60: its maximum (60) exceeds 40, so the file cannot be ruled out. Data skipping is conservative: it keeps any file whose range overlaps the predicate, even if some or all rows ultimately fall below the threshold.

A counter-example

The same data, the same predicate. Only the layout changes: files now contain records from multiple countries mixed together, with no partitioning at all.

delta-explain ./users-flat -w "country = 'DE' AND age > 40"

Delta table: ./users-flat
Version:     0
Predicate:   country = 'DE' AND age > 40

Predicate Analysis:
  partition-safe: -
  stats-safe:     country = 'DE' AND age > 40
  unsplittable:   -
  confidence:     conservative

Files in snapshot: 6

Phase 1: Data skipping (min/max statistics) [conservative]
  predicate:       country = 'DE' AND age > 40
  files remaining: 4  (-2, 33% pruned)

There is no partition pruning: country is not a partition column. The entire predicate falls into stats-safe and is evaluated by data skipping alone, which decides through file-level statistics. For equality and range predicates like these, that mainly means min/max.

On string columns, min/max is still well-defined: lexicographic order is compatible with equality and range predicates. The issue is physical interleaving: when values are mixed across files, every file’s country range spans most of the alphabet, so country = 'DE' cannot be excluded from any of them. Per-file min/max collapses the mixed values into a wide range, and that range is too broad to exclude the file.

The detail with --verbose:

delta-explain ./users-flat -w "country = 'DE' AND age > 40" --verbose

Phase 1: Data skipping (min/max statistics) [conservative]
  predicate:       country = 'DE' AND age > 40
  files remaining: 4  (-2, 33% pruned)

  [DROPPED] part-00001.snappy.parquet  (1.1 KB  4 records)  stats(age: 18..29, country: AL..US)
  [DROPPED] part-00002.snappy.parquet  (1.1 KB  5 records)  stats(age: 20..35, country: AT..US)
  [KEPT   ] part-00003.snappy.parquet  (1.1 KB  4 records)  stats(age: 30..55, country: DE..US)
  [KEPT   ] part-00004.snappy.parquet  (1.1 KB  3 records)  stats(age: 22..50, country: AT..IT)
  [KEPT   ] part-00005.snappy.parquet  (1.1 KB  5 records)  stats(age: 40..65, country: CH..US)
  [KEPT   ] part-00006.snappy.parquet  (1.1 KB  3 records)  stats(age: 25..45, country: AT..DE)

In the verbose output, country prunes nothing: every range from AL..US to AT..DE includes DE lexicographically. File 6 (country: AT..DE, age: 25..45) is the cleanest illustration: even though DE happens to coincide with the upper bound of its country range, the age range satisfies age > 40 and the file stays.

The overall result: pruning drops from 83% to 33%. On 6 files the difference is negligible, but at production scale the gap between 17% and 67% retention is the small-files problem in compressed form: each surviving file may imply object-store requests, footer reads, and scheduler work, and even with parallelism this overhead can dominate wall-clock time well beyond the bytes scanned alone.

CI/CD integration

delta-explain exposes assertion flags designed for CI.

--min-pruning enforces a floor on the total pruning percentage:

delta-explain s3://warehouse/events \
    --env-creds \
    -w "date = '${PARTITION_DATE}'" \
    --min-pruning 90

--env-creds reads cloud credentials from the standard environment variables (AWS_*, AZURE_*, GOOGLE_*). The command exits 1 if total pruning is below 90%, with stderr:

ASSERTION FAILED: total pruning 33.3% is below threshold 90.0%

The threshold depends on partitioning granularity. With N equal-sized partitions and an equality predicate that selects one of them, partition pruning eliminates (N-1)/N of the files: 365 daily partitions give 364/365 ≈ 99.7%, four region partitions give 75%. With uniformly distributed file counts, this is the exact figure. Real workloads deviate in both directions: a sparse selected partition prunes more, a hot selected partition prunes less.

Real ETL pipelines have non-uniform file counts across partitions, so a fixed cutoff can produce false positives when the selected partition is unusually large or when late-arriving backfills distort the baseline. In practice, measure drift against a recent baseline rather than enforcing a fixed cutoff. When the assertion fires, rerun with --verbose --format json to identify which files lacked stats or which fragments landed in unsplittable.

--assert-stats is the second gate. It fails (exit code 1) if any file in the snapshot is missing per-file statistics in the transaction log. The intended use is a “writer broken” alarm: if your ETL stops emitting stats, future pruning will degrade, and you want to catch this before it reaches production.

On long-lived tables, --assert-stats is unreliable because some stats live only in checkpoint Parquet files (the implementation section below explains why). The right operational use is a hard gate immediately post-ETL on write targets you control; on long-lived tables, treat it as advisory (run without it, or capture the failure as a warning rather than gating the pipeline). --min-pruning is unaffected.

--format json produces structured output for downstream parsing. Flags can be combined:

delta-explain s3://warehouse/events \
    --env-creds \
    -w "date = '${PARTITION_DATE}'" \
    --min-pruning 90 \
    --assert-stats \
    --format json

The JSON document is versioned independently from the CLI binary: it declares its own schema_version (currently "0.1.0"), independent of the CLI’s tool_version. The schema is pre-1.0; additive changes bump the minor and breaking changes bump the major.

The document includes the table version, the analyzer’s classification (analysis.*, with coded notes such as UNSPLITTABLE_OR for fragments the splitter cannot cleanly separate), the categorical stats.mode (exact / partial / absent), and an assertions[] array with a stable name per entry (e.g. min_pruning or stats_complete), plus result: pass | fail, and a top-level result. The stable name is the contract a calling system can branch on.

How delta-explain measures pruning

The tool’s measurement pipeline has three stages: predicate separation, comparative metadata scans, and statistics coverage.

Predicate separation

First, the predicate is parsed. sqlparser-rs turns a user-written SQL predicate into a source AST, which is then lowered into delta-kernel’s Predicate IR. delta-kernel separates boolean structure (Predicate: AND/OR/NOT/comparisons) from value-producing terms (Expression: column references, literals, arithmetic); the splitter operates on Predicate, with Expression operands appearing inside its comparison nodes.

ScanBuilder::with_predicate accepts the top-level Predicate. The supported syntax is:

Comparisons: =, !=, <, <=, >, >=
Logical operators: AND, OR, NOT
IN lists, BETWEEN, IS [NOT] NULL, parenthesized expressions

Anything else produces an error: subqueries, LIKE, function calls (including LOWER, UPPER, CAST, date_trunc), and references to derived or generated columns. CI predicates must therefore mirror the production query exactly: a CI assertion against country = 'DE' does not validate prod’s LOWER(country) = 'de'.

The flattening splits the predicate at top-level ANDs only: it descends through nested And nodes (which already encode any parenthesization in the source) but stops at any other operator, so operators like OR inside a fragment are left in place. Each fragment can therefore be a simple comparison or a more complex sub-expression. delta-explain deliberately stops at top-level AND flattening; it does not rewrite predicates into CNF or DNF, because that would change the shape of the user-written predicate and can explode the fragment set (CNF and DNF rewriting are worst-case exponential).

Each fragment is then classified by the columns it references, into one of three buckets. If it references only partition columns, it goes to partition_safe. If it references only non-partition columns, it goes to stats_safe; this is sound even when the column lies outside the indexed-stats prefix (Delta’s delta.dataSkippingNumIndexedCols, default 32), in which case the fragment is bucketed correctly but contributes no pruning. After flattening, no fragment is rooted at AND; each fragment is either a single predicate form or a sub-expression rooted at another operator such as OR or NOT. A fragment lands in unsplittable exactly when its column set spans both axes, typically an OR combining a partition predicate with a non-partition one.

Partition columns themselves are identified from the metaData action in the Delta transaction log, the authoritative source per the Delta protocol; this works correctly even on empty tables with no files.

Each bucket maps to the pruning mechanism introduced earlier: partition_safe to partition pruning, stats_safe to data skipping. unsplittable clauses are also evaluated by data skipping, but the kernel reasons over the whole predicate rather than the isolated fragment, and the run is labeled incomplete. Take a predicate like country = 'DE' OR (age > 30 AND age < 60). Its Predicate IR has an Or at the top, with an And nested inside one of its operands:

        OR
       /   \
country='DE'  AND
             /   \
        age > 30  age < 60

The flatten only descends through top-level ANDs, so when the root is an OR it stops. The whole expression becomes a single fragment, classified as unsplittable, and the inner age > 30 AND age < 60 is invisible to the splitter even though, in isolation, it would be a clean stats-safe sub-predicate. The kernel evaluates the predicate over per-file min/max intervals rather than concrete row values: this over-approximates the set of files that contain a matching row, so it may keep files that contain no matching rows but it never drops a file that does. The pruning is therefore sound; what cannot be done cleanly is the per-phase attribution.

Comparative metadata scans

Once the classification is done, delta-explain runs up to three scans against delta-kernel-rs, all reusing the same snapshot, depending on which buckets are populated:

a baseline scan with no predicate (the snapshot file count, used as the reference for total file count);
a partition-only scan with the partition_safe fragment, when present;
a full scan with the original predicate, when at least one fragment is stats_safe or unsplittable.

Pure-partition predicates run only the baseline and the partition-only scan; the full scan is redundant (its result equals the partition-only scan). Pure-stats predicates skip the partition-only scan, which is why the counter-example earlier shows a single Phase 1 row and no Phase 2. Mixed predicates run all three.

The CLI labels the drop from baseline to partition-only as Phase 1 (work attributable to partition pruning) and the drop from partition-only to full as Phase 2 (work attributable to data skipping); these are the attribution measurements introduced earlier, computed as differences between scan results. When part of the predicate is unsplittable, the full scan reasons over the whole expression rather than over a stats-only fragment, so its delta from the partition-only scan is no longer a clean attribution to data skipping; this is what the incomplete label signals.

The interactive diagram below steps through the same attribution model.

Comparative scans on ./users

Ready

Files (6 total, partitioned by country)

Predicate Analysis

partition-safe-

stats-safe-

unsplittable-

confidence-

Predicate applied

(press Step to start)

Files in snapshot--

Phase 1 (partition pruning)--

Phase 2 (data skipping)--

Surviving--

Step 0 / 3

The kernel evaluates the predicate against partition values and the per-file min/max statistics embedded in add actions. The cost of the analysis is proportional to the number of files in the log, not the size of the data, and the tool needs only the _delta_log/ prefix to be readable: the data files themselves are never accessed.

This is where the engine-independence claim is grounded. delta-explain calls delta-kernel’s metadata scan path (which returns the resolved file list with stats already applied) rather than its data-materialization path (which would open Parquet footers, project columns, and materialize Arrow batches).¹

Statistics coverage

Direct reading of statistics from the transaction log runs alongside the kernel scans.

delta-kernel internally uses min/max statistics to filter files when it receives a predicate via ScanBuilder. The kernel exposes the survivor set and the raw parsed stats (minValues, maxValues, nullCount, numRecords are accessible via configuration on the scan builder), but it does not expose a per-file pass/fail explanation. delta-explain needs the per-file explanation to drive the verbose view, and it needs to detect missing-stats-in-log to drive --assert-stats. Both push toward reading the JSON commits directly.

The Delta transaction log is a sequence of JSON files, one per version, where each line is an action. The add actions contain a stats field, a nested JSON string with numRecords, minValues, maxValues, and nullCount per column (the example below is pretty-printed for legibility; real commits are newline-delimited, one action per line):

{
    "add": {
        "path": "country=DE/part-00000-a35083c1.parquet",
        "stats": "{\"numRecords\":4,\"minValues\":{\"age\":40,\"score\":65.5},\"maxValues\":{\"age\":60,\"score\":88.9},\"nullCount\":{\"age\":0,\"score\":0}}"
    }
}

delta-explain iterates over all JSON files in the log in order, accumulates the statistics of every add, and removes those invalidated by a remove action (indicating a deleted or overwritten file). This is straightforward JSON-only log replay: the snapshot is the fold of add and remove actions across all surviving JSON commits. The result is a path -> stats map consistent with the current snapshot for every file whose add lives in a JSON commit. On long-lived tables, some don’t.

Since delta-explain reads only JSON commit files, the statistics map can be incomplete on production tables. The reason is the physical layout of the log:

table/_delta_log/
├── 00000000000000000000.json                  (commit JSON, one per transaction)
├── 00000000000000000001.json
├── ...
├── 00000000000000000010.checkpoint.parquet    (periodic consolidation)
├── 00000000000000000011.json
└── ...

By default, every 10 commits (configurable via delta.checkpointInterval), Delta consolidates the cumulative state into a checkpoint Parquet: a compact snapshot of every action accumulated so far. After the checkpoint, older JSON commits become eligible for log cleanup, retained for delta.logRetentionDuration (default 30 days; commonly tuned per-platform). This is independent of VACUUM, which removes data files no longer referenced by the active table state after the retention window (governed by delta.deletedFileRetentionDuration, default 7 days, and invoked manually rather than scheduled). On a long-lived production table, add actions written long ago survive only inside checkpoint Parquet files; their original JSON commits have been cleaned up. delta-explain doesn’t open the checkpoint, so those files appear as [no stats] and --assert-stats reports them as missing even though the statistics are perfectly intact.

The pruning counts driving --min-pruning are unaffected by this asymmetry. They go through delta-kernel, which performs full log replay: it reads _last_checkpoint, parses the checkpoint Parquet, applies the subsequent JSON deltas, and reconstructs the complete state. The kernel sees the stats inside the checkpoint and uses them for pruning. Only the per-file display (which delta-explain assembles directly from JSON) and the --assert-stats check are limited.

In the JSON output the same coverage is surfaced as a categorical stats.mode: exact when every file has stats in the log, partial when some don’t, absent when none do. The numerical counts (files_with_stats, total_files, pct) remain alongside the mode, but downstream tooling can branch on the label without recomputing the ratio.

Beyond the JSON-vs-checkpoint asymmetry there is a second, independent source of partial and absent: Delta only collects stats for the first delta.dataSkippingNumIndexedCols columns (default 32, configurable per-table), so a predicate on column 33 or beyond will see absent stats even on a healthy writer.

This is what makes the verbose output possible:

[DROPPED] part-00000-c34f1417.parquet  (1.1 KB  5 records)  partition(country=DE)  stats(age: 20..35)

The partition(country=DE) comes from delta-kernel (the partition_values returned by the scan). The stats(age: 20..35) comes from the direct log reading. The two sources are independent and are joined at rendering time.

--assert-stats flags any file that appears in delta-kernel’s scan but lacks a stats field in the log (or has an empty one). This catches writers that produce valid Parquet files but no statistics in the log, silently making data skipping ineffective.

Current limitations

Literal typing is the main outstanding case. sqlparser-rs produces literals not yet resolved against the Delta schema, while delta-kernel evaluates file skipping conservatively on type mismatch (keeping the file rather than risking an unsound exclusion). delta-explain resolves the common LONG, DOUBLE, and FLOAT cases from the Delta schema, but DATE, TIMESTAMP, DECIMAL, and some narrow integer types remain unhandled. The symptom is easy to miss: the predicate looks correct and statistics exist, but files are kept because the comparison cannot be evaluated soundly from metadata. For CI assertions on those types, use a predicate with a known-prunable test case before trusting the threshold.

Closing

delta-explain reasons at the protocol layer (the Delta transaction log and its action schema).

A reasonable objection is whether the protocol-layer view is enough to reason about performance. It is not enough when files are very large with many row groups, when most surviving files contain mostly non-matching rows that would be eliminated only by row-group-level skipping inside the Parquet reader, or when the bottleneck is bytes per file rather than file count.

Yet in many practical cases, it is. If a query reads 1 file out of 100 and files are roughly uniform in size, row-group pushdown inside that file is a marginal optimization. The main gain was already achieved upstream. And the most common problems are all visible at this level: wrong partitions, missing statistics, layout misaligned with query patterns.

delta-explain does not try to explain all of performance. It stops at the metadata layer, at the point where the system decides which files are worth reading at all. That decision is where the largest gain usually happens, and delta-explain makes it visible, measurable, and assertable, before the first file is ever opened.

The exact identifier names in the kernel API are not stable across delta-kernel-rs minor versions; delta-explain’s behavior reflects the version pinned to a 0.x minor in its Cargo.toml, per Cargo’s standard SemVer rules. ↩︎

The problem#

A pruning attribution model#

delta-explain in practice#

An example#

A counter-example#

CI/CD integration#

How delta-explain measures pruning#

Predicate separation#

Comparative metadata scans#

Statistics coverage#

Current limitations#

Closing#