delta-explain: Making Delta Lake Pruning Visible

The problem

Performance in Delta Lake is not mainly about raw execution speed — it is about the files that never get read. Partition pruning and data skipping eliminate unnecessary file reads before any engine touches them, using metadata from the transaction log. But that only works if the data layout is designed correctly. If partitioning is wrong or relevant statistics are missing, neither mechanism can do its job.

With small datasets, the problem is easy to ignore. But volumes grow, and what was inefficient pruning over 20 files becomes a full scan over 20,000. At that point you need to understand what’s happening — and the options are limited. You can investigate it by examining the execution plan and the Spark UI, but it requires experience, it assumes Spark, and above all it requires running the query. There is no direct way to diagnose pruning from metadata alone, without involving an execution engine.

I wrote delta-explain to fill this gap. It’s available on crates.io.

delta-explain in action

An example

A Delta table partitioned by country, with 6 files distributed across 3 partitions (DE, IT, US). Each file has min/max statistics on the age column.

We want to understand what would happen if an engine ran a query with:

WHERE country = 'DE' AND age > 40

How many files would actually be read? How much pruning is at play?

Without running a query through an engine, delta-explain can answer this:

delta-explain ./users -w "country = 'DE' AND age > 40"

The output:

Delta table: ./users
Version:     5
Predicate:   country = 'DE' AND age > 40

Files in snapshot: 6

Phase 1: Partition pruning
    predicate:       country = 'DE'
    files remaining: 2  (-4, 67% pruned)

Phase 2: Data skipping (min/max statistics)
    predicate:       age > 40
    files remaining: 1  (-1, 50% pruned)

Total reduction: 6 -> 1 files (83% pruned)

The predicate is automatically decomposed: country = 'DE' is a partition column, so it’s evaluated in Phase 1. age > 40 is not partitioned, but has min/max statistics in the transaction log, so it’s evaluated in Phase 2.

The result: out of 6 files, only 1 remains a candidate for reading. 83% of candidate files are eliminated before any engine touches a Parquet file.

If we want file-by-file detail, we add --verbose:

delta-explain ./users -w "country = 'DE' AND age > 40" --verbose

Phase 1: Partition pruning
    predicate:       country = 'DE'
    files remaining: 2  (-4, 67% pruned)

    [DROPPED] part-00000-48368dae.parquet  (1.1 KB  3 records)  partition(country=IT)  stats(age: 41..65)
    [DROPPED] part-00000-fcf95aac.parquet  (1.1 KB  5 records)  partition(country=IT)  stats(age: 22..38)
    [DROPPED] part-00000-eee5a3ec.parquet  (1.1 KB  3 records)  partition(country=US)  stats(age: 31..55)
    [DROPPED] part-00000-de2ffaef.parquet  (1.1 KB  4 records)  partition(country=US)  stats(age: 18..29)
    [KEPT   ] part-00000-a35083c1.parquet  (1.1 KB  4 records)  partition(country=DE)  stats(age: 40..60)
    [KEPT   ] part-00000-c34f1417.parquet  (1.1 KB  5 records)  partition(country=DE)  stats(age: 20..35)

Phase 2: Data skipping (min/max statistics)
    predicate:       age > 40
    files remaining: 1  (-1, 50% pruned)

    [KEPT   ] part-00000-a35083c1.parquet  (1.1 KB  4 records)  partition(country=DE)  stats(age: 40..60)
    [DROPPED] part-00000-c34f1417.parquet  (1.1 KB  5 records)  partition(country=DE)  stats(age: 20..35)

Here you can see the exact reasoning. In Phase 1, the 4 files with country=IT and country=US are eliminated — they’re not needed, the predicate asks for DE only. 2 files remain.

In Phase 2, the file with age: 20..35 is eliminated because its maximum (35) doesn’t satisfy age > 40. The only surviving file has age: 40..60 — it potentially contains valid rows. Data skipping is conservative: the file’s minimum age is 40, which does not satisfy age > 40, so it may contain no matching rows at all — but it cannot be safely excluded.

A counter-example

Now imagine the same query on a table with the same data, but without partitioning. Files contain records from multiple countries mixed together.

delta-explain ./users-flat -w "country = 'DE' AND age > 40"

Delta table: ./users-flat
Version:     0
Predicate:   country = 'DE' AND age > 40

Files in snapshot: 6

Phase 1: Data skipping (min/max statistics)
    predicate:       country = 'DE' AND age > 40
    files remaining: 4  (-2, 33% pruned)

There is no partition pruning: country is not a partition column, so there is no phase that eliminates files at the directory level. The entire predicate falls through to data skipping, which works only with min/max statistics. But on a string column like country, with values mixed across files, the ranges are wide — a file with country: AT..US cannot exclude DE — min/max on strings uses lexicographic ordering, so any value alphabetically between AT and US is a potential match, even if the matching rows are extremely sparse.

Only files whose max(age) is below 40 get eliminated. The detail with --verbose:

delta-explain ./users-flat -w "country = 'DE' AND age > 40" --verbose

Phase 1: Data skipping (min/max statistics)
    predicate:       country = 'DE' AND age > 40
    files remaining: 4  (-2, 33% pruned)

    [DROPPED] part-00001.snappy.parquet  (1.1 KB  4 records)  stats(age: 18..29, country: AL..US)
    [DROPPED] part-00002.snappy.parquet  (1.1 KB  5 records)  stats(age: 20..35, country: AT..US)
    [KEPT   ] part-00003.snappy.parquet  (1.1 KB  4 records)  stats(age: 30..55, country: DE..US)
    [KEPT   ] part-00004.snappy.parquet  (1.1 KB  3 records)  stats(age: 22..50, country: AT..IT)
    [KEPT   ] part-00005.snappy.parquet  (1.1 KB  5 records)  stats(age: 40..65, country: CH..US)
    [KEPT   ] part-00006.snappy.parquet  (1.1 KB  3 records)  stats(age: 25..45, country: AT..DE)

File 4 has country: AT..IT — a range that includes “DE” alphabetically, so it cannot be eliminated even if it contains no German records at all. File 6 has country: AT..DE and age: 25..45 — it might contain valid records, so it stays.

Result: from 83% pruning down to 33%. On 6 files the difference is negligible. On 60,000 files it means reading 30,000 extra files per query.

The point is not that the table is “wrong” in an absolute sense. It’s that the data layout is not aligned to the query pattern. delta-explain makes this misalignment visible before it becomes a production performance problem.

CI/CD integration

Once pruning is visible and measurable, the next step is to automate it. delta-explain supports assertion flags designed for CI:

delta-explain s3://warehouse/events \
    --env-creds \
    -w "date = '2024-01-15'" \
    --min-pruning 90

--min-pruning 90 fails the command (exit code 1) if total pruning is below 90%. The predicate should reflect a representative query pattern for the table — hardcoded date literals drift over time, so consider using a recent partition value or a dynamically computed date. The threshold itself depends on partitioning granularity: a table partitioned by day with 365 partitions has a natural baseline around 99.7% for a single-day predicate, while a table with 4 region partitions tops out at 75%.

The failure output:

ASSERTION FAILED: total pruning 33.3% is below threshold 90.0%

--assert-stats verifies that all files have statistics in the transaction log. --format json produces structured output for downstream parsing. They can be combined:

delta-explain s3://warehouse/events \
    --env-creds \
    -w "date = '2024-01-15'" \
    --min-pruning 90 \
    --assert-stats \
    --format json

Together they turn delta-explain into a quality gate: after every ETL, you verify that the resulting table has the expected pruning and complete statistics. If someone changes the partitioning or the writer stops producing statistics, the pipeline stops before the problem reaches users.

An example integration with GitHub Actions:

- name: Verify pruning after ETL
  run: |
    delta-explain s3://warehouse/events \
        --env-creds \
        -w "date = '2024-01-15'" \
        --min-pruning 90 \
        --assert-stats

The architecture

The tool is written in Rust and built around two main components: delta-kernel-rs and sqlparser-rs.

delta-kernel-rs is the official library for reading Delta table metadata without depending on Spark or any specific engine. delta-explain uses it to build multiple scans over the same snapshot, each with a different predicate, and to count how many files survive each pass. The first scan has no predicate (baseline); the second applies only partition column clauses; the third applies the combined partition and data-skipping predicate. The difference between scans 1 and 2 isolates partition pruning; the difference between scans 2 and 3 isolates data skipping.

The important point is what doesn’t happen: no Parquet file is opened. ScanBuilder with a predicate filters candidate files using only transaction log metadata — partition values and min/max statistics. delta-explain calls scan_metadata(), not scan_data(). The cost of the analysis is proportional to the number of files in the log, not the size of the data.

sqlparser-rs handles the other side of the problem: turning a user-written SQL predicate into a structure that delta-kernel can evaluate. The WHERE clause is parsed into a standard SQL AST, then recursively converted into delta-kernel’s Predicate and Expression types — the types ScanBuilder expects.

But the translation is not direct. There is an intermediate step that is the core of the tool: predicate separation. Top-level AND clauses are classified based on the columns they reference. If all columns in a clause are partition columns, the clause goes to Phase 1 (partition pruning). Otherwise it goes to Phase 2 (data skipping). (Partition columns are currently inferred from the partition_values keys of the first file in the snapshot, not from protocol metadata — on an empty table with no files, the tool cannot detect them.)

This separation works only for top-level AND clauses. An OR across partition and non-partition columns (e.g., country = 'DE' OR age > 40) cannot be split — the entire expression falls to data skipping, and partition pruning contributes nothing.

This is what allows delta-explain to show the two phases separately and measure each one’s contribution.

"country = 'DE' AND age > 40"
         |
    sqlparser-rs
         |
    +----+----+
    | flatten |  (split top-level ANDs)
    |   AND   |
    +----+----+
         |
    +----+--------------------+
    |                         |
  country = 'DE'          age > 40
    |                         |
    v                         v
 partition columns?    partition columns?
   YES                       NO
    |                         |
    v                         v
 Phase 1                 Phase 2
(partition pruning)     (data skipping)

A detail on type conversion: sqlparser produces generic literals (Number("42")). But delta-kernel expects precise types — an Expression::literal(42_i32) is not the same as Expression::literal(42_i64). delta-explain resolves the ambiguity by looking at the column type in the Delta schema. If age is an integer, the literal 42 becomes an i32. If it’s a long, it becomes i64. If it’s a double, 42 becomes 42.0_f64. Without this step, data skipping comparisons wouldn’t work.

Reading statistics from the transaction log

A third component works alongside the first two: direct reading of statistics from the transaction log.

delta-kernel internally uses min/max statistics to filter files when it receives a predicate via ScanBuilder. But it doesn’t expose them. It returns the list of surviving files, not the reason each file was kept or dropped. For delta-explain this is not enough: the tool’s value lies precisely in showing why a file is eliminated — and that requires the raw statistics of every file.

The solution is to read them independently, parsing the JSON files in _delta_log/. The Delta transaction log is a sequence of JSON files, one per version, where each line is an action. The add actions contain a stats field — a nested JSON string with numRecords, minValues, maxValues, and nullCount per column:

{
    "add": {
        "path": "country=DE/part-00000-a35083c1.parquet",
        "stats": "{\"numRecords\":4,\"minValues\":{\"age\":40,\"score\":65.5},\"maxValues\":{\"age\":60,\"score\":88.9},\"nullCount\":{\"age\":0,\"score\":0}}"
    }
}

delta-explain iterates over all JSON files in the log in order, accumulates the statistics of every add, and removes those invalidated by a remove action (indicating a deleted or overwritten file). The result is a path -> stats map consistent with the current snapshot.

An important limitation: delta-explain reads only JSON commit files, not checkpoint Parquet files. If the table has been checkpointed and older JSON files cleaned up — standard practice in production — the statistics map will be incomplete. Files whose add actions exist only in the checkpoint will appear with [no stats] in verbose output even if statistics exist. This affects --assert-stats as well.

This is what makes the verbose output possible:

[DROPPED] part-00000-c34f1417.parquet  (1.1 KB  5 records)  partition(country=DE)  stats(age: 20..35)

The partition(country=DE) comes from delta-kernel (the partition_values returned by the scan). The stats(age: 20..35) comes from the direct log reading. The two sources are independent and are joined at rendering time.

A useful side effect: the same statistics reading feeds the --assert-stats flag. If a file appears in delta-kernel’s scan but has no stats field in the log — or the field exists but is empty — delta-explain flags it. This catches a specific class of problems: writers that produce valid Parquet files but without statistics in the log, silently making data skipping ineffective.

The full flow is a straightforward pipeline: sqlparser transforms the WHERE clause into an AST, and the predicate is separated into partition clauses and statistics clauses. delta-kernel then runs one scan per phase counting surviving files, raw statistics are read from the log in parallel, and the report joins everything — counts from kernel, stats from the log, original predicate text from sqlparser.

The architectural boundary

delta-explain stops at the Delta protocol. It shows how many files are eliminated by partition pruning and data skipping. It does not look inside the surviving files.

In a real execution, there is a third filtering level: predicate pushdown. When an engine opens a Parquet file, it doesn’t necessarily read all rows. Parquet files are organized into row groups, and each row group has its own min/max statistics in the footer. An engine can use these statistics to skip entire row groups without decompressing them.

This level is deliberately out of scope. Not a gap — an architectural choice.

Partition pruning and data skipping operate on transaction log metadata: they are Delta protocol mechanisms, independent of the engine and the physical format. Predicate pushdown operates on the Parquet footer: it is a file format mechanism, managed by the engine at read time.

Closing

The open question is whether the view delta-explain provides is sufficient to reason about performance.

It is not enough when files are very large with many row groups, when predicates are highly selective but pass partition pruning only to be filterable inside the Parquet, or when the bottleneck is not file count but the volume of data read within each file.

Yet in many practical cases, it is. If a query reads 1 file out of 100, row-group pushdown inside that file is a marginal optimization. The main gain was already achieved upstream. And above all: the most common problems — wrong partitions, missing statistics, layout misaligned with query patterns — are all visible at this level.

delta-explain does not try to explain all of performance. It stops earlier, at the point where the system decides which files are worth reading at all.

That decision is where the largest gain usually happens. And as long as it remains invisible, performance in Delta Lake remains harder to reason about than it should be.

The problem#

delta-explain in action#

An example#

A counter-example#

CI/CD integration#

The architecture#

Reading statistics from the transaction log#

The architectural boundary#

Closing#