Where Spark Changes Shape: UnsafeRow, the JVM, and the Relocation of Portability

Where Spark Changes Shape: UnsafeRow, the JVM, and the Relocation of Portability

Run df.explain on a Parquet scan and you find ColumnarToRow, an operator whose only job is to change the data’s shape. It is the seam where two architectural eras meet, and a record of how portability in analytical systems has relocated from the JVM runtime to the data format and the query plan.

May 26, 2026 · 18 min
The Hidden DSL in Catalyst

The Hidden DSL in Catalyst

How Spark’s internal rewriting framework, Catalyst, exposes an embedded DSL with a public extension surface, the same one Delta Lake and Iceberg use to plug into the optimizer pipeline.

April 26, 2026 · 19 min
Anti-patterns in Catalyst rules

Anti-patterns in Catalyst rules

Six concrete anti-patterns I encountered building a real Catalyst extension: from the wrong rule type for throws, to mutable state under AQE and Spark Connect, to JVM bootstrap traps in PySpark.

April 26, 2026 · 16 min
Where Data System Abstractions Break: A Semiotic Reading

Where Data System Abstractions Break: A Semiotic Reading

Many of the most surprising performance pathologies in modern data systems are semiotic failures — structural divergences between what an interface signifies and what the underlying system does.

March 4, 2026 · 13 min
Delta Lake MERGE Is Not a Simple Upsert. What Actually Happens at Scale.

Delta Lake MERGE Is Not a Simple Upsert. What Actually Happens at Scale.

At 10 TB, updating 200k rows can mean rewriting thousands of files. Here’s why, and what to do about it.

March 2, 2026 · 15 min
Spark Is Not Lazy. Spark Compiles Dataflow.

Spark Is Not Just Lazy. Spark Compiles Dataflow.

Why calling Spark ’lazy’ is technically reductive, and how thinking of it as a dataflow compiler changes the way you design pipelines.

November 3, 2025 · 12 min