Spark

Where Spark Changes Shape: UnsafeRow, the JVM, and the Relocation of Portability

Run df.explain on a Parquet scan and you find ColumnarToRow, an operator whose only job is to change the data’s shape. It is the seam where two architectural eras meet, and a record of how portability in analytical systems has relocated from the JVM runtime to the data format and the query plan.

The Hidden DSL in Catalyst

How Spark’s internal rewriting framework, Catalyst, exposes an embedded DSL with a public extension surface, the same one Delta Lake and Iceberg use to plug into the optimizer pipeline.

Anti-patterns in Catalyst rules

Six concrete anti-patterns I encountered building a real Catalyst extension: from the wrong rule type for throws, to mutable state under AQE and Spark Connect, to JVM bootstrap traps in PySpark.

Where Data System Abstractions Break: A Semiotic Reading

Many of the most surprising performance pathologies in modern data systems are semiotic failures: structural divergences between what an interface signifies and what the underlying system does.

Delta Lake MERGE Is Not a Simple Upsert. What Actually Happens at Scale.

At 10 TB, updating 200k rows can mean rewriting thousands of files. Here’s why, and what to do about it.

Spark Is Not Lazy. Spark Compiles Dataflow.

Spark Is Not Just Lazy. Spark Compiles Dataflow.

Why calling Spark ’lazy’ is technically reductive, and how thinking of it as a dataflow compiler changes the way you design pipelines.