The Hidden DSL in Catalyst
How Spark’s internal rewriting framework, Catalyst, exposes an embedded DSL with a public extension surface, the same one Delta Lake and Iceberg use to plug into the optimizer pipeline.
How Spark’s internal rewriting framework, Catalyst, exposes an embedded DSL with a public extension surface, the same one Delta Lake and Iceberg use to plug into the optimizer pipeline.
Partition pruning and data skipping are invisible by default. delta-explain reads the Delta log directly and shows, step by step, how a WHERE predicate narrows down candidate files, with no engine required.
Six concrete anti-patterns I encountered building a real Catalyst extension: from the wrong rule type for throws, to mutable state under AQE and Spark Connect, to JVM bootstrap traps in PySpark.
How Delta Kernel, Arrow, and pluggable execution are disaggregating the lakehouse stack. The lakehouse stack is not converging on a new dominant engine — it is converging on a layered architecture in which protocol, data representation, and query execution are increasingly isolated behind stable interfaces.
Many of the most surprising performance pathologies in modern data systems are semiotic failures — structural divergences between what an interface signifies and what the underlying system does.
At 10 TB, updating 200k rows can mean rewriting thousands of files. Here’s why, and what to do about it.