cdelmonte.dev

Where Data System Abstractions Break: A Semiotic Reading

Many of the most surprising performance pathologies in modern data systems are semiotic failures: structural divergences between what an interface signifies and what the underlying system does.

Delta Lake MERGE Is Not a Simple Upsert. What Actually Happens at Scale.

At 10 TB, updating 200k rows can mean rewriting thousands of files. Here’s why, and what to do about it.

A Small Compiler for Explaining Delta Lake Pruning

How delta-explain turns one predicate and the Delta log into a trustworthy explanation of file pruning, through shared representation, safe rewrites, and conservative measurement.

Spark Is Not Lazy. Spark Compiles Dataflow.

Spark Is Not Just Lazy. Spark Compiles Dataflow.

Why calling Spark ’lazy’ is technically reductive, and how thinking of it as a dataflow compiler changes the way you design pipelines.

A Shared Kernel Is a Shared Trust Domain

Containers isolate processes, not trust boundaries. When your platform runs untrusted code, the architectural question is where you place the kernel boundary, and what that costs in memory, latency, and operational complexity.

Fixing Skewed Nested Joins in Spark with Asymmetric Salting

In large-scale Spark pipelines, skew can occur when a single key carries a disproportionately large nested payload. Asymmetric salting offers a targeted solution: explode, salt, join in parallel, and optionally re-aggregate.