The Disaggregation of the Lakehouse Stack

This analysis reflects the state of the ecosystem as of early 2026.

For the better part of a decade, the standard way to build a reliable data platform on object storage was: use Spark, add Delta, deploy on Databricks. It worked. It still works. But underneath that answer lies an architectural assumption that is quietly being dismantled — the assumption that the protocol, the data representation, the query engine, and the runtime should live in the same monolithic stack.

This article traces the disaggregation of that monolith — bottom-up, layer by layer — and maps what is replacing it.

The Monolith and Its Problems

Delta Lake as it was designed

Delta Lake was open-sourced by Databricks in 2019, but it was not designed in the abstract. It had been incubated internally before the open-source release — designed for Spark, written in Scala, tightly coupled to SparkSession, SparkContext, and Catalyst internals. The DeltaLog reads from a Spark-accessible filesystem. The OptimisticTransaction commits through Spark-configured storage. Schema evolution, partition pruning, and time travel were all surfaced through Spark-integrated APIs and planner internals.

This was not negligence. It was pragmatism. The problem Delta was solving — ACID semantics on object storage, schema enforcement, time travel — needed to be solved now, for Spark, where the actual workloads were running. Tight coupling was the fastest path to production.

As the ecosystem expanded, the cost of that coupling became harder to ignore.

The connector proliferation problem

Engines and access layers outside Spark — Trino, Flink, Hive, Presto, Redshift Spectrum, and a growing set of Python-native libraries — all needed to read and write Delta tables. But the authoritative implementation of Delta’s protocol semantics was tightly coupled to Spark. Every other engine had to reimplement the transaction log interpretation on its own.

The consequences were unsurprising. New protocol features would ship in the Spark connector and take months to appear in other engines. Subtle differences in how each connector interpreted the transaction log led to correctness bugs difficult to diagnose. And every connector team bore the maintenance cost of a parallel implementation of the same protocol logic.

This is not an accident of Delta’s design. It is a predictable consequence of embedding protocol semantics inside an engine-coupled implementation. The question was how to separate that logic from the engine.

The Architectural Response

Extracting the protocol

The answer was not to build better connectors one by one. It was to separate what the protocol says from how the data is processed.

Delta Kernel is the initiative that emerged from this insight — a set of libraries, developed under the delta-io umbrella, whose goal is to extract protocol logic into a shared layer independent of any engine. The division of responsibility is precise: snapshot construction, log interpretation, feature handling, and schema semantics live in the kernel; file reading, expression evaluation, and execution — including distributed execution where relevant — remain the responsibility of the host engine. The kernel exists in two implementation lines: a Java kernel for JVM-based engines, and a Rust kernel for non-JVM and cross-language environments, with C FFI bindings for embedding.¹

Before Delta Kernel:

┌─────────────────────────────────┐
│         Delta Scala             │
│  ┌───────────────────────────┐  │
│  │  Protocol logic (Delta)   │  │
│  │  + Spark runtime coupling │  │
│  └───────────────────────────┘  │
└─────────────────────────────────┘

After Delta Kernel:

┌─────────────────────────────────┐
│   Spark / Flink / Trino /       │  ← engine-specific execution
│   DataFusion / custom engine    │
└────────────────┬────────────────┘
                 │ Engine interface
┌────────────────▼────────────────┐
│          Delta Kernel           │  ← protocol logic only
│  • transaction log parsing      │
│  • snapshot resolution          │
│  • partition pruning metadata   │
│  • column mapping               │
│  • schema enforcement           │
│  • deletion vector application  │
└─────────────────────────────────┘

delta-rs and delta-kernel-rs

delta-rs started in 2020 as a full Rust implementation of Delta Lake, with DataFusion integration and Python bindings. Today, while delta-rs retains everything operational — DML, conflict resolution, retry logic — delta-kernel-rs, developed as part of the Delta Kernel initiative starting in 2023, is emerging as the shared protocol foundation: snapshot resolution, log interpretation, schema enforcement, and metadata semantics. Its write path is still experimental (currently limited to blind appends), but the read-side protocol semantics are already delegated from delta-rs.

The kernel is engine-agnostic. It knows what needs to happen — which log entries to read, which predicates to evaluate, which files belong to a snapshot — but not how. The how is defined by the caller through the Engine interface (in Rust, an interface is called a trait): how to access storage, how to evaluate expressions, how to parse JSON and Parquet.²

pub trait Engine: AsAny {
    fn evaluation_handler(&self) -> Arc<dyn EvaluationHandler>;
    fn storage_handler(&self) -> Arc<dyn StorageHandler>;
    fn json_handler(&self) -> Arc<dyn JsonHandler>;
    fn parquet_handler(&self) -> Arc<dyn ParquetHandler>;
}

This is the seam where the kernel stops and the host engine begins.³ The EngineData abstraction is nominally format-agnostic, but in practice the shipped reference path is Arrow-based, making Arrow the de facto data exchange format in the default implementation path today.

The Disaggregation

The Engine trait of delta-kernel-rs reveals something larger than API design. It is not only the engine being decoupled from protocol semantics — the entire monolithic stack is coming apart into independent layers with explicit boundaries:

Protocol: Delta Kernel — shared semantics, embeddable in any engine
In-memory format: Apache Arrow — identical columnar layout across languages⁴
Query planning and execution: pluggable — DataFusion (planning + execution) in the Rust ecosystem, Calcite (planning framework) in the JVM ecosystem (Flink, Drill, and others), Catalyst in Spark, Trino’s own cost-based optimizer⁵

Each layer is now an independent library, replaceable without disturbing its neighbors. This is what disaggregation means in this context: not the disappearance of the monolith, but the modularization of what used to be inseparable.

Convergence, not parity

In practice, the clean separation described above holds only partially. And that is exactly the point.

Delta Kernel is an explicit attempt to create a shared protocol layer, but the ecosystem remains uneven. The Flink connector still relies on Delta Standalone — the pre-kernel Java library, no longer the strategic foundation for new connector work — and a kernel-based Flink 2.0 connector has only been proposed, not built.⁶ On the Rust side, delta-rs and delta-kernel-rs are advancing quickly, but public issue trackers still show open gaps around several newer table features — liquid clustering, row tracking, and v2 checkpoints among them.⁷⁸⁹

What is converging is not feature parity — it is the boundary. The old world was one in which connectors reimplemented Delta protocol behavior in parallel, with divergence accumulating silently. The emerging world is one in which engines may still differ in maturity and operational surface, but are no longer forced, by design, to diverge at the level of protocol interpretation.

The pattern is moving upward

Apache Comet is the clearest signal that the disaggregation is moving beyond protocol and format into the engine itself. Comet operates at the Spark execution layer and is format-agnostic — it works equally with Delta, Iceberg, or plain Parquet. It keeps Spark’s distributed runtime and Catalyst’s logical planning intact, but replaces physical execution operators with Rust implementations built on DataFusion and Arrow. Plans cross the JVM/native boundary via Protobuf; data crosses via Arrow FFI — zero-copy columnar exchange across the language boundary.¹⁰ Comet’s operator coverage is expanding, though not all operators are covered, and unsupported ones fall back to JVM execution transparently. Meta’s Velox pursues a similar pattern with a C++ execution engine and its own columnar format, interoperable with Arrow but not Arrow-native.¹¹ Together, these projects demonstrate that the separation of protocol from execution, which Delta Kernel introduced at the storage layer, is now being applied one level up: between planning and execution within the query engine itself.

The disaggregation is moving upward through the stack — but it stops at the distributed runtime. The next two sections examine why.

What the Architecture Does Not (Yet) Cover

The layered picture described above is architecturally coherent, but several areas remain outside its reach.

Streaming: the protocol layer provides primitives for exactly-once sinks and change data feeds, but the orchestration layer — micro-batch triggering, offset tracking, watermark management — remains tightly coupled to Spark Structured Streaming or Flink. The Rust ecosystem does not yet offer a streaming equivalent with comparable production maturity.

Catalog integration: the disaggregated stack has no portable, vendor-neutral catalog. Hive Metastore and AWS Glue integrations exist at the engine level (Spark, Trino) but are not part of the shared protocol layer — and Unity Catalog was open-sourced in mid-2024 but remains operationally centered on the Databricks ecosystem.

Operational complexity: the monolithic stack handles maintenance and governance implicitly; a disaggregated deployment requires building that infrastructure explicitly. In practice, this means teams must own VACUUM scheduling, access control policies, and incident response across heterogeneous engines — for a stack whose failure modes no longer fit a single vendor’s documentation. Dependency coordination across independently released components is a concrete cost — version skew is a real operational burden that the monolithic stack did not expose as visibly. These are solvable problems, but the engineering effort is real.

These are real gaps, but they are gaps of coverage. The distributed runtime is a different kind of problem.

The Missing Layer

The distributed runtime is the layer most structurally resistant to disaggregation.

Why this layer is different

The distributed runtime layer is harder to disaggregate for a structural reason: its components are deeply entangled with each other. The shuffle service, memory manager, task scheduler, and DAG scheduler are tightly interdependent. Cluster state, partial failure handling, scheduling under resource pressure — these are not independent concerns that can be extracted one at a time. Spark has spent more than a decade solving them together.

This is why you cannot replace Spark with DataFusion + Calcite + delta-kernel and get equivalent behavior at scale. The single-node path is real: the Embucket team demonstrated DataFusion completing all 22 TPC-H queries at SF1000 (~1 TB) on a single node.¹² But once your data outgrows a single machine — or your workload requires stateful fault tolerance, as streaming does — you need a distributed runtime.¹³

This is also why newer systems — Trino, Flink SQL, Ray Data, Dask — still end up rebuilding substantial portions of scheduling, shuffle, and fault-tolerance infrastructure.

The pressure building on this layer

The distributed runtime is harder to disaggregate, but it is not immune to the same forces. The pressure builds more slowly and comes from a different direction.

Managed shuffle is the first crack. LinkedIn’s Magnet (a push-based shuffle optimization developed at LinkedIn for Spark), Apache Celeborn (external shuffle service), and AWS Glue’s cloud shuffle storage plugin all extract or externalize shuffle — the most expensive and most Spark-specific component of the distributed runtime — into a standalone service.

Ray attacks the problem from the opposite direction: provide a general-purpose distributed computing framework on which data systems can be built. But Ray was designed with different priorities — fine-grained actor and task parallelism, not coarse-grained stage-based SQL execution with optimized shuffle. Ray Data’s analytical path is evolving, but as of early 2026 its shuffle and fault-tolerance model have not yet reached the same level of SQL-oriented maturity and production hardening that Spark provides.

Conclusion

What began as a monolith is separating into layers with explicit boundaries. The pattern is consistent: each layer, once extracted behind a stable interface, becomes independently replaceable. The bottom layers — protocol and in-memory representation — are already disaggregated behind stable interfaces. Execution is now the layer where disaggregation is most visibly advancing. The distributed runtime remains the last intact monolithic layer, and the most structurally resistant to disaggregation — which is why Spark is still the center of Databricks’ commercial offering even as the layers around it come apart.

The common framing — “will DataFusion replace Spark?”, “is DuckDB the new Spark?” — misses the structural point. DataFusion and DuckDB are execution engines. They compete with Spark’s execution operators, not with Spark’s distributed runtime. Comet makes this explicit by embedding DataFusion inside Spark’s runtime rather than replacing it. As the protocol and in-memory layers become commoditized, the actual competition is between distributed runtimes — Spark, Flink, Ray, and their respective ecosystems — and that is where the next chapter of this story will be written.

For engineers building on this stack, the practical consequence is composability that would have been difficult to achieve even a few years ago: mixing engines against shared Delta tables, validating protocol correctness locally with pip install deltalake instead of standing up a cluster, embedding native execution inside a JVM runtime.¹⁴

The lakehouse stack is not converging on a new dominant engine. It is converging on a layered architecture.

Notes

The delta-kernel-rs FFI layer (ffi/ crate) exposes a C API using extern "C" functions and repr(C) types. Iterators are modeled as C-compatible structs with function pointers (EngineIterator), and all kernel handles use a Handle type that manages ownership across the FFI boundary. This enables embedding in C, C++, Python (via ctypes/cffi), and JVM (via JNI) without requiring Rust on the consumer side. ↩︎
“Snapshot resolution” is the process of computing the current state of a Delta table at a given version: reading the transaction log (JSON commit files and checkpoint Parquet files), replaying add/remove actions, and producing the set of data files that constitute the table at that point in time. This is the most fundamental operation the kernel performs — everything else (scans, schema reads, statistics) depends on having a resolved snapshot. ↩︎
The kernel ships a DefaultEngine (behind feature flags default-engine-native-tls / default-engine-rustls) that bundles Arrow, Parquet, cloud object store support, and async I/O via tokio — lighter than a full query engine but not trivial. delta-rs uses DefaultEngine for all protocol operations. Apache DataFusion is an optional integration layer (behind a datafusion feature flag, though always enabled in the Python distribution) that provides SQL and DataFrame query capabilities on top. ↩︎
Apache Arrow defines a columnar in-memory format with an identical memory layout across languages — Rust, Python, Java, C++ all represent Arrow buffers the same way. Data crosses component boundaries (kernel to engine, engine to connector, connector to caller) with little or no reshaping. Delta Lake stores data on disk as Parquet, which is also columnar. Parquet and Arrow share the same logical data model (typed columns, stored contiguously), so reading Parquet into Arrow is a columnar decode rather than a format conversion — substantially cheaper than materializing row-oriented structures. The kernel’s built-in DefaultEngine uses arrow-rs and parquet-rs directly; DataFusion also operates natively on Arrow RecordBatch structures. ↩︎
Apache Calcite provides SQL parsing, relational algebra, and an optimizer framework used by Flink, Drill, Phoenix, and Beam. The analogy with DataFusion is architectural, not literal: Calcite is primarily a planning framework that delegates execution to external engines, while DataFusion combines planning and execution in a single library. The relevance to the disaggregation thesis is that Flink’s Delta connector runs through Calcite’s planning layer — though as of early 2026, it still uses Delta Standalone rather than the JVM kernel for protocol semantics. ↩︎
See Feature Request: Flink 2.0 connector built on Delta Kernel. The existing Flink connector depends on Delta Standalone, which is no longer released in the 4.x line and receives only critical fixes in 3.x. A design document for a kernel-based Flink connector exists, but no implementation work has started as of early 2026. ↩︎
See delta-rs issue #3249: Features — Support needed, which tracks liquid clustering, row tracking, and v2 checkpoints as features still requiring implementation or completion. The distinction between delta-kernel-rs (protocol-level read support) and delta-rs (full operational support including writes) matters here: the kernel may recognize a protocol feature for reading purposes while delta-rs has not yet implemented the corresponding write-path operations. ↩︎
delta-rs implements DML operations in crates/core/src/operations/: merge/, update/, delete.rs, optimize.rs (including Z-order), vacuum.rs, load_cdf.rs, constraints.rs, restore.rs. Each uses DataFusion for query planning and execution, and the kernel for protocol-level commit. The kernel provides the primitives (add files, remove files, update deletion vectors); delta-rs provides the semantics (which rows to delete, how to merge, when to compact). The full list of delta-rs operations above the kernel boundary includes MERGE, UPDATE, DELETE, OPTIMIZE (including Z-order), VACUUM, Change Data Feed reads, CHECK constraints, additive schema evolution, and table restore — all gated behind the datafusion feature flag and not available through delta-kernel-rs alone. ↩︎
The kernel’s TableChanges implementation requires identical schemas across the queried version range — compatible but non-identical schemas are not yet supported. Column mapping (both name and id modes) is supported for CDF reads as of v0.20. ↩︎
Comet serializes Spark physical plan nodes to Protobuf (defined in native/proto/src/proto/operator.proto). The serialized plan crosses the JNI bridge via CometExecIterator → Native.createPlan(). On the Rust side, the Protobuf is converted to DataFusion physical plan nodes via PhysicalPlanner. Data flows via the Arrow C Data Interface: raw memory pointers to Arrow arrays and schemas cross JNI, enabling zero-copy data sharing between JVM and native memory. Unsupported operators fall back to the original Spark JVM operator transparently (CometExecRule.scala). ↩︎
Meta’s Velox is a C++ execution engine that operates on its own columnar in-memory format (RowVector/BaseVector), not directly on Arrow. It provides Arrow import/export via the Arrow C Data Interface, meaning data crosses the Velox boundary through a (typically zero-copy) conversion step. Velox’s relationship to Arrow is thus more like “interoperable peer” than “native consumer,” unlike DataFusion which operates directly on Arrow RecordBatch structures. Velox was created by Meta. ↩︎
The Embucket team demonstrated DataFusion completing all 22 TPC-H queries at SF1000 (≈1 TB) on a single node in November 2025, illustrating the performance envelope of the Rust/Arrow path on hardware-optimized single-node configurations. The DataFusion benchmarks repository (apache/datafusion-benchmarks) provides tooling to reproduce TPC-H derived benchmarks at various scale factors. These numbers should be understood as directional indicators, not as controlled head-to-head benchmarks. ↩︎
The threshold is workload-dependent rather than a fixed data size. Scan-heavy analytics can run on a single node well into the terabyte range (as the Embucket SF1000 result demonstrates), but shuffle-heavy workloads — large joins, high-cardinality aggregations — exhaust single-node capacity much sooner. The commonly cited 500 GB to 1 TB range is typical for in-memory-dominant execution, not a hard wall — DuckDB spills aggressively to disk and can push beyond it, while DataFusion’s spill support is less mature. The point is not a number but a shape. ↩︎
In practice, the gap extends beyond expression evaluation. Concurrent write resolution — the behavior when two writers conflict — differs substantially: Delta Spark and delta-rs both implement conflict resolution with isolation levels (Serializable and WriteSerializable at the protocol level, with SnapshotIsolation as an internal conflict-checking mode), while delta-kernel-rs reports conflicts without resolving them. Table maintenance behavior (statistics quality after OPTIMIZE, file sizing, Z-order effectiveness) also varies. For a local-development workflow, these differences are unlikely to surface. For integration testing against production-like concurrency patterns, they matter. ↩︎

The Monolith and Its Problems#

Delta Lake as it was designed#

The connector proliferation problem#

The Architectural Response#

Extracting the protocol#

delta-rs and delta-kernel-rs#

The Disaggregation#

Convergence, not parity#

The pattern is moving upward#

What the Architecture Does Not (Yet) Cover#

The Missing Layer#

Why this layer is different#

The pressure building on this layer#

Conclusion#

Notes#