Spark Is Not Just Lazy. Spark Compiles Dataflow.

It is often said that Spark is lazy. The statement is not wrong, but it is imprecise. What defines the performance of a Spark job is not when execution happens, but what the optimizer can see. Thinking of Spark as a dataflow compiler — rather than a lazy engine — changes the questions you ask and the code you write.

Three Kinds of “Lazy”

The word “lazy” is used to describe at least three different ideas: lazy evaluation, deferred execution, and what we might call DAG compilation. They are related, but they are not the same thing.

Lazy evaluation is a property of a programming language. Values are not computed until they are needed. Languages like Haskell implement this model through thunks, memoization, and demand-driven evaluation. This is a semantic property of the language itself.

Deferred execution is a property of an API. Operations are recorded but not executed immediately. Execution starts only when a terminal signal is received. Java’s Stream API works this way: intermediate operations like filter and map are registered, and execution begins only when a terminal operation like collect is invoked. This is not about evaluation semantics; it is about when execution is triggered.

DAG compilation is something else entirely. It is a property of the runtime. Operations are collected, analyzed, reordered, rewritten, and optimized before execution begins. SQL query engines follow this model: a query planner rewrites and optimizes a query before running it, yet no one calls PostgreSQL “lazy” for doing so.

What Spark Actually Does

Spark combines deferred execution and DAG compilation. At the API level, transformations are recorded. At runtime, Catalyst analyzes the logical plan, optimizes it, produces a physical plan, and hands it to the scheduler. The fact that execution is deferred is not the point; it is a consequence of global compilation. Spark is not postponing work for the sake of laziness. It is delaying execution to enable whole-plan optimization.

This pattern is not unique to Spark. No one would say that PostgreSQL is lazy because it rewrites queries before executing them. No one would say that LLVM is lazy because it transforms source code into IR, optimizes it, and emits machine code. Spark is closer to LLVM than to Haskell.

So far, this may read like a semantic distinction. It is not. The compiler model changes what you optimize for, and it explains failures that the “lazy” model cannot.

Where the Compiler Model Bites

Consider a simple filter-then-aggregate pipeline:

df = spark.read.parquet("events/")
result = df.filter(col("country") == "IT").groupBy("city").count()
result.show()

Catalyst pushes the filter down into the Parquet scan. The physical plan reads only the rows and columns it needs. This is predicate pushdown, and it happens because Catalyst can see the entire plan before execution begins.

Now consider the same logic, routed through a Python UDF:

is_italian = udf(lambda c: c == "IT", BooleanType())
result = df.filter(is_italian(col("country"))).groupBy("city").count()
result.show()

Functionally equivalent. But Catalyst cannot see inside the UDF. It is an opaque box. The predicate is not pushed down. The scan reads every row from every partition, serializes it to Python, evaluates the function, and sends the result back. You can verify this with explain():

# Native expression
== Physical Plan ==
*(2) HashAggregate(keys=[city], functions=[count(1)])
+- *(2) HashAggregate(keys=[city], functions=[partial_count(1)])
   +- *(1) Filter (country = IT)
      +- *(1) ColumnarToRow
         +- FileScan parquet [city,country]
              PushedFilters: [EqualTo(country,IT)]

# UDF
== Physical Plan ==
*(2) HashAggregate(keys=[city], functions=[count(1)])
+- *(2) HashAggregate(keys=[city], functions=[partial_count(1)])
   +- *(1) Filter UDF(country)
      +- *(1) ColumnarToRow
         +- FileScan parquet [city,country]
              PushedFilters: []

Notice PushedFilters. In the first plan, the predicate reaches the storage layer. In the second, it does not. The “lazy” model cannot explain this difference — both plans defer execution equally. The compiler model explains it directly: Catalyst optimizes what it can see. A UDF is opaque; a native expression is transparent.

This is not specific to Python. Scala UDFs are equally opaque to Catalyst. The issue is not the language; it is that any UDF is a ScalaUDF or PythonUDF node in the expression tree — not a native Catalyst expression, and therefore not rewritable.

Nested structures create a subtler blind spot. Suppose your Parquet schema has a struct column:

root
 |-- event_id: string
 |-- address: struct
 |    |-- country: string
 |    |-- city: string
 |-- timestamp: long

You filter on the nested field:

df = spark.read.parquet("events/")
df.filter(col("address.country") == "IT").select("event_id", "address.city")

This looks entirely native. No UDF, no opaque function — just dot notation on a Column. Catalyst resolves it, places it in the logical plan, and optimizes around it. But predicate pushdown to the storage layer is a different matter.

There are two distinct optimizations at play, and they do not behave the same way on nested fields:

Nested column pruning works. Spark can project only the struct subfields it needs — address.country and address.city — rather than reading the entire address struct. This has been supported since Spark 2.4 and works reliably with Parquet.

Nested predicate pushdown often does not. Whether the predicate address.country = "IT" reaches the Parquet reader depends on the data source implementation. With DataSource V1 (the default for spark.read.parquet), nested predicates are generally not pushed down. DataSource V2 has better support, but it varies by connector and by Spark version. The result is that the filter stays above the scan:

== Physical Plan ==
*(1) Filter (address.country = IT)
+- *(1) ColumnarToRow
   +- FileScan parquet [event_id,address]
        PushedFilters: []

Compare with the equivalent schema flattened to top-level columns:

== Physical Plan ==
*(1) Filter (country = IT)
+- *(1) ColumnarToRow
   +- FileScan parquet [event_id,city,country]
        PushedFilters: [EqualTo(country,IT)]

The logical plan is optimized in both cases. The difference is in what reaches the storage layer. This is not a Catalyst limitation; it is a boundary between the optimizer and the data source API. Catalyst hands off a set of predicates; the data source decides which ones it can handle. For nested fields, the answer is often “none.”

This catches people off guard precisely because the compiler model predicts it and the lazy model does not. If Spark is merely lazy, there is no reason to expect a difference — both queries defer execution equally. If Spark is a compiler with a pushdown interface to storage, the question becomes: what can the storage layer accept? And the answer depends on schema design.

This is not an argument against UDFs or nested schemas in general. It is a statement about what you give up when you use them: you exit the compiler’s optimization scope, wholly or partially. Sometimes that tradeoff is worth it. But you should know you are making it.

Caching as Materialization Boundary

The same perspective reframes caching. If you think in terms of lazy versus eager execution, cache() becomes a way to “force Spark to compute here.” If you think in terms of compilation, caching becomes a materialization boundary in the dataflow graph.

Each action generates a separate QueryExecution — a distinct logical plan, optimized plan, and physical plan. Catalyst optimizes within that boundary. It cannot optimize across actions. When two jobs share a common subgraph, Catalyst recomputes it independently for each. Caching materializes an intermediate result so that subsequent jobs do not recompute it from scratch.

base = spark.read.parquet("events/").filter(col("year") == 2024)

# Without cache: the filter + scan is compiled and executed twice
italian = base.filter(col("country") == "IT").count()
german  = base.filter(col("country") == "DE").count()

# With cache: the shared subgraph is materialized once
base.cache()
italian = base.filter(col("country") == "IT").count()
german  = base.filter(col("country") == "DE").count()

This is analogous to a temporary table in SQL or an intermediate target in a build system. You introduce a materialization point where the plan structure requires it — where multiple downstream jobs depend on the same computed result — not where you feel the urge to force execution.

Misplaced caching is a common symptom of the “lazy” model. If Spark is merely lazy, caching seems like a way to control when it works. If Spark is a compiler, caching is a structural decision about where the plan should be split.

Shuffle Boundaries and Plan Shape

Another area where the compiler model pays off is shuffle-awareness. Consider these two logically equivalent pipelines:

# A: filter, then join
orders_it = orders.filter(col("country") == "IT")
result_a = orders_it.join(products, "product_id")

# B: join, then filter
joined = orders.join(products, "product_id")
result_b = joined.filter(col("country") == "IT")

For an inner join with a deterministic filter on a column from orders, Catalyst will produce the same physical plan for both — it pushes the filter before the join regardless of how you wrote it. This is the benefit of global compilation: the optimizer sees the whole plan, not the sequence of API calls.

But this equivalence is not universal. With a left outer join, pushing a filter on the outer table before the join changes the semantics — it eliminates rows that the join would have preserved as nulls. Catalyst knows this and will not reorder. Similarly, if the filter is a UDF or a non-deterministic expression, the optimizer’s hands are tied. The order you wrote matters again, because you have forced the compiler to take your code literally.

The practical lesson: write declarative transformations. Use native Column expressions. Let Catalyst reorder, push down, and prune. The less you constrain the plan, the more room the optimizer has.

What Compilation Cannot Reach

The compiler model also clarifies Spark’s limits. Catalyst compiles and optimizes the logical plan of a single job because that plan is fully known before execution begins. You declare the entire dataflow, then trigger it. Within that job, you do not read intermediate results to dynamically decide the next step. Adaptive Query Execution (AQE) may refine the physical plan at runtime — coalescing partitions, switching join strategies based on actual data sizes — but it still operates within the boundary of a single query plan. It is runtime refinement, not interactive replanning.

Where the flow is interactive and sequential — an OLTP transaction, a step-by-step notebook that inspects results and branches — global compilation is not possible. This is not a limitation of Spark’s implementation; it is a structural constraint. A compiler needs the whole program. If you feed it one statement at a time, it becomes an interpreter.

This explains why chaining multiple actions in a loop, each depending on the result of the previous one, tends to produce suboptimal plans. It is not because Spark is slow; it is because the compiler scope is too narrow to optimize across iterations.

Calling Spark lazy is not wrong. But it is reductive, and the reduction has a cost. If Spark is a dataflow compiler, the way you write transformations changes. You stop asking “when does Spark execute?” and start asking “what can Catalyst see?” The answer to the second question determines the quality of your physical plan — and, ultimately, the performance of your job.

Three Kinds of “Lazy”#

What Spark Actually Does#

Where the Compiler Model Bites#

Caching as Materialization Boundary#

Shuffle Boundaries and Plan Shape#

What Compilation Cannot Reach#