Where Data System Abstractions Break: A Semiotic Reading

There is a class of production incident that feels uniquely frustrating. The system did exactly what it was supposed to do. The API was used correctly. The documentation, consulted afterward, turns out to have been accurate all along. And yet something went badly wrong — a job that rewrites ten times more data than expected, a container that provides no real isolation, a join that degrades catastrophically at scale.

The standard diagnosis is “leaky abstraction,” a term Joel Spolsky popularized in 2002. All abstractions leak, he argued. The implementation details eventually surface. This is true, but it explains very little. It doesn’t tell you which abstractions leak, where they leak, or why the leaks are so consistently surprising. It names the phenomenon without explaining its structure.

I want to propose a more precise diagnosis — one that draws on semiotics rather than software engineering, and specifically on the work of Charles Sanders Peirce and Umberto Eco. The claim is this: many of the most persistent and surprising performance pathologies in modern data systems are semiotic failures — predictable, recurring, and invisible to the model reader the interface itself constructs.

A Fifty-Year-Old Promise

The contract embedded in these interfaces is not new. Edgar F. Codd articulated it in 1970, in the paper that founded the relational model: A Relational Model of Data for Large Shared Data Banks. One of its central principles was data independence — the idea that queries should be written against a logical model, entirely insulated from the physical layout of the data. The user thinks in relations. The system handles storage, access paths, and execution strategy.

It was a radical and largely successful idea. The cost-based optimizer that Selinger and her colleagues built for IBM’s System R in 1979 was its practical realization: given a declarative SQL query, the system would enumerate possible physical execution plans and choose the cheapest one. The user never needed to know which plan was chosen. That was the point.

But the System R team immediately recognized a tension they could not quite resolve. If the physical layer is fully hidden, users cannot understand why queries are slow. They cannot diagnose anomalies. They cannot make informed decisions about schema design or query structure. The solution — EXPLAIN, and later EXPLAIN ANALYZE — was already an admission: pure logical transparency is not enough. At some point, the physical layer has to become visible.

Spark, Delta Lake, and container runtimes have re-enacted this same drama at a different scale and with a fundamentally different physical architecture. The promise is Codd’s promise, restated fifty years later. The tension is the same tension. What has changed is the distance between the logical interface and the physical reality — and that distance, in distributed systems, is vast.

The Interface as Sign

Consider df.join(other, on="key"). The syntax denotes a logical operation — a relational join with clear set-theoretic semantics. But what the syntax produces in the mind of an experienced engineer is something richer: an expectation of uniform cost, transparent execution, and predictable scaling. Peirce called this the interpretant — the meaning the sign produces in its interpreter, not just the formal definition of the operation but the practical expectations accumulated through prior systems. The practitioner brings to the interface not just its definition but the full weight of past experience with databases and dataframe APIs that behaved consistently.

The problem is that the interpretant is wrong. Or rather: it is right often enough to be trusted, and wrong in precisely the cases that matter most.

Dictionary and Encyclopedia

Eco’s distinction between dictionary and encyclopedia is essential here. A dictionary definition is minimal and denotative: it captures the necessary and sufficient conditions for a term’s application. An encyclopedic entry is open and connotative: it captures the surrounding web of knowledge, assumptions, and inferential pathways that practitioners actually rely on.

Documentation behaves like a dictionary: it defines operations precisely but minimally. Engineers inevitably read it as an encyclopedia, filling in the surrounding operational assumptions themselves.

In Spark, a join can execute as a broadcast hash join, a shuffle hash join, a sort-merge join, or a skew join with AQE rewrites. These are not minor variations — they have radically different cost profiles, memory requirements, and failure modes. The representamen is identical; the physical realization differs fundamentally. The interface provides no signal that anything has changed. This is not a documentation failure but a semiotic one: the interface activates an interpretant of simplicity and uniformity while concealing the physical variation underneath.

The Constructed Reader and the Breakdown

Eco’s concept of the Model Reader adds a further dimension. Every text, Eco argued, constructs a model reader — it presupposes a specific set of competencies, knowledge, and interpretive habits. The text inscribes a particular reader into its structure.

Software interfaces do the same thing. The designers of Spark’s DataFrame API constructed a model user: someone who thinks in terms of logical transformations, who trusts the optimizer to make good decisions. That model user is well-served by the API in most cases.

But data systems at scale require a different reader — someone who understands file layout, partition statistics, shuffle cost, and optimizer visibility. That reader was not constructed by the interface. They had to construct themselves, through failures and performance autopsies and the gradual accumulation of knowledge that no documentation quite captures.

Terry Winograd and Fernando Flores, in Understanding Computers and Cognition (1986), gave a precise name to this moment: breakdown. A system functions smoothly as long as the user and the system share an implicit interpretive horizon. When that horizon fractures, the user is forced to confront the machinery they had been trusting invisibly. The breakdown is not a system failure — it is a failure of the shared interpretive frame.

This is exactly what happens when a Spark job runs for six hours instead of twenty minutes, or when a MERGE rewrites a terabyte of data to update a thousand rows. The system has not failed. The interpretive frame has. The engineer must reconstruct, from query plans and executor logs, a physical model the interface was designed to hide.

The gap between the reader the interface assumes and the reader the system actually requires is the central source of performance pathology in modern data engineering.

Four Fracture Points

Each of the following cases shares a structure: an interface that activates the wrong expectation, a physical reality that diverges from it, and a failure mode that becomes predictable once the divergence is named.

Spark and the compiler you do not see. The phrase “Spark is lazy” is the dictionary entry. The encyclopedic reading it activates is: execution is deferred, but the structure of execution is what you wrote. In reality, Spark is a dataflow compiler. Between your code and physical execution sits Catalyst, which transforms, rewrites, and optimizes the logical plan in ways that are invisible to the caller. The interpretant (“I know what my job will do”) is wrong not because Spark misbehaves, but because “lazy evaluation” conceals the degree to which the optimizer’s decisions depend on information that may or may not be available at plan time.

Delta Lake MERGE and the upsert that rewrites everything. MERGE INTO looks like a row-level operation. The dictionary entry says: match rows, update or insert as appropriate. The encyclopedic reading is: only affected rows are touched. Partition pruning and data skipping can narrow the candidate set substantially, but the real constraint is that the storage layer operates at file granularity: once a file contains a modified row, the entire file must be rewritten. The result is that MERGE often rewrites far more data than the logical operation requires. The representamen evokes surgical precision. The referent is a sledgehammer.

Asymmetric salting and the skew that statistics cannot see. The standard model of data skew — a small number of keys that appear disproportionately often — is encoded in every explanation of the problem and in the catalog statistics that optimizers use to detect it. But skew can also arise from keys that appear exactly once but carry payloads of radically different sizes — payload skew, as opposed to key frequency skew. This form of skew is invisible to cardinality statistics. No interface signals the difference, and no optimizer compensates for it. This is the simplest case in the set: the abstraction doesn’t mislead so much as go silent.

Containers and the isolation that is not there. “Container” is perhaps the most semiotic term in modern infrastructure. The word activates a powerful interpretant of strong isolation — reinforced not just by the metaphor (the shipping container, the sealed box) but by the operational interface: docker run produces what looks like a separate machine, with its own filesystem, process tree, and network stack. The physical reality is simpler: Linux containers share a kernel. They are namespace and cgroup boundaries, not trust boundaries. A container escape exploits not a bug in the abstraction but a structural gap between what the interface signifies (isolation) and what the kernel provides (separation). The representamen constructs a model reader who is building a deployment pipeline. It does not construct the model reader who is running untrusted code. This is not a bug in containers — it is a semiotic mismatch between the metaphor the interface evokes and the guarantees the kernel actually provides.

The articles behind each case:

Spark Is Not Just Lazy. Spark Compiles Dataflow.

Delta Lake MERGE Is Not a Simple Upsert. What Actually Happens at Scale.

Fixing Skewed Nested Joins in Spark with Asymmetric Salting

A Shared Kernel Is a Shared Trust Domain

Why Interfaces Are Designed This Way

The logical transparency that Spark, Delta, and container runtimes offer is genuinely valuable — the same value Codd identified in 1970. Data independence allows practitioners to reason about complex systems without holding the entire physical implementation in their heads. The model reader these interfaces construct is not wrong; it is simply incomplete. And that incompleteness concentrates at the boundaries of the ordinary case: at scale, under skew, in adversarial environments. The interface is most misleading precisely where the stakes are highest.

In classical relational systems, the logical-physical distance was relatively contained — a handful of join algorithms, a single node, stable statistics — and failure modes tended to be gradual. In modern distributed systems, that distance is vastly larger: between a logical operation and its execution sit optimizers, adaptive planners, distributed schedulers, shuffle mechanisms, storage layouts, and incomplete statistics. The abstraction does not degrade smoothly. It holds — until it doesn’t. A Spark job runs reliably for months, then becomes ten times slower because the data distribution crossed a threshold the planner had not anticipated. The competencies required to diagnose this — the conditions under which AQE intervenes, the difference between logical and physical partitioning, the cost structure of shuffles — are not encoded in any interface. They are learned through production incidents and the slow reading of query plans.

The semiotic reading does not make these problems easier to fix. An experienced systems engineer will recognize every failure mode described here without needing Peirce to explain them. What the framework provides is not diagnosis but taxonomy: a structural account of why the same class of failure recurs across systems that share nothing except the decision to separate the logical interface from the physical reality. That is a modest claim. I think it is also a correct one.

Postscript

I studied semiotics under Umberto Eco at the University of Bologna. I work on distributed data systems. The connection between these two bodies of work did not occur to me for a long time. When it did, it seemed obvious — not as a metaphor, but as a structural identity. The problems are the same problem. The vocabulary is different.

If you have found the semiotic framing useful, the primary texts are Eco’s A Theory of Semiotics (1976) and The Role of the Reader (1979). Peirce’s collected papers are harder going, but the triadic sign model is accessible through secondary literature. Winograd and Flores’s Understanding Computers and Cognition (1986) is the most direct bridge between this tradition and software systems — it remains underread in engineering circles and does not deserve to be.

For the historical background on data independence and the optimizer, Codd’s 1970 paper is short and still worth reading directly. Selinger et al.’s 1979 paper on System R is where the logical/physical tension becomes fully visible as an engineering problem. The connection to compiler-style optimization becomes particularly clear in these database papers, where equivalent logical expressions are explicitly mapped to alternative physical execution plans.

A Fifty-Year-Old Promise#

The Interface as Sign#

Dictionary and Encyclopedia#

The Constructed Reader and the Breakdown#

Four Fracture Points#

Why Interfaces Are Designed This Way#

Postscript#