Where Data System Abstractions Break: A Semiotic Reading

There is a class of production incident that feels uniquely frustrating. The system did exactly what it was supposed to do. The API was used correctly. The documentation, consulted afterward, turns out to have been accurate all along. And yet something went badly wrong — a job that rewrites ten times more data than expected, a container that provides no real isolation, a join that degrades catastrophically at scale.

The standard diagnosis is “leaky abstraction,” a term Joel Spolsky popularized in 2002. All abstractions leak, he argued. The implementation details eventually surface. This is true, but it explains very little. It doesn’t tell you which abstractions leak, where they leak, or why the leaks are so consistently surprising. It names the phenomenon without explaining its structure.

I want to propose a more precise diagnosis — one that draws on semiotics rather than software engineering, and specifically on the work of Charles Sanders Peirce and Umberto Eco. The claim is this: the most systematic and surprising performance pathologies in modern data systems are semiotic failures. They arise not from bugs, resource exhaustion, or misconfiguration, but from a structural divergence between what an interface signifies and what the underlying system does — a divergence that is predictable, recurring, and invisible to the model reader the interface itself constructs.

A Fifty-Year-Old Promise

The contract embedded in these interfaces is not new. Edgar F. Codd articulated it in 1970, in the paper that founded the relational model: A Relational Model of Data for Large Shared Data Banks. One of its central principles was data independence — the idea that queries should be written against a logical model, entirely insulated from the physical layout of the data. The user thinks in relations. The system handles storage, access paths, and execution strategy.

It was a radical and largely successful idea. The cost-based optimizer that Selinger and her colleagues built for IBM’s System R in 1979 was its practical realization: given a declarative SQL query, the system would enumerate possible physical execution plans and choose the cheapest one. The user never needed to know which plan was chosen. That was the point.

But the System R team immediately recognized a tension they could not quite resolve. If the physical layer is fully hidden, users cannot understand why queries are slow. They cannot diagnose anomalies. They cannot make informed decisions about schema design or query structure. The solution — EXPLAIN, and later EXPLAIN ANALYZE — was already an admission: pure logical transparency is not enough. At some point, the physical layer has to become visible.

Spark, Delta Lake, and container runtimes have re-enacted this same drama at a different scale and with considerably more physical complexity. The promise is Codd’s promise, restated fifty years later. The tension is the same tension. What has changed is the distance between the logical interface and the physical reality — and that distance, in distributed systems, is vast.

The Interface as Sign

Peirce’s triadic model of the sign is the starting point. For Peirce, a sign is not a binary relationship between a word and a thing. It is a three-way relationship between a representamen (the sign in its material form), an object (what the sign refers to), and an interpretant (the meaning the sign produces in the mind of its receiver).

Applied to software interfaces, the mapping is surprisingly precise.

df.join(other, on="key") is a representamen — a token in a formal language. Its object is the logical operation of relational join: a well-defined set-theoretic operation with clear semantics. But the interpretant — what this sign actually produces in the mind of an experienced engineer — is something richer and more specific: an expectation of uniform cost, transparent execution, and behavior that scales predictably with data volume.

That interpretant is not naive. It is built from years of working with databases, SQL, and dataframe APIs that behave consistently across contexts. It is, in Eco’s terms, an encyclopedic reading of the sign — not just the formal denotation, but the full network of associations, analogies, and inferences that an experienced practitioner brings to it.

The problem is that the interpretant is wrong. Or rather: it is right often enough to be trusted, and wrong in precisely the cases that matter most.

Dictionary and Encyclopedia

Eco’s distinction between dictionary and encyclopedia is essential here. A dictionary definition is minimal and denotative: it captures the necessary and sufficient conditions for a term’s application. An encyclopedic entry is open and connotative: it captures the web of knowledge, cultural assumptions, and inferential pathways that surround a concept in actual use.

Documentation is a dictionary. Engineers read it as an encyclopedia.

The Apache Spark documentation accurately describes join as a transformation that combines two datasets based on a key. This is the dictionary entry. But when an engineer reads that entry, they activate an encyclopedia built from relational databases, from years of working with SQL optimizers, from the intuition that “join” is an operation whose cost is primarily a function of the size of the inputs and the selectivity of the predicate. That encyclopedia is the product of legitimate experience. It is also systematically misleading in a distributed execution context.

In Spark, a join can execute as a broadcast hash join, a shuffle hash join, a sort-merge join, or a skew join with AQE rewrites. These are not minor variations of the same operation — they have radically different cost profiles, different memory requirements, and different failure modes. The representamen is identical across all cases. The physical realization differs fundamentally. And the interface provides no signal that anything has changed.

This is not a documentation failure. It is a semiotic one: the interface has been designed to activate a specific interpretant — simplicity, uniformity, logical transparency — while systematically concealing the physical variation underneath.

The Constructed Reader and the Breakdown

Eco’s concept of the Model Reader adds a further dimension. Every text, Eco argued, constructs an ideal reader — it presupposes a specific set of competencies, knowledge, and interpretive habits, and it is designed to be read by someone who possesses them. The text is not neutral: it inscribes a particular reader into its structure.

Software interfaces do the same thing. The designers of Spark’s DataFrame API constructed a model user: someone who thinks in terms of logical transformations, who is comfortable leaving physical execution to the runtime, who trusts the optimizer to make good decisions. That model user is well-served by the API in most cases. The interface was designed for them.

But data systems at scale require a different reader — someone who understands file layout, partition statistics, shuffle cost, and optimizer visibility. That reader was not constructed by the interface. They had to construct themselves, through failures and performance autopsies and the gradual accumulation of knowledge that no documentation quite captures.

Terry Winograd and Fernando Flores, in Understanding Computers and Cognition (1986), gave a precise name to what happens at the moment the mismatch becomes visible: breakdown. A system functions smoothly as long as the user and the system share an implicit interpretive horizon — a set of assumptions so well-aligned they are never consciously examined. When that horizon fractures, the user is suddenly forced to confront the machinery they had been trusting invisibly. The tool that was transparent becomes opaque. The breakdown is not a system failure. It is a failure of the shared interpretive frame.

This is exactly what happens when a Spark job runs for six hours instead of twenty minutes, or when a MERGE rewrites a terabyte of data to update a thousand rows. The system has not failed. The interpretive frame has. And the engineer is left trying to reconstruct, from query plans and executor logs and storage metrics, a model of physical reality that the interface was specifically designed to hide.

The gap between the model reader the interface constructs and the reader the system actually requires is, I would argue, the central source of performance pathology in modern data engineering.

Four Fracture Points

This framework maps cleanly onto a set of failure modes I have documented in previous articles. Each one involves the same structure: a representamen that activates a misleading interpretant, a physical reality that diverges from the encyclopedic expectation, and a failure mode that is predictable once the semiotic structure is visible.

Spark and the compiler you do not see. The phrase “Spark is lazy” is the dictionary entry. The encyclopedic reading it activates is: execution is deferred, but the structure of execution is what you wrote. In reality, Spark is a dataflow compiler. Between your code and physical execution sits Catalyst, which transforms, rewrites, and optimizes the logical plan in ways that are invisible to the caller. The interpretant (“I know what my job will do”) is wrong not because Spark misbehaves, but because the sign “lazy evaluation” does not encode the degree to which the optimizer’s decisions depend on information that may or may not be available at plan time.

Delta Lake MERGE and the upsert that rewrites everything. MERGE INTO looks like a row-level operation. The dictionary entry says: match rows, update or insert as appropriate. The encyclopedic reading is: only affected rows are touched. The physical reality is that the Delta Lake writer cannot determine with certainty, at planning time, which Parquet files contain the rows that will be matched — partition pruning and data skipping narrow the candidate set, but do not eliminate the uncertainty. It resolves this conservatively, often rewriting far more data than the logical operation requires. The representamen evokes surgical precision. The referent is a sledgehammer.

Asymmetric salting and the skew that statistics cannot see. The standard model of data skew — a small number of keys that appear disproportionately often — is encoded in every explanation of the problem and in the catalog statistics that optimizers use to detect and compensate for it. But skew can also arise from keys that appear exactly once but carry payloads of radically different sizes. This form of skew is invisible to cardinality statistics. The optimizer’s model reader is someone whose data is skewed in the standard way. Engineers whose data is skewed asymmetrically are not the model reader, and the system has nothing to offer them.

Containers and the isolation that is not there. “Container” is perhaps the most semiotic term in modern infrastructure. The word itself — from the shipping container, from the physical box that separates its contents from the world — activates a powerful interpretant of strong isolation. The physical reality is simpler: Linux containers share a kernel. They are namespace and cgroup boundaries, not trust boundaries. The representamen constructs a model reader who is building a deployment pipeline. It does not construct the model reader who is running untrusted code.

The articles behind each case:

Spark Is Not Just Lazy. Spark Compiles Dataflow.

Delta Lake MERGE Is Not a Simple Upsert. What Actually Happens at Scale.

Fixing Skewed Nested Joins in Spark with Asymmetric Salting

A Shared Kernel Is a Shared Trust Domain

Why Interfaces Are Designed This Way

It would be easy to read this as a critique of bad API design. That would miss the point. These interfaces are not designed carelessly — they are designed with a specific communicative intention, and they succeed at it most of the time.

The logical transparency that Spark, Delta, and container runtimes offer is genuinely valuable. It is, in fact, the same value Codd identified in 1970: it allows practitioners to reason about complex systems without holding the entire physical implementation in their heads. Data independence is a real achievement. The model reader these interfaces construct is not wrong — it is simply incomplete.

The problem is that incompleteness is not uniformly distributed. It concentrates at the boundaries of the ordinary case: at scale, under skew, in adversarial environments, with unusual data distributions. These are exactly the cases where the cost of a wrong interpretant is highest — and where the breakdown, in Winograd and Flores’s sense, is most violent. The interface is most misleading precisely where the stakes are highest.

There is a further property worth naming. In classical relational systems, the logical-physical distance was relatively contained — a handful of join algorithms, a single node, stable statistics. The failure modes, when they appeared, tended to be gradual and legible. In modern distributed systems, that distance is vastly larger: between a logical operation and its execution sit optimizers, adaptive planners, distributed schedulers, shuffle mechanisms, storage layouts, and incomplete statistics. The result is that the abstraction does not degrade smoothly. It holds — until it doesn’t. A Spark job runs reliably for months, then becomes ten times slower because the join plan changed, or the cardinality estimate shifted, or the data distribution crossed a threshold the planner had not anticipated. The transition is rarely gradual. The interface was not designed to signal it.

Visibility and Optimization

There is a theoretical structure underneath all of this that connects semiotic analysis to compiler theory. In dataflow analysis — the formal framework that underlies optimizers like Catalyst — the set of transformations that are safe and useful at any given point in the program is determined by what information is visible to the compiler at that point. Optimization is bounded by the horizon of the compiler’s knowledge.

The same principle applies to practitioners. The quality of the decisions an engineer can make about a data system is bounded by the accuracy of their model of that system — by the precision of their interpretants. An engineer whose interpretant of df.join() includes the cost model of each physical join strategy, the conditions under which AQE will intervene, and the effect of partition statistics on plan selection can make good decisions. An engineer whose interpretant stops at the logical level cannot.

Eco called the ideal reader someone who activates the full set of competencies that a text requires. For data systems at scale, those competencies are not encoded in the interface. They have to be learned elsewhere — through production incidents, through reading query plans, through the kind of analysis I have tried to do in this series of articles.

That, in the end, is what this series has been about. Not how to use these systems. How to read them — and specifically, how to read the points where their abstractions stop describing reality.

Postscript

I studied semiotics under Umberto Eco at the University of Bologna. I work on distributed data systems. The connection between these two bodies of work did not occur to me for a long time. When it did, it seemed obvious — not as a metaphor, but as a structural identity. The problems are the same problem. The vocabulary is different.

If you have found the semiotic framing useful, the primary texts are Eco’s A Theory of Semiotics (1976) and The Role of the Reader (1979). Peirce’s collected papers are harder going, but the triadic sign model is accessible through secondary literature. Winograd and Flores’s Understanding Computers and Cognition (1986) is the most direct bridge between this tradition and software systems — it remains underread in engineering circles and does not deserve to be.

For the historical background on data independence and the optimizer, Codd’s 1970 paper is short and still worth reading directly. Selinger et al.’s 1979 paper on System R is the place where the logical/physical tension first becomes fully visible as an engineering problem. For the compiler theory side, the connection to optimizer design is more visible in those two database papers than in any compiler textbook.

A Fifty-Year-Old Promise#

The Interface as Sign#

Dictionary and Encyclopedia#

The Constructed Reader and the Breakdown#

Four Fracture Points#

Why Interfaces Are Designed This Way#

Visibility and Optimization#

Postscript#