A Shared Kernel Is a Shared Trust Domain

The Problem

Multi-tenant arbitrary code execution — managed notebooks, hosted IDEs, serverless functions, playground environments. The pattern is widespread, and the implicit assumption is always the same: containers provide sufficient isolation. They do isolate processes. They do not isolate trust boundaries.

What Containers Actually Guarantee

What Linux primitives underpin container isolation?

Primitive	What it isolates
PID namespace	process visibility
Mount namespace	filesystem view
Network namespace	network stack
UTS namespace	hostname and domain name
IPC namespace	System V IPC, POSIX message queues
User namespace	UID/GID mapping, capability scope
Cgroup namespace	cgroup root view
Time namespace	CLOCK_MONOTONIC and CLOCK_BOOTTIME offsets (since kernel 5.6)
cgroups	resource limits (CPU, memory, I/O)
Linux capabilities	privilege decomposition (~41 discrete privileges)
Seccomp	syscall filtering
Kernel	shared

Every primitive above operates on top of a single shared kernel — the only component with full privilege over memory, scheduling, and I/O, and the ultimate shared attack surface.¹

Threat Model

What does a shared kernel mean operationally? Four classes of risk — but not all of them change with the execution model.

Kernel-dependent (mitigated by moving the kernel boundary):

Container escape: kernel exploit (CVE-2022-0185 — requires CAP_SYS_ADMIN in a user namespace; CVE-2024-1086 — nf_tables use-after-free, actively exploited). A micro-VM confines the blast radius to the guest kernel; a shared kernel exposes the host.

Orthogonal (present regardless of isolation model):

Lateral movement: service account tokens, metadata endpoints. These are Kubernetes control plane risks, not kernel risks. A micro-VM does not prevent a misconfigured service account from reaching the API server.
Data exfiltration: DNS tunneling, unrestricted egress. Network-level controls (egress policies, DNS filtering) are required independently of the runtime.
Resource abuse: cryptomining, noisy neighbor. Cgroup limits apply within any execution model.

The distinction matters: adopting micro-VMs addresses the first class but not the other three. Additionally, microarchitectural side-channel attacks (Spectre, MDS) can cross even hardware virtualization boundaries — a risk that Confidential Computing (SEV-SNP, TDX) partially addresses but that standard VMs do not eliminate. Standard hardening layers — AppArmor or SELinux mandatory access control, dropping capabilities, restricting user namespaces — raise the cost of exploitation but do not change the fundamental trust model. Kernel exploits are rare, but in a multi-tenant environment the blast radius is existential: full host compromise, lateral access to every tenant’s data and credentials.

The orthogonal risks — lateral movement, data exfiltration, resource abuse — require their own controls (RBAC, network policy, egress filtering) and are out of scope here. What follows focuses on the kernel-dependent class: where the trust boundary sits, and what it costs to move it.

Where the Trust Boundary Really Is

Namespaces provide isolation properties, but they are not hard security boundaries against hostile code — the kernel attack surface they expose is substantially larger than a minimal hypervisor’s. User namespaces come closest to a true security primitive (they gate capability sets within a confined scope), but they also expand the reachable kernel surface, which is why unprivileged user namespace creation is itself a hardening target.

Seccomp narrows the interface further but cannot close it. A minimal Python HTTP server typically requires dozens of distinct syscalls at steady state (readily verified via strace -c on any standard library HTTP server). The container runtime requires additional privileged syscalls (mount, pivot_root, namespace setup) during initialization, but those execute before the seccomp filter is installed and are not part of the application’s profile.

Mainstream hardening guidance and sandboxing projects converge on the same conclusion: Linux containers provide process and resource isolation, not tenant isolation comparable to a VM.² Containers are excellent at what they were designed for. They are insufficient when the threat model includes hostile tenants — not because of a defect in containers, but because of a mismatch between tool and problem.

If tenants are not mutually trusted, the question is not whether containers are hardened enough, but whether the host kernel is still part of the shared trust domain.

The Design Space: Three Execution Models

There are three approaches to moving, or reinforcing, the kernel boundary. Each makes a different trade-off between isolation strength, performance overhead, and operational complexity.

Hardened Container          Micro-VM (Kata)             VM via KubeVirt
┌─────────────────┐         ┌─────────────────┐         ┌─────────────────┐
│   App process    │         │   App process    │         │   App process    │
│   (namespaced)   │         │   (namespaced)   │         │   (namespaced)   │
├─────────────────┤         ├─────────────────┤         ├─────────────────┤
│ gVisor Sentry    │         │  Guest kernel    │         │  Guest kernel    │
│ OR seccomp filter│         │  (own instance)  │         │  (own instance)  │
│ (reduces host    │         ├─────────────────┤         ├─────────────────┤
│  syscall surface)│         │  VMM (QEMU /     │         │  QEMU + full OS  │
│                  │         │  Firecracker)    │         │  (init, userland)│
╞══════════════════╡         ╞══════════════════╡         ╞══════════════════╡
│ ██ Host kernel ██│         │ ██ Host kernel ██│         │ ██ Host kernel ██│
│ ██  (shared)   ██│         │ ██  (shared)   ██│         │ ██  (shared)   ██│
└─────────────────┘         └─────────────────┘         └─────────────────┘

  Trust boundary:             Trust boundary:             Trust boundary:
  ▲ host kernel (shared)      ▲ guest kernel (own)        ▲ guest kernel (own)
  Attack surface:             Attack surface:             Attack surface:
  host syscall surface,       virtualization boundary      virtualization boundary
  narrowed but shared         + VMM interface              + VMM interface

A: Hardened Containers (gVisor / strict seccomp)

The kernel remains shared, but the interface to it is narrowed. gVisor interposes a user-space kernel (the Sentry) that reimplements the Linux syscall interface. Application syscalls are trapped and serviced entirely within the Sentry; only a narrow set of host syscalls made by the Sentry itself reach the host kernel, further restricted by a dedicated seccomp profile. A strict seccomp profile takes a different approach: instead of reimplementing, it blocks syscalls at the filter level. Both reduce the attack surface. Neither eliminates it. With gVisor, the host kernel is still reachable through the Sentry’s own syscall surface. With strict seccomp, it is reachable through whichever syscalls the profile allows. Compatibility gaps are real: not all syscalls are supported by gVisor, and aggressive seccomp profiles break workloads that depend on them.

B: Micro-VMs (Kata Containers via RuntimeClass)

The workload gets its own kernel. The guest kernel is the security boundary, not the namespace. A kernel exploit inside the micro-VM compromises the guest, not the host. Cold start is higher, density is lower, but the isolation is stronger: it leverages hardware virtualization (VT-x, EPT), though the VMM software (QEMU, cloud-hypervisor, or Firecracker) remains part of the trusted computing base — VMM escape vulnerabilities exist (e.g., CVE-2020-14364 — QEMU USB emulation out-of-bounds read/write; CVE-2015-3456/VENOM — QEMU floppy controller), so the attack surface shrinks but does not vanish. Kata supports multiple VMM backends. The benchmarks below use QEMU, which has the highest memory overhead; Firecracker and cloud-hypervisor reduce that cost substantially. Note that Firecracker does not support virtiofsd — it uses a block device for the guest rootfs instead of filesystem passthrough, which changes the storage architecture and eliminates one component from the overhead breakdown but introduces different I/O characteristics. Kubernetes RuntimeClass allows mixed scheduling on the same cluster — standard containers for trusted workloads, micro-VMs for untrusted code, same control plane. RuntimeClass defines a scheduling stanza (nodeSelector, tolerations) that constrains Kata pods to nodes with hardware virtualization support (VT-x enabled).

For regulated workloads that require removing the host operator from the trusted computing base, Confidential Computing extensions (AMD SEV-SNP, Intel TDX) add hardware-attested memory encryption per VM — an orthogonal hardening layer enabled at the hypervisor/firmware level, not configured within Kubernetes. Not benchmarked here, but increasingly relevant for compliance in financial and healthcare environments.

C: Full VMs (KubeVirt)

A complete virtual machine managed as a Kubernetes resource. Dedicated kernel, dedicated userspace, dedicated init system. Maximum isolation, maximum overhead. Organizations adopt KubeVirt when regulatory frameworks mandate kernel-level separation (common in financial services and government), when tenants need different operating systems, or when consolidating existing VM-based workloads into a Kubernetes control plane without re-architecting them. For greenfield multi-tenant code execution, it is usually overkill — but for brownfield migration it can be the pragmatic path.

For workloads that can compile to WebAssembly, Wasm runtimes (Wasmtime, WasmEdge) offer a fundamentally different trade-off: capability-based sandboxing with sub-millisecond cold starts and negligible memory overhead. The hard constraint is scope: Wasm cannot run arbitrary Linux binaries, Python interpreters, JVM applications, or native executables — only code compiled to Wasm (Rust, Go, C/C++ via WASI, AssemblyScript). For workloads that fit this model, the isolation-to-overhead trade-off is superior to every other option discussed here. For general-purpose code execution, the three models above remain the practical design space.

The Numbers

The three models differ in theory; the question is how much they differ in practice. To quantify the trade-offs — cold start, memory overhead, density — we benchmarked all three on the same cluster, same workload, same methodology.

Measured on a dedicated 3-node cluster: AMD Ryzen 9 7950X, 4 vCPU / 16 GB per node, Kubernetes 1.29, Proxmox VE with KVM passthrough. 50 runs per runtime, zero timeouts. Identical workload across all three: a Python HTTP server with a 40 MiB pre-allocated ballast and a readiness probe on /ready.

Full setup, scripts, and raw results: execution-boundary-bench

Note: Kata and KubeVirt numbers reflect nested virtualization (Proxmox VE). On bare-metal KVM, expect substantially lower cold starts — particularly for KubeVirt (30–90s vs 2.9 min). See methodological notes below for details.

Metric	Hardened Container (runc + seccomp)	Micro-VM (Kata/QEMU, nested virt)	VM via KubeVirt (nested virt)
Cold start P50	1,873 ms	2,887 ms	171,303 ms (~2.9 min)
Cold start P95	1,918 ms	3,892 ms	176,788 ms
App RSS	61.0 MiB	56.3 MiB	57.1 MiB
Host memory	53 MiB	419 MiB	580 MiB
Memory amplification	0.87x	7.44x	10.16x
Idle CPU	1.8m (~0.05%)	1.0m (~0.03%)	2.0m (~0.05%)
Max concurrent / node	37	10	N/A

These numbers are hardware- and kernel-dependent. The absolute values will vary. The shape of the trade-off is remarkably stable.

Host memory footprint per sandbox (host-side view, benchmark workload)

Hardened Container          Micro-VM (Kata)
  53 MiB total                419 MiB total
┌──────────────┐            ┌──────────────┐
│              │            │ QEMU process │
│  App (53 MiB)│            │ (incl. guest │  ~256 MiB guest RAM
│              │            │  memory)     │  + ~87 MiB VMM overhead
│              │            ├──────────────┤
│              │            │ kata-shim    │  ~ 42 MiB
│              │            ├──────────────┤
│              │            │ virtiofsd    │  ~ 30 MiB
│              │            ├──────────────┤
│              │            │ App (56 MiB) │  (inside guest)
└──────────────┘            └──────────────┘
  0.87x                       7.44x memory amplification

The real cost of Kata is not cold start, it is memory. The cold start delta is about one second: 2.9s vs 1.9s. Manageable. But 419 MiB of host memory for an application that consumes 56 MiB is a 7.44x amplification factor. The extra 363 MiB breaks down into three components: QEMU (~343 MiB, mostly the guest’s 256 MiB RAM allocation), the Kata shim (~42 MiB), and virtiofsd (~30 MiB, the daemon that provides the container rootfs to the guest). Every micro-VM carries an entire virtualization stack. On this benchmark setup, 100 tenants would mean over 36 GiB of pure isolation overhead.

KubeVirt is effectively unusable for interactive workloads without pre-warming. 2.9 minutes of cold start under nested virtualization (30–90 seconds on bare metal — still too slow for an interactive launch). 10x memory amplification.

With pre-warming, KubeVirt becomes a pool management problem, not a cold start problem.

The isolation cost is memory-dominated, not CPU-dominated. Idle CPU is virtually identical across all three runtimes. You pay once at boot, not on every tick. This is a favorable data point for Kata — the overhead is a fixed cost, not a recurring one. This holds for the benchmark workload (a lightly active Python HTTP server); for syscall-intensive or storage-heavy workloads, virtiofsd and the VMM can add measurable CPU cost at runtime.

Density tells the cost story. 37 containers vs 10 Kata micro-VMs per node, a 3.7:1 ratio. To serve the same number of tenants with Kata, you need almost four times the nodes. These caps are derived from available node memory (16 GiB) minus system and kubelet reservation (~1.5 GiB), divided by per-sandbox host memory footprint (53 MiB for containers, 419 MiB for Kata), conservatively rounded down to avoid swap pressure under burst allocation.

Kata exhibits bimodal cold start behavior. The first ~15 runs land around 3.9s, then drop to ~2.9s as the guest kernel and QEMU pages are cached on the host. On warm nodes, nodes that have already run Kata pods, expect the P50. On cold nodes, fresh from autoscaling or node replacement, expect the P95.

Methodological notes.

Nested virtualization. Kata and KubeVirt run under Proxmox VE. On bare-metal KVM, expect KubeVirt cold starts in the 30–90 second range rather than 2.9 minutes. Kata cold starts would also improve, though more modestly.
Stability. Memory ratios remain consistent across three separate runs (10, 10, and 50 iterations). P95/P50 ratios (container: 1.02x, Kata: 1.35x, KubeVirt: 1.03x) all below 2.0x — no outlier-driven skew. Full distributions including standard deviation and interquartile ranges are available in the benchmark repository.
Memory metric. Amplification below 1.0x for hardened containers is expected: kubectl top reports working_set_bytes, which excludes shared library pages that VmRSS includes.
Cold start definition. End-to-end from the user’s perspective: scheduling + runtime boot + Python startup + 40 MiB allocation + readiness probe success.

Cost, SLO, and Operational Surface Area

What matters for a platform team is what the numbers mean for infrastructure budget, user-facing latency contracts, and the operational surface area that each runtime adds.

100 tenants on 16 GiB nodes (node count driven by density, not memory total)

Hardened Containers (37/node)        Kata Micro-VMs (10/node)
┌────────┐ ┌────────┐ ┌────────┐    ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ node 1 │ │ node 2 │ │ node 3 │    │ node 1 │ │ node 2 │ │ node 3 │ │ node 4 │ │ node 5 │
│ 37     │ │ 37     │ │ 26     │    │ 10     │ │ 10     │ │ 10     │ │ 10     │ │ 10     │
└────────┘ └────────┘ └────────┘    └────────┘ └────────┘ └────────┘ └────────┘ └────────┘
                                    ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
  3 nodes                           │ node 6 │ │ node 7 │ │ node 8 │ │ node 9 │ │ node10 │
  ~5.2 GiB total memory             │ 10     │ │ 10     │ │ 10     │ │ 10     │ │ 10     │
                                    └────────┘ └────────┘ └────────┘ └────────┘ └────────┘

                                     10 nodes     3.7:1 density ratio
                                     ~40.9 GiB total memory

Cost model. The benchmarks above translate directly into infrastructure budget. The binding constraint is density, not raw memory. 100 tenants on hardened containers at 37 per node: 3 nodes. The same 100 tenants on Kata at 10 per node: 10 nodes. The 3.7:1 density ratio is the cost multiplier. Memory tells a consistent story — 100 × 53 MiB = ~5.2 GiB for containers vs. 100 × 419 MiB = ~40.9 GiB for Kata — but it is the per-node concurrency cap, not the memory total, that determines how many nodes you provision. With KubeVirt, density drops further and high-density multi-tenancy becomes cost-prohibitive. A secondary cost: more nodes means more east-west traffic traversing the network fabric, which matters for data-intensive workloads (e.g., Spark shuffle) where inter-node bandwidth is a bottleneck.

SLO implications. Cold start is an implicit contract with the user.

Target	Viable runtimes
< 5 s	Hardened containers, Kata on warm nodes
< 30 s	Containers, Kata, KubeVirt with pre-warming pool
Batch / non-interactive	Any option

Kata’s P95 at 3.9s sits dangerously close to the 5-second threshold. After autoscaling or node replacement, expect the P95 — and a likely breach. This makes the cluster autoscaler an SLO factor: newly provisioned nodes are cold nodes, and Kata pods scheduled on them will hit the P95. Mitigations include dedicated Kata node pools with overprovisioning, or pre-warming pods that keep guest kernel and QEMU pages in the host page cache.

Operational complexity. The memory and latency numbers are only half the story. The rest is operational burden.

Runtime path:

Runtime components. containerd alone: one component in the critical path. containerd + containerd-shim-kata-v2 + QEMU + virtiofsd + kata-agent (inside the guest): five. Each is a failure point and a log stream.
Image management. Hardened containers and Kata use the same container image, same build pipeline. KubeVirt requires a VM image plus cloud-init — a separate pipeline and a separate patching lifecycle.

Observability and debugging:

Debug difficulty. kubectl exec works on Kata: the shim proxies it. But host-level troubleshooting requires direct node access. KubeVirt requires virtctl console, an entirely different toolchain.
Observability gap. kubectl top for Kata reports host-side cgroup metrics that include the QEMU process, but without granularity to distinguish VMM overhead from application consumption. The 419 MiB shows up in the cgroup, but you cannot tell how much is guest RAM, how much is VMM overhead, and how much is actual application use. Closing this gap requires layered instrumentation: per-QEMU-process metrics via standard node-level tooling, kata-monitor or kata-shim Prometheus metrics for VMM lifecycle events, and — if needed — an in-guest agent or VSOCK-based metrics pipeline for guest-internal visibility. This is a structural trade-off: a stronger isolation boundary necessarily reduces observability from outside that boundary unless an explicit, authenticated channel is added across it.
Network policy. Regardless of runtime class, Kubernetes network policy enforcement happens at the host level (CNI plugin), not inside the guest. The isolation boundary and the data flow boundary are not the same.

Governance:

Incident response. Kata requires virtualization expertise, not just Kubernetes skills. Every isolation tier is a staffing decision.
Patching lifecycle. Kata introduces a separate patching surface: the guest kernel, QEMU, kata-agent, and virtiofsd each have their own release cadence and CVE exposure. Updating these components on a live cluster requires rolling restarts of Kata pods — a coordination cost that grows with tenant count and SLO stringency.
Supply chain. Each runtime adds dependencies to the trusted computing base. QEMU is a large C codebase with a long CVE history; gVisor is maintained by Google; Kata is a CNCF project. These are not equivalent supply chain risks, and organizations subject to NIS2 or operating in critical infrastructure sectors should evaluate accordingly.

Evolution Path

The Decision Framework at the end of this section is a lookup table for teams that already know their constraints. The Evolution Path is a narrative for greenfield platforms — it answers the question in what order to adopt isolation tiers as a platform matures. The starting point is the threat model, not the isolation technology.

Phase 1: MVP. Standard containers with a custom seccomp profile, network policies, and resource limits. 1.9s cold start, 37 sandboxes per node, memory amplification below 1x. Sufficient for semi-trusted workloads and internal teams.

Phase 2: Hardening. This phase has two distinct paths.

Phase 2a: Strict seccomp. A stricter seccomp profile, egress filtering, service mesh for lateral movement mitigation. Same density, same operational complexity, no new runtime components. The kernel remains shared, but the reachable syscall surface is minimized.

Phase 2b: gVisor. gVisor (runsc) replaces the host kernel’s syscall interface with a user-space reimplementation (the Sentry). This is architecturally different from seccomp filtering: it adds a new runtime component, introduces reported 10–30% CPU overhead on I/O- and syscall-heavy workloads (CPU-bound workloads with infrequent syscalls see negligible impact), and has its own compatibility limitations. It provides stronger isolation than seccomp alone — the host kernel sees only the Sentry’s narrow syscall surface — but at the cost of an additional runtime dependency and potential workload incompatibilities. The choice between 2a and 2b depends on whether the threat model justifies the operational cost of a new runtime component.

Phase 3: Micro-VM tier. RuntimeClass for high-risk workloads. Mixed scheduling on the same cluster: standard containers for internal workloads, Kata for untrusted user code. 2.9s cold start, 10 sandboxes per node, 7.44x memory amplification. The cost is significant, but it is quantified, not guessed.

Each phase has a concrete trigger. Phase 1 → 2: a penetration test demonstrates that user-submitted code can reach kernel surface beyond the seccomp profile, or the platform begins accepting code from external users. Phase 2 → 3: a regulatory audit requires kernel-level tenant separation, or the threat model includes adversaries capable of kernel exploitation.

Advance when the threat model demands it, not before.

Decision Framework

Given your threat model, budget, SLO, and operational capacity, here is where you land.

Threat model	Budget	SLO	Ops capacity	Runtime	Cost
Semi-trusted code	High density	< 5s	K8s ops (no virt expertise needed)	Hardened container	53 MiB/tenant, 37/node
Untrusted code, Wasm-compatible	High density	< 1s	K8s ops + Wasm runtime familiarity	Wasm runtime	Not benchmarked here; expect sub-ms cold start, negligible memory overhead (for supported workloads only)
Untrusted code, general-purpose	Moderate	< 5s (warm nodes)	K8s ops + virtualization expertise; separate patching lifecycle for guest kernel/QEMU	Micro-VM (Kata)	419 MiB/tenant, 10/node
Hostile tenants OR regulatory kernel isolation	Less constrained	< 3 min or pre-warming	Dedicated platform team; VM image pipeline, virtctl tooling, virt-specific on-call	VM via KubeVirt	580 MiB/tenant

Postscript: Spark Is Also a Multi-Tenant Code Execution Engine

One application of this framework deserves explicit mention. Spark on Kubernetes is also multi-tenant arbitrary code execution: every executor runs user-defined functions — Python UDFs, arbitrary JARs, native libraries via JNI — in a pod that shares the host kernel. On YARN the picture is similar: cgroup-based isolation (v1 or v2 depending on Hadoop version), limited namespace support, no dedicated security boundary.

The execution trust boundary is not the DAG or the JVM. It is the kernel.

Two cases are worth distinguishing. Internal clusters serving known teams with RBAC, network policy, and code review are semi-trusted environments — hardened containers (Phase 1–2) are usually sufficient, and moving the kernel boundary is not urgent. Self-serve platforms open to external users or partners are a different threat model: arbitrary code from untrusted sources, where micro-VM isolation (Phase 3) becomes relevant.

The isolation overhead is a fixed cost per sandbox, independent of workload size. For a Spark executor consuming 4 GiB of heap, the Kata overhead (~363 MiB) adds ~9% — far less dramatic than the 7.44x amplification measured on the 56 MiB benchmark workload. But the per-node density constraint remains the same shape: far fewer Kata sandboxes per node than hardened containers, regardless of executor memory.

Since Kubernetes 1.27, the Seccomp RuntimeDefault profile is GA. Cluster-wide enforcement requires the kubelet --seccomp-default flag; whether the profile is actually active depends on the container runtime and node-level configuration. ↩︎
See the NSA/CISA Kubernetes Hardening Guide and Google’s gVisor documentation for representative positions. ↩︎

The Problem#

What Containers Actually Guarantee#

Threat Model#

Where the Trust Boundary Really Is#

The Design Space: Three Execution Models#

The Numbers#

Cost, SLO, and Operational Surface Area#

Evolution Path#

Decision Framework#

Postscript: Spark Is Also a Multi-Tenant Code Execution Engine#