A Shared Kernel Is a Shared Trust Domain

The Problem

Multi-tenant arbitrary code execution — managed notebooks, hosted IDEs, serverless functions, playground environments. The pattern is widespread, and the implicit assumption is identical: containers are enough. They do isolate processes. They do not isolate trust boundaries.

What Containers Actually Guarantee

What exactly does “container isolation” consist of?

Primitive	What it isolates
PID namespace	process visibility
Mount namespace	filesystem view
Network namespace	network stack
UTS namespace	hostname and domain name
IPC namespace	System V IPC, POSIX message queues
User namespace	UID/GID mapping, capability scope
Cgroup namespace	cgroup root view
cgroups	resource limits (CPU, memory, I/O)
Linux capabilities	privilege decomposition (~41 discrete privileges)
Seccomp	syscall filtering
Kernel	shared

Every primitive above operates on top of a single shared kernel — the only component with full privilege over memory, scheduling, and I/O, and the ultimate shared attack surface.¹

Threat Model

What does a shared kernel mean operationally? Four classes of risk.

Container escape: kernel exploit (CVE-2022-0185 — requires CAP_SYS_ADMIN in a user namespace; CVE-2024-1086 — nf_tables use-after-free, actively exploited).
Lateral movement: service account tokens, metadata endpoints
Data exfiltration: DNS tunneling, unrestricted egress
Resource abuse: cryptomining, noisy neighbor

Standard hardening layers — AppArmor or SELinux mandatory access control, dropping capabilities, restricting user namespaces — raise the cost of exploitation but do not change the fundamental trust model: the host kernel remains shared. Kernel exploits are rare, but in a multi-tenant environment the blast radius is existential: full host compromise, lateral access to every tenant’s data and credentials.

Where the Trust Boundary Really Is

Namespaces provide isolation properties, but they are not hard security boundaries against hostile code — the kernel attack surface they expose is substantially larger than a minimal hypervisor’s. User namespaces come closest to a true security primitive (they gate capability sets within a confined scope), but they also expand the reachable kernel surface, which is why unprivileged user namespace creation is itself a hardening target.

Seccomp narrows the interface further but cannot close it. A minimal Python HTTP server requires 40–60 distinct syscalls at steady state. The container runtime requires additional privileged syscalls (mount, pivot_root, namespace setup) during initialization, but those execute before the seccomp filter is installed and are not part of the application’s profile.

Mainstream hardening guidance and sandboxing projects converge on the same conclusion: Linux containers provide process and resource isolation, not tenant isolation comparable to a VM.² Containers are excellent at what they were designed for. They are insufficient when the threat model includes hostile tenants — not because of a defect in containers, but because of a mismatch between tool and problem.

If tenants are not mutually trusted, the question is not whether containers are hardened enough, but whether the host kernel is still part of the shared trust domain.

The Design Space: Three Execution Models

There are three approaches to moving, or reinforcing, the kernel boundary. Each makes a different trade-off between isolation strength, performance overhead, and operational complexity.

Hardened Container          Micro-VM (Kata)             VM via KubeVirt
┌─────────────────┐         ┌─────────────────┐         ┌─────────────────┐
│   App process    │         │   App process    │         │   App process    │
│   (namespaced)   │         │   (namespaced)   │         │   (namespaced)   │
├─────────────────┤         ├─────────────────┤         ├─────────────────┤
│ gVisor Sentry    │         │  Guest kernel    │         │  Guest kernel    │
│ OR seccomp filter│         │  (own instance)  │         │  (own instance)  │
│ (reduces host    │         ├─────────────────┤         ├─────────────────┤
│  syscall surface)│         │  VMM (QEMU /     │         │  QEMU + full OS  │
│                  │         │  Firecracker)    │         │  (init, userland)│
╞══════════════════╡         ╞══════════════════╡         ╞══════════════════╡
│ ██ Host kernel ██│         │ ██ Host kernel ██│         │ ██ Host kernel ██│
│ ██  (shared)   ██│         │ ██  (shared)   ██│         │ ██  (shared)   ██│
└─────────────────┘         └─────────────────┘         └─────────────────┘

  Trust boundary:             Trust boundary:             Trust boundary:
  ▲ host kernel (shared)      ▲ guest kernel (own)        ▲ guest kernel (own)
  Attack surface:             Attack surface:             Attack surface:
  host syscall surface,       virtualization boundary      virtualization boundary
  narrowed but shared         + VMM interface              + VMM interface

A: Hardened Containers (gVisor / strict seccomp)

The kernel remains shared, but the interface to it is narrowed. gVisor interposes a user-space kernel (the Sentry) that reimplements the Linux syscall interface. Application syscalls are trapped and serviced entirely within the Sentry; only a narrow set of host syscalls made by the Sentry itself reach the host kernel, further restricted by a dedicated seccomp profile. A strict seccomp profile takes a different approach: instead of reimplementing, it blocks syscalls at the filter level. Both reduce the attack surface. Neither eliminates it. With gVisor, the host kernel is still reachable through the Sentry’s own syscall surface. With strict seccomp, it is reachable through whichever syscalls the profile allows. Compatibility gaps are real: not all syscalls are supported by gVisor, and aggressive seccomp profiles break workloads that depend on them.

B: Micro-VMs (Kata Containers via RuntimeClass)

The workload gets its own kernel. The guest kernel is the security boundary, not the namespace. A kernel exploit inside the micro-VM compromises the guest, not the host. Cold start is higher, density is lower, but the isolation is stronger: it leverages hardware virtualization (VT-x, EPT), though the VMM software (QEMU, cloud-hypervisor, or Firecracker) remains part of the trusted computing base — VMM escape vulnerabilities exist (e.g., VENOM, CVE-2015-3456), so the attack surface shrinks but does not vanish. Kata supports multiple VMM backends. The benchmarks below use QEMU, which has the highest memory overhead; Firecracker and cloud-hypervisor reduce that cost substantially. Kubernetes RuntimeClass allows mixed scheduling on the same cluster — standard containers for trusted workloads, micro-VMs for untrusted code, same control plane.

For regulated workloads that require removing the host operator from the trusted computing base, Confidential Computing extensions (AMD SEV-SNP, Intel TDX) add hardware-attested memory encryption per VM — an orthogonal hardening layer, not benchmarked here but increasingly relevant for compliance in financial and healthcare environments.

C: Full VMs (KubeVirt)

A complete virtual machine managed as a Kubernetes resource. Dedicated kernel, dedicated userspace, dedicated init system. Maximum isolation, maximum overhead. Organizations adopt KubeVirt when regulatory frameworks mandate kernel-level separation (common in financial services and government), when tenants need different operating systems, or when consolidating existing VM-based workloads into a Kubernetes control plane without re-architecting them. For greenfield multi-tenant code execution, it is usually overkill — but for brownfield migration it can be the pragmatic path.

For workloads that can compile to WebAssembly, Wasm runtimes (Wasmtime, WasmEdge) offer a fundamentally different trade-off: capability-based sandboxing with sub-millisecond cold starts and negligible memory overhead. For general-purpose code execution — arbitrary Python, JVM, native binaries — the three models above remain the practical design space.

The Numbers

The three models differ in theory; the question is how much they differ in practice. To quantify the trade-offs — cold start, memory overhead, density — we benchmarked all three on the same cluster, same workload, same methodology.

Measured on a dedicated 3-node cluster: AMD Ryzen 9 7950X, 4 vCPU / 16 GB per node, Kubernetes 1.29, Proxmox VE with KVM passthrough. 50 runs per runtime, zero timeouts. Identical workload across all three: a Python HTTP server with a 40 MiB pre-allocated ballast and a readiness probe on /ready.

Full setup, scripts, and raw results: execution-boundary-bench

Metric	Hardened Container	Micro-VM (Kata)	VM via KubeVirt
Cold start P50	1,873 ms	2,887 ms	171,303 ms (~2.9 min)
Cold start P95	1,918 ms	3,892 ms	176,788 ms
App RSS	61.0 MiB	56.3 MiB	57.1 MiB
Host memory	53 MiB	419 MiB	580 MiB
Memory amplification	0.87x	7.44x	10.16x
Idle CPU	1.8m (~0.05%)	1.0m (~0.03%)	2.0m (~0.05%)
Max concurrent / node	37	10	N/A

These numbers are hardware- and kernel-dependent. The absolute values will vary. The shape of the trade-off is remarkably stable.

Host memory footprint per sandbox (host-side view, benchmark workload)

Hardened Container          Micro-VM (Kata)
  53 MiB total                419 MiB total
┌──────────────┐            ┌──────────────┐
│              │            │ QEMU process │
│  App (53 MiB)│            │ (incl. guest │  ~256 MiB guest RAM
│              │            │  memory)     │  + ~87 MiB VMM overhead
│              │            ├──────────────┤
│              │            │ kata-shim    │  ~ 42 MiB
│              │            ├──────────────┤
│              │            │ virtiofsd    │  ~ 30 MiB
│              │            ├──────────────┤
│              │            │ App (56 MiB) │  (inside guest)
└──────────────┘            └──────────────┘
  0.87x                       7.44x memory amplification

The real cost of Kata is not cold start, it is memory. The cold start delta is about one second: 2.9s vs 1.9s. Manageable. But 419 MiB of host memory for an application that consumes 56 MiB is a 7.44x amplification factor. The extra 363 MiB breaks down into three components: QEMU (~343 MiB, mostly the guest’s 256 MiB RAM allocation), the Kata shim (~42 MiB), and virtiofsd (~30 MiB, the daemon that provides the container rootfs to the guest). Every micro-VM carries an entire virtualization stack. On this benchmark setup, 100 tenants would mean over 36 GiB of pure isolation overhead.

KubeVirt is effectively unusable for interactive workloads without pre-warming. 2.9 minutes of cold start under nested virtualization (30–90 seconds on bare metal — still too slow for an interactive launch). 10x memory amplification.

With pre-warming, KubeVirt becomes a pool management problem, not a cold start problem.

The isolation cost is memory-dominated, not CPU-dominated. Idle CPU is virtually identical across all three runtimes. You pay once at boot, not on every tick. This is a favorable data point for Kata — the overhead is a fixed cost, not a recurring one. This holds for the benchmark workload (a lightly active Python HTTP server); for syscall-intensive or storage-heavy workloads, virtiofsd and the VMM can add measurable CPU cost at runtime.

Density tells the cost story. 37 containers vs 10 Kata micro-VMs per node, a 3.7:1 ratio. To serve the same number of tenants with Kata, you need almost four times the nodes.

Kata exhibits bimodal cold start behavior. The first ~15 runs land around 3.9s, then drop to ~2.9s as the guest kernel and QEMU pages are cached on the host. On warm nodes, nodes that have already run Kata pods, expect the P50. On cold nodes, fresh from autoscaling or node replacement, expect the P95.

Methodological notes.

Nested virtualization. Kata and KubeVirt run under Proxmox VE. On bare-metal KVM, expect KubeVirt cold starts in the 30–90 second range rather than 2.9 minutes. Kata cold starts would also improve, though more modestly.
Stability. Memory ratios remain consistent across three separate runs (10, 10, and 50 iterations). P95/P50 ratios (container: 1.02x, Kata: 1.35x, KubeVirt: 1.03x) all below 2.0x — no outlier-driven skew.
Memory metric. Amplification below 1.0x for hardened containers is expected: kubectl top reports working_set_bytes, which excludes shared library pages that VmRSS includes.
Cold start definition. End-to-end from the user’s perspective: scheduling + runtime boot + Python startup + 40 MiB allocation + readiness probe success.

Cost, SLO, and Operational Surface Area

What matters for a platform team is what the numbers mean for infrastructure budget, user-facing latency contracts, and the operational surface area that each runtime adds.

100 tenants on 16 GiB nodes (node count driven by density, not memory total)

Hardened Containers (37/node)        Kata Micro-VMs (10/node)
┌────────┐ ┌────────┐ ┌────────┐    ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ node 1 │ │ node 2 │ │ node 3 │    │ node 1 │ │ node 2 │ │ node 3 │ │ node 4 │ │ node 5 │
│ 37     │ │ 37     │ │ 26     │    │ 10     │ │ 10     │ │ 10     │ │ 10     │ │ 10     │
└────────┘ └────────┘ └────────┘    └────────┘ └────────┘ └────────┘ └────────┘ └────────┘
                                    ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
  3 nodes                           │ node 6 │ │ node 7 │ │ node 8 │ │ node 9 │ │ node10 │
  ~5.2 GiB total memory             │ 10     │ │ 10     │ │ 10     │ │ 10     │ │ 10     │
                                    └────────┘ └────────┘ └────────┘ └────────┘ └────────┘

                                     10 nodes     3.7:1 density ratio
                                     ~40.9 GiB total memory

Cost model. The benchmarks above translate directly into infrastructure budget. The binding constraint is density, not raw memory. 100 tenants on hardened containers at 37 per node: 3 nodes. The same 100 tenants on Kata at 10 per node: 10 nodes. The 3.7:1 density ratio is the cost multiplier. Memory tells a consistent story — 100 × 53 MiB = ~5.2 GiB for containers vs. 100 × 419 MiB = ~40.9 GiB for Kata — but it is the per-node concurrency cap, not the memory total, that determines how many nodes you provision. With KubeVirt, density drops further and high-density multi-tenancy becomes cost-prohibitive.

SLO implications. Cold start is an implicit contract with the user.

Target	Viable runtimes
< 5 s	Hardened containers, Kata on warm nodes
< 30 s	Containers, Kata, KubeVirt with pre-warming pool
Batch / non-interactive	Any option

Kata’s P95 at 3.9s sits dangerously close to the 5-second threshold. After autoscaling or node replacement, expect the P95 — and a likely breach.

Operational complexity. The memory and latency numbers are only half the story. The rest is operational burden.

Runtime path:

Runtime components. containerd alone: one component in the critical path. containerd + containerd-shim-kata-v2 + QEMU + virtiofsd + kata-agent (inside the guest): five. Each is a failure point and a log stream.
Image management. Hardened containers and Kata use the same container image, same build pipeline. KubeVirt requires a VM image plus cloud-init — a separate pipeline and a separate patching lifecycle.

Observability and debugging:

Debug difficulty. kubectl exec works on Kata: the shim proxies it. But host-level troubleshooting requires direct node access. KubeVirt requires virtctl console, an entirely different toolchain.
Observability gap. kubectl top for Kata reports guest-side metrics only. The real 419 MiB on the host are invisible without a privileged probe. Standard cluster metrics will underreport actual host-side memory consumption.
Network policy. Regardless of runtime class, Kubernetes network policy enforcement happens at the host level (CNI plugin), not inside the guest. The isolation boundary and the data flow boundary are not the same.

Governance:

Incident response. Kata requires virtualization expertise, not just Kubernetes skills. Every isolation tier is a staffing decision.
Supply chain. Each runtime adds dependencies to the trusted computing base. QEMU is a large C codebase with a long CVE history; gVisor is maintained by Google; Kata is a CNCF project. These are not equivalent supply chain risks, and organizations subject to NIS2 or operating in critical infrastructure sectors should evaluate accordingly.

Evolution Path

The Decision Framework at the end of this section is a lookup table for teams that already know their constraints. The Evolution Path is a narrative for greenfield platforms — it answers the question in what order to adopt isolation tiers as a platform matures. The starting point is the threat model, not the isolation technology.

Phase 1: MVP. Standard containers with a custom seccomp profile, network policies, and resource limits. 1.9s cold start, 37 sandboxes per node, memory amplification below 1x. Sufficient for semi-trusted workloads and internal teams.

Phase 2: Hardening. A stricter seccomp profile, egress filtering, service mesh for lateral movement mitigation. Same density, same operational complexity, no new runtime components. gVisor is also an option at this tier but changes the picture: it adds a new runtime component (runsc), introduces reported 10–30% CPU overhead on syscall-heavy workloads, and has its own compatibility limitations. It provides stronger isolation than seccomp alone but is not operationally equivalent.

Phase 3: Micro-VM tier. RuntimeClass for high-risk workloads. Mixed scheduling on the same cluster: standard containers for internal workloads, Kata for untrusted user code. 2.9s cold start, 10 sandboxes per node, 7.44x memory amplification. The cost is significant, but it is quantified, not guessed.

Each phase has a concrete trigger. Phase 1 → 2: a penetration test demonstrates that user-submitted code can reach kernel surface beyond the seccomp profile, or the platform begins accepting code from external users. Phase 2 → 3: a regulatory audit requires kernel-level tenant separation, or the threat model includes adversaries capable of kernel exploitation.

Advance when the threat model demands it, not before.

Decision Framework

Given your threat model, budget, SLO, and operational capacity, here is where you land.

Threat model	Budget	SLO	Ops capacity	Runtime	Cost
Semi-trusted code	High density	< 5s	Small team	Hardened container	53 MiB/tenant, 37/node
Untrusted code, Wasm-compatible	High density	< 1s	Small team	Wasm runtime	Very low overhead (for supported workloads)
Untrusted code, general-purpose	Moderate	< 5s (warm nodes)	Moderate + virt skills	Micro-VM (Kata)	419 MiB/tenant, 10/node
Hostile tenants OR regulatory kernel isolation	Less constrained	< 3 min or pre-warming	Dedicated platform team	VM via KubeVirt	580 MiB/tenant

Postscript: Spark Is Also a Multi-Tenant Code Execution Engine

One application of this framework deserves explicit mention. Spark on Kubernetes is also multi-tenant arbitrary code execution: every executor runs user-defined functions — Python UDFs, arbitrary JARs, native libraries via JNI — in a pod that shares the host kernel. On YARN the picture is similar: cgroup-based isolation (v1 or v2 depending on Hadoop version), limited namespace support, no dedicated security boundary.

The execution trust boundary is not the DAG or the JVM. It is the kernel.

Two cases are worth distinguishing. Internal clusters serving known teams with RBAC, network policy, and code review are semi-trusted environments — hardened containers (Phase 1–2) are usually sufficient, and moving the kernel boundary is not urgent. Self-serve platforms open to external users or partners are a different threat model: arbitrary code from untrusted sources, where micro-VM isolation (Phase 3) becomes relevant.

The isolation overhead is a fixed cost per sandbox, independent of workload size. For a Spark executor consuming 4 GiB of heap, the Kata overhead (~363 MiB) adds ~9% — far less dramatic than the 7.44x amplification measured on the 56 MiB benchmark workload. But the per-node density constraint remains the same shape: far fewer Kata sandboxes per node than hardened containers, regardless of executor memory.

Since Kubernetes 1.27, the Seccomp RuntimeDefault profile is GA and can be enforced cluster-wide via the kubelet, though actual enforcement depends on distribution and kubelet configuration. ↩︎
See the NSA/CISA Kubernetes Hardening Guide and Google’s gVisor documentation for representative positions. ↩︎

The Problem#

What Containers Actually Guarantee#

Threat Model#

Where the Trust Boundary Really Is#

The Design Space: Three Execution Models#

The Numbers#

Cost, SLO, and Operational Surface Area#

Evolution Path#

Decision Framework#

Postscript: Spark Is Also a Multi-Tenant Code Execution Engine#