The Problem

Multi-tenant arbitrary code execution — managed notebooks, hosted IDEs, serverless functions, playground environments. The pattern is widespread, and the implicit assumption is identical: containers are enough. They do isolate processes. They do not isolate trust boundaries.

What Containers Actually Guarantee

What exactly does “container isolation” consist of?

Primitive What it isolates
PID namespace process visibility
Mount namespace filesystem view
Network namespace network stack
UTS namespace hostname and domain name
IPC namespace System V IPC, POSIX message queues
User namespace UID/GID mapping, capability scope
Cgroup namespace cgroup root view
cgroups resource limits (CPU, memory, I/O)
Linux capabilities privilege decomposition (~41 discrete privileges)
Seccomp syscall filtering
Kernel shared

Every primitive above operates on top of a single shared kernel — the only component with full privilege over memory, scheduling, and I/O, and the ultimate shared attack surface.1

Threat Model

What does a shared kernel mean operationally? Four classes of risk.

  • Container escape: kernel exploit (CVE-2022-0185 — requires CAP_SYS_ADMIN in a user namespace; CVE-2024-1086 — nf_tables use-after-free, actively exploited).
  • Lateral movement: service account tokens, metadata endpoints
  • Data exfiltration: DNS tunneling, unrestricted egress
  • Resource abuse: cryptomining, noisy neighbor

Standard hardening layers — AppArmor or SELinux mandatory access control, dropping capabilities, restricting user namespaces — raise the cost of exploitation but do not change the fundamental trust model: the host kernel remains shared. Kernel exploits are rare, but in a multi-tenant environment the blast radius is existential: full host compromise, lateral access to every tenant’s data and credentials.

Where the Trust Boundary Really Is

Namespaces provide isolation properties, but they are not hard security boundaries against hostile code — the kernel attack surface they expose is substantially larger than a minimal hypervisor’s. User namespaces come closest to a true security primitive (they gate capability sets within a confined scope), but they also expand the reachable kernel surface, which is why unprivileged user namespace creation is itself a hardening target.

Seccomp narrows the interface further but cannot close it. A minimal Python HTTP server requires 40–60 distinct syscalls at steady state. The container runtime requires additional privileged syscalls (mount, pivot_root, namespace setup) during initialization, but those execute before the seccomp filter is installed and are not part of the application’s profile.

Mainstream hardening guidance and sandboxing projects converge on the same conclusion: Linux containers provide process and resource isolation, not tenant isolation comparable to a VM.2 Containers are excellent at what they were designed for. They are insufficient when the threat model includes hostile tenants — not because of a defect in containers, but because of a mismatch between tool and problem.

If tenants are not mutually trusted, the question is not whether containers are hardened enough, but whether the host kernel is still part of the shared trust domain.

The Design Space: Three Execution Models

There are three approaches to moving, or reinforcing, the kernel boundary. Each makes a different trade-off between isolation strength, performance overhead, and operational complexity.

Hardened Container          Micro-VM (Kata)             VM via KubeVirt
┌─────────────────┐         ┌─────────────────┐         ┌─────────────────┐
│   App process    │         │   App process    │         │   App process    │
│   (namespaced)   │         │   (namespaced)   │         │   (namespaced)   │
├─────────────────┤         ├─────────────────┤         ├─────────────────┤
│ gVisor Sentry    │         │  Guest kernel    │         │  Guest kernel    │
│ OR seccomp filter│         │  (own instance)  │         │  (own instance)  │
│ (reduces host    │         ├─────────────────┤         ├─────────────────┤
│  syscall surface)│         │  VMM (QEMU /     │         │  QEMU + full OS  │
│                  │         │  Firecracker)    │         │  (init, userland)│
╞══════════════════╡         ╞══════════════════╡         ╞══════════════════╡
│ ██ Host kernel ██│         │ ██ Host kernel ██│         │ ██ Host kernel ██│
│ ██  (shared)   ██│         │ ██  (shared)   ██│         │ ██  (shared)   ██│
└─────────────────┘         └─────────────────┘         └─────────────────┘

  Trust boundary:             Trust boundary:             Trust boundary:
  ▲ host kernel (shared)      ▲ guest kernel (own)        ▲ guest kernel (own)
  Attack surface:             Attack surface:             Attack surface:
  host syscall surface,       virtualization boundary      virtualization boundary
  narrowed but shared         + VMM interface              + VMM interface

A: Hardened Containers (gVisor / strict seccomp)

The kernel remains shared, but the interface to it is narrowed. gVisor interposes a user-space kernel (the Sentry) that reimplements the Linux syscall interface. Application syscalls are trapped and serviced entirely within the Sentry; only a narrow set of host syscalls made by the Sentry itself reach the host kernel, further restricted by a dedicated seccomp profile. A strict seccomp profile takes a different approach: instead of reimplementing, it blocks syscalls at the filter level. Both reduce the attack surface. Neither eliminates it. With gVisor, the host kernel is still reachable through the Sentry’s own syscall surface. With strict seccomp, it is reachable through whichever syscalls the profile allows. Compatibility gaps are real: not all syscalls are supported by gVisor, and aggressive seccomp profiles break workloads that depend on them.

B: Micro-VMs (Kata Containers via RuntimeClass)

The workload gets its own kernel. The guest kernel is the security boundary, not the namespace. A kernel exploit inside the micro-VM compromises the guest, not the host. Cold start is higher, density is lower, but the isolation is stronger: it leverages hardware virtualization (VT-x, EPT), though the VMM software (QEMU, cloud-hypervisor, or Firecracker) remains part of the trusted computing base — VMM escape vulnerabilities exist (e.g., VENOM, CVE-2015-3456), so the attack surface shrinks but does not vanish. Kata supports multiple VMM backends. The benchmarks below use QEMU, which has the highest memory overhead; Firecracker and cloud-hypervisor reduce that cost substantially. Kubernetes RuntimeClass allows mixed scheduling on the same cluster — standard containers for trusted workloads, micro-VMs for untrusted code, same control plane.

For regulated workloads that require removing the host operator from the trusted computing base, Confidential Computing extensions (AMD SEV-SNP, Intel TDX) add hardware-attested memory encryption per VM — an orthogonal hardening layer, not benchmarked here but increasingly relevant for compliance in financial and healthcare environments.

C: Full VMs (KubeVirt)

A complete virtual machine managed as a Kubernetes resource. Dedicated kernel, dedicated userspace, dedicated init system. Maximum isolation, maximum overhead. Organizations adopt KubeVirt when regulatory frameworks mandate kernel-level separation (common in financial services and government), when tenants need different operating systems, or when consolidating existing VM-based workloads into a Kubernetes control plane without re-architecting them. For greenfield multi-tenant code execution, it is usually overkill — but for brownfield migration it can be the pragmatic path.

For workloads that can compile to WebAssembly, Wasm runtimes (Wasmtime, WasmEdge) offer a fundamentally different trade-off: capability-based sandboxing with sub-millisecond cold starts and negligible memory overhead. For general-purpose code execution — arbitrary Python, JVM, native binaries — the three models above remain the practical design space.

The Numbers

The three models differ in theory; the question is how much they differ in practice. To quantify the trade-offs — cold start, memory overhead, density — we benchmarked all three on the same cluster, same workload, same methodology.

Measured on a dedicated 3-node cluster: AMD Ryzen 9 7950X, 4 vCPU / 16 GB per node, Kubernetes 1.29, Proxmox VE with KVM passthrough. 50 runs per runtime, zero timeouts. Identical workload across all three: a Python HTTP server with a 40 MiB pre-allocated ballast and a readiness probe on /ready.

Full setup, scripts, and raw results: execution-boundary-bench

Metric Hardened Container Micro-VM (Kata) VM via KubeVirt
Cold start P50 1,873 ms 2,887 ms 171,303 ms (~2.9 min)
Cold start P95 1,918 ms 3,892 ms 176,788 ms
App RSS 61.0 MiB 56.3 MiB 57.1 MiB
Host memory 53 MiB 419 MiB 580 MiB
Memory amplification 0.87x 7.44x 10.16x
Idle CPU 1.8m (~0.05%) 1.0m (~0.03%) 2.0m (~0.05%)
Max concurrent / node 37 10 N/A

These numbers are hardware- and kernel-dependent. The absolute values will vary. The shape of the trade-off is remarkably stable.

Host memory footprint per sandbox (host-side view, benchmark workload)

Hardened Container          Micro-VM (Kata)
  53 MiB total                419 MiB total
┌──────────────┐            ┌──────────────┐
                           QEMU process 
  App (53 MiB)             (incl. guest   ~256 MiB guest RAM
                            memory)       + ~87 MiB VMM overhead
                          ├──────────────┤
                           kata-shim      ~ 42 MiB
                          ├──────────────┤
                           virtiofsd      ~ 30 MiB
                          ├──────────────┤
                           App (56 MiB)   (inside guest)
└──────────────┘            └──────────────┘
  0.87x                       7.44x memory amplification

The real cost of Kata is not cold start, it is memory. The cold start delta is about one second: 2.9s vs 1.9s. Manageable. But 419 MiB of host memory for an application that consumes 56 MiB is a 7.44x amplification factor. The extra 363 MiB breaks down into three components: QEMU (~343 MiB, mostly the guest’s 256 MiB RAM allocation), the Kata shim (~42 MiB), and virtiofsd (~30 MiB, the daemon that provides the container rootfs to the guest). Every micro-VM carries an entire virtualization stack. On this benchmark setup, 100 tenants would mean over 36 GiB of pure isolation overhead.

KubeVirt is effectively unusable for interactive workloads without pre-warming. 2.9 minutes of cold start under nested virtualization (30–90 seconds on bare metal — still too slow for an interactive launch). 10x memory amplification.

With pre-warming, KubeVirt becomes a pool management problem, not a cold start problem.

The isolation cost is memory-dominated, not CPU-dominated. Idle CPU is virtually identical across all three runtimes. You pay once at boot, not on every tick. This is a favorable data point for Kata — the overhead is a fixed cost, not a recurring one. This holds for the benchmark workload (a lightly active Python HTTP server); for syscall-intensive or storage-heavy workloads, virtiofsd and the VMM can add measurable CPU cost at runtime.

Density tells the cost story. 37 containers vs 10 Kata micro-VMs per node, a 3.7:1 ratio. To serve the same number of tenants with Kata, you need almost four times the nodes.

Kata exhibits bimodal cold start behavior. The first ~15 runs land around 3.9s, then drop to ~2.9s as the guest kernel and QEMU pages are cached on the host. On warm nodes, nodes that have already run Kata pods, expect the P50. On cold nodes, fresh from autoscaling or node replacement, expect the P95.

Methodological notes.

  • Nested virtualization. Kata and KubeVirt run under Proxmox VE. On bare-metal KVM, expect KubeVirt cold starts in the 30–90 second range rather than 2.9 minutes. Kata cold starts would also improve, though more modestly.
  • Stability. Memory ratios remain consistent across three separate runs (10, 10, and 50 iterations). P95/P50 ratios (container: 1.02x, Kata: 1.35x, KubeVirt: 1.03x) all below 2.0x — no outlier-driven skew.
  • Memory metric. Amplification below 1.0x for hardened containers is expected: kubectl top reports working_set_bytes, which excludes shared library pages that VmRSS includes.
  • Cold start definition. End-to-end from the user’s perspective: scheduling + runtime boot + Python startup + 40 MiB allocation + readiness probe success.

Cost, SLO, and Operational Surface Area

What matters for a platform team is what the numbers mean for infrastructure budget, user-facing latency contracts, and the operational surface area that each runtime adds.

100 tenants on 16 GiB nodes (node count driven by density, not memory total)

Hardened Containers (37/node)        Kata Micro-VMs (10/node)
┌────────┐ ┌────────┐ ┌────────┐    ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ node 1 │ │ node 2 │ │ node 3 │    │ node 1 │ │ node 2 │ │ node 3 │ │ node 4 │ │ node 5 │
│ 37     │ │ 37     │ │ 26     │    │ 10     │ │ 10     │ │ 10     │ │ 10     │ │ 10     │
└────────┘ └────────┘ └────────┘    └────────┘ └────────┘ └────────┘ └────────┘ └────────┘
                                    ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
  3 nodes                           │ node 6 │ │ node 7 │ │ node 8 │ │ node 9 │ │ node10 │
  ~5.2 GiB total memory             │ 10     │ │ 10     │ │ 10     │ │ 10     │ │ 10     │
                                    └────────┘ └────────┘ └────────┘ └────────┘ └────────┘

                                     10 nodes     3.7:1 density ratio
                                     ~40.9 GiB total memory

Cost model. The benchmarks above translate directly into infrastructure budget. The binding constraint is density, not raw memory. 100 tenants on hardened containers at 37 per node: 3 nodes. The same 100 tenants on Kata at 10 per node: 10 nodes. The 3.7:1 density ratio is the cost multiplier. Memory tells a consistent story — 100 × 53 MiB = ~5.2 GiB for containers vs. 100 × 419 MiB = ~40.9 GiB for Kata — but it is the per-node concurrency cap, not the memory total, that determines how many nodes you provision. With KubeVirt, density drops further and high-density multi-tenancy becomes cost-prohibitive.

SLO implications. Cold start is an implicit contract with the user.

Target Viable runtimes
< 5 s Hardened containers, Kata on warm nodes
< 30 s Containers, Kata, KubeVirt with pre-warming pool
Batch / non-interactive Any option

Kata’s P95 at 3.9s sits dangerously close to the 5-second threshold. After autoscaling or node replacement, expect the P95 — and a likely breach.

Operational complexity. The memory and latency numbers are only half the story. The rest is operational burden.

Runtime path:

  • Runtime components. containerd alone: one component in the critical path. containerd + containerd-shim-kata-v2 + QEMU + virtiofsd + kata-agent (inside the guest): five. Each is a failure point and a log stream.
  • Image management. Hardened containers and Kata use the same container image, same build pipeline. KubeVirt requires a VM image plus cloud-init — a separate pipeline and a separate patching lifecycle.

Observability and debugging:

  • Debug difficulty. kubectl exec works on Kata: the shim proxies it. But host-level troubleshooting requires direct node access. KubeVirt requires virtctl console, an entirely different toolchain.
  • Observability gap. kubectl top for Kata reports guest-side metrics only. The real 419 MiB on the host are invisible without a privileged probe. Standard cluster metrics will underreport actual host-side memory consumption.
  • Network policy. Regardless of runtime class, Kubernetes network policy enforcement happens at the host level (CNI plugin), not inside the guest. The isolation boundary and the data flow boundary are not the same.

Governance:

  • Incident response. Kata requires virtualization expertise, not just Kubernetes skills. Every isolation tier is a staffing decision.
  • Supply chain. Each runtime adds dependencies to the trusted computing base. QEMU is a large C codebase with a long CVE history; gVisor is maintained by Google; Kata is a CNCF project. These are not equivalent supply chain risks, and organizations subject to NIS2 or operating in critical infrastructure sectors should evaluate accordingly.

Evolution Path

The Decision Framework at the end of this section is a lookup table for teams that already know their constraints. The Evolution Path is a narrative for greenfield platforms — it answers the question in what order to adopt isolation tiers as a platform matures. The starting point is the threat model, not the isolation technology.

Phase 1: MVP. Standard containers with a custom seccomp profile, network policies, and resource limits. 1.9s cold start, 37 sandboxes per node, memory amplification below 1x. Sufficient for semi-trusted workloads and internal teams.

Phase 2: Hardening. A stricter seccomp profile, egress filtering, service mesh for lateral movement mitigation. Same density, same operational complexity, no new runtime components. gVisor is also an option at this tier but changes the picture: it adds a new runtime component (runsc), introduces reported 10–30% CPU overhead on syscall-heavy workloads, and has its own compatibility limitations. It provides stronger isolation than seccomp alone but is not operationally equivalent.

Phase 3: Micro-VM tier. RuntimeClass for high-risk workloads. Mixed scheduling on the same cluster: standard containers for internal workloads, Kata for untrusted user code. 2.9s cold start, 10 sandboxes per node, 7.44x memory amplification. The cost is significant, but it is quantified, not guessed.

Each phase has a concrete trigger. Phase 1 → 2: a penetration test demonstrates that user-submitted code can reach kernel surface beyond the seccomp profile, or the platform begins accepting code from external users. Phase 2 → 3: a regulatory audit requires kernel-level tenant separation, or the threat model includes adversaries capable of kernel exploitation.

Advance when the threat model demands it, not before.

Decision Framework

Given your threat model, budget, SLO, and operational capacity, here is where you land.

Threat model Budget SLO Ops capacity Runtime Cost
Semi-trusted code High density < 5s Small team Hardened container 53 MiB/tenant, 37/node
Untrusted code, Wasm-compatible High density < 1s Small team Wasm runtime Very low overhead (for supported workloads)
Untrusted code, general-purpose Moderate < 5s (warm nodes) Moderate + virt skills Micro-VM (Kata) 419 MiB/tenant, 10/node
Hostile tenants OR regulatory kernel isolation Less constrained < 3 min or pre-warming Dedicated platform team VM via KubeVirt 580 MiB/tenant

Postscript: Spark Is Also a Multi-Tenant Code Execution Engine

One application of this framework deserves explicit mention. Spark on Kubernetes is also multi-tenant arbitrary code execution: every executor runs user-defined functions — Python UDFs, arbitrary JARs, native libraries via JNI — in a pod that shares the host kernel. On YARN the picture is similar: cgroup-based isolation (v1 or v2 depending on Hadoop version), limited namespace support, no dedicated security boundary.

The execution trust boundary is not the DAG or the JVM. It is the kernel.

Two cases are worth distinguishing. Internal clusters serving known teams with RBAC, network policy, and code review are semi-trusted environments — hardened containers (Phase 1–2) are usually sufficient, and moving the kernel boundary is not urgent. Self-serve platforms open to external users or partners are a different threat model: arbitrary code from untrusted sources, where micro-VM isolation (Phase 3) becomes relevant.

The isolation overhead is a fixed cost per sandbox, independent of workload size. For a Spark executor consuming 4 GiB of heap, the Kata overhead (~363 MiB) adds ~9% — far less dramatic than the 7.44x amplification measured on the 56 MiB benchmark workload. But the per-node density constraint remains the same shape: far fewer Kata sandboxes per node than hardened containers, regardless of executor memory.


  1. Since Kubernetes 1.27, the Seccomp RuntimeDefault profile is GA and can be enforced cluster-wide via the kubelet, though actual enforcement depends on distribution and kubelet configuration. ↩︎

  2. See the NSA/CISA Kubernetes Hardening Guide and Google’s gVisor documentation for representative positions. ↩︎