A shared kernel is a shared trust domain. Here’s what that means for your platform.


The Problem

Multi-tenant arbitrary code execution. Managed notebooks, hosted IDEs, serverless functions, playground environments: the pattern is everywhere. The implicit assumption: containers isolate. The problem: they isolate processes, not trust boundaries. This distinction has concrete architectural consequences.

What Containers Actually Guarantee

Containers guarantee multiple layers of isolation, all operating on top of a single shared kernel.

Primitive What it isolates
PID namespace process visibility isolation
Mount namespace filesystem view isolation
Network namespace network stack isolation
cgroups resource limits (CPU, memory, I/O)
Linux capabilities privilege restriction
Seccomp syscall filtering (opt-in, non default)
Kernel shared

The kernel is the only component with full privilege over memory, scheduling, and I/O. If it is shared, privilege is shared.

Threat Model

  • Container escape: kernel exploit (CVE-2022-0185, CVE-2024-1086)
  • Lateral movement: service account tokens, metadata endpoints
  • Data exfiltration: DNS tunneling, unrestricted egress
  • Resource abuse: cryptomining, noisy neighbor

Kernel exploits are rare, but in a multi-tenant environment the impact radius is existential. It is not about probability, it is about blast radius.

Where the Trust Boundary Really Is

A shared kernel is a shared trust domain.

Namespaces and cgroups are not security primitives: they are resource management primitives repurposed for isolation. Seccomp reduces the attack surface but does not eliminate it. Even a minimal Python HTTP server requires on the order of a hundred allowed syscalls, including filesystem and process-management calls such as mount, pivot_root, and namespace setup operations. The floor is higher than most assume.

If your threat model includes untrusted code, the architectural question becomes: where do you place the kernel boundary?

Containers are excellent for process isolation and resource control. They are insufficient when your threat model includes hostile tenants. This is not a defect in containers: it is a mismatch between tool and problem.

The Design Space: Three Execution Models

There are three approaches to moving, or reinforcing, the kernel boundary. Each makes a different trade-off between isolation strength, performance overhead, and operational complexity.

A: Hardened Containers (gVisor / strict seccomp)

The kernel remains shared, but the interface to it is narrowed. gVisor interposes a user-space kernel that intercepts syscalls before they reach the host kernel. A strict seccomp profile takes the opposite approach: instead of intercepting, it blocks. Both reduce the attack surface. Neither eliminates it — the host kernel is still reachable, directly or indirectly. Compatibility gaps are real: not all syscalls are supported by gVisor, and aggressive seccomp profiles break workloads that depend on them. A reasonable compromise for semi-trusted code.

B: Micro-VMs (Firecracker / Kata Containers via RuntimeClass)

The workload gets its own kernel. The guest kernel is the security boundary, not the namespace. A kernel exploit inside the micro-VM compromises the guest, not the host. Cold start is higher, density is lower, but the isolation is real: it is hardware-enforced, not policy-enforced. Kubernetes RuntimeClass allows mixed scheduling on the same cluster — standard containers for trusted workloads, micro-VMs for untrusted code, same control plane.

C: Full VMs (KubeVirt)

A complete virtual machine managed as a Kubernetes resource. Dedicated kernel, dedicated userspace, dedicated init system. Maximum isolation, maximum overhead. Justifiable when regulatory requirements mandate kernel-level separation, or when tenants need different operating systems. For most multi-tenant code execution scenarios, it is overkill.

The Numbers

Measured on a dedicated 3-node cluster: AMD Ryzen 9 7950X, 4 vCPU / 16 GB per node, Kubernetes 1.29, Proxmox VE with KVM passthrough. 50 runs per runtime, zero timeouts. Identical workload across all three: a Python HTTP server with a 40 MiB pre-allocated ballast and a readiness probe on /ready.

Full setup, scripts, and raw results: execution-boundary-bench

Metric Hardened Container Micro-VM (Kata) Full VM (KubeVirt)
Cold start P50 1,873 ms 2,887 ms 171,303 ms (~2.9 min)
Cold start P95 1,918 ms 3,892 ms 176,788 ms
App RSS 61.0 MiB 56.3 MiB 57.1 MiB
Host memory 53 MiB 419 MiB 580 MiB
Memory amplification 0.87x 7.44x 10.16x
Idle CPU 1.8m (1.8%) 1.0m (1%) 2.0m (2%)
Max concurrent / node 37 10 N/A

These numbers are hardware- and kernel-dependent. The absolute values will vary. The shape of the trade-off is remarkably stable.

The real cost of Kata is not cold start, it is memory. The cold start delta is about one second: 2.9s vs 1.9s. Manageable. But 419 MiB of host memory for an application that consumes 56 MiB is a 7.44x amplification factor. Where do those extra 363 MiB go? QEMU (~343 MiB) + kata-shim (~42 MiB) + virtiofsd (~30 MiB). Every micro-VM carries an entire virtualization toolchain. Multiply by 100 tenants and you are looking at ~40 GiB of pure isolation overhead.

KubeVirt closes the debate for interactive workloads. 2.9 minutes of cold start. 10x memory amplification. Without pre-warming, it is unusable for notebooks, IDEs, or playgrounds. With pre-warming it becomes a pool management problem, not a cold start problem.

The isolation cost is primarily memory-bound, not CPU-bound. Idle CPU is virtually identical across all three runtimes: 1-2%. You pay once at boot, not on every tick. This is a favorable data point for Kata — the overhead is a fixed cost, not a recurring one.

Density tells the cost story. 37 containers vs 10 Kata micro-VMs per node, a 3.7:1 ratio. To serve the same number of tenants with Kata, you need almost four times the nodes.

Kata exhibits bimodal cold start behavior. The first ~15 runs land around 3.9s, then drop to ~2.9s as the guest kernel and QEMU pages are cached on the host. On warm nodes, nodes that have already run Kata pods, expect the P50. On cold nodes, fresh from autoscaling or node replacement, expect the P95.

Methodological notes. Kata and KubeVirt run under nested virtualization (Proxmox VE). Absolute values would be marginally lower on bare-metal KVM, but the ratios remain consistent across three separate runs (10, 10, and 50 iterations). P95/P50 ratios (container: 1.02x, Kata: 1.35x, KubeVirt: 1.03x) all below 2.0x, indicating stable measurements with no outlier-driven skew. Memory amplification below 1.0x for hardened containers is expected and correct: kubectl top reports working_set_bytes, which excludes shared library pages that VmRSS includes. Cold start is measured end-to-end from the user’s perspective: scheduling + runtime boot + Python startup + 40 MiB allocation + readiness probe success.

Cost, SLO, and Operational Surface Area

Cost model. The benchmarks above translate directly into infrastructure budget. 100 tenants on hardened containers: 100 × 53 MiB = ~5.2 GiB. The same 100 tenants on Kata at 7.44x amplification: ~40.8 GiB. On 16 GiB nodes, that is ~3 nodes vs ~11 nodes for the same workload. With KubeVirt at 10.16x, high-density multi-tenancy becomes cost-prohibitive.

SLO implications. Cold start is an implicit contract with the user.

  • SLO under 5s: hardened containers or Kata on warm nodes
  • SLO under 30s: containers, Kata, or KubeVirt with a pre-warming pool
  • SLO under 3 min: any option works

Kata’s P95 at 3.9s sits dangerously close to the 5-second threshold. On cold nodes, fresh from autoscaling or replacement, it will breach the threshold.

Operational complexity. Isolation does not just cost memory and latency. It costs operational surface area.

  • Runtime components. containerd alone: one component in the critical path. containerd + kata-shim + QEMU + virtiofsd: four. Each is a failure point, a log stream, and a process to monitor.
  • Debug difficulty. kubectl exec works on Kata: the shim proxies it. But host-level troubleshooting requires direct node access. KubeVirt requires virtctl console, an entirely different toolchain.
  • Observability gap. kubectl top for Kata reports guest-side metrics only. The real 419 MiB on the host are invisible without a privileged probe. Your monitoring is underreporting actual memory consumption. This is a concrete operational risk.
  • Image management. Hardened containers and Kata use the same container image, same build pipeline. KubeVirt requires a VM image plus cloud-init — a separate pipeline and a separate patching lifecycle.
  • Incident response. Who debugs a QEMU crash at 3 AM? Kata requires virtualization expertise, not just Kubernetes skills. Every isolation tier is a staffing decision.

Evolution Path

You do not start with Firecracker on day one. You start with your threat model.

Phase 1: MVP. Standard containers with a custom seccomp profile, network policies, and resource limits. 1.9s cold start, 37 sandboxes per node, memory amplification below 1x. Sufficient for semi-trusted workloads and internal teams.

Phase 2: Hardening. gVisor or a more aggressive seccomp profile, egress filtering, service mesh for lateral movement mitigation. Same density, same operational complexity. No new runtime components, just tighter policy on the existing ones.

Phase 3: Micro-VM tier. RuntimeClass for high-risk workloads. Mixed scheduling on the same cluster: standard containers for internal workloads, Kata for untrusted user code. 2.9s cold start, 10 sandboxes per node, 7.44x memory amplification. The cost is significant, but it is quantified, not guessed.

Each phase has a trigger. You do not advance out of caution. You advance when your threat model demands it.

Isolation tiering is an architectural decision, not a feature toggle.

Decision Framework

Given your threat model, budget, SLO, and operational capacity, here is where you land.

Threat model Budget SLO Ops capacity Runtime Cost
Semi-trusted code High density < 5s Small team Hardened container 53 MiB/tenant, 37/node
Untrusted code Moderate < 5s (warm nodes) Moderate + virt skills Micro-VM (Kata) 419 MiB/tenant, 10/node
Hostile tenants OR regulatory kernel isolation Less constrained < 3 min or pre-warming Dedicated platform team Full VM (KubeVirt) 580 MiB/tenant

Epilogue: Spark Is Also a Multi-Tenant Code Execution Engine

Most data platform teams do not think of their Spark clusters as multi-tenant arbitrary code execution. But they are.

Every Spark job can contain arbitrary UDFs in Scala, Python, or Java. Untrusted Python code via PySpark. Native libraries through JNI or Arrow. Third-party JARs with no verification. On Kubernetes, every executor runs as a pod. Pods share the host kernel. The threat model described above applies identically.

  • Spark on YARN: cgroup v1, limited namespace isolation
  • Spark on Kubernetes: namespace + cgroup v2, shared kernel
  • Neither provides a dedicated security boundary

The execution trust boundary is not the DAG. It is not the JVM. It is the kernel.

The Decision Framework above applies directly. Executors running trusted code from internal teams need 53 MiB of host memory each, fit 37 per node, and are well served by hardened containers. Spark-as-a-Service platforms accepting arbitrary user code need 419 MiB per executor, fit only 10 per node, and require 3.7x more infrastructure for the same tenant count. The cost of moving the kernel boundary is the same whether the workload is a notebook or a Spark job.