Skip to content

Kubernetes User Namespace Isolation for Sandbox Pods #982

@mrunalp

Description

@mrunalp

Problem Statement

OpenShell sandbox pods currently run with host-level capabilities (SYS_ADMIN, NET_ADMIN). While the supervisor drops privileges for child processes, a container e
scape vulnerability would land the attacker as root on the host with these capabilities active. Kubernetes v1.36 graduated user namespace support to GA (spec.hostUsers: false), which maps
container UID 0 to an unprivileged host UID and makes capabilities container-scoped. This is a significant defense-in-depth improvement that OpenShell should support.

Proposed Design

Two-layer configuration for enabling user namespaces on sandbox pods:

  • Cluster-wide default: enable_user_namespaces field on the server Config / KubernetesComputeConfig, exposed via the OPENSHELL_ENABLE_USER_NAMESPACES environment variable and the server.enableUserNamespaces Helm value. Defaults to false.
  • Per-sandbox override: optional bool user_namespaces field on the SandboxTemplate proto message. When set, overrides the cluster default. Translated to platform_config.host_users for the Kubernetes driver.

Pod spec changes when enabled:

  • spec.hostUsers: false is set on sandbox pods, activating Kubernetes user namespace isolation.
  • The capability list is extended with SETUID, SETGID, and DAC_READ_SEARCH (matching the Podman driver). These are needed because the bounding set is reset inside a user namespace: SETUID/SETGID for the supervisor to drop privileges, DAC_READ_SEARCH for cross-UID /proc/<pid>/fd/ access in network policy enforcement.

What stays the same:

  • Seccomp filters (CLONE_NEWUSER block remains — we still don't want nested user namespaces from sandboxed processes).
  • Landlock filesystem restrictions (unprivileged, no capabilities needed).
  • Supervisor privilege-drop logic.
  • Init containers and volume mounts (ID-mapped mounts handle ownership transparently).

Components involved:

  • proto/openshell.protoSandboxTemplate.user_namespaces field
  • crates/openshell-core/src/config.rsConfig.enable_user_namespaces
  • crates/openshell-driver-kubernetes/src/config.rsKubernetesComputeConfig.enable_user_namespaces
  • crates/openshell-driver-kubernetes/src/driver.rs — pod spec generation (hostUsers, capabilities), new platform_config_bool helper
  • crates/openshell-server/src/cli.rs — CLI arg / env var
  • crates/openshell-server/src/compute/mod.rsbuild_platform_config translation
  • crates/openshell-server/src/lib.rs — config wiring
  • deploy/helm/openshell/values.yaml and templates/statefulset.yaml — Helm plumbing
  • docs/security/best-practices.mdx — user-facing documentation

Additional changes:

  • Supervisor hostPath volume type changed from DirectoryOrCreate to Directory (the path is always pre-provisioned; DirectoryOrCreate could fail under user namespaces when the mapped UID can't create host directories).
  • A warn! is emitted when GPU and user namespaces are both active on the same sandbox (NVIDIA device plugin compatibility is unverified).

Alternatives Considered

  1. Always-on user namespaces (no opt-in): Rejected because user namespaces require Kubernetes 1.33+ (beta) or 1.36+ (GA), a supporting container runtime (containerd 2.0+, CRI-O 1.25+), and Linux 5.12+ with a filesystem that supports ID-mapped mounts. Forcing it on would break existing deployments on older clusters.

  2. Per-sandbox only (no cluster default): Rejected because operators deploying to a capable cluster should be able to enable user namespaces once for all sandboxes rather than setting it on each sandbox creation request.

  3. Typed field on DriverSandboxTemplate instead of platform_config passthrough: Rejected because host_users is Kubernetes-specific. The existing platform_config opaque Struct is the correct place for platform-specific knobs, matching the pattern used by runtime_class_name and annotations.

Agent Investigation

Explored the full configuration flow from proto through server to K8s driver:

  • Pod spec construction in crates/openshell-driver-kubernetes/src/driver.rs (sandbox_template_to_k8s, apply_supervisor_sideload)
  • Current capability set (SYS_ADMIN, NET_ADMIN, SYS_PTRACE, SYSLOG) and why each is needed
  • Podman driver's user namespace handling in crates/openshell-driver-podman/src/container.rs (adds SETUID, SETGID, DAC_READ_SEARCH — same pattern adopted here)
  • Seccomp filter's CLONE_NEWUSER block in crates/openshell-sandbox/src/sandbox/linux/seccomp.rs (remains active)
  • Network namespace creation in crates/openshell-sandbox/src/sandbox/linux/netns.rs (uses nsenter instead of ip netns exec to avoid sysfs remount, which requires real CAP_SYS_ADMIN in the host user namespace)
  • Helm chart env var wiring pattern in deploy/helm/openshell/templates/statefulset.yaml

Validated end-to-end on:

  • OCP 4.22 (K8s 1.35.3, CRI-O 1.35, RHEL CoreOS, kernel 5.14): full SSH tunnel, workspace init, sandbox command execution with non-identity UID mapping (0 → 3285581824)
  • Native K8s v1.37 (CRI-O 1.36, Fedora, kernel 6.19): pod spec and UID mapping verified
  • mise run cluster (k3s-in-Docker): pod spec verified, runtime fails due to nested overlayfs lacking ID-mapped mount support (expected and documented)

Known limitations:

  • Does not work in Docker-in-Docker / k3s-in-Docker dev clusters (nested overlayfs lacks MOUNT_ATTR_IDMAP support)
  • GPU + user namespaces compatibility is unverified (warning emitted)
  • Requires Linux 5.12+ and a supporting container runtime

Checklist

  • I've reviewed existing issues and the architecture docs
  • This is a design proposal, not a "please build this" request

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions