Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .agents/skills/debug-openshell-cluster/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,7 @@ Component images (server, sandbox) can reach kubelet via two paths:

**Local/external pull mode** (default local via `mise run cluster`): Local images are tagged to the configured local registry base (default `127.0.0.1:5000/openshell/*`), pushed to that registry, and pulled by k3s via `registries.yaml` mirror endpoint (typically `host.docker.internal:5000`). The `cluster` task pushes prebuilt local tags (`openshell/*:dev`, falling back to `localhost:5000/openshell/*:dev` or `127.0.0.1:5000/openshell/*:dev`).

Gateway image builds now stage a partial Rust workspace from `deploy/docker/Dockerfile.images`. If cargo fails with a missing manifest under `/build/crates/...`, or an imported symbol exists locally but is missing in the image build, verify that every current gateway dependency crate (including `openshell-driver-kubernetes` and `openshell-ocsf`) is copied into the staged workspace there.
Gateway image builds now stage a partial Rust workspace from `deploy/docker/Dockerfile.images`. If cargo fails with a missing manifest under `/build/crates/...`, or an imported symbol exists locally but is missing in the image build, verify that every current gateway dependency crate (including `openshell-driver-docker`, `openshell-driver-kubernetes`, and `openshell-ocsf`) is copied into the staged workspace there.

```bash
# Verify image refs currently used by openshell deployment
Expand Down
1 change: 1 addition & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ These pipelines connect skills into end-to-end workflows. Individual skill files
| `crates/openshell-tui/` | Terminal UI | Ratatui-based dashboard for monitoring |
| `crates/openshell-vm/` | MicroVM runtime | Experimental, work-in-progress libkrun-based VM execution |
| `crates/openshell-driver-kubernetes/` | Kubernetes compute driver | In-process `ComputeDriver` backend for K8s sandbox pods |
| `crates/openshell-driver-docker/` | Docker compute driver | In-process `ComputeDriver` backend for local Docker sandbox containers |
| `crates/openshell-driver-vm/` | VM compute driver | Standalone libkrun-backed `ComputeDriver` subprocess (embeds its own rootfs + runtime) |
| `python/openshell/` | Python SDK | Python bindings and CLI packaging |
| `proto/` | Protobuf definitions | gRPC service contracts |
Expand Down
19 changes: 18 additions & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

18 changes: 16 additions & 2 deletions architecture/gateway.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ graph TD
| Persistence | `crates/openshell-server/src/persistence/mod.rs` | `Store` enum (SQLite/Postgres), generic object CRUD, protobuf codec |
| Compute runtime | `crates/openshell-server/src/compute/mod.rs` | `ComputeRuntime`, gateway-owned sandbox lifecycle orchestration over a compute backend |
| Compute driver: Kubernetes | `crates/openshell-driver-kubernetes/src/driver.rs` | Kubernetes CRD create/delete/watch, pod template translation |
| Compute driver: Docker | `crates/openshell-driver-docker/src/lib.rs` | Local Docker container create/stop/delete/watch |
| Compute driver: VM | `crates/openshell-driver-vm/src/driver.rs` | Per-sandbox microVM create/delete/watch, supervisor-only guest boot |
| Sandbox index | `crates/openshell-server/src/sandbox_index.rs` | `SandboxIndex` -- in-memory name/pod-to-id correlation |
| Watch bus | `crates/openshell-server/src/sandbox_watch.rs` | `SandboxWatchBus` -- in-memory broadcast for persisted sandbox updates |
Expand Down Expand Up @@ -103,6 +104,7 @@ The gateway boots in `cli::run_cli` (`crates/openshell-server/src/cli.rs`) and p
1. Connect to the persistence store (`Store::connect`), which auto-detects SQLite vs Postgres from the URL prefix and runs migrations.
2. Create `ComputeRuntime` with a `ComputeDriver` implementation selected by `OPENSHELL_DRIVERS`:
- `kubernetes` wraps `KubernetesComputeDriver` in `ComputeDriverService`, so the gateway uses the `openshell.compute.v1.ComputeDriver` RPC surface even without transport.
- `docker` constructs `openshell-driver-docker` in-process and manages local containers labeled with the configured sandbox namespace.
- `vm` spawns the standalone `openshell-driver-vm` binary as a local compute-driver process, resolves it from `--driver-dir`, conventional libexec install paths, or a sibling of the gateway binary, connects to it over a Unix domain socket, and keeps the libkrun/rootfs runtime out of the gateway binary.
3. Build `ServerState` (shared via `Arc<ServerState>` across all handlers), including a fresh `SupervisorSessionRegistry`.
4. **Spawn background tasks**:
Expand All @@ -116,7 +118,7 @@ The gateway boots in `cli::run_cli` (`crates/openshell-server/src/cli.rs`) and p

## Configuration

All configuration is via CLI flags with environment variable fallbacks. The `--db-url` and `--ssh-handshake-secret` flags are required.
All configuration is via CLI flags with environment variable fallbacks. The `--db-url` flag is required. The `--ssh-handshake-secret` flag is required for non-Docker drivers; Docker sandboxes do not receive a handshake secret.

| Flag | Env Var | Default | Description |
|------|---------|---------|-------------|
Expand All @@ -132,7 +134,7 @@ All configuration is via CLI flags with environment variable fallbacks. The `--d
| `--sandbox-namespace` | `OPENSHELL_SANDBOX_NAMESPACE` | `default` | Kubernetes namespace for sandbox CRDs |
| `--sandbox-image` | `OPENSHELL_SANDBOX_IMAGE` | None | Default container image for sandbox pods |
| `--grpc-endpoint` | `OPENSHELL_GRPC_ENDPOINT` | None | gRPC endpoint reachable from within the cluster (for supervisor callbacks) |
| `--drivers` | `OPENSHELL_DRIVERS` | `kubernetes` | Compute backend to use. Current options are `kubernetes` and `vm`. |
| `--drivers` | `OPENSHELL_DRIVERS` | `kubernetes` | Compute backend to use. Current options are `kubernetes`, `docker`, and `vm`. |
| `--vm-driver-state-dir` | `OPENSHELL_VM_DRIVER_STATE_DIR` | `target/openshell-vm-driver` | Host directory for VM sandbox rootfs, console logs, and runtime state |
| `--driver-dir` | `OPENSHELL_DRIVER_DIR` | unset | Override directory for `openshell-driver-vm`. When unset, the gateway searches `~/.local/libexec/openshell`, `/usr/local/libexec/openshell`, `/usr/local/libexec`, then a sibling binary. |
| `--vm-krun-log-level` | `OPENSHELL_VM_KRUN_LOG_LEVEL` | `1` | libkrun log level for VM helper processes |
Expand Down Expand Up @@ -600,6 +602,18 @@ The Helm chart template is at `deploy/helm/openshell/templates/statefulset.yaml`

The gateway reaches the sandbox exclusively through the supervisor-initiated `ConnectSupervisor` session, so the driver never returns sandbox network endpoints.

### Docker Driver

The Docker driver (`crates/openshell-driver-docker/src/lib.rs`) is an in-process compute backend for local standalone gateways. It creates one Docker container per sandbox, labels each container with `openshell.ai/managed-by=openshell`, `openshell.ai/sandbox-id`, `openshell.ai/sandbox-name`, and `openshell.ai/sandbox-namespace`, and bind-mounts a Linux `openshell-sandbox` supervisor binary into the container.

- **Create**: Pulls or validates the sandbox image according to `sandbox_image_pull_policy`, creates a labeled container, mounts the supervisor binary and optional TLS material, and starts the container with the supervisor as entrypoint.
- **List/Get/Watch**: Reads labeled containers in the configured sandbox namespace and derives driver-native sandbox status from Docker state plus supervisor relay readiness.
- **Stop**: Stops the matching labeled container without deleting it.
- **Delete**: Force-removes the matching labeled container.
- **Gateway shutdown**: On SIGINT or SIGTERM, `run_server()` leaves the accept loop and calls the Docker shutdown cleanup hook. The hook stops all running, restarting, or paused OpenShell-managed containers in the configured sandbox namespace so local sandboxes do not keep running after the gateway exits.
- **Gateway startup resume**: Before the watch and reconcile loops spawn, `ComputeRuntime::resume_persisted_sandboxes()` walks every sandbox record in the store. For each sandbox whose phase is `Provisioning`, `Ready`, or `Unknown`, it asks the Docker driver to start the labeled container if it is in the `exited` or `created` state (`StartupResume::resume_sandbox`). Containers in `running` or `restarting` are left alone; `paused`, `dead`, and `removing` are skipped. If the matching container has disappeared, the sandbox is moved to phase `Error` with reason `BackendResourceMissing`; if the start call fails, the sandbox moves to `Error` with reason `ResumeFailed`. This is what makes sandboxes survive a graceful gateway restart end-to-end: shutdown stops them, the next startup resumes them, and the store remains the source of truth across the cycle. Drivers that do not need this hook (Kubernetes, Podman, VM) leave `startup_resume = None`, which makes the resume sweep a no-op.
- **Handshake secret**: The Docker driver does not inject `OPENSHELL_SSH_HANDSHAKE_SECRET` or `OPENSHELL_SSH_HANDSHAKE_SKEW_SECS` into containers. Supervisor relay auth relies on the gateway connection rather than a Docker-visible container env secret.

### VM Driver

`VmDriver` (`crates/openshell-driver-vm/src/driver.rs`) is served by the standalone `openshell-driver-vm` process. The gateway spawns that binary on demand and talks to it over the internal `openshell.compute.v1.ComputeDriver` gRPC contract via a Unix domain socket.
Expand Down
2 changes: 1 addition & 1 deletion architecture/sandbox-connect.md
Original file line number Diff line number Diff line change
Expand Up @@ -623,7 +623,7 @@ The sandbox SSH daemon's exit thread waits for the reader thread to finish forwa

### Sandbox environment variables

These are injected into compute-backed sandboxes by the **Kubernetes** driver (`crates/openshell-driver-kubernetes/src/driver.rs`) and the **Podman** driver (`crates/openshell-driver-podman/src/container.rs`). Together they are required for **persistent `ConnectSupervisor` registration and relay** (see [Podman and relay environment](#podman-and-relay-environment) for the Podman-specific fix):
These are injected into compute-backed sandboxes by the **Kubernetes** driver (`crates/openshell-driver-kubernetes/src/driver.rs`), the **Podman** driver (`crates/openshell-driver-podman/src/container.rs`), and the **Docker** driver (`crates/openshell-driver-docker/src/lib.rs`). Together they are required for **persistent `ConnectSupervisor` registration and relay** (see [Podman and relay environment](#podman-and-relay-environment) for the Podman-specific fix):

| Variable | Description |
|---|---|
Expand Down
13 changes: 10 additions & 3 deletions crates/openshell-core/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ pub const CDI_GPU_DEVICE_ALL: &str = "nvidia.com/gpu=all";
pub enum ComputeDriverKind {
Kubernetes,
Vm,
Docker,
Podman,
}

Expand All @@ -57,6 +58,7 @@ impl ComputeDriverKind {
match self {
Self::Kubernetes => "kubernetes",
Self::Vm => "vm",
Self::Docker => "docker",
Self::Podman => "podman",
}
}
Expand All @@ -75,9 +77,10 @@ impl FromStr for ComputeDriverKind {
match value.trim().to_ascii_lowercase().as_str() {
"kubernetes" => Ok(Self::Kubernetes),
"vm" => Ok(Self::Vm),
"docker" => Ok(Self::Docker),
"podman" => Ok(Self::Podman),
other => Err(format!(
"unsupported compute driver '{other}'. expected one of: kubernetes, vm, podman"
"unsupported compute driver '{other}'. expected one of: kubernetes, vm, docker, podman"
)),
}
}
Expand Down Expand Up @@ -450,12 +453,16 @@ mod tests {
"podman".parse::<ComputeDriverKind>().unwrap(),
ComputeDriverKind::Podman
);
assert_eq!(
"docker".parse::<ComputeDriverKind>().unwrap(),
ComputeDriverKind::Docker
);
}

#[test]
fn compute_driver_kind_rejects_unknown_values() {
let err = "docker".parse::<ComputeDriverKind>().unwrap_err();
assert!(err.contains("unsupported compute driver 'docker'"));
let err = "firecracker".parse::<ComputeDriverKind>().unwrap_err();
assert!(err.contains("unsupported compute driver 'firecracker'"));
}

#[test]
Expand Down
28 changes: 28 additions & 0 deletions crates/openshell-driver-docker/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

[package]
name = "openshell-driver-docker"
description = "Docker compute driver for OpenShell"
version.workspace = true
edition.workspace = true
rust-version.workspace = true
license.workspace = true
repository.workspace = true

[dependencies]
openshell-core = { path = "../openshell-core" }

tokio = { workspace = true }
tonic = { workspace = true }
futures = { workspace = true }
tokio-stream = { workspace = true }
tracing = { workspace = true }
bytes = { workspace = true }
url = { workspace = true }
bollard = { version = "0.20" }
tar = "0.4"
tempfile = "3"

[lints]
workspace = true
Loading
Loading