feat: Kubernetes container runtime for agent spawning

## Use case

Spawn per-session agent containers as **Kubernetes pods on a user-provided cluster** instead of local Docker / Apple Container. Today `src/container-runtime.ts` exports `CONTAINER_RUNTIME_BIN = 'docker'` and `src/container-runner.ts` shells out to `docker run`; on a host that's already a K8s control-plane node with everything else running in-cluster, the local-Docker path is the odd one out — different lifecycle, different observability, different resource controls.

Concretely my host is `swarm0`, a control-plane node in a 5-node Cilium cluster. All other personal workloads (convertx, pi-hole, netbird, portainer, etc.) live in-cluster behind nginx-ingress + cert-manager DNS-01. The nanoclaw host process runs as root (it's a dedicated node), and `/pods/nanoclaw-v2/data/v2-sessions/` is on NFS. I just finished a v1→v2 migration on this host and discovered a couple of papercuts that a K8s runtime would dissolve:

- The host writes session DBs as uid 0; the agent image's `USER node` (uid 1000) cannot write outbound.db on NFS without an explicit chown step (filed separately as #2353).
- The credential proxy binds to `docker0` and containers reach it via `host.docker.internal` — both Docker-specific concepts that don't apply in K8s.

## What I think a runtime abstraction would need to handle

This is sketch-level — happy to refine if you want to take it on:

1. **Runtime selection** — extend the existing Docker / Apple-Container split with a `kubernetes` mode (env-toggled, e.g. `CONTAINER_RUNTIME=k8s`). `CONTAINER_RUNTIME_BIN` becomes an abstract spawn function rather than a binary name.
2. **Pod template generation** — replace `args.push('-e', ...)` / `args.push('-v', ...)` with PodSpec generation: env → `env`, volume mounts → `volumes`/`volumeMounts`, `--user` → `securityContext.runAsUser`, `--add-host` → `hostAliases`, `--rm` → an `OnFailure` Pod (or a Job).
3. **Volumes** — session dirs need to be readable+writable by both the host and the agent pod. NFS PVs / PVCs work if the cluster has a CSI driver matching the host's NFS export, or `hostPath` if the host process and pods land on the same node (DaemonSet-style affinity). Mount-allowlist entries in `~/.config/nanoclaw/mount-allowlist.json` would translate to additional PVCs or hostPath volumes.
4. **Credential proxy** — currently `http://host.docker.internal:CREDENTIAL_PROXY_PORT`. In K8s either a `Service` of type ClusterIP (proxy runs as a sidecar Deployment) or a host-network endpoint plus `hostNetwork: true` on agent pods.
5. **Heartbeat / DB visibility** — `/workspace/.heartbeat` is currently a host bind-mount the host process polls via `fs.statSync`. With pods, either keep the same NFS-backed path (works if you have a shared filesystem) or move heartbeat into outbound.db / a CRD / a watch on Pod conditions.
6. **Image build** — `container/build.sh` builds locally and tags `nanoclaw-agent-v2-<slug>:latest`. With K8s, that image needs to be pushed to a registry the cluster can pull from (most clusters can't pull from the local Docker daemon). I have an in-cluster `registry` namespace; users without one would need a setup-time prompt to point at Docker Hub / GHCR / etc.
7. **Orphan cleanup** — `cleanupOrphans()` scans `docker ps --filter label=...`. K8s equivalent: list Pods with the install-slug label and delete completed ones (or rely on `Job.spec.ttlSecondsAfterFinished`).
8. **Per-pod logs** — `--rm` wipes container logs on exit today, which already makes debugging hard (`docs/...` mentions this). K8s pods retain logs until garbage collection; using a non-`--rm` model would actually be easier.

## What I'd like out of this issue

Just to flag the use case and architectural shape. Not asking you to take it on — happy to do the work on a fork branch if you don't want it in trunk, but wanted to surface the design considerations first in case you've already thought about it or have opinions on the abstraction boundary.

## Context

- Discovered while finishing a v1→v2 migration on a K8s control-plane node.
- Companion bug (uid mismatch when host is root + NFS): #2353
- Workaround in use today: native credential proxy + local Docker + the chown patch from #2353. Works but feels architecturally off given the host's environment.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Kubernetes container runtime for agent spawning #2354

Use case

What I think a runtime abstraction would need to handle

What I'd like out of this issue

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: Kubernetes container runtime for agent spawning #2354

Description

Use case

What I think a runtime abstraction would need to handle

What I'd like out of this issue

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions