Skip to content

fix(session-manager): chown new session dirs when host runs as root#2353

Open
netadmincmh-hash wants to merge 1 commit into
nanocoai:mainfrom
netadmincmh-hash:fix/chown-session-dirs-when-host-is-root
Open

fix(session-manager): chown new session dirs when host runs as root#2353
netadmincmh-hash wants to merge 1 commit into
nanocoai:mainfrom
netadmincmh-hash:fix/chown-session-dirs-when-host-is-root

Conversation

@netadmincmh-hash
Copy link
Copy Markdown

Summary

Linux installs that run NanoClaw as root with the data directory on a network filesystem hit an unrecoverable container spawn loop. Two constraints collide:

  1. The agent image ships with USER node (uid 1000) and Claude Code refuses to run as root:

    --dangerously-skip-permissions cannot be used with root/sudo privileges
    

    So --user 0:0 on docker run is not a workaround.

  2. The host process writes inbound.db and the session-folder scaffolding as uid 0. On a network filesystem (NFS in my case) the container's uid 1000 cannot write outbound.db or touch the heartbeat file. bun:sqlite surfaces this as:

    Fatal error: attempt to write a readonly database
    

The container exits code=1 microseconds after agent-runner startup, the host sweep retries a few times, then marks the inbound message completed without ever sending a response.

What this changes

initSessionFolder chowns each freshly-created session directory to 1000:1000 when process.getuid() === 0. No-op for any non-root host UID — those paths fall through to the existing --user $hostUid:$hostGid mapping in container-runner.ts. execFileSync('chown', ...) is best-effort: if it fails, the agent fails later with the clearer SQLite error and the sweep retries until the operator notices.

Test plan

  • pnpm run build passes (no new deps, no type changes).
  • On the affected host (root + NFS-backed /pods), reproduced the spawn loop on main. With this patch applied, chown -R 1000:1000 <session-dir> runs at session create, the container's node user can write outbound.db, and Telegram round-trip completes (verified end-to-end).
  • Doesn't run when host UID != 0, so no behavior change for the normal Mac/Linux-as-non-root install.

Reproducer

# On a host running as root, with /pods on NFS:
sudo -i
cd /pods/nanoclaw-v2
systemctl start nanoclaw-v2-<slug>
# Send any inbound to the bot. Container exits ~370ms after spawn.
journalctl -u nanoclaw-v2-<slug> | grep "readonly database"

Notes / things to discuss

  • The hard-coded 1000:1000 matches the image's USER node directive but isn't future-proof if the image ever changes UID. If you'd prefer, this could be made configurable via env var (e.g. NANOCLAW_CONTAINER_UID:GID) or read from the image at startup. Happy to fold in either approach.
  • Existing session dirs created before the patch is applied won't be chowned automatically — operators on affected setups will need a one-time chown -R 1000:1000 data/v2-sessions/.
  • This was discovered finishing a v1→v2 migration; full incident notes in the migration record (separate from this PR).

🤖 Generated with Claude Code

Two constraints collide on a Linux install where the NanoClaw host runs as
root and the data directory is on a network filesystem (NFS, etc.):

  1. The agent image ships with USER node (uid 1000) and Claude Code refuses
     to run as root with the error:
       --dangerously-skip-permissions cannot be used with root/sudo privileges
     so we cannot pass --user 0:0 to docker run as a workaround.

  2. The host writes inbound.db and the session-folder scaffolding as uid 0.
     On a network filesystem the container's uid 1000 cannot write outbound.db
     or touch the heartbeat file, and bun:sqlite surfaces this as
       Fatal error: attempt to write a readonly database

The result is an unrecoverable spawn loop: every container exits with code 1
microseconds after agent-runner startup, and the host sweep marks the inbound
message completed after a few retries.

This patch chowns each freshly-created session directory to 1000:1000 when
process.getuid() === 0. No-op when the host already runs as the container UID
(1000) or any other non-root UID — those paths fall through to the existing
--user $hostUid:$hostGid mapping in container-runner.ts. chown is best-effort:
if it fails, the agent will fail later with the clearer SQLite error and the
sweep retries until the operator notices.

Reproducer: run nanoclaw-v2 as root with /pods on NFS, send any inbound
message; container exits code=1 with 'attempt to write a readonly database'.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant