Skip to content

fix(opencode): kill server process group + configurable IDLE_TIMEOUT_MS#2152

Merged
gavrielc merged 1 commit into
nanocoai:providersfrom
glifocat:fix/opencode-process-group-and-timeout
May 1, 2026
Merged

fix(opencode): kill server process group + configurable IDLE_TIMEOUT_MS#2152
gavrielc merged 1 commit into
nanocoai:providersfrom
glifocat:fix/opencode-process-group-and-timeout

Conversation

@glifocat
Copy link
Copy Markdown
Collaborator

Type of Change

  • Feature skill - adds a channel or integration (source code changes + SKILL.md)
  • Utility skill - adds a standalone tool (code files in .claude/skills/<name>/, no source changes)
  • Operational/container skill - adds a workflow or agent skill (SKILL.md only, no source changes)
  • Fix - bug fix or security fix to source code
  • Simplification - reduces or simplifies source code
  • Documentation - docs, README, or CONTRIBUTING changes only

Description

Closes #2148. Closes #2149.

Two related bugs in the OpenCode provider that fire together when a local backend (Ollama, llama.cpp) is slower than the hardcoded 90 s event timeout. Bundled into a single PR because they share container/agent-runner/src/providers/opencode.ts and a small helper.

#2148proc.kill('SIGKILL') leaks the underlying binary, holding port 4096

spawn('opencode', ...) runs the npm opencode-ai wrapper script that execs the platform binary opencode-linux-*/bin/opencode — which is the actual port listener on 127.0.0.1:4096. SIGKILL on the wrapper PID either races with the exec or the listener has already detached; the binary survives and the port stays bound. Next spawnOpencodeServer call fails with Failed to start server on port 4096 / EADDRINUSE.

Fix: spawn detached and signal the whole process group via a new killProcessTree(proc) helper that calls process.kill(-pid, 'SIGKILL') (with a fallback to plain proc.kill('SIGKILL') if the negative-PID call throws — covers the case where the spawn never made it into a process group).

Both call sites updated:

  • startup-timeout cleanup in spawnOpencodeServer
  • destroySharedRuntime

#2149 — Configurable idle timeout

IDLE_TIMEOUT_MS = 90_000 was hardcoded. Used as a between-events watchdog, but on a freshly-prompted session it acts as a TTFT ceiling — fine for cloud APIs (sub-second TTFT), too tight for local 30B+ inference on cold start.

Fix: read OPENCODE_IDLE_TIMEOUT_MS from env, default to 300_000 (5 min). Generous for cloud, just enough for slow local. Per-group override via container.json env, e.g. "OPENCODE_IDLE_TIMEOUT_MS": "600000" — no rebuild needed since src/ is bind-mounted.

Tests

No behavior-changing additions. Manually verified:

  • Process group kill: docker exec <container> pgrep -af opencode no longer shows orphan [opencode] <defunct> after a forced timeout; 127.0.0.1:4096 is free immediately.
  • Configurable timeout: env override applied; default 300 s confirmed when var unset.

Compounding behavior

Without #2148 fixed, every timeout from the 90 s ceiling (or any idle ceiling) leaks a process and renders the agent container unusable until restarted. Fixing one without the other is half a fix — that's why they're filed together.

For Skills

  • SKILL.md contains instructions, not inline code (code goes in separate files)
  • SKILL.md is under 500 lines
  • I tested this skill on a fresh clone

Not a skill PR — section N/A.

Two bugs in the upstream OpenCode provider that fire together when a
local backend (Ollama, llama.cpp) is slower than the hardcoded 90s
event timeout:

1. proc.kill('SIGKILL') only kills the wrapper process the spawn
   returned, not the opencode-linux-*/bin/opencode child it execs into.
   The child keeps holding port 4096, so the next spawnOpencodeServer()
   fails with "Failed to start server on port 4096" / EADDRINUSE.
   Fix: spawn detached and signal the whole process group via
   process.kill(-pid, 'SIGKILL') in a new killProcessTree() helper.

2. IDLE_TIMEOUT_MS = 90_000 is hardcoded. For a local 31B model the
   first prompt's time-to-first-token routinely exceeds that, tripping
   the timeout. Fix: read OPENCODE_IDLE_TIMEOUT_MS from env, default
   300_000 (5 min) — generous for cloud APIs, just enough for local.

Per-group override goes in container.json env (e.g. "600000" for a
slow local box), no rebuild needed since src/ is bind-mounted.

Same bugs exist on origin/providers — should be ported upstream.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added follows-guidelines PR was created using the current contributing template PR: Fix Bug fix labels Apr 30, 2026
@gavrielc gavrielc merged commit b429ab3 into nanocoai:providers May 1, 2026
1 check passed
@gavrielc
Copy link
Copy Markdown
Collaborator

gavrielc commented May 1, 2026

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

follows-guidelines PR was created using the current contributing template PR: Fix Bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants