fix(opencode): kill server process group + configurable IDLE_TIMEOUT_MS#2152
Merged
gavrielc merged 1 commit intoMay 1, 2026
Merged
Conversation
Two bugs in the upstream OpenCode provider that fire together when a
local backend (Ollama, llama.cpp) is slower than the hardcoded 90s
event timeout:
1. proc.kill('SIGKILL') only kills the wrapper process the spawn
returned, not the opencode-linux-*/bin/opencode child it execs into.
The child keeps holding port 4096, so the next spawnOpencodeServer()
fails with "Failed to start server on port 4096" / EADDRINUSE.
Fix: spawn detached and signal the whole process group via
process.kill(-pid, 'SIGKILL') in a new killProcessTree() helper.
2. IDLE_TIMEOUT_MS = 90_000 is hardcoded. For a local 31B model the
first prompt's time-to-first-token routinely exceeds that, tripping
the timeout. Fix: read OPENCODE_IDLE_TIMEOUT_MS from env, default
300_000 (5 min) — generous for cloud APIs, just enough for local.
Per-group override goes in container.json env (e.g. "600000" for a
slow local box), no rebuild needed since src/ is bind-mounted.
Same bugs exist on origin/providers — should be ported upstream.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
|
Thanks! |
This was referenced May 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Type of Change
.claude/skills/<name>/, no source changes)Description
Closes #2148. Closes #2149.
Two related bugs in the OpenCode provider that fire together when a local backend (Ollama, llama.cpp) is slower than the hardcoded 90 s event timeout. Bundled into a single PR because they share
container/agent-runner/src/providers/opencode.tsand a small helper.#2148 —
proc.kill('SIGKILL')leaks the underlying binary, holding port 4096spawn('opencode', ...)runs the npmopencode-aiwrapper script thatexecs the platform binaryopencode-linux-*/bin/opencode— which is the actual port listener on127.0.0.1:4096. SIGKILL on the wrapper PID either races with the exec or the listener has already detached; the binary survives and the port stays bound. NextspawnOpencodeServercall fails withFailed to start server on port 4096/ EADDRINUSE.Fix: spawn detached and signal the whole process group via a new
killProcessTree(proc)helper that callsprocess.kill(-pid, 'SIGKILL')(with a fallback to plainproc.kill('SIGKILL')if the negative-PID call throws — covers the case where the spawn never made it into a process group).Both call sites updated:
spawnOpencodeServerdestroySharedRuntime#2149 — Configurable idle timeout
IDLE_TIMEOUT_MS = 90_000was hardcoded. Used as a between-events watchdog, but on a freshly-prompted session it acts as a TTFT ceiling — fine for cloud APIs (sub-second TTFT), too tight for local 30B+ inference on cold start.Fix: read
OPENCODE_IDLE_TIMEOUT_MSfrom env, default to300_000(5 min). Generous for cloud, just enough for slow local. Per-group override viacontainer.jsonenv, e.g."OPENCODE_IDLE_TIMEOUT_MS": "600000"— no rebuild needed sincesrc/is bind-mounted.Tests
No behavior-changing additions. Manually verified:
docker exec <container> pgrep -af opencodeno longer shows orphan[opencode] <defunct>after a forced timeout;127.0.0.1:4096is free immediately.Compounding behavior
Without #2148 fixed, every timeout from the 90 s ceiling (or any idle ceiling) leaks a process and renders the agent container unusable until restarted. Fixing one without the other is half a fix — that's why they're filed together.
For Skills
Not a skill PR — section N/A.