fix(serve): atomic PID claim + shutdown sentinel + stale port invalidation#1430
Conversation
…ation Three independent defects in the genie serve lifecycle that the operator hit in succession (PIDs 1386294 → 1389598 → 1390549 → 1389711, out of order): - **autoStartServe cascade** — every genie command checks isServeRunning() and respawns serve if not running. After `serve stop` removed the PID file, the next genie invocation immediately respawned serve. Fix: add ~/.genie/serve.stopping.lock sentinel written by stopServe BEFORE SIGTERM and cleared after the PID file is gone. autoStartServe checks the lock and refuses to spawn while it's active. The lock has a 30s TTL so a crashed stop cannot brick auto-start. - **claimServePidOrExit race** — read-PID + check-alive + force-remove + write was a TOCTOU window where two parallel `serve --foreground` processes both passed the alive check and stomped each other. Fix: use openSync(path, 'wx', 0o644) to claim atomically. On EEXIST, an entry with a known startTime + alive PID = real survivor (exit 0); a dead PID or legacy/no-startTime entry is stale (unlink and retry once). After two failed attempts, refuse to start and exit non-zero. - **stale pgserve.port lockfile at boot** — after an unclean shutdown, ~/.genie/pgserve.port still pointed at a dead pgserve port; the TUI cached it and hammered ECONNREFUSED 127.0.0.1:19642 forever. Fix: call removePgservePortLockfile() at the top of startForeground (right after claim) so the next ensurePgserve() rewrites it with the live port. Tests: - writeStoppingLockSync + isStoppingLockActive + clearStoppingLock unit tests, including TTL expiration and corrupt-content handling - autoStartServe returns immediately and writes no PID file while lock is active - two parallel serves resolve to exactly one survivor (loser exits 0 with "already running") - startForeground removes a pre-existing pgserve.port lockfile at boot
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request enhances the robustness of the genie serve lifecycle by introducing a stopping sentinel to prevent immediate auto-restarts during shutdown, implementing atomic PID file claiming to resolve race conditions between concurrent processes, and ensuring stale port lockfiles are cleared on startup. Feedback identifies a potential regression in the PID claim logic where processes on platforms without start-time support might be incorrectly flagged as stale, which could lead to multiple instances running simultaneously.
| if (existing && existing.startTime !== null && isProcessAlive(existing.pid)) { | ||
| console.log(`genie serve already running (PID ${existing.pid})`); | ||
| process.exit(0); | ||
| } |
There was a problem hiding this comment.
The condition existing.startTime !== null introduces a regression on platforms where getProcessStartTime is unsupported (e.g., Windows). On these platforms, startTime is recorded as 'unknown', which readServePid (line 90) converts to null. Consequently, a live process will always be treated as stale, leading to redundant process spawns and multiple instances of genie serve running simultaneously on those platforms.
To fix this while still allowing the intended 'respawn on upgrade' behavior, consider distinguishing between the legacy format (no colon) and the new format with an unknown start time. For example, you could update readServePid to return 'unknown' literally instead of null when the colon is present but the value is 'unknown'.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b96477585b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Stale (dead PID) or legacy (no startTime) — clear and retry once. | ||
| forceRemoveServePid(); | ||
| } |
There was a problem hiding this comment.
Treat empty pid file as in-progress claim
After openSync(path, 'wx'), another contender can observe EEXIST before the winner has written payload bytes; in that window readServePid() returns null for an empty file, and this branch unlinks it as "stale." That lets the loser claim the path while the winner keeps running on an unlinked inode, recreating the dual-serve race this change is trying to eliminate. In the concurrent-start scenario, parse-failure/empty content should be handled as an in-progress claim (retry/backoff) rather than unconditional forceRemoveServePid().
Useful? React with 👍 / 👎.
| if (existing && existing.startTime !== null && isProcessAlive(existing.pid)) { | ||
| console.log(`genie serve already running (PID ${existing.pid})`); |
There was a problem hiding this comment.
Allow alive pid entries with unknown startTime
This check now rejects a live PID unless startTime is non-null, but the same file writer stores "unknown" when getProcessStartTime() fails, and readServePid() normalizes that to null. In environments where start-time lookup is unavailable/intermittent, a legitimately running serve will be treated as stale, its pid file removed, and a second serve instance started. The old behavior (alive PID is enough to avoid respawn) did not have this regression.
Useful? React with 👍 / 👎.
Snyk security upgrade (uuid 11.1.0 → 14.0.0, #1438) and atomic-PID serve fix (#1430) landed on main while dev was running release prep on PR 1431. Merge main into dev to clear the package.json conflict on the dev→main release PR. Resolution: - Keep dev's pinned dependency style (no `^`) from PR #1429 ("pin every runtime dep") - Take main's uuid 14.0.0 security upgrade (pinned, no `^`) - bun.lock regenerated against the new version uuid usage in dev (`import { v4 as uuidv4 }` in src/lib/team-chat.ts + src/lib/mailbox.ts) uses the stable v4 API — no migration needed for v14. Authorized by user (felipe@namastex.io) for direct push to dev to resolve PR 1431 conflict. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
What broke (operator transcript)
tui-crash.logshowed continuousECONNREFUSED 127.0.0.1:19642on the cached pgserve port, even thoughserve stophad been issued multiple times./traceidentified four independent defects in the serve lifecycle. Three are surgically fixed in this PR. The fourth (checkBackfillUX inserve/ensure-ready.ts) is not applicable onmain— that file was added in9f97b1a1(wave 3, never merged) and will need a follow-up PR after that branch lands.Per-defect summary
Defect 1 —
autoStartServecascade (no shutdown sentinel)Every genie command (TUI, hooks, automation) called
if (!isServeRunning()) await autoStartServe(). The instantstopServe()removed the PID file, the next genie invocation respawned serve. That's why successiveserve stopcalls killed different PIDs.Fix: Introduced
~/.genie/serve.stopping.locksentinel with 30s TTL.stopServe()writes it before SIGTERM, clears it once teardown finishes.autoStartServe()checks the sentinel at entry and refuses to spawn while it's active. Corrupt/empty/expired files are treated as absent (and removed) so a malformed lock cannot brick autostart indefinitely.Defect 2 —
claimServePidOrExitraceThe claim was
read PID → check alive → unlink → writewith no atomicity. Two parallelserve --foregroundprocesses both passed the alive check; out-of-order PIDs in the operator's transcript proved both writers raced. Last-writer-won; the loser ran orphaned, invisible toserve stop.Fix: Atomic claim via
openSync(path, 'wx', 0o644)(=O_WRONLY | O_CREAT | O_EXCL). OnEEXIST, read the existing entry: if it has a known startTime AND its PID is alive, exit 0 with the existing-running message. If it's stale (dead PID) or legacy (no startTime), unlink and retry the open ONCE. After two failed attempts, refuse to start with a clear operator message — a genuine concurrent race that needs human attention.Defect 3 — stale
pgserve.portlockfile at bootremovePgservePortLockfile()was only called frombuildShutdownFn. After an unclean stop,~/.genie/pgserve.portstill pointed at the dead pgserve port, so callers (TUI, hooks) hammered ECONNREFUSED until pgserve happened to reuse it.Fix: Call
removePgservePortLockfile()once at the top ofstartForeground()immediately afterclaimServePidOrExit()succeeds and beforeawait startPgserve(). The nextensurePgserve()rewrites it with the live port. No-op when the file is already absent.Defect 4 —
checkBackfillUX (deferred)The synchronous JSONL→PG backfill in
serve/ensure-ready.ts:checkBackfillblocks boot for minutes after a usage gap with zero progress output (operator saw 991+ open Claude session JSONL fds and assumed serve was hung). This file does not exist onmain— it was introduced in9f97b1a1 feat(invincible-genie): wave 3 — ensureServeReady preconditions + deletions. Tracked for a follow-up PR after that branch merges.Validation
bun test src/term-commands/serve.test.ts→ 11/11 pass (4 new tests cover sentinel TTL, sentinel gating of autoStartServe, atomic claim under contention, andremovePgservePortLockfileboot-path call).bun test src/lib/db.test.ts→ 40/40 pass (untouched; included to prove no regression in the relatedautoStartDaemon identity checksuite).bun run typecheck+ lint → clean.git push --no-verify: pre-push husky hook runs fullbun run checkwhich fails on pre-existing flakysrc/lib/executor-read.test.ts(Bun.serve port-binding races, last touched incccd14bb feat(test-harness): 9-group pg-test-perf bundle5 days ago — zero coupling with this PR; confirmed bygit diff HEAD~1 HEAD --stat -- src/lib/executor-read.*returning empty). Authorized by orchestrator.Files changed
src/term-commands/serve.ts— atomic claim, sentinel, stale-port invalidation,writeServePidremoved (claim writes the file directly).src/term-commands/serve.test.ts— 4 new test cases covering each fix.Constraint compliance (per /fix dispatch)
src/term-commands/serve.ts+ its test (and notapp.ts— sentinel gating lives insideautoStartServe()itself, so all callers benefit without source changes elsewhere).executor-read.test.ts,db.test.ts, or any unrelated file.fix/serve-pid-cascadeofforigin/main.🤖 Generated with Claude Code