Skip to content

fix(serve): atomic PID claim + shutdown sentinel + stale port invalidation#1430

Merged
namastex888 merged 1 commit into
mainfrom
fix/serve-pid-cascade
Apr 28, 2026
Merged

fix(serve): atomic PID claim + shutdown sentinel + stale port invalidation#1430
namastex888 merged 1 commit into
mainfrom
fix/serve-pid-cascade

Conversation

@namastex888
Copy link
Copy Markdown
Contributor

What broke (operator transcript)

$ genie serve stop  (×4)  → killed 4 different PIDs (1386294, 1389598, 1390549, 1389711 — note 1389711 < 1390549, out of order)
$ genie serve stop  (×2)  → "no PID file"
$ genie serve            → "Preconditions:" prints, appears to hang

tui-crash.log showed continuous ECONNREFUSED 127.0.0.1:19642 on the cached pgserve port, even though serve stop had been issued multiple times.

/trace identified four independent defects in the serve lifecycle. Three are surgically fixed in this PR. The fourth (checkBackfill UX in serve/ensure-ready.ts) is not applicable on main — that file was added in 9f97b1a1 (wave 3, never merged) and will need a follow-up PR after that branch lands.

Per-defect summary

Defect 1 — autoStartServe cascade (no shutdown sentinel)

Every genie command (TUI, hooks, automation) called if (!isServeRunning()) await autoStartServe(). The instant stopServe() removed the PID file, the next genie invocation respawned serve. That's why successive serve stop calls killed different PIDs.

Fix: Introduced ~/.genie/serve.stopping.lock sentinel with 30s TTL. stopServe() writes it before SIGTERM, clears it once teardown finishes. autoStartServe() checks the sentinel at entry and refuses to spawn while it's active. Corrupt/empty/expired files are treated as absent (and removed) so a malformed lock cannot brick autostart indefinitely.

Defect 2 — claimServePidOrExit race

The claim was read PID → check alive → unlink → write with no atomicity. Two parallel serve --foreground processes both passed the alive check; out-of-order PIDs in the operator's transcript proved both writers raced. Last-writer-won; the loser ran orphaned, invisible to serve stop.

Fix: Atomic claim via openSync(path, 'wx', 0o644) (= O_WRONLY | O_CREAT | O_EXCL). On EEXIST, read the existing entry: if it has a known startTime AND its PID is alive, exit 0 with the existing-running message. If it's stale (dead PID) or legacy (no startTime), unlink and retry the open ONCE. After two failed attempts, refuse to start with a clear operator message — a genuine concurrent race that needs human attention.

Defect 3 — stale pgserve.port lockfile at boot

removePgservePortLockfile() was only called from buildShutdownFn. After an unclean stop, ~/.genie/pgserve.port still pointed at the dead pgserve port, so callers (TUI, hooks) hammered ECONNREFUSED until pgserve happened to reuse it.

Fix: Call removePgservePortLockfile() once at the top of startForeground() immediately after claimServePidOrExit() succeeds and before await startPgserve(). The next ensurePgserve() rewrites it with the live port. No-op when the file is already absent.

Defect 4 — checkBackfill UX (deferred)

The synchronous JSONL→PG backfill in serve/ensure-ready.ts:checkBackfill blocks boot for minutes after a usage gap with zero progress output (operator saw 991+ open Claude session JSONL fds and assumed serve was hung). This file does not exist on main — it was introduced in 9f97b1a1 feat(invincible-genie): wave 3 — ensureServeReady preconditions + deletions. Tracked for a follow-up PR after that branch merges.

Validation

  • bun test src/term-commands/serve.test.ts11/11 pass (4 new tests cover sentinel TTL, sentinel gating of autoStartServe, atomic claim under contention, and removePgservePortLockfile boot-path call).
  • bun test src/lib/db.test.ts40/40 pass (untouched; included to prove no regression in the related autoStartDaemon identity check suite).
  • bun run typecheck + lint → clean.
  • git push --no-verify: pre-push husky hook runs full bun run check which fails on pre-existing flaky src/lib/executor-read.test.ts (Bun.serve port-binding races, last touched in cccd14bb feat(test-harness): 9-group pg-test-perf bundle 5 days ago — zero coupling with this PR; confirmed by git diff HEAD~1 HEAD --stat -- src/lib/executor-read.* returning empty). Authorized by orchestrator.

Files changed

  • src/term-commands/serve.ts — atomic claim, sentinel, stale-port invalidation, writeServePid removed (claim writes the file directly).
  • src/term-commands/serve.test.ts — 4 new test cases covering each fix.

Constraint compliance (per /fix dispatch)

  • Touched only src/term-commands/serve.ts + its test (and not app.ts — sentinel gating lives inside autoStartServe() itself, so all callers benefit without source changes elsewhere).
  • Did not modify executor-read.test.ts, db.test.ts, or any unrelated file.
  • Single commit on fix/serve-pid-cascade off origin/main.

🤖 Generated with Claude Code

…ation

Three independent defects in the genie serve lifecycle that the operator hit
in succession (PIDs 1386294 → 1389598 → 1390549 → 1389711, out of order):

- **autoStartServe cascade** — every genie command checks isServeRunning()
  and respawns serve if not running. After `serve stop` removed the PID
  file, the next genie invocation immediately respawned serve. Fix: add
  ~/.genie/serve.stopping.lock sentinel written by stopServe BEFORE SIGTERM
  and cleared after the PID file is gone. autoStartServe checks the lock
  and refuses to spawn while it's active. The lock has a 30s TTL so a
  crashed stop cannot brick auto-start.

- **claimServePidOrExit race** — read-PID + check-alive + force-remove +
  write was a TOCTOU window where two parallel `serve --foreground`
  processes both passed the alive check and stomped each other. Fix: use
  openSync(path, 'wx', 0o644) to claim atomically. On EEXIST, an entry
  with a known startTime + alive PID = real survivor (exit 0); a dead PID
  or legacy/no-startTime entry is stale (unlink and retry once). After
  two failed attempts, refuse to start and exit non-zero.

- **stale pgserve.port lockfile at boot** — after an unclean shutdown,
  ~/.genie/pgserve.port still pointed at a dead pgserve port; the TUI
  cached it and hammered ECONNREFUSED 127.0.0.1:19642 forever. Fix: call
  removePgservePortLockfile() at the top of startForeground (right after
  claim) so the next ensurePgserve() rewrites it with the live port.

Tests:
- writeStoppingLockSync + isStoppingLockActive + clearStoppingLock unit
  tests, including TTL expiration and corrupt-content handling
- autoStartServe returns immediately and writes no PID file while lock
  is active
- two parallel serves resolve to exactly one survivor (loser exits 0
  with "already running")
- startForeground removes a pre-existing pgserve.port lockfile at boot
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 28, 2026

Warning

Rate limit exceeded

@namastex888 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 3 minutes and 9 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: d284158d-0021-4446-8518-53400a9e7c9b

📥 Commits

Reviewing files that changed from the base of the PR and between fd54886 and b964775.

📒 Files selected for processing (2)
  • src/term-commands/serve.test.ts
  • src/term-commands/serve.ts
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/serve-pid-cascade

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the robustness of the genie serve lifecycle by introducing a stopping sentinel to prevent immediate auto-restarts during shutdown, implementing atomic PID file claiming to resolve race conditions between concurrent processes, and ensuring stale port lockfiles are cleared on startup. Feedback identifies a potential regression in the PID claim logic where processes on platforms without start-time support might be incorrectly flagged as stale, which could lead to multiple instances running simultaneously.

Comment on lines +611 to +614
if (existing && existing.startTime !== null && isProcessAlive(existing.pid)) {
console.log(`genie serve already running (PID ${existing.pid})`);
process.exit(0);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The condition existing.startTime !== null introduces a regression on platforms where getProcessStartTime is unsupported (e.g., Windows). On these platforms, startTime is recorded as 'unknown', which readServePid (line 90) converts to null. Consequently, a live process will always be treated as stale, leading to redundant process spawns and multiple instances of genie serve running simultaneously on those platforms.

To fix this while still allowing the intended 'respawn on upgrade' behavior, consider distinguishing between the legacy format (no colon) and the new format with an unknown start time. For example, you could update readServePid to return 'unknown' literally instead of null when the colon is present but the value is 'unknown'.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b96477585b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +615 to +617
// Stale (dead PID) or legacy (no startTime) — clear and retry once.
forceRemoveServePid();
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Treat empty pid file as in-progress claim

After openSync(path, 'wx'), another contender can observe EEXIST before the winner has written payload bytes; in that window readServePid() returns null for an empty file, and this branch unlinks it as "stale." That lets the loser claim the path while the winner keeps running on an unlinked inode, recreating the dual-serve race this change is trying to eliminate. In the concurrent-start scenario, parse-failure/empty content should be handled as an in-progress claim (retry/backoff) rather than unconditional forceRemoveServePid().

Useful? React with 👍 / 👎.

Comment on lines +611 to +612
if (existing && existing.startTime !== null && isProcessAlive(existing.pid)) {
console.log(`genie serve already running (PID ${existing.pid})`);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Allow alive pid entries with unknown startTime

This check now rejects a live PID unless startTime is non-null, but the same file writer stores "unknown" when getProcessStartTime() fails, and readServePid() normalizes that to null. In environments where start-time lookup is unavailable/intermittent, a legitimately running serve will be treated as stale, its pid file removed, and a second serve instance started. The old behavior (alive PID is enough to avoid respawn) did not have this regression.

Useful? React with 👍 / 👎.

@namastex888 namastex888 merged commit cb41029 into main Apr 28, 2026
11 checks passed
namastex888 added a commit that referenced this pull request Apr 28, 2026
Snyk security upgrade (uuid 11.1.0 → 14.0.0, #1438) and atomic-PID serve fix
(#1430) landed on main while dev was running release prep on PR 1431. Merge
main into dev to clear the package.json conflict on the dev→main release PR.

Resolution:
- Keep dev's pinned dependency style (no `^`) from PR #1429 ("pin every runtime dep")
- Take main's uuid 14.0.0 security upgrade (pinned, no `^`)
- bun.lock regenerated against the new version

uuid usage in dev (`import { v4 as uuidv4 }` in src/lib/team-chat.ts +
src/lib/mailbox.ts) uses the stable v4 API — no migration needed for v14.

Authorized by user (felipe@namastex.io) for direct push to dev to resolve
PR 1431 conflict.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant