fix(prune): broaden TTL-archive filter to include stale_spawn_dead_pane (PR-B G3 #1)#1636
Conversation
Reconciler tags two flavors of dead-pane workers: - dead_pane_zombie (active state → pane died) - stale_spawn_dead_pane (spawning state → pane died before ready) `archiveExhaustedZombies` and `listExhaustedZombies` only matched `dead_pane_zombie`, so the spawning-flavor zombies stayed visible in `genie ls` forever even after auto_resume exhaustion. Broaden the audit_events EXISTS filter in both queries to match either reason. Behavior unchanged for `dead_pane_zombie` rows; pre-existing TTL + auto_resume=false guards still apply. Wish: cli-noise-and-hygiene-cleanup G3 (deliverable #1). Subsequent PR adds `genie prune --errored` mode (G3 deliverables #2-5). Smoke: seeded a 2h-old stale_spawn_dead_pane row → listExhaustedZombies(1) now returns it (was 0 before).
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request expands the zombie agent archival criteria to include the 'stale_spawn_dead_pane' reason, updating the documentation, archival and listing logic, and adding a consistency test. The reviewer suggests also including the 'stale_spawn' reason to more comprehensively clean up exhausted agents that failed during the initial spawn phase, noting that documentation and tests should be updated to reflect this addition.
| * by the scheduler's exhaustion branch. Without this TTL, such rows stayed | ||
| * visible in `genie ls` forever (#1293), holding registry slots and confusing | ||
| * users into thinking the agent is still recoverable. | ||
| * `reason IN ('dead_pane_zombie', 'stale_spawn_dead_pane')` AND whose |
| * `reason IN ('dead_pane_zombie', 'stale_spawn_dead_pane')` — | ||
| * both reconciler reasons indicate a dead pane and are TTL-eligible. |
There was a problem hiding this comment.
| AND e.entity_id = a.id | ||
| AND e.event_type = 'state_changed' | ||
| AND e.details->>'reason' = 'dead_pane_zombie' | ||
| AND e.details->>'reason' IN ('dead_pane_zombie', 'stale_spawn_dead_pane') |
There was a problem hiding this comment.
Consider including the stale_spawn reason in this filter. Agents that fail to spawn initially (Pass 1 of the reconciler) and subsequently exhaust their auto_resume budget are also effectively zombies that clutter the registry. Since Pass 1 already excludes dir: agents, adding stale_spawn here would provide more comprehensive cleanup of exhausted agents without affecting permanent directory placeholders. Additionally, note that this filter string is duplicated in listExhaustedZombies, which increases maintenance risk.
AND e.details->>'reason' IN ('dead_pane_zombie', 'stale_spawn_dead_pane', 'stale_spawn')| AND e.entity_id = a.id | ||
| AND e.event_type = 'state_changed' | ||
| AND e.details->>'reason' = 'dead_pane_zombie' | ||
| AND e.details->>'reason' IN ('dead_pane_zombie', 'stale_spawn_dead_pane') |
| // accumulating forever in `genie ls` even after auto_resume exhaustion. | ||
| const source = readFileSync(join(__dirname, '..', 'agent-registry.ts'), 'utf-8'); | ||
|
|
||
| const archiveFilter = "e.details->>'reason' IN ('dead_pane_zombie', 'stale_spawn_dead_pane')"; |
There was a problem hiding this comment.
If the stale_spawn reason is added to the registry filters, this test expectation must be updated to match the new SQL string, as the test relies on exact source-code matching.
| const archiveFilter = "e.details->>'reason' IN ('dead_pane_zombie', 'stale_spawn_dead_pane')"; | |
| const archiveFilter = "e.details->>'reason' IN ('dead_pane_zombie', 'stale_spawn_dead_pane', 'stale_spawn')"; |
…coped per reviewer Lands the wish doc that scaffolds PR-A (#1634) and PR-B (#1636/#1637/#1638/ #1640/#1642), plus the 2026-05-07 PR-C draft + reviewer FIX-FIRST corrections. Why this is a separate docs commit: - The wish file was authored 2026-05-04 but only ever sat in a stash; never committed despite shipping work referencing it. This commit lands the reference document for completed + pending work in one place. - PR-C as originally drafted had three invalid premises against live 4.260507.1 (G3 amendment already implemented at scheduler-daemon.ts:1296; G9 line is on stderr not stdout; G10 design assumes binary-spawn that the HTTP probe doesn't do). Reviewer corrections folded in. - Only G8 (kill-path shadow+UUID dedup) survives intact — file path corrected to src/term-commands/agents.ts:2817 (handleWorkerKill). - G9 reframed as stderr-noise reduction (DEBUG=pgserve gating). - G10 deferred pending /trace into update.ts:362. QA dogfooding-72h artifacts (AUDIT.md, QA-PLAN.md) document the 72-h fix-audit sweep that surfaced the bugs and triggered the wish update. Refs: #1677 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Cli-noise-and-hygiene-cleanup G3 deliverable #1 — broaden
archiveExhaustedZombies+listExhaustedZombiesreason filter from'dead_pane_zombie'(only) toIN ('dead_pane_zombie', 'stale_spawn_dead_pane').Before this fix,
genie prune --zombiesignored spawning-flavor dead-pane workers; they accumulated ingenie lsindefinitely after auto_resume exhausted. After: both flavors are eligible for TTL-based archive once their auto_resume budget is spent.Behavior for
dead_pane_zombierows is unchanged. TTL + auto_resume=false guards still apply.Files
src/lib/agent-registry.ts(+10 / -8) — broaden SQL filter on both queries + update doc commentsrc/lib/__tests__/zombie-spawns.test.ts(+14 / -0) — source-grep regression testTest plan
stale_spawn_dead_panerow →listExhaustedZombies(1)now surfaces it (returned 0 pre-fix)Wish + sequencing
cli-noise-and-hygiene-cleanupPR-B G3 (deliverable Do agent conversations and actions get stored somewhere? #1 of 5)genie prune --erroredmode + 1h default TTL (G3 deliverables How to select between Genie versions when running claude? #2-5)🤖 Generated with Claude Code