Skip to content

fix(kanban): detect darwin zombie workers (salvages #20023)#20188

Merged
teknium1 merged 1 commit into
mainfrom
hermes/hermes-9ddf5187
May 5, 2026
Merged

fix(kanban): detect darwin zombie workers (salvages #20023)#20188
teknium1 merged 1 commit into
mainfrom
hermes/hermes-9ddf5187

Conversation

@teknium1
Copy link
Copy Markdown
Contributor

@teknium1 teknium1 commented May 5, 2026

Kanban dispatcher now correctly detects zombie worker processes on macOS, so crashed workers get reclaimed on the next tick instead of tying up their task for the full claim_expires window (~15 min).

Salvaged from #20023 (@LeonSGP43).

Root cause: _pid_alive() used os.kill(pid, 0) which succeeds for zombie processes because the process table entry still exists post-exit, pre-reap. On Linux it fell through to /proc/<pid>/status to read State: Z, but on macOS there's no /proc, so the zombie check was a documented no-op. A worker that crashed at startup (missing skill, bad credential, import error) stayed "alive" to the dispatcher until claim TTL expired, creating a ~5 min dispatcher cadence × 15 min TTL = 3+ wasted re-spawn attempts per stuck task, all of which crashed identically.

Changes

  • hermes_cli/kanban_db.py: after kill(pid, 0) succeeds on Darwin, shell out to ps -o stat= -p <pid> with a 1s timeout. Return False if ps exits non-zero (no such process) or if the BSD stat field contains Z. If the probe itself errors, keep the optimistic kill(0) answer — conservative default.
  • tests/hermes_cli/test_kanban_core_functionality.py: new test_pid_alive_detects_darwin_zombie covering the Darwin branch with a mocked ps returning Z+.
  • Linux /proc/<pid>/status path unchanged.

Validation

Before After
macOS: worker crashes at startup _pid_alive → True for ~15 min until claim_expires; dispatcher re-spawns 3+ times _pid_alive → False on next tick; task reclaimed via crashed-worker path
Linux: zombie worker /proc/<pid>/status peek returns False (unchanged) unchanged
Windows / other POSIX no zombie check (unchanged) unchanged
ps probe fails (unexpected) N/A falls back to kill(0) answer (optimistic)
Targeted tests 174/174 pass (test_kanban_core_functionality + test_kanban_db)

Closes #20015

Co-authored-by: LeonSGP43 cine.dreamer.one@gmail.com

@teknium1 teknium1 merged commit 1a03e3b into main May 5, 2026
10 of 11 checks passed
@teknium1 teknium1 deleted the hermes/hermes-9ddf5187 branch May 5, 2026 11:43
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins comp/cli CLI entry point, hermes_cli/, setup wizard labels May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kanban dispatcher: macOS zombie detection is a no-op — _pid_alive returns True for defunct workers

3 participants