fix(kanban): detect darwin zombie workers (salvages #20023) by teknium1 · Pull Request #20188 · NousResearch/hermes-agent

teknium1 · 2026-05-05T11:39:35Z

Kanban dispatcher now correctly detects zombie worker processes on macOS, so crashed workers get reclaimed on the next tick instead of tying up their task for the full claim_expires window (~15 min).

Salvaged from #20023 (@LeonSGP43).

Root cause: _pid_alive() used os.kill(pid, 0) which succeeds for zombie processes because the process table entry still exists post-exit, pre-reap. On Linux it fell through to /proc/<pid>/status to read State: Z, but on macOS there's no /proc, so the zombie check was a documented no-op. A worker that crashed at startup (missing skill, bad credential, import error) stayed "alive" to the dispatcher until claim TTL expired, creating a ~5 min dispatcher cadence × 15 min TTL = 3+ wasted re-spawn attempts per stuck task, all of which crashed identically.

Changes

hermes_cli/kanban_db.py: after kill(pid, 0) succeeds on Darwin, shell out to ps -o stat= -p <pid> with a 1s timeout. Return False if ps exits non-zero (no such process) or if the BSD stat field contains Z. If the probe itself errors, keep the optimistic kill(0) answer — conservative default.
tests/hermes_cli/test_kanban_core_functionality.py: new test_pid_alive_detects_darwin_zombie covering the Darwin branch with a mocked ps returning Z+.
Linux /proc/<pid>/status path unchanged.

Validation

	Before	After
macOS: worker crashes at startup	`_pid_alive` → True for ~15 min until claim_expires; dispatcher re-spawns 3+ times	`_pid_alive` → False on next tick; task reclaimed via crashed-worker path
Linux: zombie worker	`/proc/<pid>/status` peek returns False (unchanged)	unchanged
Windows / other POSIX	no zombie check (unchanged)	unchanged
`ps` probe fails (unexpected)	N/A	falls back to `kill(0)` answer (optimistic)
Targeted tests	—	174/174 pass (`test_kanban_core_functionality` + `test_kanban_db`)

Closes #20015

Co-authored-by: LeonSGP43 cine.dreamer.one@gmail.com

fix(kanban): detect darwin zombie workers

93e70f4

teknium1 merged commit 1a03e3b into main May 5, 2026
10 of 11 checks passed

teknium1 deleted the hermes/hermes-9ddf5187 branch May 5, 2026 11:43

teknium1 mentioned this pull request May 5, 2026

fix(kanban): detect darwin zombie workers #20023

Closed

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins comp/cli CLI entry point, hermes_cli/, setup wizard labels May 5, 2026

BrewTestBot mentioned this pull request May 7, 2026

hermes-agent 2026.5.7 Homebrew/homebrew-core#281437

Merged

1 task

github-actions Bot mentioned this pull request May 8, 2026

chore: bump NousResearch/hermes-agent version from v2026.4.30 to v2026.5.7 Docker-Hub-sirmark/docker-hermes-agent#5

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): detect darwin zombie workers (salvages #20023)#20188

fix(kanban): detect darwin zombie workers (salvages #20023)#20188
teknium1 merged 1 commit into
mainfrom
hermes/hermes-9ddf5187

teknium1 commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

teknium1 commented May 5, 2026

Changes

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants