Summary
_pid_alive() in hermes_cli/kanban_db.py only implements zombie detection on Linux (parsing /proc/<pid>/status for State: Z). On macOS, os.kill(pid, 0) returns success for defunct/zombie processes, so a worker that crashes immediately stays "alive" to the dispatcher until claim_expires times out (~15 min default).
Where
hermes_cli/kanban_db.py:2158-2173
The docstring at line 2136-2144 even admits this:
On Linux we additionally peek at /proc/<pid>/status and treat State: Z
as dead. On other POSIX or on Windows the zombie check is a no-op.
Reproduction
- Run the kanban dispatcher on macOS with a ~5 min cadence.
- Assign a worker a task that causes an immediate crash (e.g., require a skill it doesn't have, or a missing credential that triggers an unhandled exception at startup).
os.kill(pid, 0) succeeds against the defunct process because the process table entry still exists.
- The dispatcher sees the worker as alive and does NOT re-queue the task until
claim_expires (~15 min later).
- This creates a zombie-respawn loop where the dispatcher tries again every N minutes, gets the same crash, and the task stays stuck until manual SQL intervention.
Impact
Tasks stuck in running for up to 15 minutes on macOS, requiring manual sqlite3 surgery to break the loop. With a 5-minute dispatcher cadence and default claim_expires of 15 minutes, users see 3+ wasted spawn attempts per stuck task.
Suggested Fix
On Darwin, use proc_pidinfo(PROC_PIDTASKINFO) or kqueue with EVFILT_PROC to detect zombie state. A simpler fallback: check if the process group leader is still alive, or verify that proc_pidinfo's pti_status field is not 0.
Environment
- macOS (any version)
- Hermes Agent v0.11.0 (a7fb79e)
Summary
_pid_alive()inhermes_cli/kanban_db.pyonly implements zombie detection on Linux (parsing/proc/<pid>/statusforState: Z). On macOS,os.kill(pid, 0)returns success for defunct/zombie processes, so a worker that crashes immediately stays "alive" to the dispatcher untilclaim_expirestimes out (~15 min default).Where
hermes_cli/kanban_db.py:2158-2173The docstring at line 2136-2144 even admits this:
Reproduction
os.kill(pid, 0)succeeds against the defunct process because the process table entry still exists.claim_expires(~15 min later).Impact
Tasks stuck in
runningfor up to 15 minutes on macOS, requiring manualsqlite3surgery to break the loop. With a 5-minute dispatcher cadence and default claim_expires of 15 minutes, users see 3+ wasted spawn attempts per stuck task.Suggested Fix
On Darwin, use
proc_pidinfo(PROC_PIDTASKINFO)orkqueuewithEVFILT_PROCto detect zombie state. A simpler fallback: check if the process group leader is still alive, or verify thatproc_pidinfo'spti_statusfield is not 0.Environment