fix(kanban): skip dispatch for tasks assigned to non-profile lanes (salvages #20105, #20134) by teknium1 · Pull Request #20165 · NousResearch/hermes-agent

teknium1 · 2026-05-05T11:11:47Z

Kanban dispatcher no longer crash-loops on tasks assigned to names that aren't real Hermes profiles, and the stuck-queue warning only fires when there's genuine spawnable work sitting idle.

Root cause: dispatch_once() claimed any ready+assigned task and shelled out hermes -p <assignee> chat -q .... When <assignee> named a control-plane terminal lane (e.g. orion-cc) rather than a profile on disk, the subprocess died with "Profile 'X' does not exist", was reaped as a zombie, the TTL detector released the claim back to ready, and the next tick re-spawned the same failing worker — forever.

Salvaged from #20105 + #20134 (@Brecht-H).

Changes

hermes_cli/kanban_db.py: dispatch_once() pre-checks profile_exists(assignee) before claiming; non-matches route into a new DispatchResult.skipped_nonspawnable bucket (separate from skipped_unassigned).
hermes_cli/kanban_db.py: new has_spawnable_ready(conn) helper returns True only if ≥1 ready+assigned+unclaimed task has an assignee that resolves to a real profile.
gateway/run.py + hermes_cli/kanban.py: both dispatchers swap their ready_nonempty probe to has_spawnable_ready, so "dispatcher stuck" WARN no longer fires on multi-lane hosts where the queue is healthy but none of the ready tasks target a spawnable profile.
tests/hermes_cli/conftest.py: new all_assignees_spawnable fixture monkeypatches profile_exists → True for tests that use synthetic assignees. Threaded through 8 dispatcher tests that the profile-exists guard would otherwise have silently broken.

Defensive import: both profile_exists lookups fall back to legacy "any ready+assigned" behavior if hermes_cli.profiles is unimportable, so degraded installs still surface the original warn.

Validation

	Before	After
Task assigned to `orion-cc` (not a profile)	permanent crash loop, 2 zombies/tick, `spawned=1 crashed=1` every minute	silent skip, `skipped_nonspawnable=1`, no claim, no zombie
Multi-lane queue full of terminal-lane assignees	`dispatcher stuck` WARN every 5 min	silent — `has_spawnable_ready=False`
Real profile missing PATH/venv/creds	`dispatcher stuck` WARN still fires after 6 ticks	unchanged (safety net intact)
Targeted tests	—	246/246 pass (`test_kanban_{db,cli,boards,core_functionality}`)

Live-verified by @Brecht-H on his Orion multi-lane host: 2-hour crash loop on t_a14dc1d5 + t_646c96f2 terminated on gateway restart; dispatcher silent on every subsequent tick; stale running claims reclaimed cleanly to ready.

Closes #20054
Closes #20105
Closes #20134
Supersedes #20065 (readiness check lives at a tighter call site — before claim, not before spawn)

Co-authored-by: Brecht-H 73849650+Brecht-H@users.noreply.github.com

…l profile The kanban dispatcher's `_default_spawn` invokes ``hermes -p <task.assignee> chat -q ...``. When ``assignee`` names a control-plane lane (e.g. an interactive Claude Code terminal like ``orion-cc`` / ``orion-research``) instead of a real Hermes profile, the subprocess fails on startup with "Profile 'X' does not exist", gets reaped as a zombie, the TTL/crash detector marks the task back to ``ready``, and the next tick re-spawns the same crashing worker. Result: a permanent crash loop emitting ``spawned=2 crashed=2 every tick`` in the gateway log and burning CPU forever. Reproduce on a fresh Hermes-agent install: # 1. Create a kanban task whose assignee names a non-profile. hermes kanban create --assignee orion-cc --status ready \ --title "Review PR #N" --body "..." # 2. Start the gateway with the embedded dispatcher. hermes gateway run # gateway.log lines every minute: # kanban dispatcher: tick spawned=1 reclaimed=0 crashed=1 ... # 3. ps -ef | grep '[h]ermes.*defunct' shows zombies. Fix --- ``dispatch_once()`` now pre-checks ``hermes_cli.profiles. profile_exists(assignee)`` before claiming. If False, the row is added to ``skipped_unassigned`` (it's effectively "unassigned-to-an-executable-profile") and the dispatcher moves on without claiming, spawning, or counting a crash. The check is opt-in safe: if the import fails (e.g. test isolation, profile module restructured), ``profile_exists`` falls back to ``None`` and the original behaviour is preserved unchanged. This addresses the explicit hint in the kanban task body (``t_2bab06e3``): "Should ready-state tasks auto-spawn at all, or only on explicit orion-cc claim? If spurious, gate the auto-spawn behind a config flag (e.g. only assignee=hermes or assignee=auto)." Profile-existence is a tighter gate than a config flag — it self-documents (the user already knows whether they have an ``orion-cc`` profile), and it doesn't require Mac to maintain an allowlist as new lane names appear. New lanes that ARE real profiles (created via ``hermes profile create``) auto- qualify the moment the profile dir is created. Validated live -------------- On Orion's hermes-agent install, two ``orion-research``- assigned tasks (Bug A and Bug C investigations) had been crash-looping since 2026-05-05 06:58 local. After applying the patch + restarting the gateway: - Stale ``running`` claims released to ``ready`` cleanly. - New gateway emitted ``kanban dispatcher: embedded`` and has ticked silently for 2+ minutes — no spawned=, crashed=, or stuck= log lines (all spawn skips are quiet). - Tasks remain ``ready`` with ``claim_lock=None``, ``worker_pid=None``, ``spawn_failures=0``. - Dashboard + telegram + freqtrade unaffected. Confidence: high (live verified on Orion). Scope-risk: narrow (additive guard inside one function). Not-tested: behaviour when a profile is renamed mid-tick — current code re-imports ``profile_exists`` per row so a freshly created profile auto-qualifies on the next tick. Machine: orion-terminal Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ly non-spawnable assignees After PR #20105 (dispatcher skips ready tasks whose assignee fails ``profile_exists()`` to prevent the orion-cc/orion-research crash loop), the gateway and CLI emit a spurious "kanban dispatcher stuck: ready queue non-empty for N consecutive ticks but 0 workers spawned" warning every 5 minutes on multi-lane setups where the queue is steadily full of human-pulled work assigned to terminal lanes. The warn is intended to catch real failure modes (broken PATH, missing venv, credential loss for a real Hermes profile). On a multi-lane host it fires forever even though everything is healthy: the dispatcher correctly chose not to spawn, and there is nothing for the operator to fix. Changes: * ``DispatchResult`` gains a ``skipped_nonspawnable`` field (separate from ``skipped_unassigned``) so callers can distinguish "task missing an owner — operator should route it" from "task owned by a control-plane lane — terminal will pull it". * ``dispatch_once`` routes the ``not profile_exists(assignee)`` skip into the new bucket (was lumped into ``skipped_unassigned``). * New helper ``has_spawnable_ready(conn)`` returns True iff at least one ready+assigned+unclaimed task in the DB has an assignee that maps to a real Hermes profile. Falls back to legacy "any ready+assigned" when ``profile_exists`` is unimportable so degraded installs still surface the original warn. * The gateway dispatcher (``gateway/run.py``) and the CLI standalone daemon (``hermes_cli/kanban.py``) both swap their cheap ``ready_nonempty`` probe to use ``has_spawnable_ready``. Stuck-warn now fires only when there is genuine spawnable work the dispatcher failed to start. * CLI dispatch output prints ``Skipped (non-spawnable assignee — terminal lane, OK)`` for visibility without alarm. Tests: * New ``has_spawnable_ready`` cases (empty queue, terminal-lane only, mixed real+terminal). * New ``test_dispatch_skips_nonspawnable_into_separate_bucket`` verifies the bucketing change. * Updated ``test_dispatch_skips_unassigned`` to assert no cross-leak. * Added ``all_assignees_spawnable`` fixture in ``tests/hermes_cli/conftest.py`` and threaded it through dispatcher tests that use synthetic assignees ("alice", "bob"). PR #20105 (the parent commit) silently broke 8 such tests by routing those assignees into ``skipped_nonspawnable`` instead of spawning; this PR repairs them as part of the same code area. Verified locally: 246/246 kanban-suite tests pass. Stacks on top of fix/kanban-dispatcher-skip-missing-profile-2026-05-05 (PR #20105). Reviewer: this PR is meant to merge AFTER #20105. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Brecht-H and others added 2 commits May 5, 2026 04:09

teknium1 merged commit f25d3ec into main May 5, 2026
10 of 11 checks passed

teknium1 deleted the hermes/hermes-9ddf5187 branch May 5, 2026 11:13

This was referenced May 5, 2026

fix(kanban): validate worker profile before spawn #20065

Closed

fix(kanban): dispatcher skips ready tasks whose assignee is not a real profile #20105

Closed

fix(kanban): suppress dispatcher stuck-warn for non-spawnable assignees #20134

Closed

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins comp/cli CLI entry point, hermes_cli/, setup wizard labels May 5, 2026

teknium1 mentioned this pull request May 5, 2026

chore(kanban): tier-2 batch salvage — doctor, started_at, parent-guard, latest_summary, selects, linked-children (closes #18344 #20022 #19473 #19828 #19743 #20251 #20019) #20448

Merged

BrewTestBot mentioned this pull request May 7, 2026

hermes-agent 2026.5.7 Homebrew/homebrew-core#281437

Merged

1 task

github-actions Bot mentioned this pull request May 8, 2026

chore: bump NousResearch/hermes-agent version from v2026.4.30 to v2026.5.7 Docker-Hub-sirmark/docker-hermes-agent#5

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): skip dispatch for tasks assigned to non-profile lanes (salvages #20105, #20134)#20165

fix(kanban): skip dispatch for tasks assigned to non-profile lanes (salvages #20105, #20134)#20165
teknium1 merged 2 commits into
mainfrom
hermes/hermes-9ddf5187

teknium1 commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

teknium1 commented May 5, 2026

Changes

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants