fix(kanban): skip dispatch for tasks assigned to non-profile lanes (salvages #20105, #20134)#20165
Merged
Conversation
…l profile
The kanban dispatcher's `_default_spawn` invokes
``hermes -p <task.assignee> chat -q ...``. When ``assignee``
names a control-plane lane (e.g. an interactive Claude Code
terminal like ``orion-cc`` / ``orion-research``) instead of a
real Hermes profile, the subprocess fails on startup with
"Profile 'X' does not exist", gets reaped as a zombie, the
TTL/crash detector marks the task back to ``ready``, and the
next tick re-spawns the same crashing worker. Result: a
permanent crash loop emitting ``spawned=2 crashed=2 every tick``
in the gateway log and burning CPU forever.
Reproduce on a fresh Hermes-agent install:
# 1. Create a kanban task whose assignee names a non-profile.
hermes kanban create --assignee orion-cc --status ready \
--title "Review PR #N" --body "..."
# 2. Start the gateway with the embedded dispatcher.
hermes gateway run
# gateway.log lines every minute:
# kanban dispatcher: tick spawned=1 reclaimed=0 crashed=1 ...
# 3. ps -ef | grep '[h]ermes.*defunct' shows zombies.
Fix
---
``dispatch_once()`` now pre-checks ``hermes_cli.profiles.
profile_exists(assignee)`` before claiming. If False, the row
is added to ``skipped_unassigned`` (it's effectively
"unassigned-to-an-executable-profile") and the dispatcher
moves on without claiming, spawning, or counting a crash.
The check is opt-in safe: if the import fails (e.g. test
isolation, profile module restructured), ``profile_exists``
falls back to ``None`` and the original behaviour is preserved
unchanged.
This addresses the explicit hint in the kanban task body
(``t_2bab06e3``):
"Should ready-state tasks auto-spawn at all, or only on
explicit orion-cc claim? If spurious, gate the auto-spawn
behind a config flag (e.g. only assignee=hermes or
assignee=auto)."
Profile-existence is a tighter gate than a config flag — it
self-documents (the user already knows whether they have an
``orion-cc`` profile), and it doesn't require Mac to maintain
an allowlist as new lane names appear. New lanes that ARE
real profiles (created via ``hermes profile create``) auto-
qualify the moment the profile dir is created.
Validated live
--------------
On Orion's hermes-agent install, two ``orion-research``-
assigned tasks (Bug A and Bug C investigations) had been
crash-looping since 2026-05-05 06:58 local. After applying
the patch + restarting the gateway:
- Stale ``running`` claims released to ``ready`` cleanly.
- New gateway emitted ``kanban dispatcher: embedded`` and
has ticked silently for 2+ minutes — no spawned=,
crashed=, or stuck= log lines (all spawn skips are quiet).
- Tasks remain ``ready`` with ``claim_lock=None``,
``worker_pid=None``, ``spawn_failures=0``.
- Dashboard + telegram + freqtrade unaffected.
Confidence: high (live verified on Orion).
Scope-risk: narrow (additive guard inside one function).
Not-tested: behaviour when a profile is renamed mid-tick —
current code re-imports ``profile_exists`` per row so a
freshly created profile auto-qualifies on the next tick.
Machine: orion-terminal
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ly non-spawnable assignees After PR #20105 (dispatcher skips ready tasks whose assignee fails ``profile_exists()`` to prevent the orion-cc/orion-research crash loop), the gateway and CLI emit a spurious "kanban dispatcher stuck: ready queue non-empty for N consecutive ticks but 0 workers spawned" warning every 5 minutes on multi-lane setups where the queue is steadily full of human-pulled work assigned to terminal lanes. The warn is intended to catch real failure modes (broken PATH, missing venv, credential loss for a real Hermes profile). On a multi-lane host it fires forever even though everything is healthy: the dispatcher correctly chose not to spawn, and there is nothing for the operator to fix. Changes: * ``DispatchResult`` gains a ``skipped_nonspawnable`` field (separate from ``skipped_unassigned``) so callers can distinguish "task missing an owner — operator should route it" from "task owned by a control-plane lane — terminal will pull it". * ``dispatch_once`` routes the ``not profile_exists(assignee)`` skip into the new bucket (was lumped into ``skipped_unassigned``). * New helper ``has_spawnable_ready(conn)`` returns True iff at least one ready+assigned+unclaimed task in the DB has an assignee that maps to a real Hermes profile. Falls back to legacy "any ready+assigned" when ``profile_exists`` is unimportable so degraded installs still surface the original warn. * The gateway dispatcher (``gateway/run.py``) and the CLI standalone daemon (``hermes_cli/kanban.py``) both swap their cheap ``ready_nonempty`` probe to use ``has_spawnable_ready``. Stuck-warn now fires only when there is genuine spawnable work the dispatcher failed to start. * CLI dispatch output prints ``Skipped (non-spawnable assignee — terminal lane, OK)`` for visibility without alarm. Tests: * New ``has_spawnable_ready`` cases (empty queue, terminal-lane only, mixed real+terminal). * New ``test_dispatch_skips_nonspawnable_into_separate_bucket`` verifies the bucketing change. * Updated ``test_dispatch_skips_unassigned`` to assert no cross-leak. * Added ``all_assignees_spawnable`` fixture in ``tests/hermes_cli/conftest.py`` and threaded it through dispatcher tests that use synthetic assignees ("alice", "bob"). PR #20105 (the parent commit) silently broke 8 such tests by routing those assignees into ``skipped_nonspawnable`` instead of spawning; this PR repairs them as part of the same code area. Verified locally: 246/246 kanban-suite tests pass. Stacks on top of fix/kanban-dispatcher-skip-missing-profile-2026-05-05 (PR #20105). Reviewer: this PR is meant to merge AFTER #20105. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 5, 2026
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Kanban dispatcher no longer crash-loops on tasks assigned to names that aren't real Hermes profiles, and the stuck-queue warning only fires when there's genuine spawnable work sitting idle.
Root cause:
dispatch_once()claimed any ready+assigned task and shelled outhermes -p <assignee> chat -q .... When<assignee>named a control-plane terminal lane (e.g.orion-cc) rather than a profile on disk, the subprocess died with "Profile 'X' does not exist", was reaped as a zombie, the TTL detector released the claim back toready, and the next tick re-spawned the same failing worker — forever.Salvaged from #20105 + #20134 (@Brecht-H).
Changes
hermes_cli/kanban_db.py:dispatch_once()pre-checksprofile_exists(assignee)before claiming; non-matches route into a newDispatchResult.skipped_nonspawnablebucket (separate fromskipped_unassigned).hermes_cli/kanban_db.py: newhas_spawnable_ready(conn)helper returns True only if ≥1 ready+assigned+unclaimed task has an assignee that resolves to a real profile.gateway/run.py+hermes_cli/kanban.py: both dispatchers swap theirready_nonemptyprobe tohas_spawnable_ready, so "dispatcher stuck" WARN no longer fires on multi-lane hosts where the queue is healthy but none of the ready tasks target a spawnable profile.tests/hermes_cli/conftest.py: newall_assignees_spawnablefixture monkeypatchesprofile_exists → Truefor tests that use synthetic assignees. Threaded through 8 dispatcher tests that the profile-exists guard would otherwise have silently broken.Defensive import: both
profile_existslookups fall back to legacy "any ready+assigned" behavior ifhermes_cli.profilesis unimportable, so degraded installs still surface the original warn.Validation
orion-cc(not a profile)spawned=1 crashed=1every minuteskipped_nonspawnable=1, no claim, no zombiedispatcher stuckWARN every 5 minhas_spawnable_ready=Falsedispatcher stuckWARN still fires after 6 tickstest_kanban_{db,cli,boards,core_functionality})Live-verified by @Brecht-H on his Orion multi-lane host: 2-hour crash loop on
t_a14dc1d5+t_646c96f2terminated on gateway restart; dispatcher silent on every subsequent tick; stalerunningclaims reclaimed cleanly toready.Closes #20054
Closes #20105
Closes #20134
Supersedes #20065 (readiness check lives at a tighter call site — before claim, not before spawn)
Co-authored-by: Brecht-H 73849650+Brecht-H@users.noreply.github.com