Skip to content

fix(kanban): skip dispatch for tasks assigned to non-profile lanes (salvages #20105, #20134)#20165

Merged
teknium1 merged 2 commits into
mainfrom
hermes/hermes-9ddf5187
May 5, 2026
Merged

fix(kanban): skip dispatch for tasks assigned to non-profile lanes (salvages #20105, #20134)#20165
teknium1 merged 2 commits into
mainfrom
hermes/hermes-9ddf5187

Conversation

@teknium1
Copy link
Copy Markdown
Contributor

@teknium1 teknium1 commented May 5, 2026

Kanban dispatcher no longer crash-loops on tasks assigned to names that aren't real Hermes profiles, and the stuck-queue warning only fires when there's genuine spawnable work sitting idle.

Root cause: dispatch_once() claimed any ready+assigned task and shelled out hermes -p <assignee> chat -q .... When <assignee> named a control-plane terminal lane (e.g. orion-cc) rather than a profile on disk, the subprocess died with "Profile 'X' does not exist", was reaped as a zombie, the TTL detector released the claim back to ready, and the next tick re-spawned the same failing worker — forever.

Salvaged from #20105 + #20134 (@Brecht-H).

Changes

  • hermes_cli/kanban_db.py: dispatch_once() pre-checks profile_exists(assignee) before claiming; non-matches route into a new DispatchResult.skipped_nonspawnable bucket (separate from skipped_unassigned).
  • hermes_cli/kanban_db.py: new has_spawnable_ready(conn) helper returns True only if ≥1 ready+assigned+unclaimed task has an assignee that resolves to a real profile.
  • gateway/run.py + hermes_cli/kanban.py: both dispatchers swap their ready_nonempty probe to has_spawnable_ready, so "dispatcher stuck" WARN no longer fires on multi-lane hosts where the queue is healthy but none of the ready tasks target a spawnable profile.
  • tests/hermes_cli/conftest.py: new all_assignees_spawnable fixture monkeypatches profile_exists → True for tests that use synthetic assignees. Threaded through 8 dispatcher tests that the profile-exists guard would otherwise have silently broken.

Defensive import: both profile_exists lookups fall back to legacy "any ready+assigned" behavior if hermes_cli.profiles is unimportable, so degraded installs still surface the original warn.

Validation

Before After
Task assigned to orion-cc (not a profile) permanent crash loop, 2 zombies/tick, spawned=1 crashed=1 every minute silent skip, skipped_nonspawnable=1, no claim, no zombie
Multi-lane queue full of terminal-lane assignees dispatcher stuck WARN every 5 min silent — has_spawnable_ready=False
Real profile missing PATH/venv/creds dispatcher stuck WARN still fires after 6 ticks unchanged (safety net intact)
Targeted tests 246/246 pass (test_kanban_{db,cli,boards,core_functionality})

Live-verified by @Brecht-H on his Orion multi-lane host: 2-hour crash loop on t_a14dc1d5 + t_646c96f2 terminated on gateway restart; dispatcher silent on every subsequent tick; stale running claims reclaimed cleanly to ready.

Closes #20054
Closes #20105
Closes #20134
Supersedes #20065 (readiness check lives at a tighter call site — before claim, not before spawn)

Co-authored-by: Brecht-H 73849650+Brecht-H@users.noreply.github.com

Brecht-H and others added 2 commits May 5, 2026 04:09
…l profile

The kanban dispatcher's `_default_spawn` invokes
``hermes -p <task.assignee> chat -q ...``. When ``assignee``
names a control-plane lane (e.g. an interactive Claude Code
terminal like ``orion-cc`` / ``orion-research``) instead of a
real Hermes profile, the subprocess fails on startup with
"Profile 'X' does not exist", gets reaped as a zombie, the
TTL/crash detector marks the task back to ``ready``, and the
next tick re-spawns the same crashing worker. Result: a
permanent crash loop emitting ``spawned=2 crashed=2 every tick``
in the gateway log and burning CPU forever.

Reproduce on a fresh Hermes-agent install:

  # 1. Create a kanban task whose assignee names a non-profile.
  hermes kanban create --assignee orion-cc --status ready \
      --title "Review PR #N" --body "..."
  # 2. Start the gateway with the embedded dispatcher.
  hermes gateway run
  # gateway.log lines every minute:
  #   kanban dispatcher: tick spawned=1 reclaimed=0 crashed=1 ...
  # 3. ps -ef | grep '[h]ermes.*defunct' shows zombies.

Fix
---
``dispatch_once()`` now pre-checks ``hermes_cli.profiles.
profile_exists(assignee)`` before claiming. If False, the row
is added to ``skipped_unassigned`` (it's effectively
"unassigned-to-an-executable-profile") and the dispatcher
moves on without claiming, spawning, or counting a crash.

The check is opt-in safe: if the import fails (e.g. test
isolation, profile module restructured), ``profile_exists``
falls back to ``None`` and the original behaviour is preserved
unchanged.

This addresses the explicit hint in the kanban task body
(``t_2bab06e3``):

  "Should ready-state tasks auto-spawn at all, or only on
  explicit orion-cc claim? If spurious, gate the auto-spawn
  behind a config flag (e.g. only assignee=hermes or
  assignee=auto)."

Profile-existence is a tighter gate than a config flag — it
self-documents (the user already knows whether they have an
``orion-cc`` profile), and it doesn't require Mac to maintain
an allowlist as new lane names appear. New lanes that ARE
real profiles (created via ``hermes profile create``) auto-
qualify the moment the profile dir is created.

Validated live
--------------
On Orion's hermes-agent install, two ``orion-research``-
assigned tasks (Bug A and Bug C investigations) had been
crash-looping since 2026-05-05 06:58 local. After applying
the patch + restarting the gateway:

- Stale ``running`` claims released to ``ready`` cleanly.
- New gateway emitted ``kanban dispatcher: embedded`` and
  has ticked silently for 2+ minutes — no spawned=,
  crashed=, or stuck= log lines (all spawn skips are quiet).
- Tasks remain ``ready`` with ``claim_lock=None``,
  ``worker_pid=None``, ``spawn_failures=0``.
- Dashboard + telegram + freqtrade unaffected.

Confidence: high (live verified on Orion).
Scope-risk: narrow (additive guard inside one function).
Not-tested: behaviour when a profile is renamed mid-tick —
current code re-imports ``profile_exists`` per row so a
freshly created profile auto-qualifies on the next tick.
Machine: orion-terminal

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ly non-spawnable assignees

After PR #20105 (dispatcher skips ready tasks whose assignee fails
``profile_exists()`` to prevent the orion-cc/orion-research crash
loop), the gateway and CLI emit a spurious "kanban dispatcher stuck:
ready queue non-empty for N consecutive ticks but 0 workers spawned"
warning every 5 minutes on multi-lane setups where the queue is
steadily full of human-pulled work assigned to terminal lanes.

The warn is intended to catch real failure modes (broken PATH,
missing venv, credential loss for a real Hermes profile). On a
multi-lane host it fires forever even though everything is healthy:
the dispatcher correctly chose not to spawn, and there is nothing
for the operator to fix.

Changes:

* ``DispatchResult`` gains a ``skipped_nonspawnable`` field
  (separate from ``skipped_unassigned``) so callers can distinguish
  "task missing an owner — operator should route it" from "task
  owned by a control-plane lane — terminal will pull it".
* ``dispatch_once`` routes the ``not profile_exists(assignee)`` skip
  into the new bucket (was lumped into ``skipped_unassigned``).
* New helper ``has_spawnable_ready(conn)`` returns True iff at least
  one ready+assigned+unclaimed task in the DB has an assignee that
  maps to a real Hermes profile. Falls back to legacy "any
  ready+assigned" when ``profile_exists`` is unimportable so degraded
  installs still surface the original warn.
* The gateway dispatcher (``gateway/run.py``) and the CLI standalone
  daemon (``hermes_cli/kanban.py``) both swap their cheap
  ``ready_nonempty`` probe to use ``has_spawnable_ready``. Stuck-warn
  now fires only when there is genuine spawnable work the dispatcher
  failed to start.
* CLI dispatch output prints ``Skipped (non-spawnable assignee —
  terminal lane, OK)`` for visibility without alarm.

Tests:

* New ``has_spawnable_ready`` cases (empty queue, terminal-lane
  only, mixed real+terminal).
* New ``test_dispatch_skips_nonspawnable_into_separate_bucket``
  verifies the bucketing change.
* Updated ``test_dispatch_skips_unassigned`` to assert no
  cross-leak.
* Added ``all_assignees_spawnable`` fixture in
  ``tests/hermes_cli/conftest.py`` and threaded it through dispatcher
  tests that use synthetic assignees ("alice", "bob"). PR #20105
  (the parent commit) silently broke 8 such tests by routing those
  assignees into ``skipped_nonspawnable`` instead of spawning; this
  PR repairs them as part of the same code area.

Verified locally: 246/246 kanban-suite tests pass.

Stacks on top of fix/kanban-dispatcher-skip-missing-profile-2026-05-05
(PR #20105). Reviewer: this PR is meant to merge AFTER #20105.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Kanban dispatcher should validate assignee profile readiness before spawning workers

3 participants