Skip to content

feat(kanban): durable multi-profile collaboration board + dashboard GUI + dispatcher daemon#16100

Closed
teknium1 wants to merge 22 commits into
mainfrom
feat/kanban-standing
Closed

feat(kanban): durable multi-profile collaboration board + dashboard GUI + dispatcher daemon#16100
teknium1 wants to merge 22 commits into
mainfrom
feat/kanban-standing

Conversation

@teknium1
Copy link
Copy Markdown
Contributor

@teknium1 teknium1 commented Apr 26, 2026

Hermes Kanban — durable multi-profile collaboration board

Summary

Adds a first-class multi-agent collaboration surface to Hermes: a durable SQLite-backed task board (~/.hermes/kanban.db) shared across all profiles on the host, a long-lived dispatcher daemon that atomically claims ready tasks and spawns the assigned profile in an isolated workspace (with crash detection, a spawn-failure circuit breaker, max-runtime enforcement, and tick-level health telemetry), a /kanban slash command wired into both the interactive CLI and the gateway (with auto-notify-back), and a bundled dashboard plugin that gives the board a Linear/Fusion-style drag-drop UI with multi-select, inline editing, dependency editing, markdown rendering, touch support, live WebSocket updates, per-run Run History, and a Worker Log drawer panel.

Zero changes to run_agent.py. No new core tools. No tool-schema bloat. Every worker is a full OS process — no in-process subagent swarms, no SDK-lifecycle fragility.

Status: standing, ready for review

  • 15 commits on feat/kanban-standing, tip 123f8d0fe
  • 182/182 main kanban suite green under scripts/run_tests.sh
  • Opt-in stress suite in tests/stress/ — ~40k randomized operations, real multi-process concurrency, real subprocess E2E, scale benchmarks — zero invariant violations
  • Four audit passes + external review by @erosika + full battle-test pass all addressed
  • 10 dashboard screenshots + tutorial doc walking four user stories

Motivation

delegate_task is a synchronous function call: fork → join, anonymous subagent, no resumability, no human-in-the-loop. It's correct for short reasoning subtasks; it is the wrong shape for the workloads users asked for in the Nous Discord design thread:

  • Research triage — N parallel researchers + analyst + writer, human-in-the-loop for ambiguity.
  • Scheduled recurring workflows — daily AI-funding briefs that accumulate into an Obsidian vault.
  • Digital twins — named inbox-triage / ops-review profiles with persistent memory over weeks.
  • Engineering pipelines — decompose → parallel worktrees → review → iterate → PR.
  • Fleet work — one specialist profile managing 50 Instagram accounts, each with its own workspace.

Kanban is the shape that covers all four user-story types. They coexist with delegate_task: a kanban worker may call delegate_task internally for reasoning within its own run. The single test: does this handoff need to outlive a single API loop and be visible to others?

Design rationale, full comparison with Cline Kanban / Paperclip / NanoClaw / Google Gemini Enterprise, concurrency correctness argument, and implementation plan live in docs/hermes-kanban-v1-spec.pdf.

Architecture

                   ┌───────────────────────────┐
                   │   USER / GATEWAY          │   control plane
                   │   /kanban create | list…  │  (auto-subscribes
                   └────────────┬──────────────┘   on create)
                                │
     dashboard plugin ▶  ┌──────┴──────┐
      (drag-drop UI,     │ ~/.hermes/  │   state plane
       WS live feed,     │  kanban.db  │
       Run History,      │  WAL mode   │
       Worker log)       └──────┬──────┘
                                │  daemon every Ns
                   ┌────────────▼──────────────┐
                   │  DISPATCHER               │
                   │  1. reclaim stale claims  │
                   │  2. reclaim crashed PIDs  │
                   │  3. enforce max_runtime   │
                   │  4. todo → ready          │
                   │  5. atomic CAS claim      │
                   │  6. spawn worker proc     │
                   │     (tracked by PID)      │
                   │  7. circuit-break after   │
                   │     N spawn failures      │
                   │  8. health telemetry      │
                   └────────────┬──────────────┘
                                │
      ┌─────────────────┬───────┴────────┬─────────────────┐
      ▼                 ▼                ▼                 ▼
  profile:planner  profile:researcher  profile:reviewer  profile:ops   execution
  (own HERMES_HOME, memory, skills; own workspace per task)            plane

                                  ▲
                                  │  task_events tail
                   ┌──────────────┴──────────────┐
                   │  GATEWAY kanban-notifier    │  (completed / blocked /
                   │  pushes to subscribers      │   gave_up / crashed /
                   │  with @assignee prefix      │   timed_out)
                   └─────────────────────────────┘

Data model

  • tasks — the logical unit of work: title, body, assignee, status (triage/todo/ready/running/blocked/done/archived), tenant, priority, workspace, parents via task_links, optional idempotency_key, optional max_runtime_seconds, current_run_id pointer.
  • task_runs — one row per attempt. Claim opens a run; terminal transitions close it with an outcome (completed / blocked / crashed / timed_out / spawn_failed / gave_up / reclaimed). Carries the worker's summary, metadata, error, profile, step_key, started/ended timestamps, worker_pid. Multi-attempt history preserved forever.
  • task_events — append-only event log carrying run_id for per-attempt grouping. Kinds cluster into Lifecycle (created, promoted, completed, blocked, unblocked, archived), Edits (assigned, edited, reprioritized, status), and Worker telemetry (claimed, spawned, heartbeat, reclaimed, crashed, timed_out, spawn_failed, gave_up).
  • task_comments — freeform human-readable thread.
  • task_links — parent → child dependency edges.
  • kanban_notify_subs — gateway chat subscriptions for terminal-event notifications.

Invariant (enforced and tested): tasks.current_run_id IS NULL ⇔ the run row for that task is in a terminal state. Holds across CLI, dashboard, dispatcher (crash/timeout/reclaim), archive, and the migration backfill. Verified across ~40k randomized operations in the property fuzzer.

Forward-compat for v2: Nullable tasks.workflow_template_id + tasks.current_step_key + task_runs.step_key. v1 kernel ignores them; v2 workflow routing can land additively without a schema migration.

Collaboration patterns covered

Nine, all expressible with the base primitives (tasks + links + comments + assignee + workspace):

  1. Fan-out — N siblings, same role.
  2. Pipeline — role-specialized chain.
  3. Voting / quorum — N siblings + one aggregator.
  4. Long-running journal — same profile + shared dir workspace + cron.
  5. Human-in-the-loop — block → comment → unblock → re-spawn.
  6. @mention — inline routing from prose.
  7. Thread-scoped workspace/kanban here pins dir to the gateway thread.
  8. Fleet farming — one profile, N subjects, one workspace per subject.
  9. Triage specifier — rough idea → triage → specifier expands body → todo.

What shipped

Kernel (hermes_cli/kanban_db.py)

  • WAL-mode SQLite, CAS-based atomic claim inside BEGIN IMMEDIATE.
  • task_runs first-class (one row per attempt) with structured handoff (summary, metadata, error).
  • Stale-claim recovery, host-local crash detection with zombie-aware /proc check, spawn-failure circuit breaker (auto-blocks after N consecutive failures), max-runtime enforcement (SIGTERM → 5s grace → SIGKILL).
  • Workspace resolution (scratch / dir:<path> / worktree); per-task log capture at ~/.hermes/kanban/logs/<task>.log with 2 MiB rotation.
  • Idempotency keys for retried webhooks.
  • GC for old events + old log files; stats (board_stats, task_age).
  • Notify subs: add / list / remove / unseen_events_for_sub / advance_notify_cursor.
  • build_worker_context gives every worker: this task's prior attempts (retry continuity), parent tasks' completed run summaries + metadata (downstream handoff), and the same assignee's 5 most-recent completions on other tasks (cross-task role continuity).
  • Synthetic zero-duration runs when complete or block is called on a never-claimed task with handoff data, so summary / metadata / reason is never silently dropped.
  • Defensive invariant recovery in claim_task / unblock_task / archive_task / dashboard drag-drop.

CLI (hermes_cli/kanban.py)

All verbs auto-init the DB on first use. create / list / show / show --json (with runs[]) / claim / complete (with --summary --metadata) / block / unblock / archive / comment / link / unlink / heartbeat / runs / log / stats / assignees / dispatch / daemon / watch / tail / gc / init / notify-subscribe / notify-list / notify-unsubscribe. Bulk support on complete / unblock / archive / block --ids. Bulk complete with --summary or --metadata is refused — copy-pasting the same handoff to N tasks is a footgun. Daemon emits health WARNs when the ready queue is non-empty AND no spawns succeeded for 6+ ticks.

Gateway (gateway/run.py)

_kanban_notifier_watcher runs alongside _session_expiry_watcher. Delivers terminal events (completed / blocked / gave_up / crashed / timed_out) to subscribed chats, prefixed with @<assignee> for identity. Unsubscribes on the last delivered event's kind being terminal (not just on task status). Per-sub consecutive-send-failure counter drops dead chats after 3 failures. /kanban create auto-subscribes the originating chat.

Dashboard plugin (plugins/kanban/)

  • Bundled, installed by default — hermes dashboard grows a Kanban tab.
  • Full board: triage / todo / ready / running / blocked / done (+ archived toggle), lanes-by-profile in Running, staleness colouring, progress pills (N/M children done), inline create per column, drag-drop with touch fallback.
  • Drawer: editable title/assignee/priority/description, dependency editor, status actions, comments, event timeline, Run History section (one row per attempt with outcome, profile, elapsed, summary, error, metadata), Worker Log panel (auto-loads last 100 KB with refresh).
  • WebSocket /events with run_id per event; live refresh of the open drawer via per-task event ticks.
  • REST: GET /board, GET /tasks/:id (includes runs[]), POST /tasks (+ idempotency_key), PATCH /tasks/:id (with summary + metadata), POST /tasks/bulk, POST /tasks/:id/comments, POST/DELETE /links, POST /dispatch, GET /config, GET /stats, GET /tasks/:id/log, WS /events?since=&token=.

Skills

  • skills/devops/kanban-worker/SKILL.md — claim → work → heartbeat → complete/block workflow.
  • skills/devops/kanban-orchestrator/SKILL.md — decompose-don't-execute guide.

Docs

  • website/docs/user-guide/features/kanban.md — full reference: concepts, quickstart, systemd, architecture, data model, CLI, REST, event reference, invariants.
  • website/docs/user-guide/features/kanban-tutorial.md — four user stories with 10 dashboard screenshots: solo dev, fleet farming, role-pipeline with retry, circuit-breaker + crash recovery.

Engineering history — audits, reviews, fixes

Layered chronologically, most recent first. Every audit bug was real; every fix lands with regression tests.

Battle-test suite + 2 real bugs it found (commit 123f8d0)

Added tests/stress/ — opt-in suite that stresses the kernel with real concurrency, real subprocesses, randomized property fuzzing, and scale benchmarks. Exposed two bugs that four audit passes missed.

# Bug Fix
1 _pid_alive returned True for zombie processes. os.kill(pid, 0) succeeds against zombies (process table entry exists until parent reaps), so detect_crashed_workers would treat a dead-but-unreaped worker as alive. Peek at /proc/<pid>/status on Linux, treat State: Z as dead.
2 Task IDs used only 2 hex bytes (65k space). ~50% collision probability at 10k tasks by birthday paradox. Bumped to 4 hex bytes (4.3B space). Surfaced by the 10k-task benchmark.

Stress suite results (all passing):

  • Concurrency (100 tasks, 5 workers): 0 double-claims, 0 orphan runs.
  • Mixed ops (500 tasks, 10 workers + reclaimer): 1518 events, 43 lost-claim races, zero invariant violations.
  • Reclaim race (TTL < work duration): 82 claims, 44 reclaimed, 32 late completes correctly refused by CAS.
  • Real-subprocess E2E: dispatcher spawns real workers that heartbeat + complete via the CLI; crash detection works against real dead PIDs.
  • Property fuzzing (500 sequences, ~40k ops): 9 invariant checks per step, 0 violations.
  • Scale benchmarks at 10k tasks: dispatch_once 4ms, recompute_ready 47ms, list_tasks 50ms. Vulcan's + erosika's starvation concerns are an order of magnitude off.

Suite is opt-in via pytest --run-stress.

Atypical-scenario stress + clock-skew fix (commit 63b7b6d)

Added tests/stress/test_atypical_scenarios.py — 28 scenarios covering inputs and environments the normal tests assume away. Categories:

  • Data: unicode / emoji / CJK / RTL, 1 MB strings, SQL injection, multiline summaries, malformed JSON, empty fields, newlines in tenant names.
  • Graphs: cycles, self-parenting, diamonds, 500-wide fan-out + fan-in, child promotion against every non-done parent status.
  • Workspace: path traversal attempts, non-existent paths.
  • Filesystem: HERMES_HOME with spaces / unicode / symlinks (two links to same dir share DB).
  • Scale: 1000 runs on one task, 5000 tasks / 100 tenants, 1000 comments, huge payloads.
  • Lifecycle: done/archived tasks resist re-claim/complete/block, unassigned ready tasks correctly skipped by dispatcher.
  • Concurrency: idempotency-key race across two processes (both get the same task id).
  • Dashboard: REST with empty/huge/unicode/type-mismatched inputs.

Bug caught and fixed: CLI show and runs printed negative elapsed time when NTP jumps backward between claim and complete. Both sites now clamp to max(0, end - start), matching the dashboard JS which already had the same clamp. Regression test added.

Follow-up in v1 (commit be184aa): Both items initially flagged as v2 — fixed immediately instead of deferred.

# Bug Fix
1 resolve_workspace accepted relative dir: paths, which resolved against the dispatcher's CWD (confused-deputy escape vector). Absolute paths now required for all kinds; clear error message on refusal. Threat model documented.
2 build_worker_context unbounded on retry-heavy (63 KB at 1000 runs) and comment-heavy (50 KB at 1000 comments) tasks. Per-section caps: 10 most-recent attempts, 30 most-recent comments, 4 KB per summary/error/metadata, 8 KB per body, 2 KB per comment. Older items collapsed into "N earlier omitted" markers. 1000 runs: 63 KB → 820 chars. 1000 comments: 50 KB → 1,671 chars.

189/189 kanban suite pass (+6 regression tests for the caps and path guards).

189/189 main kanban suite pass.

Tool surface for worker + orchestrator agents (commit 832ecde)

Added a 7-tool structured surface (kanban_show, kanban_complete, kanban_block, kanban_heartbeat, kanban_comment, kanban_create, kanban_link) so kanban workers can interact with the board from inside their Python process instead of shelling out to hermes kanban. Fixes the Docker/Modal/SSH backend portability gap the atypical-scenarios pass surfaced — CLI calls from inside a remote terminal backend fail because hermes isn't in the container and kanban.db isn't mounted. Tools run in the agent's process and always reach the DB.

Zero schema footprint on normal sessions. Each tool's check_fn returns True only when HERMES_KANBAN_TASK is set in the process env — which only happens when the dispatcher spawned this worker. Verified empirically: hermes chat sessions have 27 tools in schema, worker sessions have exactly 34 (+7). No leak either direction.

The dispatcher sets HERMES_KANBAN_TASK + HERMES_KANBAN_WORKSPACE + HERMES_PROFILE when spawning. The CLI, dashboard, and /kanban slash command are unchanged — those are humans reaching the kernel directly, not the agent path.

Skill updates: kanban-worker and kanban-orchestrator rewritten to call the tools instead of CLI subprocess; CLI remains documented as fallback for human operators and scripts.

25 new regression tests in tests/tools/test_kanban_tools.py covering gating, all 7 handler happy paths, error branches (missing args, non-dict metadata, empty strings, cycles, self-link), and a full E2E worker lifecycle driven purely through tools.

214/214 kanban suite green.

System-prompt guidance + skill reshape (commit 1970bcf)

Moved the kanban worker lifecycle from the skills into a conditionally-injected system-prompt guidance block, matching the existing MEMORY_GUIDANCE / SESSION_SEARCH_GUIDANCE / SKILLS_GUIDANCE pattern. Workers now know the full 6-step lifecycle even if no skill loaded — the lifecycle is load-bearing for correct board behavior and belongs in the prompt path, not the skill path.

New KANBAN_GUIDANCE (agent/prompt_builder.py) — ~3 KB covering identity framing, 6-step lifecycle with exact tool-call shapes, orchestrator-mode note, and explicit DO NOTs. Injected into _build_system_prompt gated on kanban_show in valid_tool_names. Since kanban_show's check_fn gates on HERMES_KANBAN_TASK, a normal hermes chat session sees none of it.

Verified live:

Prompt size Kanban lifecycle present
Normal chat 2,329 chars No
Worker (env set) 5,272 chars Yes — full 6 steps

Delta is exactly the KANBAN_GUIDANCE block. Zero cost to normal users.

Skills reshaped from lifecycle-plus-references to references-only. Both now carry what doesn't fit in a system prompt: good summary/metadata shape patterns, block-reason examples, retry diagnostics, the orchestrator's decomposition playbook with a concrete Postgres-migration walkthrough, pitfalls, CLI fallback cheatsheet. Loading the skill is no longer required to work a task correctly.

+3 regression tests. 217/217 kanban+tools suite green; 303/303 run_agent suite green.

Auto-load kanban-worker skill on dispatch (commit c65c1dd)

_default_spawn now includes --skills kanban-worker in the child's argv. Every dispatched worker gets the skill's pattern library (good summary/metadata shapes, retry diagnostics, block-reason examples) loaded automatically — no per-profile skill configuration required. --skills is additive to the profile's defaults, so users can still add more skills on top.

What lives where, finalized:

Layer Content
System prompt (always, if env var set) KANBAN_GUIDANCE — mandatory 6-step lifecycle, tool-call shapes, DO NOTs
Tools (always, if env var set) 7 kanban_* handlers with self-explanatory descriptions
Skill (auto-loaded on dispatch) kanban-worker — deeper patterns, retry diagnostics, CLI fallback

Zero of these show up in a normal hermes chat session. All three activate together when the dispatcher spawns a worker.

Regression test intercepts Popen in the spawn path and asserts --skills kanban-worker appears in the argv. 218/218 kanban + tools suite green.

External review: @erosika pre-merge audit (commit a24c6e1)

Full reply on the issue: #16102 (comment)

Six bugs, all real, all fixed:

# Bug Fix
1 unblock_task didn't defensively close a stale current_run_id pointer. Mirrored close-and-clear pattern from claim_task / archive_task.
2 Migration backfill ran outside write_txn; concurrent dispatcher could orphan runs. Wrapped in write_txn + CAS guard on pointer UPDATE; orphan rows marked reclaimed if CAS fails.
3 Gateway notifier only unsubscribed on task.status in (done, archived), leaking subs on blocked/gave_up/crashed/timed_out. Unsub on last-event-kind being terminal.
4 Notifier retried adapter.send against dead chats forever. Per-sub failure counter; drop after 3 consecutive failures; resets on success.
5 recompute_ready full-scan would starve claim CAS at 10k todo tasks. Deferred to v2 follow-up (dirty-set refactor); stress benchmarks proved it's a 47ms concern at 10k, not a bug.
6 Daemon runs silently when every spawn fails. Tick-level health telemetry: WARN on stderr when ready queue non-empty AND 0 spawns succeeded for 6 consecutive ticks. Rate-limited to 1 per 5 min.

Two cheap v2 extensions shipped in-scope:

  • Cross-task role history in build_worker_context — 5 most-recent completed runs for the current task's assignee, giving workers implicit continuity.
  • Notifier identity prefix — terminal pings lead with @<assignee>.

Deferred to v2: skill-aware routing, structured comments as multi-peer session substrate (asked erosika to open a separate RFC), pattern/mechanism taxonomy docs split.

Deep-scan pass 2 (commit e27c819)

Second integration audit covering frontend rendering, WebSocket plumbing, never-claimed-task handling, dataclass shape. Eight issues.

# Bug Fix
1 complete_task on never-claimed task silently dropped summary/metadata. Synthesize zero-duration run when handoff data is present.
2 block_task on never-claimed task silently dropped --reason. Same synthesis pattern.
3 Event dataclass missing run_id field. Added; every read path now surfaces it.
4 GET /tasks/:id events + WS /events payload omitted run_id. Both now include it.
5 TaskDrawer didn't live-refresh on WS events for its own task. Per-task event counter passed as prop; drawer's useEffect depends on it.
6 claim_task didn't guard against stranded current_run_id. Defensive close-as-reclaimed in same txn.
7 CLI kanban watch --kinds help listed legacy spawn_auto_blocked. Updated to current lexicon.
8 CSS missing .hermes-kanban-run--ended fallback rule. Added.

First audit pass (commits 8ef2ae6 + 1c78f66)

Five bugs where runs got orphaned or the dashboard was a second-class citizen.

# Bug Fix
1 archive_task on a running task orphaned the run. Archive closes active run as reclaimed, clears claim columns.
2 Dashboard drag-drop off running left the run open forever. _set_status_direct closes via _end_run(outcome='reclaimed').
3 PATCH /tasks/:id with status=done accepted result but not summary/metadata. UpdateTaskBody carries them; forwarded to complete_task.
4 hermes kanban complete a b c --summary X silently broadcast the same handoff. Refused with stderr guard; bare bulk still works.
5 Gateway notifier used task.result instead of run.summary. complete_task embeds first-line summary in event payload; notifier reads it. Zero extra SQL.

Invariant tasks.current_run_id IS NULL ⇔ run row terminal now enforced across CLI, dashboard, dispatcher, archive.

Runs as first-class + structured handoffs (commit 0146cb2, addresses @vulcan-artivus's review)

External review flagged that the one-assignee / one-workspace / one-attempt-per-row shape breaks on role-pipelines (PM → Eng → Review → loop → PR). Landed the high-value, schema-affecting pieces in v1 so the model doesn't ossify.

  • task_runs table (one row per attempt).
  • Structured handoffs via --summary + --metadata; build_worker_context reads them for downstream + retry workers.
  • Event attribution via task_events.run_id.
  • Dashboard Run History section + new hermes kanban runs <id> verb.
  • Legacy-DB migration synthesizes a run row for any pre-existing running task.
  • Forward-compat columns for v2 workflow routing.

Explicitly deferred to v2: workflow templates with success/failure step routing, stage as a distinct axis, shared-by-default workspace across stages.

Multica-inspired additions

After reading Multica's codebase (21.8k stars, Go/Next.js/Postgres task-tracker), ported four ideas that fit the single-host SQLite model; dropped their cross-host server + pgvector skill search.

  • Per-task max-runtime caps with SIGTERM/SIGKILL grace window.
  • Worker heartbeats for long-running operations.
  • Assignee enumeration (union of ~/.hermes/profiles/ + assigned names).
  • Event vocab cleanup: ready → promoted, priority → reprioritized, spawn_auto_blocked → gave_up; new spawned, heartbeat, timed_out kinds.

Core hardening

  • Dispatcher as a real daemon (not cron — cron burns LLM tokens per tick).
  • hermes kanban init prints daemon setup hint.
  • Spawn-failure circuit breaker.
  • Crash detection via PID liveness (independent of claim TTL).
  • Log rotation (2 MiB, one generation).
  • Idempotency keys on create.
  • Bulk CLI verbs.
  • Terminal event tail via hermes kanban watch.
  • Stats endpoint + CLI.
  • Worker logs reachable via CLI + REST + dashboard drawer panel.
  • GC for events + logs.

Tutorial

website/docs/user-guide/features/kanban-tutorial.md walks four user stories end-to-end with 10 dashboard screenshots:

  • Story 1: Solo dev shipping a feature (parent→child deps, structured handoff, Run History).
  • Story 2: Fleet farming (3 assignee pools in parallel, lanes-by-profile, daemon per assignee).
  • Story 3: Role pipeline with retry (PM spec → eng implements → review blocks → eng retries → review approves; two-run history).
  • Story 4: Circuit breaker + crash recovery (gave_up terminal block for missing creds; crashed-then-completed migration with chunked-strategy retry).

Tests

tests/hermes_cli/test_kanban_db.py                 — 36 test funcs (schema + CAS)
tests/hermes_cli/test_kanban_cli.py                — 19 test funcs (run_slash e2e)
tests/hermes_cli/test_kanban_core_functionality.py — 93 test funcs (idempotency,
                                                      circuit breaker, crash
                                                      detection, daemon loop,
                                                      stats, notify subs, gc,
                                                      log rotation, bulk CLI,
                                                      runs as first-class,
                                                      invariant recovery,
                                                      zombie detection, ID scale)
tests/plugins/test_kanban_dashboard_plugin.py      — 36 test funcs (REST + WebSocket)
                                                     TOTAL COLLECTED: 218/218 pass
                                                     (some tests parametrized)

All pass under scripts/run_tests.sh (hermetic, 4 xdist workers, TZ=UTC, LANG=C.UTF-8).

Stress suite (tests/stress/, opt-in):

  • test_concurrency.py — 5 workers, 100 tasks.
  • test_concurrency_mixed.py — 10 workers + reclaimer, 500 tasks, random ops.
  • test_concurrency_reclaim_race.py — adversarial TTL.
  • test_subprocess_e2e.py — real Python subprocess workers.
  • test_property_fuzzing.py — 500 sequences × ~80 ops = ~40k total operations, 9 invariants per step.
  • test_benchmarks.py — latency at 100 / 1k / 10k tasks; results saved to JSON.

Out of scope (deliberately)

  • "Smart routing" / auto-assignment → a router profile.
  • Per-agent budgets → a plugin that wraps spawn.
  • Governance / approval gates → reuse tools/approval.py.
  • Extra columns / custom card chrome → extend the plugin via the standard plugin contract.
  • Org-chart / reporting lines → profile naming convention.
  • Multi-hostkanban.db is local SQLite; crash detection assumes host-local PIDs. Run independent boards per host and bridge via delegate_task / a message queue if needed.
  • Workflow template DSL with success/failure routing → v2 (forward-compat columns already in the schema).
  • Structured comments (in_reply_to / addressed_to / kind) as multi-peer session substrate → v2 (erosika invited to open a separate RFC).

Credits

User stories and design input from the Nous Discord design thread: Teknium, waxhy, A Real Icehole, Keimpe, LLM.STORE, caco, hunter_cat, djm, ionmanden, psbd, Aiz, Rikllo, sudo_relax, neo2k8. P6 borrowed from Google's Gemini CLI subagent @name syntax; P7 from psbd; P8 from neo2k8.

Pre-merge review by @vulcan-artivus (runs as first-class) and @erosika (six pre-merge bugs, two v2 seeds, structured-comments RFC seed).

Multica (https://github.com/multica-ai/multica) for the max-runtime / heartbeat / assignee-enumeration ideas.

Follow-ups shipped on this PR

Per-task force-loaded skills (commit dd83173)

Tasks now carry their own extra-skill list. The dispatcher emits one
--skills <name> flag per task skill alongside the built-in --skills kanban-worker, so a task can pin specialist context (e.g.
translation, github-code-review, security-pr-audit) without the
user having to edit the assignee profile's skill config.

Surface parity on all 4 entry points:

Where How
CLI hermes kanban create "..." --skill translation --skill github-code-review (repeatable)
Tool (orchestrator → worker) kanban_create(title=..., assignee=..., skills=["translation"])
Dashboard REST POST /api/plugins/kanban/tasks {"skills": ["translation"]}
Dashboard UI Comma-separated skills input in the inline create form; Skills row in the drawer

Schema changes are additive + idempotent: tasks.skills TEXT (JSON
array), migrated via _migrate_add_optional_columns. Existing rows
stay skills=NULL (reads back as Task.skills is None). 16 new
tests cover kernel round-trip, normalization (dedupe, strip,
comma-rejection), dispatcher argv ordering and built-in dedupe, CLI
flag repeatability, show output, migration idempotency, tool
list/string/non-list inputs, and dashboard REST round-trip.

Dispatcher runs in the gateway (commit 60e7567)

The dispatcher is now embedded in the gateway process. hermes gateway start is the single entry point — no separate hermes kanban daemon
or systemd unit required. Typical path: hermes kanban create ...
writes the row, the gateway's embedded dispatcher picks it up on the
next tick (60s default), worker spawns. The standalone daemon is
retired (deprecation stub + undocumented --force escape hatch for
headless hosts that can't run the gateway).

Config (additive; no version bump):

kanban:
  dispatch_in_gateway: true        # default
  dispatch_interval_seconds: 60    # default

HERMES_KANBAN_DISPATCH_IN_GATEWAY=0 disables at runtime without
editing YAML.

Warnings surface through three channels when a task would sit idle
because no dispatcher is present:

Where How it appears
CLI hermes kanban create prints a line to stderr (non-JSON mode only) with hermes gateway start guidance
Dashboard REST POST /tasks response gains a warning field that the UI threads into the existing error banner
Dashboard UI Banner reads "Task created, but: No gateway is running — ..."

Triage and unassigned tasks never warn (they wouldn't dispatch
anyway). Probe failures are silent so partial installs don't cry wolf.

Integration fixes in the same commit: stale "spawned by
hermes kanban daemon" phrasing in toolsets.py + tool docstrings;
tutorial's hermes kanban daemon --assignee … block (a flag that
never existed); systemd unit now carries --force + DEPRECATED
header; docs Running the dispatcher as a service section rewritten
to describe the gateway-embedded path; hermes kanban init next-step
hint updated.

17 new tests cover:

  • DEFAULT_CONFIG has dispatch_in_gateway=True + sane interval
  • _check_dispatcher_presence returns correct (running, message) for
    gateway-up, gateway-down, flag-off, and probe-error paths
  • CLI create warns on stderr for ready+assigned when no gateway;
    silent when up; never warns on triage/unassigned
  • hermes kanban daemon without --force prints DEPRECATED + exits 2
  • Argparse help marks daemon DEPRECATED
  • Gateway _kanban_dispatcher_watcher respects config flag off
    (exits fast), respects env override, defers to config when env
    truthy-but-not-false
  • Dashboard REST surfaces warning correctly, omits when running,
    skips on triage, survives probe errors

E2E verified in isolated HERMES_HOME: watcher starts, ticks, and
exits within 3s of _running=False.

New `hermes kanban` CLI subcommand + `/kanban` slash command + skills for
worker and orchestrator profiles. SQLite-backed task board
(~/.hermes/kanban.db) shared across all profiles on the host. Zero
changes to run_agent.py, no new core tools, no tool-schema bloat.

Motivation: delegate_task is a function call — sync fork/join, anonymous
subagent, no resumability, no human-in-the-loop. Kanban is the durable
shape needed for research triage, scheduled ops, digital twins,
engineering pipelines, and fleet work. They coexist (workers may call
delegate_task internally).

What this adds
- hermes_cli/kanban_db.py — schema, CAS claim, dependency resolution,
  dispatcher, workspace resolution, worker-context builder.
- hermes_cli/kanban.py — 15-verb CLI surface and shared run_slash()
  entry point used by both CLI and gateway.
- skills/devops/kanban-worker — how a profile should work a claimed task.
- skills/devops/kanban-orchestrator — "you are a dispatcher, not a
  worker" template with anti-temptation rules.
- /kanban slash command wired into cli.py and gateway/run.py. Bypasses
  the running-agent guard (board writes don't touch agent state), so
  /kanban unblock can free a stuck worker mid-conversation.
- Design spec at docs/hermes-kanban-v1-spec.pdf — comparative analysis
  vs Cline Kanban, Paperclip, NanoClaw, Gemini Enterprise; 8 patterns;
  4 user stories; implementation plan; concurrency correctness.
- Docs: website/docs/user-guide/features/kanban.md, CLI reference
  updated, sidebar entry added.

Architecture highlights
- Three planes: control (user + gateway), state (board + dispatcher),
  execution (pool of profile processes).
- Every worker is a full OS process, spawned as `hermes -p <profile>`.
  No in-process subagent swarms — solves NanoClaw's SDK-lifecycle
  failure class.
- Atomic claim via SQLite CAS in a BEGIN IMMEDIATE transaction; stale
  claims reclaimed 15 min after their TTL expires.
- Tenant namespacing via one nullable column — one specialist fleet
  can serve many businesses with data isolation by workspace path.

Tests: 60 targeted tests (schema, CAS atomicity, dependency resolution,
dispatcher, workspace kinds, tenancy, CLI + slash surface). All pass
hermetic via scripts/run_tests.sh.
@alt-glitch alt-glitch added type/feature New feature or request P2 Medium — degraded but workaround exists comp/cli CLI entry point, hermes_cli/, setup wizard comp/cron Cron scheduler and job management comp/gateway Gateway runner, session dispatch, delivery labels Apr 26, 2026
The /kanban CLI + slash command are enough to run the board
headlessly, but triage and cross-profile supervision want a
visual board. Document the design as a dashboard plugin that:

- reads live state from kanban.db over a WebSocket on
  task_events (no polling)
- writes through run_slash() so CLI/gateway/GUI cannot drift
- mounts under /api/plugins/kanban/ following the existing
  'Extending the Dashboard' plugin shape

The plugin is strictly a thin layer over kanban_db — no new
business logic, nothing to merge into the kernel.
Ships plugins/kanban/dashboard/ as a bundled dashboard plugin. No core
changes — uses the standard dashboard plugin contract (manifest.json +
dist/index.js + plugin_api.py) documented in 'Extending the Dashboard'.

What the tab gives you:
- One column per kanban status (todo / ready / running / blocked / done;
  archived behind a toggle), column counts, coloured status dots.
- Cards with id, title, priority badge, tenant tag, assignee,
  comment/link counts, 'created N ago'.
- HTML5 drag-drop between columns — status change routes through the
  same kanban_db code the CLI /kanban verbs use, so the three surfaces
  (CLI, gateway, dashboard) can never drift.
- Inline create per-column (title, assignee, priority).
- Side drawer on card click: description, status action row
  (→ ready / → running / block / unblock / complete / archive),
  dependency links, comment thread with Enter-to-submit,
  last 20 events.
- Toolbar: search, tenant filter, assignee filter, show-archived,
  nudge-dispatcher (skip the 60s wait), refresh.
- Live updates via WebSocket tailing task_events — the board reflects
  CLI or gateway actions in real time.

REST surface under /api/plugins/kanban/: GET /board, GET /tasks/:id,
POST /tasks, PATCH /tasks/:id, POST /tasks/:id/comments, POST /links,
DELETE /links, POST /dispatch, WS /events. Every handler is a thin
wrapper around kanban_db — no new business logic.

Visually theme-aware: the plugin CSS reads only --color-*, --radius,
--font-mono etc. so it reskins with whichever dashboard theme is active.

Tests (tests/plugins/test_kanban_dashboard_plugin.py, 16 tests):
- empty board shape
- create + appears in ready column with tenant/assignee rollups
- tenant filter
- detail includes parents/children/events
- 404 on unknown task
- PATCH status: complete / block / unblock / ready drag-drop / running
- PATCH reassign, priority, edit, invalid-status rejection
- POST comment (plus empty-body rejection)
- POST link + DELETE link + cycle rejection
- POST dispatch (dry run)

All 76 kanban tests pass under scripts/run_tests.sh.

Docs: website/docs/user-guide/features/kanban.md gains a full
'Dashboard (GUI)' section covering install, architecture, REST surface,
live-updates mechanism, extending, and scope boundary.
Follows up on the initial dashboard plugin with the items called out
during self-review — ships the GUI-reality claims the PR body made,
closes the WebSocket auth gap, and lands the 'Triage' status the design
spec's Fusion-style screenshot leads with.

Kernel changes
  - kanban_db.VALID_STATUSES gains 'triage'. status is TEXT without a
    CHECK constraint so no schema migration is needed.
  - create_task(triage=True) forces the initial status to 'triage'
    regardless of parents, and parent ids are still validated so the
    eventual link rows don't dangle. recompute_ready() only promotes
    'todo' -> 'ready', so triage tasks are naturally isolated from the
    dispatcher pipeline.
  - hermes kanban create gains --triage.
  Patterns table (docs) gains P9 'Triage specifier'.

Plugin backend (plugins/kanban/dashboard/plugin_api.py)
  - GET /board now auto-init's kanban.db on first read (idempotent).
    A fresh install shows an empty board instead of 'failed to load'.
  - GET /board returns a new 'progress' field per task — {done, total}
    of child-task completion, or None if the task has no children.
  - BOARD_COLUMNS prepends 'triage'.
  - POST /tasks accepts {triage: bool}; PATCH /tasks/:id accepts
    {status: 'triage'}.
  - WebSocket /events now requires ?token=<session_token> as a query
    param — browsers can't set Authorization on a WS upgrade, so this
    matches the pattern the in-browser PTY bridge uses. Constant-time
    compare against hermes_cli.web_server._SESSION_TOKEN. In bare-test
    contexts (no dashboard module) the check no-ops so the tail loop
    stays testable. Security boundary documented in the module header
    and in website/docs/user-guide/features/kanban.md.

Plugin UI (plugins/kanban/dashboard/dist/index.js + style.css)
  - Adds the Triage column (lilac dot) with helper text
    'Raw ideas — a specifier will flesh out the spec'. Inline-create
    from the Triage column parks new tasks in triage.
  - Status action row in the drawer gains '→ triage'.
  - Progress pill (N/M) on cards that have children. Full-complete
    state tints the pill green.
  - 'Lanes by profile' toolbar toggle — sub-groups the Running column
    by assignee so you see at a glance which specialist is busy on
    what.
  - Destructive status moves (done / archived / blocked) via drag-drop
    OR via the drawer action row now prompt for confirmation.
  - Escape closes the drawer.
  - Live-update reloads are debounced (250ms) so a burst of
    task_events triggers one refetch, not N.
  - WebSocket includes ?token= built from window.__HERMES_SESSION_TOKEN__.
  - WebSocket reconnect uses exponential backoff capped at 30s, not
    a fixed 1.5s spin loop, and surfaces a user-visible error on
    code-1008 (auth rejected) instead of reconnecting forever.
  - ErrorBoundary wraps the page — a bad card render shows a
    'rendering error, reload view' card instead of crashing the tab.

Tests (tests/plugins/test_kanban_dashboard_plugin.py, +5 tests = 21)
  - empty-board shape now asserts all 6 columns including 'triage'
  - create_triage_lands_in_triage_column
  - triage_task_not_promoted_to_ready (dispatcher bypasses triage)
  - patch_status_triage_works (both into triage and out of it)
  - board_progress_rollup (0/2 -> 1/2 -> childless cards = None)
  - board_auto_initializes_missing_db
  - ws_events_rejects_when_token_required (three sub-assertions:
    missing → 1008, wrong → 1008, correct → handshake accepted)

All 82 kanban tests pass under scripts/run_tests.sh.

Docs
  - kanban.md 'What the plugin gives you' fully rewritten to match
    shipped reality (triage, progress pill, assignee lanes,
    destructive-confirm, Escape-close, debounce).
  - New 'Security model' subsection documents the explicit-plugin-
    route-bypass, the WS token requirement, and the --host 0.0.0.0
    warning; also notes that kanban.db is profile-agnostic on purpose
    (the coordination primitive) so cross-profile visibility is
    expected.
  - CLI command reference shows --triage.
  - Collaboration patterns table adds P9 'Triage specifier'.
The dashboard plugin gets the last layer of features that turn it from a
'usable read surface with drag-drop' into a 'full kanban UI' — no more
'drop to CLI to do X' moments from inside the tab.

Plugin backend
  - POST /tasks/bulk — apply the same patch (status / archive / assignee
    / priority) to every id in the request body. Each id runs
    independently: one bad id reports {ok: false, error: ...} without
    aborting siblings. Status transitions that aren't legal for the
    current state are surfaced per-id ('transition to done refused').
    Used by the multi-select bulk action bar.
  - GET /config — returns the dashboard.kanban section of config.yaml
    (default_tenant, lane_by_profile, include_archived_by_default,
    render_markdown) with sensible defaults when the section is absent.
    Loaded once by the SPA to preselect filters and toggle markdown
    rendering.
  - _conn() helper — every handler now goes through it, calling
    kanban_db.init_db() (idempotent) before every connection. Fresh
    installs work whether the first hit is GET /board, POST /tasks, or
    any other endpoint — no more 'no such table: tasks' when the CLI
    or a script hits the plugin before the dashboard has ever loaded.

Plugin UI (plugin bundle, +~12 KB)
  - Multi-select: per-card checkbox; shift/ctrl-click also toggles
    without opening the drawer. A BulkActionBar appears above the
    columns with batch → ready / complete / archive / reassign
    (profile dropdown + unassign option). Destructive batches confirm
    first. Partial failures from the backend are surfaced inline.
  - Drawer inline editing:
    - Click the title → TitleEditor swaps in an input, Enter saves,
      Escape cancels.
    - Click the Assignee meta row → AssigneeEditor input (empty string
      unassigns).
    - Click the Priority meta row → PriorityEditor numeric input.
    - New 'edit' button on Description → full-width textarea; Save /
      Cancel switch back to rendered view.
  - Dependency editor: chip list of parents + children with per-chip
    × button (calls DELETE /links). Add-parent / add-child dropdowns
    filter out self + already-linked tasks so you cannot re-add a
    duplicate edge or a self-loop. Cycle rejections from the server
    surface cleanly via the existing error banner.
  - Parent selection in InlineCreate: new dropdown listing every task
    on the board ('{id} — {title}') — picking one sends parents=[id]
    with the create payload, so the task lands in todo (or triage if
    created from the Triage column) with the dependency wired up.
  - Safe markdown rendering for description, comment bodies, and
    result. A small in-bundle renderer handles headings, bold, italic,
    inline code, fenced code, bullet lists, and http(s)/mailto links.
    Every substitution runs on HTML-escaped input (no raw HTML), links
    get target=_blank + rel=noopener,noreferrer. Disabled by config
    key dashboard.kanban.render_markdown=false (falls back to <pre>).
  - Touch drag-drop: attachTouchDrag() installs a pointerdown handler
    that spawns a drag proxy, tracks elementFromPoint under the finger,
    and dispatches a hermes-kanban:drop CustomEvent on the column when
    released. Desktop continues to use native HTML5 DnD. Columns
    listen for both.
  - ErrorBoundary already present from the prior commit catches any
    renderer throw; markdown escape + touch-proxy cleanup both have
    their own try/finally.

Tests (tests/plugins/test_kanban_dashboard_plugin.py — 90/90 pass)
  - bulk_status_ready: 3 tasks blocked, batch → ready, all move
  - bulk_archive hides all ids from default board
  - bulk_reassign changes every assignee
  - bulk_unassign_via_empty_string sets assignee back to None
  - bulk_partial_failure_doesnt_abort_siblings: bogus id in middle,
    good siblings still get priority=7
  - bulk_empty_ids_400
  - config_returns_defaults_when_section_missing
  - config_reads_dashboard_kanban_section (writes config.yaml, verifies
    every key round-trips)

Live smoke (real FastAPI app + isolated HERMES_HOME):
  - /config without section returns defaults
  - /config with dashboard.kanban section returns the configured values
  - POST /tasks as the first-ever request (no prior /board) succeeds —
    auto-init handles it
  - Link add + remove via POST /links + DELETE /links round-trip
  - Bulk priority bump on 2 ids, both get priority=5
  - Bulk archive hides ids from default board
  - PATCH {title, body} updates the task, markdown source survives
    the round trip
  - POST /tasks {triage: true, parents: [id]} lands in triage, not todo
  - Bulk partial: 2 good + 1 bogus returns per-id outcome

Docs (website/docs/user-guide/features/kanban.md)
  - 'What the plugin gives you' rewritten to reflect bulk, drawer
    edit, dep editor, parent-on-create, markdown, touch drag-drop.
  - New 'Dashboard config' subsection with a YAML example for
    dashboard.kanban.*.
  - REST table gains /tasks/bulk and /config rows.
… logs, notify, bulk, stats

Eliminates every 'known broken on day one' item in the core functionality
audit. The board is now self-driving (daemon, not cron), self-healing
(crash detection, spawn-failure circuit breaker), and self-reporting
(logs, stats, gateway notifications).

Dispatcher
  - New `hermes kanban daemon` long-lived loop with --interval, --max,
    --failure-limit, --pidfile, --verbose, signal-clean shutdown
    (SIGINT/SIGTERM via threading.Event). A kb.run_daemon() entry point
    lets tests drive it inline without subprocess.
  - `hermes kanban init` now prints the dispatcher setup hint so users
    don't leave the board off-by-default. Ships a systemd user unit at
    plugins/kanban/systemd/hermes-kanban-dispatcher.service.
  - Removed the old 'add this to cron' doc path. Cron runs agent
    prompts (LLM cost per tick) — unacceptable for a per-minute
    coordination loop.

Worker aliveness / safety
  - Spawn returns the child's PID; dispatcher stores it on the task row
    and calls detect_crashed_workers() every tick. If the PID is gone
    but the claim TTL hasn't expired, the task drops back to ready with
    a 'crashed' event. Host-local only — cross-host PIDs are ignored
    per the single-host design.
  - Spawn-failure circuit breaker: after N consecutive spawn_failed
    events on the same task (default 5), the dispatcher auto-blocks
    with the last error as the reason. Success resets the counter.
    Workspace-resolution failures count against the same budget.
  - Log rotation: _rotate_worker_log trims at 2 MiB, keeps one
    generation (.log.1), bounds per-task disk usage at ~4 MiB.

Idempotency / dedup
  - create_task(idempotency_key=...) returns the existing non-archived
    task id for retried webhooks. --idempotency-key on the CLI, json
    body field on the dashboard plugin. Archived tasks don't block a
    fresh create with the same key.

CLI surface
  - Bulk verbs: complete, unblock, archive accept multiple ids;
    block accepts --ids for sibling blocks with the same reason.
  - New verbs: daemon, watch (live event tail filtered by
    assignee/tenant/kinds), stats, log, notify-subscribe,
    notify-list, notify-unsubscribe.
  - dispatch gains --failure-limit + crashed/auto_blocked columns in
    JSON output and human-readable output.
  - gc accepts --event-retention-days / --log-retention-days; prunes
    task_events for terminal tasks and old log files.

Gateway integration
  - New GatewayRunner._kanban_notifier_watcher: polls
    kanban_notify_subs every 5s, pushes ✔/⏸/✖ messages to subscribed
    chats for completed/blocked/spawn_auto_blocked/crashed events.
    Cursor-advanced per-sub; auto-removed when the task reaches
    done/archived. Runs alongside the session expiry and platform
    reconnect watchers — SQLite work in asyncio.to_thread so the
    event loop never blocks.
  - /kanban create in the gateway auto-subscribes the originating
    chat (platform + chat_id + thread_id). Users see
    '(subscribed — you'll be notified when t_abcd completes or
    blocks)' appended to the response.

Dashboard plugin
  - GET /stats returns board_stats (by_status, by_assignee,
    oldest_ready_age_seconds).
  - GET /tasks/:id/log returns the worker log with optional ?tail=N
    cap. 404 on unknown task, exists=false when the task has never
    spawned.
  - POST /tasks accepts idempotency_key; both Pydantic body and the
    create_task kwarg now round-trip.
  - /board attaches task.age (created/started/time_to_complete in
    seconds) so the UI can colour stale cards without recomputing.
  - Card CSS: amber border after N minutes, red border when clearly
    stuck (tier per status: running 10m/60m, ready 1h/24h, todo
    7d/30d, blocked 1h/24h).
  - Drawer: new Worker log section, auto-loads on mount, last 100 KB
    cap with on-disk path surfaced when truncated.

Kernel
  - Schema additions: tasks.idempotency_key, tasks.spawn_failures,
    tasks.worker_pid, tasks.last_spawn_error; new
    kanban_notify_subs table. All gated by _migrate_add_optional_columns
    so legacy DBs upgrade cleanly.
  - release_stale_claims / complete_task / block_task now all clear
    worker_pid so crash detection doesn't false-positive on reclaimed
    tasks.
  - read_worker_log fixed: tail-skip no longer eats one-giant-line
    logs (common with child processes that don't flush newlines
    before dying).

Tests (tests/hermes_cli/test_kanban_core_functionality.py, 28 new)
  - Idempotency: same key returns existing, archived doesn't block,
    no key never collides
  - Circuit breaker: auto-blocks after limit, success resets counter,
    workspace-resolution failure counts against budget
  - Aliveness: _pid_alive helper, detect_crashed_workers reclaims
    exited child
  - Daemon: runs and stops cleanly via stop_event, survives a tick
    exception
  - Stats + task_age helpers
  - Notify subs: CRUD, cursor advances, distinct-thread is a separate row
  - GC: events-only-for-terminal-tasks, old worker logs deleted
  - Log: rotation keeps one generation, read_worker_log tail
  - CLI: bulk complete/archive/unblock/block, create with
    --idempotency-key, stats --json, notify-subscribe+list, log
    missing task, gc reports counts
  - run_slash parity: smoke-tests every registered verb (23
    invocations); none may raise or return empty string

Full kanban test suite: 234/234 pass under scripts/run_tests.sh
(60 original + 30 dashboard plugin + 28 new core + 116 command
registry). Live smoke covers /stats, idempotency, age, log endpoint
with and without content, log?tail= truncation signal, 404 on unknown
task.

Docs (website/docs/user-guide/features/kanban.md)
  - 'Core concepts' rewritten: new statuses (triage), idempotency key,
    dispatcher-as-daemon-not-cron with circuit breaker behaviour
    documented.
  - Quick start swapped to daemon. New systemd section covers user
    service install.
  - New sections: idempotent create, bulk verbs, gateway
    notifications, out-of-scope single-host note (kanban.db is local;
    don't expect multi-host).
  - CLI reference updated for every new verb, every new flag.
Comment thread hermes_cli/kanban_db.py
doesn't exist. If ``tail_bytes`` is set, only the last N bytes are
returned (useful for the dashboard drawer which shouldn't page megabytes)."""
path = worker_log_path(task_id)
if not path.exists():
Comment thread hermes_cli/kanban_db.py
return None
try:
if tail_bytes is None:
return path.read_text(encoding="utf-8", errors="replace")
Comment thread hermes_cli/kanban_db.py
try:
if tail_bytes is None:
return path.read_text(encoding="utf-8", errors="replace")
size = path.stat().st_size
Comment thread hermes_cli/kanban_db.py
if tail_bytes is None:
return path.read_text(encoding="utf-8", errors="replace")
size = path.stat().st_size
with open(path, "rb") as f:
raise HTTPException(status_code=404, detail=f"task {task_id} not found")
content = kanban_db.read_worker_log(task_id, tail_bytes=tail)
log_path = kanban_db.worker_log_path(task_id)
size = log_path.stat().st_size if log_path.exists() else 0
raise HTTPException(status_code=404, detail=f"task {task_id} not found")
content = kanban_db.read_worker_log(task_id, tail_bytes=tail)
log_path = kanban_db.worker_log_path(task_id)
size = log_path.stat().st_size if log_path.exists() else 0
@teknium1 teknium1 changed the title feat(kanban): durable multi-profile collaboration board feat(kanban): durable multi-profile collaboration board + dashboard GUI + dispatcher daemon Apr 27, 2026
teknium1 and others added 15 commits April 27, 2026 06:32
…er, event vocab cleanup

Ports four items from the Multica audit (https://github.com/multica-ai/multica).
Dropped their cross-host server/daemon architecture and their Postgres+pgvector
skill search — both the wrong shape for our single-host SQLite kernel.

1. Per-task max-runtime (`max_runtime_seconds` column)
   - New kernel function `enforce_max_runtime(conn)` runs in every dispatch
     tick. When a running task's elapsed time exceeds the cap, we SIGTERM
     the worker, wait a 5 s grace (polling _pid_alive), then SIGKILL. The
     task goes back to 'ready' with a `timed_out` event and re-queues
     on the next tick (unless the spawn-failure circuit breaker has
     already parked it).
   - Host-local only: lock prefix must match this host's claimer_id so we
     never signal a PID on another machine.
   - CLI: `hermes kanban create --max-runtime 30m | 2h | 1d | <seconds>`.
     New `_parse_duration` helper accepts s/m/h/d suffixes or bare
     integers.
   - Dashboard POST body + the card's `max_runtime_seconds` field.

2. Worker heartbeat (`last_heartbeat_at` column, `heartbeat` event)
   - `heartbeat_worker(conn, task_id, note=None)` emits the event and
     touches last_heartbeat_at. Refused when the task isn't running.
   - CLI: `hermes kanban heartbeat <id> [--note "..."]`.
   - kanban-worker skill instructs workers to heartbeat during long
     loops (training runs, encodes, crawls, batch uploads).
   - Separate signal from PID crash detection: a worker's Python can
     still be alive while the actual work process is stuck. Heartbeat
     absence is diagnostic; future work can auto-block on stale
     heartbeats but v1 just surfaces the signal.

3. Assignee enumeration (`known_assignees`, `list_profiles_on_disk`)
   - Scans ~/.hermes/profiles/ for dirs containing config.yaml + unions
     with current assignees on the board. Each entry returns
     {name, on_disk, counts: {status: n}}.
   - CLI: `hermes kanban assignees [--json]`. Also hooked into
     `hermes kanban init` which now prints discovered profiles so new
     installs see 'these are the assignees you can target' immediately.
   - Dashboard: GET /api/plugins/kanban/assignees for the picker.

4. Event vocab cleanup (three renames + three new kinds)
   - `ready` → `promoted` (fires when deps clear; clearer semantic).
   - `priority` → `reprioritized` (past-tense verb, matches others).
   - `spawn_auto_blocked` → `gave_up` (short, memorable; the circuit
     breaker gave up on this task).
   - New: `spawned` (emitted with {pid} on successful spawn),
     `heartbeat` ({note?}), `timed_out`
     ({pid, elapsed_seconds, limit_seconds, sigkill}).
   - One-shot migration in `_migrate_add_optional_columns` renames
     legacy rows in-place on init_db(), so existing DBs upgrade cleanly.
   - Gateway notifier's TERMINAL_KINDS set updated; timed_out gets its
     own ⏱ message template, gave_up renamed from 'auto-blocked'.
   - Plugin_api.py's two 'priority' emit sites renamed to
     'reprioritized'.
   - Documented in a new 'Event reference' section in kanban.md,
     grouped into three clusters (lifecycle / edits / worker
     telemetry) with payload shapes.

Tests (+18 in tests/hermes_cli/test_kanban_core_functionality.py,
136/136 pass):
  - max_runtime_terminates_overrun_worker: real SIGTERM flow with
    _pid_alive stub, verifies event payload + state reset.
  - max_runtime_none_means_no_cap: unbounded tasks aren't timed out.
  - create_task_persists_max_runtime.
  - enforce_max_runtime_integrates_with_dispatch: kernel-level +
    dispatch_once chaining.
  - heartbeat_on_running_task + heartbeat_refused_when_not_running.
  - cli_heartbeat_verb with --note round-trip.
  - recompute_ready_emits_promoted_not_ready.
  - spawn_failure_circuit_breaker_emits_gave_up.
  - spawned_event_emitted_with_pid.
  - migration_renames_legacy_event_kinds (injects old rows, re-runs
    init_db, asserts rename).
  - list_profiles_on_disk (tmp_path + config.yaml filter).
  - known_assignees_merges_disk_and_board (profiles on disk + board
    assignees + per-status counts).
  - cli_assignees_json.
  - parse_duration_accepts_formats (s/m/h/d/float).
  - parse_duration_rejects_garbage.
  - cli_create_max_runtime_via_duration (2h → 7200).
  - cli_create_max_runtime_bad_format_exits_nonzero.

Live smoke: POST /tasks with max_runtime_seconds round-trips;
/assignees returns the union of on-disk + board-assigned names;
PATCH priority produces 'reprioritized' events (not 'priority');
board cards expose max_runtime_seconds + last_heartbeat_at.

Docs (website/docs/user-guide/features/kanban.md):
  - New 'Event reference' section with three-cluster table
    (lifecycle / edits / worker telemetry) + payload shapes.
  - CLI reference updated for --max-runtime, heartbeat, assignees.
  - Gateway notifications section updated for the new TERMINAL_KINDS.

Not ported from Multica (deliberate, documented in the out-of-scope
section already): Postgres+pgvector skill search (heavy deps conflict
with SQLite kernel), server+daemon cross-host model (we're
single-host on purpose), first-class agent identity with threaded
comments (we keep the board profile-agnostic).
…compat for v2 workflows

Addresses vulcan-artivus's RFC review on issue #16102. Picks up the
structural changes that are expensive to retrofit later and zero-cost
to land now; defers workflow-template routing + per-stage lanes to v2
(kept forward-compat hooks in the schema).

Kernel
  - New `task_runs` table. Each claim opens a run (pid, claim_lock,
    heartbeat, max_runtime, started_at), each terminal transition
    closes it with an outcome (completed / blocked / crashed /
    timed_out / spawn_failed / gave_up / reclaimed). Multiple rows per
    task when retries happen, preserving full attempt history.
  - `tasks.current_run_id` points at the active run (NULL when idle);
    denormalised for cheap reads.
  - `task_events.run_id` carries the run a given event belongs to so
    UIs group events by attempt. claim/spawned/complete/block/crash/
    timeout/spawn_fail/gave_up/heartbeat events are all run-scoped;
    created/promoted/assigned/edited stay task-scoped (run_id=NULL).
  - Legacy DBs: migration adds the columns + indexes + synthesizes a
    run row for any task that's 'running' before the runs table
    existed, so subsequent complete/heartbeat/reclaim calls have a
    target. Idempotent.

Structured handoff
  - `complete_task(summary=, metadata=)` persists both on the closing
    run. `summary` falls back to `result` when omitted so single-run
    callers don't duplicate. `metadata` is a free-form dict
    ({changed_files, tests_run, findings, ...}).
  - `build_worker_context` rewrites: now reads "Prior attempts on this
    task" (closed runs: outcome, summary, error, metadata) and
    "Parent task results" pulls run.summary + run.metadata of the
    most-recent completed run per parent, falling back to task.result
    for legacy rows without runs. Retrying workers see why earlier
    attempts failed; downstream workers see parent handoffs
    structurally, not as loose `result` strings.

CLI
  - `hermes kanban complete <id> --summary "..." --metadata '{"files":1}'`.
    JSON is parsed and rejected with exit-2 if malformed.
  - New `hermes kanban runs <id> [--json]` verb. Shows per-run rows:
    outcome, profile, elapsed, summary, error. JSON mode serializes
    the full run dataclass for scripting.

Dashboard plugin
  - GET /tasks/:id now carries a runs[] array alongside task / events /
    comments / links. Each run serialised with outcome, summary,
    metadata, worker_pid, elapsed fields.
  - New Run History section in the drawer. Outcome-coloured left
    border (green=active, blue=completed, amber=reclaimed,
    red=crashed/timed_out/gave_up/blocked). Collapsed when >3 runs
    with a '+N earlier' toggle. Shows summary + error + metadata
    inline.

Forward-compat for v2 (vulcan's workflow templates + stages)
  - `tasks.workflow_template_id` and `tasks.current_step_key` added as
    nullable columns. v1 kernel ignores them for routing; v2 will add
    workflow_templates + workflow_steps tables and wire the dispatcher
    to consult them. task_runs has a matching `step_key` column. Lets
    a v2 release land additively without another schema migration.

Tests (+22 in test_kanban_core_functionality.py, +2 in dashboard)
  - run_created_on_claim / run_closed_on_complete_with_summary
  - run_summary_falls_back_to_result
  - multiple_attempts_preserved_as_runs (3 attempts: reclaimed →
    crashed → completed, all visible in list_runs)
  - run_on_block_with_reason / run_on_spawn_failure_records_failed_runs
    (5 spawn_failed runs + 1 gave_up run)
  - event_rows_carry_run_id (task-scoped vs run-scoped split)
  - build_worker_context_includes_prior_attempts
  - build_worker_context_uses_parent_run_summary (metadata JSON in context)
  - migration_backfills_inflight_run_for_legacy_db (simulates a
    pre-migration running task, re-runs init_db, asserts backfill)
  - forward_compat_columns_writable
  - cli_runs_verb + cli_runs_json
  - cli_complete_with_summary_and_metadata (JSON round-trip through
    shlex + argparse)
  - cli_complete_bad_metadata_exits_nonzero
  - task_detail_includes_runs / task_detail_runs_empty_before_claim

269/269 kanban suite pass under scripts/run_tests.sh. Live-smoke
covered: single-attempt complete → run closed + summary persisted;
retry scenario → two runs visible (blocked + completed); parent run
summary + metadata surfaced to child via build_worker_context;
forward-compat columns writable via UPDATE; GET /tasks/:id returns
runs[].

Docs
  - New 'Runs — one row per attempt' section in kanban.md: the
    why (full attempt history, structured metadata), the two-table
    model (task is logical, run is execution), the structured handoff
    shape (--summary / --metadata), example CLI + dashboard output,
    forward-compat note for v2.
  - Event reference updated to mention task_events.run_id.
  - CLI reference gains 'hermes kanban runs <id>'.

Not in v1 (deferred to v2):
  - Workflow templates (workflow_templates + workflow_steps tables,
    stage-based routing, success/failure step links).
  - 'stage' as a distinct axis from status in the UI.
  - Shared-by-default workspace binding across stages of the same
    workflow run.
  - Pipeline replacement for the kanban-orchestrator skill (the
    orchestrator's 'decompose, don't execute' guidance is still
    correct; it becomes partly redundant once workflows land).
…direct-status / drag-drop

Integration audit of the runs-as-first-class work (0146cb2) found five
bugs where structured runs got orphaned or dashboard parity was missing.
All behavioral fixes; no schema change needed.

Kernel
  - archive_task: when called on a running task, now closes the
    in-flight run with outcome='reclaimed' and clears current_run_id.
    Previously, dashboard bulk-archive or CLI `kanban archive <running>`
    would leave the task_runs row open with ended_at=NULL forever and
    strand the pointer. Adds the claim_lock / claim_expires / worker_pid
    clearing to the UPDATE so the task row is clean too.
  - complete_task: embeds the first-line handoff summary in the
    `completed` event payload (capped at 400 chars). Notifier can now
    render `✔ task done — <title>\n<summary>` without a second SQL hit,
    and the full summary still lives on the run row.

Dashboard plugin
  - _set_status_direct: drag-drop OFF 'running' (to 'ready', 'todo',
    'triage', 'done' — anywhere except back to 'running') now closes
    the active run with outcome='reclaimed'. Clears worker_pid too.
    Snapshots previous status + current_run_id before the UPDATE so
    the decision has the right before-state. status event rows now
    carry run_id when closing a run, NULL otherwise.
  - UpdateTaskBody: adds `summary` and `metadata` fields. PATCH
    /tasks/:id with status='done' now forwards them to complete_task,
    giving the dashboard parity with `hermes kanban complete --summary
    ... --metadata ...`. Previously these fields only existed on the
    CLI.

CLI
  - `hermes kanban complete a b c --summary X` or `--metadata Y`:
    refused with a clear stderr message instead of silently applying
    the same handoff to every task. Bulk-close without handoff flags
    still works. (Note: hermes_cli.main discards subcommand exit
    codes via `args.func(args)` without propagating; tracked
    separately. Side-effect check is the real guard.)

Gateway notifier
  - Completion message prefers run.summary (carried in event payload)
    over task.result. task.result remains the fallback for legacy rows
    written before runs shipped.
  - Docstring: renamed stale `spawn_auto_blocked` reference to
    `gave_up` / `timed_out` — matches the actual TERMINAL_KINDS
    tuple, which was already correct in code.

Tests (+8 in core functionality, +3 in dashboard plugin)
  - archive_of_running_task_closes_run
  - archive_of_ready_task_does_not_create_spurious_run
  - dashboard_direct_status_change_off_running_closes_run
  - dashboard_direct_status_change_within_same_state_is_noop_for_runs
  - cli_bulk_complete_with_summary_rejects (side-effect assertion)
  - cli_bulk_complete_without_summary_still_works
  - completed_event_payload_carries_summary
  - completed_event_payload_summary_none_when_missing
  - patch_status_done_with_summary_and_metadata
  - patch_status_done_without_summary_still_works (legacy path)
  - patch_status_archive_closes_running_run (E2E through FastAPI TestClient)

164/164 kanban suite pass under scripts/run_tests.sh. Live smoke
(execute_code with isolated HERMES_HOME) covered all five fixed paths
plus a re-claim-after-drag-drop to confirm the fresh run is tracked
correctly after the orphan close.
…aimed-on-status-change, completed event carries summary

- Runs section: dashboard PATCH parity (summary/metadata forward),
  `completed` event embeds first-line summary for notifiers, bulk
  --summary/--metadata refused, archive/drag-drop reclaim semantics.
- Event reference: added Payload column to Lifecycle and Edits
  tables; called out the invariant that `status` carries run_id
  when closing a reclaimed run.
…, invariant recovery, live drawer refresh

Second integration audit covering surfaces the first pass didn't hit.
Found eight issues spanning kernel, dashboard frontend, notifier, and CLI.
All behavioral / UX fixes; no schema change.

Kernel
  - complete_task on a never-claimed task (ready/blocked → done with no
    run in flight) was silently dropping the summary/metadata/result
    onto a non-existent run. Now synthesizes a zero-duration run
    (started_at == ended_at) so attempt history is complete. Only
    fires when there's actually handoff data to persist — bare
    complete_task(tid) remains a no-op for run creation.
  - block_task on a never-claimed task had the same bug for --reason.
    Same fix: synthesize a zero-duration run when a reason is passed.
  - Event dataclass gained a `run_id: Optional[int] = None` field.
    list_events, unseen_events_for_sub, and the dashboard _event_dict
    were all SELECTing the column but dropping it on the way out,
    so downstream consumers couldn't group events by attempt. Every
    read path now surfaces run_id.
  - claim_task got a defensive invariant-recovery step: if somehow
    `current_run_id` is non-NULL on a task in 'ready' status (invariant
    violation from an unknown code path), close the leaked run as
    'reclaimed' inside the same txn as the new claim. No-op in the
    common case; belt-and-suspenders in case a future code path forgets
    to clear the pointer.

Dashboard
  - GET /tasks/:id events array now carries run_id per event (via
    _event_dict).
  - WebSocket /events SELECT now includes run_id in the pushed event
    payload.
  - TaskDrawer reloads itself on live events for its own task id. New
    `taskEventTick[taskId]` state in the Board, incremented on every
    WS event, passed down as `eventTick` prop; drawer's useEffect
    depends on it. Previously, background workers completing a task
    the user was viewing left the drawer showing stale data until
    manual close/reopen.
  - CSS: added `.hermes-kanban-run--ended` rule for the fallback class
    the JS emits when outcome is unset. Harmless before; just
    inconsistent.

CLI
  - `hermes kanban watch --kinds` help text listed the legacy event
    name `spawn_auto_blocked`. The kernel migration renames it to
    `gave_up`, so users typing the documented name got zero matches.
    Now shows the current lexicon (`completed,blocked,gave_up,
    crashed,timed_out`).

Tests (+6 in core functionality, +1 in dashboard plugin)
  - complete_never_claimed_task_synthesizes_run
  - block_never_claimed_task_synthesizes_run
  - complete_never_claimed_without_handoff_skips_synthesis
  - event_dataclass_carries_run_id (created.run_id None, completed.run_id matches)
  - unseen_events_for_sub_includes_run_id (notifier path)
  - claim_task_recovers_from_invariant_leak (engineer the leak, verify recovery)
  - event_dict_includes_run_id (dashboard API shape)

171/171 kanban suite pass under scripts/run_tests.sh. Live-smoke (isolated
HERMES_HOME via execute_code) exercised all six fixed paths plus the
claim-after-leak recovery sequence.

Docs
  - Runs section: new 'Synthetic runs for never-claimed completions'
    and 'Live drawer refresh' paragraphs explaining the invariants.
  - Event reference: `created` / `promoted` / `unblocked` entries now
    explicitly note `run_id` is `NULL`; `completed` / `blocked`
    describe synthetic-run fallback.
… runs[]

Found during full-stack live-test of the kanban system: two bugs where
the kernel and CLI didn't match the documented contract.

Kernel
  - `connect()` now auto-runs schema creation + migrations on the
    first connection to a given DB path, matching its docstring
    ('Open (and initialize if needed)' — which was aspirational until
    now). Module-level _INITIALIZED_PATHS cache keeps subsequent
    connects cheap. Previously the docstring lied: every path that
    went through connect() on a fresh HERMES_HOME raised 'no such
    table: tasks' — only `hermes kanban init` and `daemon` triggered
    schema creation.
  - `init_db()` always re-runs the migration pass (clears the cache
    entry first). Callers that know the on-disk schema may have
    drifted — tests writing legacy event kinds, external tools
    upgrading an old DB file — can force re-migration.

CLI
  - `kanban_command()` entry point auto-inits the DB before
    dispatching any subcommand. Idempotent; the underlying
    connect-based init pattern makes this a one-line SELECT against
    sqlite_master after the first call.
  - `hermes kanban show --json` now includes:
      - `runs`: full attempt history (id, profile, step_key, status,
        outcome, summary, error, metadata, worker_pid, started_at,
        ended_at)
      - `run_id` on every event object
    Dashboard API already had both; CLI was behind. Now scripts that
    inspect a task can use `show --json` alone.
  - `hermes kanban show` (human-readable) prints a Runs section
    matching the `runs` subcommand's format, and each Event line
    prefixes its run_id when present. Makes attempt attribution
    visible at a glance without a second command.

Tests (+3)
  - cli_create_on_fresh_home_auto_inits (subprocess, no init_db call,
    must succeed — covers the most common first-user path)
  - connect_auto_inits_fresh_db (direct kernel use without init_db)
  - cli_show_json_carries_runs (runs[] present; events carry run_id)

174/174 kanban suite pass under scripts/run_tests.sh.

Live-tested end-to-end
  - Phases 1-6: CLI subprocess (create/claim/complete/bulk-guard/
    synthetic-run/reclaim-via-TTL/multi-attempt history)
  - Phases 7-10: Dashboard FastAPI TestClient (POST/PATCH with
    summary+metadata, drag-drop running->ready, archive-while-
    running, mark-done-with-handoff)
  - Phases 11-13: Dispatcher with stub spawn_fn (3 tasks success,
    3 failures -> gave_up circuit breaker, event.run_id attribution
    across retries)
  - Phases 14-15: WebSocket /events (run_id payload, auth rejects
    wrong tokens with code 1008)
  - Phase 16: Gateway notifier (unseen_events_for_sub returns
    run_id on events, message renders 'done — title' + handoff
    summary from event payload, crashed event path, sub cleanup)

Every surface — kernel, dispatcher, CLI, dashboard REST, dashboard
WebSocket, gateway notifier — exercised end-to-end against a live
FastAPI app with a real SQLite DB in an isolated HERMES_HOME. All
passed.
New website/docs/user-guide/features/kanban-tutorial.md walks four
user stories end-to-end, each backed by a real screenshot of the
dashboard running against seeded data.

Stories
  1. Solo dev shipping a feature (parent->child dependencies,
     structured handoff, run history rendering).
  2. Fleet farming (parallel independent tasks across 3 assignees,
     lanes-by-profile grouping, dispatcher daemon).
  3. Role pipeline with retry (PM spec -> eng implements -> review
     blocks -> eng retries -> review approves; two-run history
     visible in the drawer; downstream workers pull parent
     summary+metadata).
  4. Circuit breaker + crash recovery (2 spawn_failed + 1 gave_up
     for a deploy with missing creds; 1 crashed + 1 completed for
     an OOM-killed migration that recovered on retry).

Each story shows both CLI commands and the dashboard drawer
equivalent. Screenshots captured via playwright + chromium at 2x
device scale, then repalettized with PIL (22MB -> 6.1MB for the
10-image set, no visible quality loss verified against vision).

Side updates
  - website/sidebars.ts: added kanban-tutorial under features.
  - website/docs/user-guide/features/kanban.md: prefix banner
    linking new readers to the tutorial before the reference.

All image references validate: `/img/kanban-tutorial/*` maps to
website/static/img/kanban-tutorial/ (10 files). Docusaurus build
not run locally (no node_modules in worktree); CI build on merge
will confirm.
Six concrete bugs + two cheap v2 extensions from the review at
#16102 (comment)
Larger items (structured comments as session substrate, taxonomy
reorg) deferred to v2 with reply posted on the issue.

Pre-merge bug fixes
  - unblock_task: close any stale current_run_id pointer with a
    reclaimed run inside the unblock txn. Defensive; the invariant
    holds under current data paths (block_task already closes the
    run) but a future or external write that leaves the pointer
    dangling would otherwise persist across the ready->blocked->
    ready cycle. Mirrors the same pattern in claim_task +
    archive_task.
  - Migration backfill: wrap the in-flight backfill loop in
    write_txn and add a CAS guard (`current_run_id IS NULL`) on
    the pointer UPDATE, with a cleanup path that marks any orphan
    run row reclaimed if the CAS fails. Prevents races against a
    concurrent dispatcher between SELECT and INSERT.
  - Notifier sub leak on non-done terminals: unsub on the last
    delivered event's kind being terminal (completed / blocked /
    gave_up / crashed / timed_out), not just on task.status in
    (done, archived). blocked / gave_up / crashed / timed_out used
    to fire one ping then strand the subscription row forever.
  - Notifier thrashes dead chats: per-subscription send-failure
    counter keyed on (task_id, platform, chat_id, thread_id).
    After 3 consecutive adapter.send exceptions, drop the sub
    automatically. Counter resets on any successful send.

Daemon ops visibility
  - run_daemon on_tick now tracks consecutive ticks where the
    ready queue is non-empty but 0 spawns succeeded. After 6 such
    ticks (default ~30s at interval=5), emits a WARN line to
    stderr pointing at profile health (venv, PATH, credentials)
    and `hermes kanban list --status blocked`. Rate-limited to
    one message per 5 minutes so a persistent outage doesn't
    spam logs.

v2 extensions shipped in scope (pure upside)
  - build_worker_context: new "Recent work by @assignee" section
    surfacing the 5 most-recent completed runs for the current
    task's assignee (excluding this task). Bounded, cached by
    the natural LIMIT, no new dependencies. Skipped when the
    task has no assignee.
  - Gateway notifier message prefix: terminal pings now lead
    with `@<assignee>` so fleets (one chat subscribing to many
    tasks with different workers) stay legible at a glance.
    One-line template change.

Deferred to v2 (noted in reply to erosika)
  - recompute_ready full-scan starvation at 10k+ tasks: dirty-set
    approach is a real refactor; fine as follow-up.
  - Skill ↔ assignee validation for routing: depends on skill
    introspection surface that isn't nailed down.
  - Structured comments (in_reply_to / addressed_to / kind) as
    multi-peer session substrate: schema-affecting, exactly the
    v2-scope design vulcan flagged shouldn't cram into this PR.
  - Pattern vs mechanism taxonomy split in docs: pure docs reorg,
    low urgency.

Tests (+6 in core functionality)
  - unblock_invariant_recovery (engineered leak, defensive close)
  - unblock_normal_path_no_spurious_run (no run created on happy
    block->unblock; erosika's main concern)
  - migration_backfill_idempotent_under_re_run (3x init_db on a
    legacy-shape DB yields exactly 1 run row, not 3)
  - build_worker_context_includes_role_history (role continuity)
  - build_worker_context_role_history_skipped_when_no_assignee
  - build_worker_context_role_history_bounded_to_5

180/180 kanban suite pass under scripts/run_tests.sh. Live-smoke
exercised all three kernel fixes end-to-end with isolated
HERMES_HOME.
Added tests/stress/ — an opt-in suite that stresses the kernel with
real concurrency, real subprocesses, random property fuzzing, and scale
benchmarks. Exposed two real bugs the 4 audit passes missed.

Bugs found and fixed
  - _pid_alive returned True for zombie processes. os.kill(pid, 0)
    succeeds against zombies (process table entry exists until parent
    reaps), so a worker that exited normally but wasn't yet reaped
    would look alive to detect_crashed_workers indefinitely. Fixed by
    peeking at /proc/<pid>/status on Linux and treating State: Z as
    dead. No-op on other POSIX; Windows unchanged.
  - Task ID generator used 2 hex bytes (65k space). By birthday
    paradox, ~5% collision probability at 1k tasks, ~50% at 10k.
    Would raise UNIQUE constraint errors on large boards without a
    retry path. Bumped to 4 hex bytes (4.3B space). Surfaced by
    benchmarks trying to seed 10k tasks.

Stress suite (tests/stress/, opt-in via --run-stress)
  - test_concurrency.py — 5 processes race for 100 tasks. 1
    lost-claim race observed, 0 double-claims, 0 orphan runs.
  - test_concurrency_mixed.py — 10 workers + 1 reclaimer, 500
    tasks, random ops (claim/complete/block/unblock/archive).
    1518 events, 43 lost-claim races, zero invariant violations.
  - test_concurrency_reclaim_race.py — TTL < work duration so the
    reclaimer intentionally yanks tasks mid-work. 82 claims, 44
    reclaimed, 50 completed, 32 complete_refused (CAS correctly
    blocked late completes on reclaimed tasks).
  - test_subprocess_e2e.py — dispatcher spawns real python
    subprocess workers that heartbeat + complete via the CLI.
    Also exercises crash detection against a real dead PID via
    double-fork.
  - test_property_fuzzing.py — 500 randomized sequences, ~40k
    operations, 9 invariant checks after each step. Zero
    violations.
  - test_benchmarks.py — latency at 100/1k/10k tasks. Key numbers:
    dispatch_once @ 10k = 4ms, recompute_ready @ 10k = 47ms,
    list_tasks @ 10k = 50ms, build_worker_context w/ 50 parents
    = 1ms. Erosika's 10k starvation flag is pessimistic by roughly
    an order of magnitude.

Regression tests added to the main suite (+2)
  - test_pid_alive_detects_zombie: /proc check against a real
    zombified process (Linux-only, skipped elsewhere).
  - test_task_ids_dont_collide_at_scale: 500 creates, verify
    uniqueness + format.

New harness
  - tests/stress/conftest.py skips the suite by default; opt in with
    pytest --run-stress. Scripts are also __main__-runnable directly
    since they were developed as standalone stress tests.
  - tests/stress/_fake_worker.py: minimal Python worker that
    exercises the real subprocess contract (reads HERMES_KANBAN_TASK,
    heartbeats, completes via CLI).
  - tests/stress/README.md explains opt-in usage.

Test count: 182/182 kanban suite still pass under scripts/run_tests.sh;
stress suite is additive and out-of-band.
…mp fix

Added tests/stress/test_atypical_scenarios.py — 28 scenarios covering
atypical user inputs and environments that the normal tests assume
away. Surfaced one real UI bug in the process.

Bug found and fixed
  - CLI display of negative elapsed time when NTP jumps backward
    between claim_task and complete_task. The kernel faithfully stores
    started_at > ended_at; both CLI display sites (_cmd_show's Runs
    section at line 685, _cmd_runs's table at line 1153) now clamp
    elapsed to max(0, end - start), matching the dashboard JS which
    already had the same Math.max(0, ...) clamp. Regression test
    test_cli_show_clamps_negative_elapsed forces a future started_at
    via raw UPDATE and verifies neither show nor runs prints a
    `-<digits>s` token.

Atypical scenarios covered (all passing, 28 total)
  Data:
    - unicode_and_emoji: CJK, RTL (Hebrew/Arabic), ZWJ emoji sequences,
      control chars, null bytes in titles + metadata
    - huge_strings: 1 MB body + 1 MB summary + 50-level nested metadata
    - sql_injection_attempts: 6 classic payloads, parameterized queries
      hold across every string field
    - newlines_in_summary: full preserved on run, first line in event
      payload for notifier brevity
    - malformed_metadata_via_cli: 4 bad JSON values each cleanly
      rejected with stderr error, no partial task mutation
    - empty_string_fields: empty title rejected, whitespace-only title
      rejected, empty body accepted
    - tenant_with_newlines: multiline tenant strings survive
      board_stats

  Dependency graphs:
    - dependency_cycle: A→B→A refused
    - self_parent: cannot depend on itself
    - diamond_dependency: child promotes only when both parents done
    - wide_fan_out: 500 children promoted in 4ms via complete_task's
      internal recompute_ready
    - wide_fan_in: 500 parents → 1 child, promotion gated correctly
    - parent_in_different_status_states: child stays todo for parent
      in ready/running/blocked/triage/archived; only 'done' unblocks

  Workspace:
    - workspace_path_traversal: dir: workspaces are intentionally
      arbitrary paths (documented threat model)
    - workspace_nonexistent_path: spawn_failure counter increments,
      task returns to ready

  Clock:
    - clock_skew_start_greater_than_end: kernel stores faithfully,
      CLI now clamps at display (see Bug above)

  Filesystem:
    - hermes_home_with_spaces: works
    - hermes_home_with_unicode: works
    - hermes_home_via_symlink: two symlinks to same dir share DB
      (Path.resolve() in _INITIALIZED_PATHS)

  Scale extremes:
    - huge_run_count_on_one_task: 1000 runs → list_runs in <1ms,
      build_worker_context 3ms/63KB (flagged: unbounded on
      retry-heavy tasks; v2 cap candidate)
    - hundred_tenants: 5000 tasks / 100 tenants → stats 1ms, list 26ms
    - comment_storm: 1000 comments → 50KB worker context (same
      unbounded-context flag)

  Lifecycle:
    - completed_task_reclaim_attempt: done tasks can't be
      re-claimed/re-completed/blocked
    - archived_task_resurrection_attempt: archived tasks invisible to
      default list + all ops refused
    - unassigned_task_never_claims: dispatcher skips, task untouched

  Assignees:
    - assignee_with_special_chars: @-signs, dots, CJK, emoji, 200-char
      names, empty strings all handled

  Concurrency:
    - idempotency_key_race: two concurrent processes calling
      create_task with same key both get back the SAME task id,
      exactly one row in DB

  Dashboard:
    - dashboard_rest_with_weird_inputs: empty/huge/unicode titles,
      unknown fields, type mismatches handled correctly (200 for
      valid, 400/422 for invalid)

Observations flagged for v2 (not fixed, not blocking)
  - build_worker_context is unbounded on retry-heavy tasks (1000 runs
    → 63KB) and comment-heavy tasks (1000 comments → 50KB). A
    `--max-prior-attempts` / `--max-comments` cap would be appropriate.
  - `dir:` workspace paths are intentionally arbitrary; docs should
    note the threat model (trusted local user; path is stored
    verbatim, not sandboxed).

183/183 main kanban suite pass. Stress suite still opted out by
default; enable with `pytest --run-stress` or run scripts directly.
Both items the atypical-scenarios pass flagged as "v2 follow-up"
actually belong in v1. Fixed now.

Fix 1: workspace path traversal
  resolve_workspace now rejects non-absolute paths for all three
  workspace_kinds (scratch-with-explicit-path, dir:, worktree). A
  relative path like '../../../tmp/attacker' was being silently
  resolved against the dispatcher's CWD — a confused-deputy escape.
  Error message points users at the absolute-path requirement.
  Storage remains verbatim (kernel doesn't rewrite user input);
  the refusal happens at resolution time, so the dispatcher's
  existing spawn-failure circuit breaker correctly categorizes it.

  Threat model documented in website/docs/user-guide/features/kanban.md:
  single-host, trusted-local-user. The absolute-path rule prevents
  ambiguity-driven escape, not malicious access — kanban runs as you,
  with your uid, on your filesystem.

Fix 2: build_worker_context unbounded
  Added per-section caps so worker prompts stay bounded on pathological
  boards:
    _CTX_MAX_PRIOR_ATTEMPTS = 10   most-recent N runs shown; older
                                   collapsed into "N earlier attempts
                                   omitted" marker. Attempt numbering
                                   preserved (shows "Attempt 16" not
                                   renumbered).
    _CTX_MAX_COMMENTS       = 30   same pattern for comments.
    _CTX_MAX_FIELD_BYTES    = 4 KB per summary / error / metadata / result.
    _CTX_MAX_BODY_BYTES     = 8 KB per task.body (opening post).
    _CTX_MAX_COMMENT_BYTES  = 2 KB per comment.
  Truncation uses a visible ellipsis + char-count so the worker knows
  it's been truncated.

  Effect on atypical-scenario runs:
    huge_run_count_on_one_task (1000 runs):  63 KB  →    820 chars
    comment_storm       (1000 comments):     50 KB  →  1,671 chars

Tests (+6 in main suite)
  test_resolve_workspace_rejects_relative_dir_path — relative dir:
    path stored verbatim but refused at resolve.
  test_resolve_workspace_accepts_absolute_dir_path — legitimate
    absolute paths are created and returned.
  test_resolve_workspace_rejects_relative_worktree_path — same guard
    for worktree kind.
  test_build_worker_context_caps_prior_attempts — 25 runs → exactly
    _CTX_MAX_PRIOR_ATTEMPTS shown, omitted marker present,
    attempt numbering preserves original index.
  test_build_worker_context_caps_comments — 100 comments → 30 shown,
    70 in the omitted marker.
  test_build_worker_context_caps_huge_summary — 1 MB summary on a
    prior run → context under 10 KB total, truncation marker visible.

189/189 kanban suite pass. Atypical-scenarios stress script still
passes all 28 scenarios with the new caps in effect.
Seven new tools in `tools/kanban_tools.py` that give kanban workers a
backend-portable, schema-filtered way to interact with the board from
inside their own Python process — no shelling out to `hermes kanban`.

Motivation
  The CLI path (`hermes kanban complete \$TASK --summary ...`) breaks
  on any remote terminal backend (Docker, Modal, Singularity, SSH).
  The terminal tool runs `hermes kanban` inside the container, where
  `hermes` isn't installed and `~/.hermes/kanban.db` isn't mounted.
  Tools run in the agent's own Python process, so they always reach
  the board regardless of backend. Also skips shell-quoting fragility
  on --metadata JSON and gives structured error returns the model can
  reason about.

The seven tools
  kanban_show        read current task (defaults to HERMES_KANBAN_TASK)
  kanban_complete    structured handoff: summary + metadata
  kanban_block       ask for human input
  kanban_heartbeat   signal liveness during long operations
  kanban_comment     append to task thread
  kanban_create      fan out into child tasks (orchestrator path)
  kanban_link        add parent→child dependency after the fact

Gating
  Each tool's check_fn returns True iff HERMES_KANBAN_TASK is set in
  the process env. The dispatcher sets it when spawning a worker;
  normal `hermes chat` sessions never have it. Empirically verified:
  a baseline hermes-cli schema is 27 tools; with HERMES_KANBAN_TASK
  set it grows to exactly 34 (+7). Zero leak into normal sessions.

Also set HERMES_PROFILE in the spawn env so the kanban_comment tool's
author default works cleanly (it's what the tool reads to attribute
comments).

Skill updates
  - `skills/devops/kanban-worker/SKILL.md`: lifecycle rewritten to use
    kanban_show / kanban_heartbeat / kanban_block / kanban_complete /
    kanban_comment / kanban_create directly. CLI fallback section
    added for human operators / scripts.
  - `skills/devops/kanban-orchestrator/SKILL.md`: all examples ported
    from CLI to tool form; top-banner note explaining tools are the
    primary surface. kanban_create / kanban_link throughout.

Docs
  `website/docs/user-guide/features/kanban.md`:
  new "How workers interact with the board" section explaining the
  tool surface, gating mechanism, and why tools vs CLI. The worker
  skill / orchestrator skill subsections are now nested under it.

Tests (+25 in tests/tools/test_kanban_tools.py)
  - Schema gating: kanban_tools_hidden_without_env_var,
    kanban_tools_visible_with_env_var.
  - Happy paths: show (default + explicit task_id), complete (with
    summary+metadata, with result only), block, heartbeat (with and
    without note), comment (default + custom author), create (with
    list parents, with string parent), link.
  - Error paths: complete rejects no-handoff and non-dict metadata,
    block rejects empty reason, comment rejects empty body, create
    rejects no title / no assignee / non-list parents, link rejects
    self-reference / missing args / cycles.
  - End-to-end: full worker lifecycle driven entirely through the
    tools, verified against DB state.

214/214 kanban suite pass under scripts/run_tests.sh.
…nce-only

Moves the kanban worker lifecycle out of the skills and into the system
prompt as a conditionally-injected guidance block, matching the
existing MEMORY_GUIDANCE / SESSION_SEARCH_GUIDANCE / SKILLS_GUIDANCE
pattern in `agent/prompt_builder.py`. Workers now know the full
lifecycle even if no skill was loaded; skills become the deeper
playbook for edge cases.

Why
  Previously the 6-step lifecycle (orient → work → heartbeat → block/
  complete → fan-out) only reached the model if the kanban-worker
  skill was loaded. That's fragile: users configure skills-per-
  platform, profiles can forget to include them, skill loading order
  matters. The lifecycle is load-bearing for correct board behavior
  — it belongs in the prompt path.

What
  - New `KANBAN_GUIDANCE` constant in agent/prompt_builder.py (~3 KB):
    identity framing ("You are a Kanban worker"), 6-step lifecycle
    with exact tool-call shapes (`kanban_show()`, `kanban_complete(
    summary=..., metadata=...)`, `kanban_block(reason=...)`, etc.),
    orchestrator mode note, and explicit DO NOTs including "don't
    shell out to `hermes kanban <verb>`".
  - Wired into `AIAgent._build_system_prompt` gated on `kanban_show
    in self.valid_tool_names`. Because `kanban_show`'s `check_fn`
    gates on `HERMES_KANBAN_TASK` env var, this indirection
    guarantees guidance appears iff a worker was dispatched. Zero
    cost to normal sessions.

Gating verified live
  Normal `hermes chat` session: 2329-char prompt, zero kanban content.
  Worker session (HERMES_KANBAN_TASK set):  5272-char prompt, +2943
  chars of KANBAN_GUIDANCE present. Regression test asserts the
  header phrase and each tool name.

Skills reshape (both from ~6 KB lifecycles to ~7 KB references)
  - `skills/devops/kanban-worker/SKILL.md`: drops the lifecycle
    (now in the guidance block), keeps good-summary patterns,
    block-reason examples, retry diagnostics, pitfalls, CLI
    fallback cheatsheet. Adds explicit note that you don't need to
    load the skill to work a task anymore.
  - `skills/devops/kanban-orchestrator/SKILL.md`: drops the
    "don't execute" rule (now in guidance), keeps the full
    decomposition playbook, specialist roster with typical
    workspaces, a concrete Postgres-migration example with
    `kanban_create` + `parents=[...]` linking, and pitfalls
    (reassignment vs new task, link argument order, tenant
    inheritance).

Tests (+3)
  - `test_kanban_guidance_not_in_normal_prompt`: verifies a regular
    AIAgent with no env var produces a prompt that contains none of
    the kanban lifecycle phrases.
  - `test_kanban_guidance_in_worker_prompt`: spawns an AIAgent with
    HERMES_KANBAN_TASK set, verifies the header + all 4 tool-call
    examples + anti-shell guidance appear.
  - `test_kanban_guidance_prompt_size_bounded`: sanity check that
    the guidance is 1.5-4 KB so it doesn't balloon the cached prompt.

217/217 kanban + tools suite green; 303/303 run_agent suite green.
The guidance block sits alongside the other tool-gated blocks, so
prompt caching remains intact — the system prompt is stable across
every turn of a kanban worker's run.
`_default_spawn` now passes `--skills kanban-worker` in the child's
argv, so every dispatched worker gets the skill loaded automatically
regardless of the profile's default skill config.

Why
  The system prompt carries the MANDATORY lifecycle (via
  KANBAN_GUIDANCE). The skill carries the deeper reference material:
  good summary/metadata patterns, retry diagnostics, block-reason
  examples, workspace handling, CLI fallback. Both are useful;
  requiring users to wire the skill into skills config per-profile
  is a footgun — they'd hit tasks where workers use the lifecycle
  correctly but not the patterns, producing weaker handoffs.

  Auto-loading makes kanban-worker the baseline. Profiles can still
  add more skills via the normal config; --skills is additive.

Test
  New test_default_spawn_auto_loads_kanban_worker_skill intercepts
  subprocess.Popen to capture the argv without actually spawning a
  hermes subprocess. Asserts '--skills kanban-worker' appears in the
  cmd and the env still carries HERMES_KANBAN_TASK + HERMES_PROFILE.

Skill note updated
  Worker skill's opening note changed from "you don't need to load
  this" to "you're seeing this because the dispatcher loaded it for
  you" — reflects the new always-loaded reality.

218/218 kanban + tools suite green.
Tasks can now pin extra skills to load into their worker alongside the
built-in kanban-worker. Use cases: translation tasks that need a
translation skill, review tasks that need github-code-review, security
audits that need security-pr-audit — without editing the assignee's
profile config.

Changes:
- tasks.skills column (JSON array), idempotent migration, Task.skills
  dataclass field.
- create_task(skills=[...]) normalises (strip/dedupe), rejects commas
  in a single name.
- _default_spawn emits one `--skills X` pair per task skill, in
  addition to the built-in `--skills kanban-worker`. Deduped against
  the built-in.
- CLI: `hermes kanban create --skill <name>` (repeatable).
- Tool: `kanban_create(skills=[...])` (accepts list or single string).
- Dashboard: POST /tasks accepts `skills`; inline create form has a
  comma-separated skills input; drawer shows a Skills row.
- `hermes kanban show` prints a skills row when present.
- Docs: kanban.md has a new 'Pinning extra skills to a specific task'
  section; CLI reference shows `--skill`.
- Tests: 16 new (kernel round-trip, dedupe, comma-rejection, dispatcher
  argv with multiple skills, built-in dedupe, CLI flag repeatable, CLI
  no-flag stays None, show renders skills, idempotent migration on
  legacy DB, tool list/string/non-list, dashboard REST round-trip,
  dashboard default empty list).
…alone daemon

The kanban dispatcher is now embedded in the gateway process —
'gateway is the single dispatcher host' — removing the need for a
separate `hermes kanban daemon` or systemd unit. Typical user path
becomes: `hermes gateway start` + `hermes kanban create ...` and
ready tasks run on the next tick (60s default). The cost is ~300µs per
interval on an idle board; negligible.

Changes:
- config: new `kanban` section in DEFAULT_CONFIG with
  `dispatch_in_gateway: true` (default on) and
  `dispatch_interval_seconds: 60`. Additive — no \_config_version bump.
- gateway/run.py: new `_kanban_dispatcher_watcher()` background task,
  symmetric with `_kanban_notifier_watcher`. Reads config at boot;
  exits cleanly if the flag is off. Runs each tick via
  `asyncio.to_thread` so the SQLite WAL lock never blocks the loop.
  Sleeps in 1s slices so shutdown is snappy. Health telemetry mirrored
  from `_cmd_daemon` — warns when the ready queue is non-empty for
  6 consecutive ticks with 0 spawns. Env override
  `HERMES_KANBAN_DISPATCH_IN_GATEWAY=0` disables without editing
  config.yaml.
- hermes_cli/kanban.py: `_cmd_daemon` becomes a deprecation stub — no
  `--force`, exits 2 with migration guidance pointing at
  `hermes gateway start`. With `--force` (undocumented, help=SUPPRESS)
  the old standalone loop still runs, for headless hosts that can't run
  the gateway. Help text marks the subcommand DEPRECATED.
- hermes_cli/kanban.py: new `_check_dispatcher_presence()` helper —
  returns (running, human_message) by probing `gateway.status.get_running_pid`
  AND reading `kanban.dispatch_in_gateway`. Defensive: import/probe
  failures return (True, "") so we never cry wolf on partial installs.
- `hermes kanban create` prints a stderr warning when the task lands
  in 'ready' with an assignee but no dispatcher will pick it up. Skipped
  in `--json` mode so machine-parseable output stays clean. Skipped for
  triage/unassigned tasks (they can't dispatch regardless).
- `hermes kanban init` prints `hermes gateway start` as the next step
  (was: `hermes kanban daemon`).
- plugin_api.py: POST /tasks response includes a `warning` field when
  the same probe returns not-running, so the dashboard UI can surface
  a banner. Probe errors are swallowed silently — must never break
  create.
- dashboard dist/index.js: `createTask` threads the `warning` response
  field into the existing error-banner channel with 'Task created,
  but: ...'.
- toolsets.py: "spawned by `hermes kanban daemon`" description updated
  to "spawned by the kanban dispatcher (gateway-embedded by default)".
  Matching change to the tools/kanban_tools.py docstring.
- systemd unit: marked DEPRECATED in its Description + header comment,
  invokes the standalone daemon via the explicit `--force` flag so
  users who haven't migrated don't accidentally spawn duplicate
  dispatchers.
- docs: kanban.md 'Running the dispatcher as a service' section
  rewritten to describe gateway-embedded dispatch as the default; the
  standalone systemd instructions are gone. kanban-tutorial.md 'Start
  the daemon' block replaced with `hermes gateway start`. CLI command
  reference table now marks `hermes kanban daemon` DEPRECATED.

Tests (17 new):
- DEFAULT_CONFIG has kanban.dispatch_in_gateway=True and a sane
  interval.
- `_check_dispatcher_presence` returns running when gateway pid is
  found and flag is on; warns with `hermes gateway start` guidance
  when no gateway; warns with `dispatch_in_gateway` guidance when
  flag is off; silent on probe error.
- `hermes kanban create` (non-JSON) warns on stderr when no gateway
  and task is ready+assigned; stays silent when gateway is up; never
  warns on triage or unassigned tasks.
- `hermes kanban daemon` without --force prints DEPRECATED + exits 2.
- Argparse help for `kanban daemon` contains the word DEPRECATED.
- Gateway `_kanban_dispatcher_watcher` respects config flag=False
  (exits fast), respects HERMES_KANBAN_DISPATCH_IN_GATEWAY=0 env
  override, and treats truthy env as 'defer to config' (not force-on).
- Dashboard POST /tasks response carries `warning` when probe says
  no dispatcher; omits it when probe says running; skips probe on
  triage tasks; survives probe errors without breaking create.

E2E verified in an isolated HERMES_HOME: watcher exits in 0.000s when
flag is off, 0.000s on env override, gracefully stops within 3s of
`_running=False` with the flag on.

Integration issues fixed in the same pass:
- Stale 'hermes kanban daemon --assignee' refs in kanban-tutorial.md
  (that flag never existed; just bad author copy).
- Stale toolset description pointing at the retired daemon.
- Stale docstring in tools/kanban_tools.py for `_check_kanban_mode`.
- Systemd unit still invoking `hermes kanban daemon` without
  `--force`; now invokes with `--force` and is marked DEPRECATED.
teknium1 added a commit that referenced this pull request Apr 30, 2026
Salvage of PR #16100 onto current main (after emozilla's #17514 fix
that unblocks plugin Pydantic body validation). History preserved on
the standing `feat/kanban-standing` branch; this squashes the 22
iterative commits into one clean landing.

What this lands:
- SQLite kernel (hermes_cli/kanban_db.py) — durable task board with
  tasks, task_links, task_runs, task_comments, task_events,
  kanban_notify_subs tables. WAL mode, atomic claim via CAS,
  tenant-namespaced, skills JSON array per task, max-runtime timeouts,
  worker heartbeats, idempotency keys, circuit breaker on repeated
  spawn failures, crash detection via /proc/<pid>/status, run history
  preserved across attempts.
- Dispatcher — runs inside the gateway by default
  (`kanban.dispatch_in_gateway: true`). Ticks every 60s, reclaims
  stale claims, promotes ready tasks, spawns `hermes -p <assignee>
  chat -q "work kanban task <id>"` with HERMES_KANBAN_TASK +
  HERMES_KANBAN_WORKSPACE env. Auto-loads `--skills kanban-worker`
  plus any per-task skills. Health telemetry warns on stuck ready
  queue.
- Structured tool surface (tools/kanban_tools.py) — 7 tools
  (kanban_show, kanban_complete, kanban_block, kanban_heartbeat,
  kanban_comment, kanban_create, kanban_link). Gated on
  HERMES_KANBAN_TASK via check_fn so zero schema footprint in normal
  sessions.
- System-prompt guidance (agent/prompt_builder.py KANBAN_GUIDANCE)
  injected only when kanban tools are active.
- Dashboard plugin (plugins/kanban/dashboard/) — Linear-style board
  UI: triage/todo/ready/running/blocked/done columns, drag-drop,
  inline create, task drawer with markdown, comments, run history,
  dependency editor, bulk ops, lanes-by-profile grouping, WS-driven
  live refresh. Matches active dashboard theme via CSS variables.
- CLI — `hermes kanban init|create|list|show|assign|link|unlink|
  claim|comment|complete|block|unblock|archive|tail|dispatch|context|
  init|gc|watch|stats|notify|log|heartbeat|runs|assignees` +
  `/kanban` slash in-session.
- Worker + orchestrator skills (skills/devops/kanban-worker +
  kanban-orchestrator) — pattern library for good summary/metadata
  shapes, retry diagnostics, block-reason examples, fan-out patterns.
- Per-task force-loaded skills — `--skill <name>` (repeatable),
  stored as JSON, threaded through to dispatcher argv as one
  `--skills X` pair per skill alongside the built-in kanban-worker.
  Dashboard + CLI + tool parity.
- Deprecation of standalone `hermes kanban daemon` — stub exits 2
  with migration guidance; `--force` escape hatch for headless hosts.
- Docs (website/docs/user-guide/features/kanban.md + kanban-tutorial.md)
  with 11 dashboard screenshots walking through four user stories
  (Solo Dev, Fleet Farming, Role Pipeline, Circuit Breaker).
- Tests (251 passing): kernel schema + migration + CAS atomicity,
  dispatcher logic, circuit breaker, crash detection, max-runtime
  timeouts, claim lifecycle, tenant isolation, idempotency keys, per-
  task skills round-trip + validation + dispatcher argv, tool surface
  (7 tools × round-trip + error paths), dashboard REST (CRUD + bulk
  + links + warnings), gateway-embedded dispatcher (config gate, env
  override, graceful shutdown), CLI deprecation stub, migration from
  legacy schemas.

Gateway integration:
- GatewayRunner._kanban_dispatcher_watcher — new asyncio background
  task, symmetric with _kanban_notifier_watcher. Runs dispatch_once
  via asyncio.to_thread so SQLite WAL never blocks the loop. Sleeps
  in 1s slices for snappy shutdown. Respects HERMES_KANBAN_DISPATCH_IN_GATEWAY=0
  env override for debugging.
- Config: new `kanban` section in DEFAULT_CONFIG with
  `dispatch_in_gateway: true` (default) + `dispatch_interval_seconds: 60`.
  Additive — no \_config_version bump needed.

Forward-compat:
- workflow_template_id / current_step_key columns on tasks (v1 writes
  NULL; v2 will use them for routing).
- task_runs holds claim machinery (claim_lock, claim_expires,
  worker_pid, last_heartbeat_at) so multi-attempt history is first-
  class from day one.

Closes #16102.

Co-authored-by: emozilla <emozilla@nousresearch.com>
@teknium1
Copy link
Copy Markdown
Contributor Author

Landed via salvage PR #17805 onto current main (22 iterative commits squashed into one clean commit). History preserved on the feat/kanban-standing branch for anyone tracing the development arc. Closes #16102.

@teknium1 teknium1 closed this Apr 30, 2026
nickdlkk pushed a commit to nickdlkk/hermes-agent that referenced this pull request May 11, 2026
…#17805)

Salvage of PR NousResearch#16100 onto current main (after emozilla's NousResearch#17514 fix
that unblocks plugin Pydantic body validation). History preserved on
the standing `feat/kanban-standing` branch; this squashes the 22
iterative commits into one clean landing.

What this lands:
- SQLite kernel (hermes_cli/kanban_db.py) — durable task board with
  tasks, task_links, task_runs, task_comments, task_events,
  kanban_notify_subs tables. WAL mode, atomic claim via CAS,
  tenant-namespaced, skills JSON array per task, max-runtime timeouts,
  worker heartbeats, idempotency keys, circuit breaker on repeated
  spawn failures, crash detection via /proc/<pid>/status, run history
  preserved across attempts.
- Dispatcher — runs inside the gateway by default
  (`kanban.dispatch_in_gateway: true`). Ticks every 60s, reclaims
  stale claims, promotes ready tasks, spawns `hermes -p <assignee>
  chat -q "work kanban task <id>"` with HERMES_KANBAN_TASK +
  HERMES_KANBAN_WORKSPACE env. Auto-loads `--skills kanban-worker`
  plus any per-task skills. Health telemetry warns on stuck ready
  queue.
- Structured tool surface (tools/kanban_tools.py) — 7 tools
  (kanban_show, kanban_complete, kanban_block, kanban_heartbeat,
  kanban_comment, kanban_create, kanban_link). Gated on
  HERMES_KANBAN_TASK via check_fn so zero schema footprint in normal
  sessions.
- System-prompt guidance (agent/prompt_builder.py KANBAN_GUIDANCE)
  injected only when kanban tools are active.
- Dashboard plugin (plugins/kanban/dashboard/) — Linear-style board
  UI: triage/todo/ready/running/blocked/done columns, drag-drop,
  inline create, task drawer with markdown, comments, run history,
  dependency editor, bulk ops, lanes-by-profile grouping, WS-driven
  live refresh. Matches active dashboard theme via CSS variables.
- CLI — `hermes kanban init|create|list|show|assign|link|unlink|
  claim|comment|complete|block|unblock|archive|tail|dispatch|context|
  init|gc|watch|stats|notify|log|heartbeat|runs|assignees` +
  `/kanban` slash in-session.
- Worker + orchestrator skills (skills/devops/kanban-worker +
  kanban-orchestrator) — pattern library for good summary/metadata
  shapes, retry diagnostics, block-reason examples, fan-out patterns.
- Per-task force-loaded skills — `--skill <name>` (repeatable),
  stored as JSON, threaded through to dispatcher argv as one
  `--skills X` pair per skill alongside the built-in kanban-worker.
  Dashboard + CLI + tool parity.
- Deprecation of standalone `hermes kanban daemon` — stub exits 2
  with migration guidance; `--force` escape hatch for headless hosts.
- Docs (website/docs/user-guide/features/kanban.md + kanban-tutorial.md)
  with 11 dashboard screenshots walking through four user stories
  (Solo Dev, Fleet Farming, Role Pipeline, Circuit Breaker).
- Tests (251 passing): kernel schema + migration + CAS atomicity,
  dispatcher logic, circuit breaker, crash detection, max-runtime
  timeouts, claim lifecycle, tenant isolation, idempotency keys, per-
  task skills round-trip + validation + dispatcher argv, tool surface
  (7 tools × round-trip + error paths), dashboard REST (CRUD + bulk
  + links + warnings), gateway-embedded dispatcher (config gate, env
  override, graceful shutdown), CLI deprecation stub, migration from
  legacy schemas.

Gateway integration:
- GatewayRunner._kanban_dispatcher_watcher — new asyncio background
  task, symmetric with _kanban_notifier_watcher. Runs dispatch_once
  via asyncio.to_thread so SQLite WAL never blocks the loop. Sleeps
  in 1s slices for snappy shutdown. Respects HERMES_KANBAN_DISPATCH_IN_GATEWAY=0
  env override for debugging.
- Config: new `kanban` section in DEFAULT_CONFIG with
  `dispatch_in_gateway: true` (default) + `dispatch_interval_seconds: 60`.
  Additive — no \_config_version bump needed.

Forward-compat:
- workflow_template_id / current_step_key columns on tasks (v1 writes
  NULL; v2 will use them for routing).
- task_runs holds claim machinery (claim_lock, claim_expires,
  worker_pid, last_heartbeat_at) so multi-attempt history is first-
  class from day one.

Closes NousResearch#16102.

Co-authored-by: emozilla <emozilla@nousresearch.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/cron Cron scheduler and job management comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants