fix(kanban): address @erosika's pre-merge review (issue #16102)

teknium1 · teknium1 · commit a24c6e191fc5 · 2026-04-27T20:40:49.000-07:00
Six concrete bugs + two cheap v2 extensions from the review at #16102 (comment) Larger items (structured comments as session substrate, taxonomy reorg) deferred to v2 with reply posted on the issue. Pre-merge bug fixes - unblock_task: close any stale current_run_id pointer with a reclaimed run inside the unblock txn. Defensive; the invariant holds under current data paths (block_task already closes the run) but a future or external write that leaves the pointer dangling would otherwise persist across the ready->blocked-> ready cycle. Mirrors the same pattern in claim_task + archive_task. - Migration backfill: wrap the in-flight backfill loop in write_txn and add a CAS guard (`current_run_id IS NULL`) on the pointer UPDATE, with a cleanup path that marks any orphan run row reclaimed if the CAS fails. Prevents races against a concurrent dispatcher between SELECT and INSERT. - Notifier sub leak on non-done terminals: unsub on the last delivered event's kind being terminal (completed / blocked / gave_up / crashed / timed_out), not just on task.status in (done, archived). blocked / gave_up / crashed / timed_out used to fire one ping then strand the subscription row forever. - Notifier thrashes dead chats: per-subscription send-failure counter keyed on (task_id, platform, chat_id, thread_id). After 3 consecutive adapter.send exceptions, drop the sub automatically. Counter resets on any successful send. Daemon ops visibility - run_daemon on_tick now tracks consecutive ticks where the ready queue is non-empty but 0 spawns succeeded. After 6 such ticks (default ~30s at interval=5), emits a WARN line to stderr pointing at profile health (venv, PATH, credentials) and `hermes kanban list --status blocked`. Rate-limited to one message per 5 minutes so a persistent outage doesn't spam logs. v2 extensions shipped in scope (pure upside) - build_worker_context: new "Recent work by @assignee" section surfacing the 5 most-recent completed runs for the current task's assignee (excluding this task). Bounded, cached by the natural LIMIT, no new dependencies. Skipped when the task has no assignee. - Gateway notifier message prefix: terminal pings now lead with `@<assignee>` so fleets (one chat subscribing to many tasks with different workers) stay legible at a glance. One-line template change. Deferred to v2 (noted in reply to erosika) - recompute_ready full-scan starvation at 10k+ tasks: dirty-set approach is a real refactor; fine as follow-up. - Skill ↔ assignee validation for routing: depends on skill introspection surface that isn't nailed down. - Structured comments (in_reply_to / addressed_to / kind) as multi-peer session substrate: schema-affecting, exactly the v2-scope design vulcan flagged shouldn't cram into this PR. - Pattern vs mechanism taxonomy split in docs: pure docs reorg, low urgency. Tests (+6 in core functionality) - unblock_invariant_recovery (engineered leak, defensive close) - unblock_normal_path_no_spurious_run (no run created on happy block->unblock; erosika's main concern) - migration_backfill_idempotent_under_re_run (3x init_db on a legacy-shape DB yields exactly 1 run row, not 3) - build_worker_context_includes_role_history (role continuity) - build_worker_context_role_history_skipped_when_no_assignee - build_worker_context_role_history_bounded_to_5 180/180 kanban suite pass under scripts/run_tests.sh. Live-smoke exercised all three kernel fixes end-to-end with isolated HERMES_HOME.
diff --git a/gateway/run.py b/gateway/run.py
@@ -2490,6 +2490,23 @@ async def _kanban_notifier_watcher(self, interval: float = 5.0) -> None:
             return
 
         TERMINAL_KINDS = ("completed", "blocked", "gave_up", "crashed", "timed_out")
+        # Terminal event kinds trigger automatic unsubscription — the task
+        # is done, blocked, or in a retry-needed state that the human
+        # shouldn't keep pinging a stale chat for. Previously we only
+        # unsubbed when task.status in ('done', 'archived'), which left
+        # subscriptions on 'blocked' / 'gave_up' / 'crashed' / 'timed_out'
+        # tasks stranded forever.
+        TERMINAL_EVENT_KINDS = TERMINAL_KINDS
+        # Per-subscription send-failure counter. Adapter.send raising
+        # means the chat is dead (deleted, bot kicked, etc.) — after N
+        # consecutive send failures the sub is dropped so we don't spin
+        # against a dead chat every 5 seconds forever.
+        MAX_SEND_FAILURES = 3
+        sub_fail_counts: dict[tuple, int] = getattr(
+            self, "_kanban_sub_fail_counts", {}
+        )
+        self._kanban_sub_fail_counts = sub_fail_counts
+
         # Initial delay so the gateway can finish wiring adapters.
         await asyncio.sleep(5)
 
@@ -2546,6 +2563,11 @@ def _collect():
                     title = (task.title if task else sub["task_id"])[:120]
                     for ev in d["events"]:
                         kind = ev.kind
+                        # Identity prefix: attribute terminal pings to the
+                        # worker that did the work. Makes fleets (where one
+                        # chat subscribes to many tasks) legible at a glance.
+                        who = (task.assignee if task and task.assignee else None)
+                        tag = f"@{who} " if who else ""
                         if kind == "completed":
                             # Prefer the run's summary (the worker's
                             # intentional human-facing handoff, carried
@@ -2563,57 +2585,83 @@ def _collect():
                                 r = task.result.strip().splitlines()[0][:160]
                                 handoff = f"\n{r}"
                             msg = (
-                                f"✔ Kanban {sub['task_id']} done"
+                                f"✔ {tag}Kanban {sub['task_id']} done"
                                 f" — {title}{handoff}"
                             )
                         elif kind == "blocked":
                             reason = ""
                             if ev.payload and ev.payload.get("reason"):
                                 reason = f": {str(ev.payload['reason'])[:160]}"
-                            msg = f"⏸ Kanban {sub['task_id']} blocked{reason}"
+                            msg = f"⏸ {tag}Kanban {sub['task_id']} blocked{reason}"
                         elif kind == "gave_up":
                             err = ""
                             if ev.payload and ev.payload.get("error"):
                                 err = f"\n{str(ev.payload['error'])[:200]}"
                             msg = (
-                                f"✖ Kanban {sub['task_id']} gave up "
+                                f"✖ {tag}Kanban {sub['task_id']} gave up "
                                 f"after repeated spawn failures{err}"
                             )
                         elif kind == "crashed":
                             msg = (
-                                f"✖ Kanban {sub['task_id']} worker crashed "
+                                f"✖ {tag}Kanban {sub['task_id']} worker crashed "
                                 f"(pid gone); dispatcher will retry"
                             )
                         elif kind == "timed_out":
                             limit = 0
                             if ev.payload and ev.payload.get("limit_seconds"):
                                 limit = int(ev.payload["limit_seconds"])
                             msg = (
-                                f"⏱ Kanban {sub['task_id']} timed out "
+                                f"⏱ {tag}Kanban {sub['task_id']} timed out "
                                 f"(max_runtime={limit}s); will retry"
                             )
                         else:
                             continue
                         metadata: dict[str, Any] = {}
                         if sub.get("thread_id"):
                             metadata["thread_id"] = sub["thread_id"]
+                        sub_key = (
+                            sub["task_id"], sub["platform"],
+                            sub["chat_id"], sub.get("thread_id") or "",
+                        )
                         try:
                             await adapter.send(
                                 sub["chat_id"], msg, metadata=metadata,
                             )
+                            # Reset the failure counter on success.
+                            sub_fail_counts.pop(sub_key, None)
                         except Exception as exc:
+                            fails = sub_fail_counts.get(sub_key, 0) + 1
+                            sub_fail_counts[sub_key] = fails
                             logger.warning(
-                                "kanban notifier: send failed for %s on %s: %s",
-                                sub["task_id"], platform_str, exc,
+                                "kanban notifier: send failed for %s on %s "
+                                "(attempt %d/%d): %s",
+                                sub["task_id"], platform_str, fails,
+                                MAX_SEND_FAILURES, exc,
                             )
+                            if fails >= MAX_SEND_FAILURES:
+                                logger.warning(
+                                    "kanban notifier: dropping subscription "
+                                    "%s on %s after %d consecutive send failures",
+                                    sub["task_id"], platform_str, fails,
+                                )
+                                await asyncio.to_thread(self._kanban_unsub, sub)
+                                sub_fail_counts.pop(sub_key, None)
                             # Don't advance cursor on send failure — retry next tick.
                             break
                     else:
                         # All events delivered; advance cursor + maybe unsub.
                         await asyncio.to_thread(
                             self._kanban_advance, sub, d["cursor"],
                         )
-                        if task and task.status in ("done", "archived"):
+                        # Unsubscribe when the LAST delivered event is a
+                        # terminal kind (the task hit a "no further updates"
+                        # state), not just on task.status in {done, archived}.
+                        # Covers blocked / gave_up / crashed / timed_out which
+                        # used to leak subs forever.
+                        last_kind = d["events"][-1].kind if d["events"] else None
+                        task_terminal = task and task.status in ("done", "archived")
+                        event_terminal = last_kind in TERMINAL_EVENT_KINDS
+                        if task_terminal or event_terminal:
                             await asyncio.to_thread(
                                 self._kanban_unsub, sub,
                             )
diff --git a/hermes_cli/kanban.py b/hermes_cli/kanban.py
@@ -924,7 +924,38 @@ def _cmd_daemon(args: argparse.Namespace) -> int:
     print(f"Kanban dispatcher running (interval={args.interval}s, pid={os.getpid()}). "
           f"Ctrl-C to stop.")
 
+    # Health telemetry: warn when every tick finds ready work but fails to
+    # spawn any worker. Catches broken profiles, PATH drift, missing venv,
+    # credential loss — cases where the per-task circuit breaker auto-blocks
+    # each task quietly but the operator has no signal that the dispatcher
+    # itself is dysfunctional.
+    HEALTH_WINDOW = 6  # ticks (default 30s at interval=5)
+    health_state = {"bad_ticks": 0, "last_warn_at": 0}
+
     def _on_tick(res):
+        ready_pending = bool(res.skipped_unassigned) or _ready_queue_nonempty()
+        spawned_any = bool(res.spawned)
+        if ready_pending and not spawned_any:
+            health_state["bad_ticks"] += 1
+        else:
+            health_state["bad_ticks"] = 0
+        # Emit a warning once per HEALTH_WINDOW bad ticks (not every tick)
+        # so log volume stays bounded while the problem persists.
+        if health_state["bad_ticks"] >= HEALTH_WINDOW:
+            now = int(time.time())
+            # Rate-limit repeats: at most one warning per 5 minutes.
+            if now - health_state["last_warn_at"] >= 300:
+                print(
+                    f"[{_fmt_ts(now)}] WARN dispatcher stuck: "
+                    f"ready queue non-empty for {health_state['bad_ticks']} "
+                    f"consecutive ticks but 0 workers spawned successfully. "
+                    f"Check profile health (venv, PATH, credentials) and "
+                    f"`hermes kanban list --status ready` / "
+                    f"`hermes kanban list --status blocked` for recent "
+                    f"spawn_failed tasks.",
+                    file=sys.stderr, flush=True,
+                )
+                health_state["last_warn_at"] = now
         if not verbose:
             return
         did_work = (
@@ -941,6 +972,20 @@ def _on_tick(res):
                 flush=True,
             )
 
+    def _ready_queue_nonempty() -> bool:
+        """Cheap SELECT — just asks whether there's at least one ready
+        task with an assignee that the dispatcher could have picked up."""
+        try:
+            with kb.connect() as conn:
+                row = conn.execute(
+                    "SELECT 1 FROM tasks "
+                    "WHERE status = 'ready' AND assignee IS NOT NULL "
+                    "    AND claim_lock IS NULL LIMIT 1"
+                ).fetchone()
+                return row is not None
+        except Exception:
+            return False
+
     try:
         kb.run_daemon(
             interval=args.interval,
diff --git a/hermes_cli/kanban_db.py b/hermes_cli/kanban_db.py
@@ -436,40 +436,56 @@ def _migrate_add_optional_columns(conn: sqlite3.Connection) -> None:
     # One-shot backfill: any task that is 'running' before runs existed
     # had its claim_lock / claim_expires / worker_pid on the task row.
     # Synthesize a matching task_runs row so subsequent end-run / heartbeat
-    # calls have something to write to. Safe to re-run: the check below
-    # skips tasks that already have a current_run_id.
+    # calls have something to write to. Wrapped in write_txn to serialize
+    # against any concurrent dispatcher, and the per-row UPDATE uses
+    # ``current_run_id IS NULL`` as a CAS guard so a racing claim can't
+    # produce an orphaned row if it interleaves with the backfill pass.
     runs_exist = conn.execute(
         "SELECT name FROM sqlite_master WHERE type='table' AND name='task_runs'"
     ).fetchone() is not None
     if runs_exist:
-        inflight = conn.execute(
-            "SELECT id, assignee, claim_lock, claim_expires, worker_pid, "
-            "       max_runtime_seconds, last_heartbeat_at, started_at "
-            "FROM tasks "
-            "WHERE status = 'running' AND (current_run_id IS NULL)"
-        ).fetchall()
-        for row in inflight:
-            started = row["started_at"] or int(time.time())
-            cur = conn.execute(
-                """
-                INSERT INTO task_runs (
-                    task_id, profile, status,
-                    claim_lock, claim_expires, worker_pid,
-                    max_runtime_seconds, last_heartbeat_at,
-                    started_at
-                ) VALUES (?, ?, 'running', ?, ?, ?, ?, ?, ?)
-                """,
-                (
-                    row["id"], row["assignee"], row["claim_lock"],
-                    row["claim_expires"], row["worker_pid"],
-                    row["max_runtime_seconds"], row["last_heartbeat_at"],
-                    started,
-                ),
-            )
-            conn.execute(
-                "UPDATE tasks SET current_run_id = ? WHERE id = ?",
-                (cur.lastrowid, row["id"]),
-            )
+        with write_txn(conn):
+            inflight = conn.execute(
+                "SELECT id, assignee, claim_lock, claim_expires, worker_pid, "
+                "       max_runtime_seconds, last_heartbeat_at, started_at "
+                "FROM tasks "
+                "WHERE status = 'running' AND current_run_id IS NULL"
+            ).fetchall()
+            for row in inflight:
+                started = row["started_at"] or int(time.time())
+                cur = conn.execute(
+                    """
+                    INSERT INTO task_runs (
+                        task_id, profile, status,
+                        claim_lock, claim_expires, worker_pid,
+                        max_runtime_seconds, last_heartbeat_at,
+                        started_at
+                    ) VALUES (?, ?, 'running', ?, ?, ?, ?, ?, ?)
+                    """,
+                    (
+                        row["id"], row["assignee"], row["claim_lock"],
+                        row["claim_expires"], row["worker_pid"],
+                        row["max_runtime_seconds"], row["last_heartbeat_at"],
+                        started,
+                    ),
+                )
+                # CAS: only install the pointer if nothing else claimed
+                # the task between our SELECT and here (shouldn't happen
+                # under the write_txn, but belt-and-suspenders). If the
+                # CAS fails we've got an orphan run_row — mark it
+                # reclaimed so it doesn't look in-flight.
+                upd = conn.execute(
+                    "UPDATE tasks SET current_run_id = ? "
+                    "WHERE id = ? AND current_run_id IS NULL",
+                    (cur.lastrowid, row["id"]),
+                )
+                if upd.rowcount != 1:
+                    conn.execute(
+                        "UPDATE task_runs SET status = 'reclaimed', "
+                        "    outcome = 'reclaimed', ended_at = ? "
+                        "WHERE id = ?",
+                        (int(time.time()), cur.lastrowid),
+                    )
 
     # One-shot event-kind rename pass. The old names ("ready", "priority",
     # "spawn_auto_blocked") still worked but were awkward on the wire;
@@ -1356,10 +1372,36 @@ def block_task(
 
 
 def unblock_task(conn: sqlite3.Connection, task_id: str) -> bool:
-    """Transition ``blocked -> ready``."""
+    """Transition ``blocked -> ready``.
+
+    Defensively closes any stale ``current_run_id`` pointer before flipping
+    status. In the common path (``block_task`` closed the run already) this
+    is a no-op. If a future or external write left the pointer dangling,
+    the leaked run is closed as ``reclaimed`` inside the same txn so the
+    runs invariant (``current_run_id IS NULL`` ⇔ run row in terminal
+    state) holds for the rest of this function's lifetime.
+    """
+    now = int(time.time())
     with write_txn(conn):
+        stale = conn.execute(
+            "SELECT current_run_id FROM tasks WHERE id = ? AND status = 'blocked'",
+            (task_id,),
+        ).fetchone()
+        if stale and stale["current_run_id"]:
+            conn.execute(
+                """
+                UPDATE task_runs
+                   SET status = 'reclaimed', outcome = 'reclaimed',
+                       summary = COALESCE(summary, 'invariant recovery on unblock'),
+                       ended_at = ?,
+                       claim_lock = NULL, claim_expires = NULL, worker_pid = NULL
+                 WHERE id = ? AND ended_at IS NULL
+                """,
+                (now, int(stale["current_run_id"])),
+            )
         cur = conn.execute(
-            "UPDATE tasks SET status = 'ready' WHERE id = ? AND status = 'blocked'",
+            "UPDATE tasks SET status = 'ready', current_run_id = NULL "
+            "WHERE id = ? AND status = 'blocked'",
             (task_id,),
         )
         if cur.rowcount != 1:
@@ -2100,6 +2142,32 @@ def build_worker_context(conn: sqlite3.Connection, task_id: str) -> str:
             lines.extend(body_lines)
             lines.append("")
 
+    # Cross-task role history: what else has THIS assignee completed
+    # recently? Gives the worker implicit continuity — "I'm the reviewer
+    # and my last three reviews focused on security" — without forcing
+    # the user to wire anything into SOUL.md / MEMORY.md. Bounded to the
+    # most recent 5 completed runs, excluding this task so the retry
+    # section above isn't duplicated. Safe on assignee=None (skipped).
+    if task.assignee:
+        role_rows = conn.execute(
+            "SELECT t.id, t.title, r.summary, r.ended_at "
+            "FROM task_runs r JOIN tasks t ON r.task_id = t.id "
+            "WHERE r.profile = ? AND r.task_id != ? "
+            "  AND r.outcome = 'completed' "
+            "ORDER BY r.ended_at DESC LIMIT 5",
+            (task.assignee, task_id),
+        ).fetchall()
+        if role_rows:
+            lines.append(f"## Recent work by @{task.assignee}")
+            for row in role_rows:
+                ts = time.strftime(
+                    "%Y-%m-%d %H:%M", time.localtime(int(row["ended_at"]))
+                )
+                s = (row["summary"] or "").strip().splitlines()
+                first = s[0][:200] if s else "(no summary)"
+                lines.append(f"- {row['id']} — {row['title']} ({ts}): {first}")
+            lines.append("")
+
     comments = list_comments(conn, task_id)
     if comments:
         lines.append("## Comment thread")
diff --git a/tests/hermes_cli/test_kanban_core_functionality.py b/tests/hermes_cli/test_kanban_core_functionality.py