Skip to content

fix(kanban): auto-block workers that exit without completing + shutdown race (#20894)#21214

Merged
teknium1 merged 1 commit into
mainfrom
hermes/hermes-26a2390d
May 7, 2026
Merged

fix(kanban): auto-block workers that exit without completing + shutdown race (#20894)#21214
teknium1 merged 1 commit into
mainfrom
hermes/hermes-26a2390d

Conversation

@teknium1
Copy link
Copy Markdown
Contributor

@teknium1 teknium1 commented May 7, 2026

Kanban workers that exit rc=0 without calling kanban_complete / kanban_block now auto-block on the first occurrence instead of respawn-looping forever.

Closes #20894. Two independent bugs from that report, one commit.

Changes

Gateway shutdown (gateway/run.py)

_notify_active_sessions_of_shutdown iterated self.adapters.items() while adapter.send() could pop the dict via the fatal-error path — raised RuntimeError: dictionary changed size during iteration during gateway shutdown. Snapshot the dict with list(...) before iterating.

Dispatcher worker lifecycle (hermes_cli/kanban_db.py)

Root cause: small local models (e.g. gemma4-e2b q4, reported) answer conversationally and the hermes chat -q subprocess exits rc=0 without calling a terminal-transition tool. The task row stays status='running' with a dead pid, detect_crashed_workers classifies it as a crash, the dispatcher respawns, and the model does the same thing again. Infinite loop.

Three-piece fix:

  1. _recent_worker_exits registry — bounded (TTL 600s, size cap 4096 with age-based pruning) module dict populated by the existing waitpid(WNOHANG) reap loop at the top of dispatch_once. Keys are child pids, values are raw wait() status ints.
  2. _classify_worker_exit(pid) — returns clean_exit / nonzero_exit / signaled / unknown using os.WIFEXITED + WIFSIGNALED + WEXITSTATUS + WTERMSIG.
  3. Branch in detect_crashed_workers — when a dead pid is found:
    • clean_exit → emit protocol_violation event, call _record_task_failure(failure_limit=1) → immediate circuit-breaker trip. Retrying is deterministically futile.
    • nonzero_exit / signaled → existing crashed event + normal counter (existing 2-strike auto-block still applies).
    • unknown (pid reaped by someone else before our tick) → fall back to existing crashed behavior. Safe default.

DispatchResult.auto_blocked now includes protocol-violation trips so telemetry/tests see them. The public detect_crashed_workers(conn) -> list[str] return shape is preserved; auto-block ids are surfaced via a side-channel attribute that dispatch_once picks up.

User-visible: a protocol-violating task now shows

status = blocked
last_failure_error = "worker exited cleanly (rc=0) without calling kanban_complete or kanban_block — protocol violation"

on the first occurrence instead of looping DEFAULT_FAILURE_LIMIT times first.

Validation

E2E verified live with three subprocess scenarios:

Spawn Classification Task status after 1 tick Events
Popen(["true"]) abandoned (rc=0) clean_exit blocked protocol_violation, gave_up
Popen(["false"]) abandoned (rc=1) nonzero_exit ready, counter=1 crashed
Popen(["true"]) + p.wait() before tick unknown ready, counter=1 crashed

Tests:

  • tests/hermes_cli/test_kanban_core_functionality.py -k "detect_crashed or protocol_violation or nonzero_exit" — 4 passed (2 new regression tests)
  • Full kanban trio (test_kanban_db.py + test_kanban_core_functionality.py + test_kanban_tools.py) — 245 passed (243 existing + 2 new)
  • Gateway restart/shutdown (test_restart_drain.py, test_shutdown_cache_cleanup.py, test_restart_resume_pending.py) — 91 passed

Closes

When a kanban worker subprocess exits rc=0 but its task is still in
status='running', the agent almost certainly answered the task
conversationally without calling kanban_complete or kanban_block. The
dispatcher used to classify this as a generic crash and respawn, which
loops forever on small local models (gemma4-e2b q4 etc.) that keep
returning clean but unproductive output.

Dispatcher changes:
- The waitpid reap loop at the top of dispatch_once now records each
  reaped child's raw exit status in a bounded module registry
  (_recent_worker_exits, TTL 600s, size cap 4096).
- _classify_worker_exit distinguishes clean_exit / nonzero_exit /
  signaled / unknown using os.WIFEXITED / WIFSIGNALED.
- detect_crashed_workers consults the classification when a worker
  is found dead. clean_exit → protocol_violation event + immediate
  circuit-breaker trip (failure_limit=1). Everything else keeps the
  existing crashed-event + counter behavior.
- DispatchResult.auto_blocked now includes protocol-violation trips.

Gateway fix (Bug A in #20894):
- gateway.run._notify_active_sessions_of_shutdown snapshots
  self.adapters with list(...) before iterating. adapter.send() can
  hit a fatal-error path that pops the adapter from the dict, which
  was raising 'RuntimeError: dictionary changed size during iteration'
  during shutdown.

Regression tests:
- test_detect_crashed_workers_protocol_violation_auto_blocks verifies
  rc=0 + still-running → status=blocked on first occurrence with
  protocol_violation + gave_up events and NO crashed event.
- test_detect_crashed_workers_nonzero_exit_uses_default_limit verifies
  non-zero exits keep the existing 2-strike behavior.

Closes #20894.
@teknium1 teknium1 merged commit fdb9e0f into main May 7, 2026
10 of 11 checks passed
@teknium1 teknium1 deleted the hermes/hermes-26a2390d branch May 7, 2026 12:24
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

🔎 Lint report: hermes/hermes-26a2390d vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 7498 on HEAD, 7487 on base (🆕 +11)

🆕 New issues (3):

Rule Count
invalid-argument-type 1
invalid-assignment 1
unresolved-attribute 1
First entries
hermes_cli/kanban_db.py:3099: [invalid-argument-type] invalid-argument-type: Argument to function `_record_task_failure` is incorrect: Expected `int`, found `Literal[1] | None`
tests/hermes_cli/test_kanban_core_functionality.py:3714: [invalid-assignment] invalid-assignment: Object of type `(p) -> Literal[False]` is not assignable to attribute `_pid_alive` of type `def _pid_alive(pid: int | None) -> bool`
hermes_cli/kanban_db.py:3110: [unresolved-attribute] unresolved-attribute: Unresolved attribute `_last_auto_blocked` on type `def detect_crashed_workers(conn: Connection) -> list[str]`

✅ Fixed issues: none

Unchanged: 3935 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@alt-glitch alt-glitch added type/bug Something isn't working comp/plugins Plugin system and bundled plugins comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have labels May 7, 2026
bot-ted added a commit to bot-ted/hermes-agent that referenced this pull request May 8, 2026
* fix(kanban): avoid fragile failure-column renames

* chore: follow-up cleanup for Kanban migration fix

- Expand migration comment to name the primary failure mode (missing
  column OperationalError from #20842) ahead of the secondary SQLite
  schema-reparse concern; also document the stale-cols-snapshot invariant
- Add clarifying comments on from_row() legacy fallback branches noting
  they are belt-and-suspenders dead code post-migration
- Add task_events comment in existing test explaining why the table is
  required by the migrator
- Add test_legacy_migration_no_legacy_columns_at_all: Scenario A —
  explicitly asserts the exact #20842 crash no longer occurs and that
  consecutive_failures defaults to 0 on a DB that never had spawn_failures
- Add test_legacy_migration_both_columns_already_present: Scenario D —
  asserts the migration is a no-op when both columns already exist,
  preserving the existing counter value

* fix(tui): bound virtual history offset searches

* ci(docker): don't cancel overlapping builds, guard :latest

Switch top-level concurrency to cancel-in-progress=false so every push
to main gets its own SHA-tagged image published — no more discarded
builds when commits land back-to-back.

Guard the :latest tag with a second job that has its own concurrency
group with cancel-in-progress=true plus a git-ancestor check against
the revision label on the current :latest. Together these guarantee
:latest only ever moves forward in history: a slower run whose commit
isn't a descendant of the current :latest refuses to clobber it, and
a newer push mid-way through the move-latest job preempts the older
one before it can retag.

- Every main push publishes nousresearch/hermes-agent:sha-<commit>
  with an org.opencontainers.image.revision label embedded.
- move-latest job reads that label off :latest, runs merge-base
  --is-ancestor, and only retags (via buildx imagetools create,
  registry-side, no rebuild) if our commit strictly descends.
- fetch-depth bumped to 1000 so merge-base has the history it needs.
- Release tag flow unchanged (unique tag, no race).

* docs(tool-gateway): rewrite as pitch-first marketing page (#20827)

Previous version read like internal API docs \u2014 leading with env var tables,
config YAML, and 'precedence' rules before ever explaining the product.
Complete rewrite inverts the structure so readers see value first,
mechanics last.

Structure now:
- Lede: 'One subscription. Every tool built in.' + pitch paragraph
- CTA: subscribe/manage button styled as a real call-to-action
- What's included: emoji-led table with expanded descriptions per tool.
  Image gen lists all 9 models by name (FLUX 2 Klein/Pro, Z-Image Turbo,
  Nano Banana Pro, GPT Image 1.5/2, Ideogram V3, Recraft V4 Pro, Qwen)
- Why it's here: value bullets \u2014 one bill, one signup, one key, same
  quality, bring-your-own anytime
- Get started: two-command flow (hermes model \u2192 hermes status)
- Eligibility: paid-tier note with upgrade link
- Mix and match: three realistic usage patterns
- Using individual image models: ID reference table for power users
- --- separator ---
- Configuration reference (demoted): use_gateway flag, disabling,
  self-hosted gateway env vars moved below the fold where they belong
- FAQ: streamlined, removed redundant content

Fact-checked against code:
- 9 FAL models confirmed from tools/image_generation_tool.py FAL_MODELS
- Status section output verified against hermes_cli/status.py
- Portal subscription URL preserved
- Self-hosted env vars (TOOL_GATEWAY_DOMAIN etc.) kept accurate

Verified: docusaurus build SUCCESS, page renders, no new broken links.

* fix(auth): fall back to global-root auth.json for providers missing in profile

Profile processes (kanban workers, cron subprocesses, delegated subagents)
read the profile's auth.json only. If a provider was authenticated at the
global root but not inside the profile, the profile's credential_pool
comes back empty and the process fails with 'No LLM provider configured'
— even though the credentials are sitting in ~/.hermes/auth.json. #18594
propagated HERMES_HOME correctly, which is what surfaced this: workers
now land in the right profile, and the profile turns out to shadow global
with no fallback.

Semantics (read-only, per-provider shadowing):
* Profile has any entries for provider X → use profile only (global ignored).
* Profile has zero entries for provider X → fall back to global.
* Writes (write_credential_pool, _save_auth_store) still target the profile.
* Classic mode (HERMES_HOME == global root) skips the fallback entirely —
  _global_auth_file_path() returns None.

Also mirrors the fallback in get_provider_auth_state so OAuth singletons
(nous, minimax-oauth, openai-codex, spotify) inherit cleanly — the Nous
shared-token store (PR #19712) remains the authoritative path for Nous
OAuth rotation, this just makes the read side consistent with it.

Seat belt: _load_global_auth_store() refuses to read the real user's
~/.hermes/auth.json under PYTEST_CURRENT_TEST even when HERMES_HOME points
to a profile-shaped path. Guard uses $HOME (stable across fixtures) rather
than Path.home() (which fixtures often monkeypatch to a tmp root).

Reported by @SeedsForbidden on Twitter as the credential_pool shadowing
follow-up to the #18594 fix.

* feat(gateway): per-platform gateway_restart_notification flag

Adds an opt-out toggle on PlatformConfig that gates both restart
lifecycle pings: the "♻ Gateway restarted" message sent to the chat
that issued /restart, and the "♻️ Gateway online" home-channel
startup notification. Defaults to True so existing deployments are
unaffected.

The motivating split is operator vs. end-user surfaces: a back-channel
like Telegram should keep these pings, while a Slack workspace shared
with end users should not surface gateway lifecycle noise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(gateway): also gate pre-restart "Gateway restarting" notification

Extend the gateway_restart_notification flag to cover
_notify_active_sessions_of_shutdown — the message that fires just
before drain ("⚠️ Gateway restarting — Your current task will be
interrupted. Send any message after restart and I'll try to resume
where you left off.") sent to active sessions and home channels.

Same operator/end-user reasoning: on a Slack workspace shared with
end users, "Gateway restarting" reads as "the bot is broken" — the
operator should be able to suppress it consistently with the other
two lifecycle pings rather than having a partial opt-out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: add guillaumemeyer to AUTHOR_MAP

For cherry-picked commits in PR #20801.

* fix(cli): submit LF enter in thin PTYs (#20896)

* fix(tui): refresh virtual offsets after row resize (#20898)

* fix(tui): honor skin highlight colors (#20895)

* fix(tui): steady transcript scrollbar (#20917)

* fix(tui): steady transcript scrollbar

Keep the visible scrollbar tied to committed viewport position while virtual history can still prefetch against pending scroll targets, and preserve drag grab offset synchronously for native-feeling scrollbar drags.

* fix(tui): smooth precision wheel scroll

Replace the opt-scroll throttle with frame-sized coalescing so modifier wheel gestures stay line-precise without stepping.

* fix(tui): restore voice push-to-talk parity (#20897)

* fix(tui): restore classic CLI voice push-to-talk parity

(cherry picked from commit 93b9ae301bb89f5b5e01b4b9f8ac91ffa74fbd9d)

* fix(tui): harden voice push-to-talk stop flow

Address review feedback from PR #16189 by stopping the active recorder before background transcription, documenting single-shot voice capture, and covering the TUI gateway flags with regression tests.

* fix(tui): preserve silent voice strike tracking

Keep single-shot voice recording's no-speech counter alive across starts so the TUI can still emit the three-strikes auto-disable event, and bind the auto-restart state at module scope for type checking.

* fix(tui): clean up voice stop failure path

Address follow-up review by naming the TUI flow as single-shot push-to-talk and cancelling the recorder when forced stop cannot produce a WAV.

* fix(tui): report busy voice capture starts

Return explicit start state from the voice wrapper so the TUI gateway does not report recording while forced-stop transcription is still cleaning up.

* fix(tui): handle busy voice record responses

Apply the gateway busy status immediately in the TUI and route forced-stop voice events to the session that sent the stop request.

* fix(tui): clear voice recording on null response

Treat a null voice.record RPC result as a failed optimistic start so the REC badge cannot stick after gateway-side errors.

* fix(tui): count silent manual voice stops

Preserve single-shot voice no-speech strikes through forced stop transcription so empty push-to-talk captures still trigger the three-strikes guard.

---------

Co-authored-by: Montbra <montbra@gmail.com>

* fix(gateway): don't dead-end setup wizard when only system-scope unit is installed

The setup wizard dropped non-root users at a bare shell prompt when
trying to start a system-scope gateway service. Previously
_require_root_for_system_service called sys.exit(1), which the
wizard's `except Exception` guards cannot catch (SystemExit is a
BaseException). Users with a pre-existing /etc/systemd/system unit
(e.g. from an earlier `sudo hermes setup` run) hit this whenever
they re-ran `hermes setup` as a regular user.

- Convert _require_root_for_system_service to raise a typed
  SystemScopeRequiresRootError (RuntimeError subclass) instead of
  sys.exit(1). The direct CLI path (`hermes gateway install|start|stop|
  restart|uninstall` without sudo) still exits 1 cleanly via a new
  catch at the top of gateway_command, matching the existing
  UserSystemdUnavailableError pattern.
- Add _system_scope_wizard_would_need_root() pre-check and
  _print_system_scope_remediation() helper. Both setup wizards
  (hermes_cli/setup.py and hermes_cli/gateway.py::gateway_setup) now
  detect the dead-end before prompting and print actionable guidance:
  either `sudo systemctl start <service>` this time, or uninstall the
  system unit and install a per-user one.
- Defense-in-depth: all 5 wizard prompt sites also catch
  SystemScopeRequiresRootError and fall back to the remediation
  helper if the pre-check is bypassed (race, etc.).

Tests: 12 new tests in TestSystemScopeRequiresRootError,
TestSystemScopeWizardPreCheck, TestSystemScopeRemediationOutput, and
TestGatewayCommandCatchesSystemScopeError covering the exception
contract, pre-check matrix (root vs non-root, system-only vs
user-present vs none vs explicit system=True), remediation output
for each action, and the direct-CLI exit-1 path.

* fix(tui): preserve session when switching personality

Previously, /personality in the TUI called _reset_session_agent() which
destroyed the agent, cleared conversation history, and effectively started
a new session. This made personality switching disruptive — users lost
their entire conversation context.

Now /personality updates the agent's ephemeral_system_prompt in-place and
injects a pivot marker into the conversation history. The marker tells
the model to adopt the new persona from that point forward, which is
necessary because LLMs tend to pattern-match their prior responses and
continue the established tone without an explicit signal.

Changes:
- tui_gateway/server.py: Rewrite _apply_personality_to_session to update
  the agent in-place instead of resetting. Inject a user-role pivot
  marker so the model actually switches style mid-conversation.
- ui-tui/src/app/slash/commands/session.ts: Update help text (no longer
  mentions history reset).
- tests/test_tui_gateway_server.py: Update test to verify history is
  preserved, pivot marker is injected, and ephemeral prompt is set.

* fix(gateway): wait for systemd restart readiness

* fix(discord): narrow rate-limit catch and move sync state under gateway/

Two follow-ups on top of helix4u's slash-command sync hardening:

- Only suppress exceptions that are actually Discord 429 rate limits
  (discord.RateLimited, HTTPException with status 429, or a clearly
  rate-limit-named duck type). Arbitrary failures that happen to expose
  a retry_after attribute now re-raise to the outer handler instead of
  silently swallowing a cooldown.
- Move the sync-state JSON under $HERMES_HOME/gateway/ so the home root
  stops collecting ad-hoc runtime files.

Added a test verifying unrelated exceptions don't get misclassified as
rate limits.

* docs(kanban): fix orchestrator skill setup instructions (#20958)

* docs(kanban): fix worker skill setup instructions too (#20960)

Follow-up to #20958. The worker skill section had the same stale
'hermes skills install devops/kanban-worker' command — kanban-worker
is also bundled, so that command fails with 'Could not fetch from any
source.'

Replace with bundled-skill verification + restore pattern, matching
the orchestrator section. Uses <your-worker-profile> placeholder since
assignees vary (researcher, writer, ops, linguist, reviewer, etc.)
rather than a single fixed 'worker' profile.

* feat(profiles): --no-skills flag for empty profile creation (#20986)

Adds `hermes profile create <name> --no-skills` to create a profile with
zero bundled skills. Writes a `.no-bundled-skills` marker file in the
profile root so `hermes update`'s all-profile skill sync loop also skips
the profile — without the marker, every update would re-seed skills and
the user would have to delete them again.

Use case (from @hiut1u): orchestrator profiles and narrow-task profiles
don't need 100+ bundled skills polluting their system prompt.

- create_profile() gains a `no_skills` param, mutually exclusive with
  `--clone` / `--clone-all` (cloning explicitly copies skills).
- seed_profile_skills() no-ops on opted-out profiles and returns
  `{skipped_opt_out: True}` so callers can report cleanly.
- Web API (POST /api/profiles) accepts `no_skills: bool`.
- Delete `.no-bundled-skills` to opt back in — next `hermes update`
  re-seeds normally.

6 new tests in TestNoSkillsOptOut cover marker write, mutual exclusion
with clone, seed_profile_skills opt-out, fresh profile unaffected, and
delete-marker-re-enables-seeding.

* fix: route Telegram image documents through photo handling

* chore: AUTHOR_MAP entry for mrcoferland

* test(docker): align Dockerfile contract tests with simplified TUI flow

The Dockerfile dropped the manual `@hermes/ink` materialisation gymnastics
in favour of letting npm workspaces resolve the bundled package
naturally. Two contract tests still asserted the older flow:

`test_dockerfile_installs_tui_dependencies` required:
    'ui-tui/packages/hermes-ink/package-lock.json' in dockerfile_text

…but the lockfile is no longer COPIED individually \u2014 the entire
`ui-tui/packages/hermes-ink/` tree is COPIED instead (the workspace
reference from `ui-tui/package.json` is `file:` so npm needs the
real source, not just a manifest stub).

`test_dockerfile_materializes_local_tui_ink_package` required a 7-clause
conjunction matching specific `rm -rf` / `npm install --omit=dev`
`--prefix node_modules/@hermes/ink` / `rm -rf .../react` invocations
that were stripped out when the workspace resolution was simplified.

Update the assertions to pin the *contract* the image actually has to
carry rather than the *exact shell incantations* the old flow used:

* TUI deps install: ui-tui/package.json + ui-tui/package-lock.json +
  ui-tui/packages/hermes-ink/ tree are all COPIED, and an npm
  install/ci step runs in ui-tui.
* Bundled hermes-ink: the workspace package source is COPIED (so
  `await import('@hermes/ink')` resolves at runtime).

This keeps the spirit of #15012 / #16690 (zombie reaping + bundled
workspace materialisation must continue to work) without locking the
Dockerfile into one specific implementation flavour.

Validation:

    $ pytest tests/tools/test_dockerfile_pid1_reaping.py -q
    6 passed in 1.43s

No production code change. Fixes the two failures observed on `main`
(run 25250051126):

`tests/tools/test_dockerfile_pid1_reaping.py::test_dockerfile_installs_tui_dependencies`
`tests/tools/test_dockerfile_pid1_reaping.py::test_dockerfile_materializes_local_tui_ink_package`

* test(update): patch isatty on real streams to fix xdist-flaky --yes tests

Two CI tests for the new `--yes` update flag (#18261) flaked under
`pytest-xdist` on Linux/Python 3.11 even though they passed every
local run on macOS/Python 3.14.4:

  FAILED tests/hermes_cli/test_update_yes_flag.py
    ::TestUpdateYesConfigMigration::test_no_yes_flag_still_prompts_in_tty
      `AssertionError: assert <MagicMock 'input'>.called is False`
  FAILED tests/hermes_cli/test_update_yes_flag.py
    ::TestUpdateYesStashRestore::test_yes_restores_stash_without_prompting
      `AssertionError: assert <MagicMock '_restore_stashed_changes'>.called is False`

Captured stdout for the first failure shows `cmd_update` taking the
"Non-interactive session \u2014 skipping config migration prompt." branch
\u2014 i.e. the `sys.stdin.isatty() and sys.stdout.isatty()` check at
`hermes_cli/main.py:7118` evaluated to `False` despite the test doing:

    with patch("hermes_cli.main.sys") as mock_sys:
        mock_sys.stdin.isatty.return_value = True
        mock_sys.stdout.isatty.return_value = True

The whole-module mock is fragile under xdist worker reuse: a sibling
test that imports `hermes_cli.main` first can leave another `sys`
reference resolved inside the function (re-import in a helper, etc.),
and the wholesale module replacement never gets consulted.

Switch to `patch.object(_sys.stdin, "isatty", return_value=True)` (and
the same for `stdout`). That patches the *attribute on the real stream
object* \u2014 every call site, no matter how it reached `sys.stdin`,
hits the patched method. Same fix applied to the stash-restore test
(it took the "non-TTY \u2192 skip restore prompt" branch for the same reason).

Validation:

    $ pytest tests/hermes_cli/test_update_yes_flag.py -q
    3 passed in 5.47s

No production code change. Fixes the two failures observed on `main`
(run 25250051126):

`tests/hermes_cli/test_update_yes_flag.py::TestUpdateYesConfigMigration::test_no_yes_flag_still_prompts_in_tty`
`tests/hermes_cli/test_update_yes_flag.py::TestUpdateYesStashRestore::test_yes_restores_stash_without_prompting`

Refs: #18261 (added the `--yes` flag + these tests).

* fix(web): force light color-scheme on docs iframe

The Documentation tab embeds the public Hermes Agent docs site via an
<iframe>. On any system where the browser's prefers-color-scheme
resolves to dark — the default on macOS with system dark mode, and
common on Linux/Windows too — the docs body text rendered nearly
invisible against its own background.

Cause: Docusaurus intentionally leaves <html> and <body> transparent
and relies on the browser's Canvas color to fill the viewport. Inside
our iframe, the iframe element had bg-background (the dashboard's dark
canvas) AND inherited the dashboard's dark color-scheme, so the
browser set the iframe's Canvas to a dark value. Docusaurus's
transparent body exposed that dark Canvas, and the docs body text
(tuned for a light Canvas) became near-illegible. Affects every
built-in dashboard theme.

Fix: replace bg-background on the iframe with [color-scheme:light]
(spec-blessed cross-origin override of the inherited color-scheme;
forces the iframe's Canvas to light) and bg-white (belt-and-suspenders
fallback during the brief paint window before content loads). The
docs site's own theme toggle keeps working — Docusaurus stores its
choice in localStorage and applies opaque dark backgrounds to its
layout elements that cover the white Canvas we forced.

* fix(security): close TOCTOU window when saving MCP OAuth credentials

_write_json (the persistence helper used by HermesTokenStorage for both
tokens and client_info) created the temp file via Path.write_text and
only chmod'd it to 0o600 afterward. Between create and chmod the file
existed on disk at the process umask (commonly 0o644 = world-readable),
briefly exposing MCP OAuth access/refresh tokens to other local users.

Use os.open with O_WRONLY|O_CREAT|O_EXCL and an explicit S_IRUSR|S_IWUSR
mode so the file is created atomically at 0o600, plus tighten the parent
dir to 0o700 so siblings can't traverse to the creds file. The temp name
also gains a per-process random suffix to avoid collisions between
concurrent writers and stale leftovers from a crashed prior write.

Mirrors the fix shipped for agent/google_oauth.py in #19673.

Adds a regression test asserting the resulting file mode is 0o600 and
the parent directory is 0o700 (skipped on Windows where POSIX mode bits
aren't enforced).

* chore(release): add Gutslabs to AUTHOR_MAP for PR #21148 salvage

* test(update): teach restart-mocks about the post-update survivor sweep

Issue #17648 added a post-update SIGTERM-survivor sweep to `cmd_update`:
~3s after issuing graceful/SIGTERM restarts, the code re-queries
`find_gateway_pids` and SIGKILLs anything still alive. That's the
right fix for stuck-drain gateways in production, but it broke three
unit tests that assumed `find_gateway_pids` would keep returning the
same PIDs forever:

  FAILED ::TestCmdUpdateLaunchdRestart::test_update_restarts_profile_manual_gateways
    AssertionError: Expected 'kill' to not have been called. Called 1 times.
    Calls: [call(12345, <Signals.SIGKILL: 9>)].

  FAILED ::TestCmdUpdateLaunchdRestart::test_update_profile_manual_gateway_falls_back_to_sigterm
    AssertionError: Expected 'kill' to have been called once. Called 2 times.
    Calls: [call(12345, SIGTERM), call(12345, SIGKILL)].

  FAILED ::TestServicePidExclusion::test_update_kills_manual_pid_but_not_service_pid
    assert 2 == 1
      manual_kills = [call(42999, SIGTERM), call(42999, SIGKILL)]

In each test `os.kill` is mocked, so the simulated PID never actually
exits \u2014 the sweep finds it again and escalates. The production code
is correct; the tests just need to model OS behaviour properly.

Two-test fix (profile-manual restart cases): use
`side_effect=[[12345], []]` so the first `find_gateway_pids` call
returns the live PID and the second (the sweep) returns nothing, as if
the OS had reaped the process.

Service-PID-exclusion fix: track which PIDs got killed in a closure
set, and exclude them on subsequent `fake_find` calls. `os.kill`
gets a `side_effect` that records the kill instead of swallowing it
silently. Now the sweep doesn't re-find the manual PID, no SIGKILL
escalation, `manual_kills == 1`.

Validation:

    $ pytest tests/hermes_cli/test_update_gateway_restart.py -q
    43 passed in 4.13s

No production code change. Fixes the three failures observed on `main`
(run 25250051126):

  test_update_restarts_profile_manual_gateways
  test_update_profile_manual_gateway_falls_back_to_sigterm
  test_update_kills_manual_pid_but_not_service_pid

Refs: #17648 (post-update survivor sweep that the tests didn't model).

* fix(image-routing): expose attached image paths in native multimodal text part

In native image mode (vision-capable models like gpt-4o, claude-sonnet-4),
build_native_content_parts() previously emitted only the user's caption
plus image_url parts. The local file path of each attached image never
appeared in the conversation text, so the model could see the pixels but
had no string handle for tools that take image_url: str (custom MCP
tools, vision_analyze on a re-look, attach-to-tracker workflows).

The text-mode path already injects an equivalent hint via
Runner._enrich_message_with_vision ("...vision_analyze using image_url:
<path>..."). This brings native mode to parity by appending one
"[Image attached at: <path>]" line per successfully attached image to
the user-text part of the multimodal turn. Skipped (unreadable) paths
are NOT advertised, so the model is never told a non-existent file is
attached.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(optional-skills): port Anthropic financial-services skills as optional finance bundle (#21180)

Adds 7 optional skills under optional-skills/finance/ adapted from
anthropics/financial-services (Apache-2.0):

  excel-author        — openpyxl conventions: blue/black/green cells,
                        formulas over hardcodes, named ranges, balance
                        checks, sensitivity tables. Ships recalc.py.
  pptx-author         — python-pptx for model-backed decks (pitch,
                        IC memo, earnings note) that bind every number
                        to a source workbook cell.
  dcf-model           — institutional DCF (49KB skill): projections,
                        WACC, terminal value, Bear/Base/Bull scenarios,
                        5x5 sensitivity tables. Ships validate_dcf.py.
  comps-analysis      — comparable company analysis: operating metrics,
                        multiples, statistical benchmarking.
  lbo-model           — leveraged buyout: S&U, debt schedule, cash
                        sweep, exit multiple, IRR/MOIC sensitivity.
  3-statement-model   — fully-integrated IS/BS/CF with balance-check
                        plugs. Ships references/ for formatting,
                        formulas, SEC filings.
  merger-model        — accretion/dilution analysis for M&A.

All seven are optional (not active by default). Users install via
'hermes skills install official/finance/<skill>'.

Hermesification:
- Stripped every Office JS / Office Add-in / mcp__office__*
  branch — skills assume headless openpyxl only.
- Replaced Cowork MCP data-source instructions with 'MCP first (via
  native-mcp), fall back to web_search/web_extract against SEC EDGAR
  and user-provided data'.
- Swapped Claude tool references (Bash, Read, Write, Edit, mcp__*)
  for Hermes-native equivalents and Python library calls.
- Canonical Hermes frontmatter (name/description/version/author/
  license/metadata.hermes.{tags,related_skills}).
- Descriptions tightened to 187-238 chars, trigger-first.
- Attribution preserved: author field credits 'Anthropic (adapted by
  Nous Research)', license: Apache-2.0, each SKILL.md links back to
  the upstream source directory.

Verification:
- All 7 discovered by OptionalSkillSource with source_id='official'
- Bundle fetch includes support files (scripts, references, troubleshooting)
- related_skills cross-refs all resolve within the bundle
- No Claude product / Cowork / Office JS / /mnt/skills leakage
  remains in body text (bounded mentions only in attribution blocks)

Source: https://github.com/anthropics/financial-services (Apache-2.0)

* test(skills): cover additional rescan paths in skill_commands cache (#14536)

The rescan-on-platform-change fix landed in #18739 ships one regression
test that exercises the HERMES_PLATFORM env-var path. Three other code
paths in get_skill_commands / _resolve_skill_commands_platform have no
direct coverage; this commit adds a regression test for each.

- Gateway session context (HERMES_SESSION_PLATFORM via ContextVar): the
  resolver consults get_session_env after HERMES_PLATFORM, and the
  gateway sets that variable through set_session_vars (a ContextVar),
  not os.environ. The test uses set_session_vars / clear_session_vars
  to drive the actual gateway signal, and the disabled-skill stub reads
  the same value via get_session_env. A regression that swapped
  get_session_env for plain os.getenv would still pass an env-var-based
  test but break concurrent gateway sessions, which is the bug the
  ContextVar plumbing exists to prevent.
- Returning to no-platform-scope (CLI / cron / RL rollouts after a
  gateway session): the cached telegram view must be dropped and the
  unfiltered scan repopulated when HERMES_PLATFORM is unset again.
- Same-platform cache hit: consecutive calls under the same platform
  scope must NOT rescan. The rescan trigger is change in scope, not
  "always re-resolve" — a gateway serving many consecutive telegram
  requests should pay the scan cost once, not per request.

The third test wraps scan_skill_commands with a spy after the cache is
primed, so the assertion is on call_count == 0 across three subsequent
get_skill_commands() calls.

All 39 tests in tests/agent/test_skill_commands.py pass under
scripts/run_tests.sh.

* fix(gateway): translate inbound document host paths to container paths for Docker backend

When terminal.backend is docker, inbound documents uploaded via messaging
platforms (Telegram, Slack, Discord, Feishu, Email, etc.) are cached at a host
path under ~/.hermes/cache/documents, but the container sandbox only sees them
at the auto-mounted /root/.hermes/cache/documents path.

This PR adds to_agent_visible_cache_path() in tools/credential_files.py (the
natural sibling to get_cache_directory_mounts()) and calls it at the
document-context-injection site in gateway/run.py so the agent always receives
a path it can open directly, matching the mount layout already established
by get_cache_directory_mounts() (#4846).

Scope: only Docker backend for now; other backends use different mount
semantics and are left unchanged until verified.

Fixes #18787

* feat(gateway): opt-in cleanup of temporary progress bubbles (#21186)

When display.cleanup_progress (or display.platforms.<plat>.cleanup_progress)
is true, the gateway deletes tool-progress bubbles, long-running '⏳ Still
working...' notices, and status-callback messages after the final response
is delivered successfully. Currently effective on adapters that implement
delete_message (Telegram); silently no-ops elsewhere. Off by default.
Failed runs skip cleanup so bubbles stay as breadcrumbs.

Minimal plumbing: base.py's existing post_delivery_callback slot now chains
new registrations onto any existing callback (with per-callback exception
isolation) rather than clobbering. Stale-generation registrations are
rejected so they can't step on a fresher run's callbacks. This lets the
cleanup callback coexist with the background-review release hook already
registered on the same slot.

Co-authored-by: mrcharlesiv <Mrcharlesiv@gmail.com>

* fix(kanban): heartbeat tool extends claim TTL, not just last_heartbeat_at

The kanban_heartbeat tool called heartbeat_worker but never
heartbeat_claim, so a worker that loops the tool while a single tool
call blocks the agent for >DEFAULT_CLAIM_TTL_SECONDS still got
reclaimed by release_stale_claims. The function name and
heartbeat_claim's own docstring imply otherwise:

  "Workers that know they'll exceed 15 minutes should call this
   every few minutes to keep ownership."

But there was no caller in the worker tool path. Workers couldn't
invoke heartbeat_claim themselves either — it isn't exposed as a tool.

Fix: _handle_heartbeat now calls heartbeat_claim first, reading
HERMES_KANBAN_CLAIM_LOCK from the worker env (the dispatcher pins
this in _default_spawn). Falls back to _claimer_id() for locally-
driven workers that didn't go through dispatcher spawn.

Test: tests/tools/test_kanban_tools.py::test_heartbeat_extends_claim_expires
rewinds claim_expires into the past, calls the tool, and asserts the
new value is at least now + DEFAULT_CLAIM_TTL_SECONDS // 2. Verified to
fail against the unfixed code (claim_expires stays at the rewound
value).

Closes the root cause underlying the symptom in #21141 (15-min
respawns of long-running workers). #21141 separately addresses
post-reclaim cleanup; this fixes the upstream "shouldn't have been
reclaimed in the first place" half.

* chore(release): map stephen0110 noreply email

* fix(kanban): stop reclaimed workers before retry

* fix(kanban): reap completed worker children in dispatch_once

The gateway-embedded dispatcher (default since `kanban.dispatch_in_gateway
= true`) is the parent of every spawned kanban worker. `_default_spawn`
calls `subprocess.Popen(..., start_new_session=True)` and returns the
pid — `start_new_session` detaches the controlling tty but does not
reparent to init, so the gateway keeps each worker as a child until it
`wait()`s for them.

Nothing in the dispatch loop ever calls `waitpid`. Result: every
completed worker becomes a `<defunct>` zombie that lingers until the
gateway exits. We hit ~430 zombies on a single hermes-agent container
after ~40 days of steady kanban traffic, approaching process-table
exhaustion on the host.

Fix: add a non-blocking reap loop at the top of `dispatch_once`, so
every dispatcher tick (default 60s) drains zombies that accumulated
since the last tick. WNOHANG keeps the call non-blocking; ChildProcessError
means no children to reap.

Why here, not a SIGCHLD handler:
- signal.signal requires the main thread; gateway threading model makes
  that placement non-trivial.
- Bounded staleness: at default interval=60s the maximum live zombie
  count is one tick's worth of worker completions.
- No interaction with detect_crashed_workers: that function only inspects
  rows where status='running', and rows reach 'done' (and stop being
  inspected) before their workers exit.

* chore(release): map sonic-netizen noreply email

* fix: auto-block repeated kanban retries

* chore(release): map mwnickerson noreply email

* feat(gateway): auto-resume interrupted sessions after restart

* fix(gateway): preserve resume marker on interrupted restart

* refactor(gateway): simplify auto-resume + extend to crash recovery

Follow-up on top of @kyan12's PR #20888 — same feature, cleaner shape,
wider coverage.

Changes:
- Drop the synthetic '[System note: ...]' in the internal MessageEvent.
  The existing _is_resume_pending branch in _handle_message_with_agent
  (run.py ~L13738) already injects a reason-aware recovery system note
  on the next turn.  With kyan's text in place the model saw two stacked
  system notes.  Now the event text is empty and the existing injection
  path owns the wording.
- Drop SessionStore.list_resume_pending() as a new public method.  The
  filter is 8 lines inline in _schedule_resume_pending_sessions() —
  one caller, no other pluggability need.
- Add 'restart_interrupted' to the auto-resume reason set.  That's the
  reason SessionStore.suspend_recently_active() stamps on sessions
  recovered from a crash/OOM/SIGKILL (no .clean_shutdown marker).
  Previously those sessions had to wait for a real user message to
  auto-resume; now they continue automatically at startup like
  drain-timeout interruptions do.
- Reasons live in a _AUTO_RESUME_REASONS frozenset at class scope so
  future reasons (e.g. 'manual_resume_request') can be opted in with
  one line.

Test coverage added:
- drain-timeout + crash-recovery both scheduled
- stale entries skipped (outside freshness window)
- suspended entries skipped (suspended > resume_pending)
- originless entries skipped (no routing target)
- disallowed reasons skipped (graceful forward-compat)

E2E verified end-to-end with a real on-disk SessionStore: 2 eligible
sessions scheduled, 2 ineligible skipped, empty-text internal events
delivered to the adapter.

Co-authored-by: Kevin Yan <kevyan1998@gmail.com>

* fix(auth): sync shared Nous refresh tokens

* refactor(auth): dedupe file-lock helper; document Nous lock order

Extract the shared flock/msvcrt boilerplate from _auth_store_lock and
_nous_shared_store_lock into a single _file_lock(lock_path, holder,
timeout, message) helper. Each caller keeps its own threading.local
holder so reentrancy state stays per-lock.

Also document the lock-ordering invariant on both wrappers:
_auth_store_lock is OUTER, _nous_shared_store_lock is INNER for all
runtime refresh paths. The one exception is _try_import_shared_nous_state,
which holds the shared lock alone across the full HTTP refresh+mint
cycle to prevent concurrent sibling imports from racing on the single-
use shared refresh token; that helper must not be called with the auth
lock already held.

* fix(gateway): avoid duplicated responses history

* chore: add AUTHOR_MAP entries for thelumiereguy and counterposition

* fix(gateway): use monotonic deadlines in QR onboarding flows

* fix(oauth,gateway): monotonic deadlines for polling/timeout loops

Widen PR #20314's fix to the other timeout-polling sites in the codebase
that share the same wall-clock-jump bug class. All of these measure elapsed
timeout duration, not civil time, so they belong on time.monotonic().

- hermes_cli/auth.py: auth-store file-lock timeout, Spotify OAuth callback
  wait, Nous portal device-auth token poll.
- hermes_cli/copilot_auth.py: Copilot OAuth device-flow token poll.
- hermes_cli/gateway.py: gateway systemd restart wait.
- hermes_cli/web_server.py: dashboard Codex device-auth user_code wait,
  dashboard Nous device-auth token poll. (sess["expires_at"] stays on
  time.time() — it's a persisted absolute timestamp, not a local
  deadline-polling variable.)
- agent/copilot_acp_client.py: Copilot ACP JSON-RPC request timeout.

* fix(weixin): replace all aiohttp ClientTimeout with asyncio.wait_for()

aiohttp ClientTimeout uses BaseTimerContext which calls
loop.call_later() internally. When invoked via
asyncio.run_coroutine_threadsafe() from cron jobs, this
triggers "Timeout context manager should be used inside a task"
errors, causing message delivery failures.

Replace all direct ClientTimeout usage with asyncio.wait_for():
- _upload_ciphertext: CDN upload (120s timeout)
- _download_bytes: CDN download (configurable timeout)
- _download_remote_media: remote media fetch (30s timeout)

Also set total=None on _send_session to disable aiohttp built-in
timeout, and change trust_env=True to False to bypass proxy for
WeChat CDN connections.

* test(weixin): update timeout assertion for asyncio.wait_for migration

* chore: AUTHOR_MAP entry for chenlinfeng@ruije / @noOne-list

* feat(security): enable secret redaction by default (#17691, #20785) (#21193)

Flip the default for HERMES_REDACT_SECRETS from off to on so the redactor
already wired into send_message_tool, logs, and tool output actually runs
on a fresh install.

- agent/redact.py: env-var default "" → "true"
- hermes_cli/config.py: DEFAULT_CONFIG security.redact_secrets True;
  two config-template comments rewritten
- gateway/run.py + cli.py: startup log / banner warning when the user
  has explicitly opted out, so the downgrade is visible in agent.log
  and at CLI banner time
- docs/reference/environment-variables.md: description reconciled
- tests: flipped the default-pin, restructured the force=True
  regression test to explicit-false instead of unset

Users who need raw credential values (redactor development) can still
opt out via security.redact_secrets: false in config.yaml or
HERMES_REDACT_SECRETS=false in .env.

Closes #17691.
Addresses #20785 (short-term output-pipeline recommendation).

* feat: add Discord message deletion action

* chore: AUTHOR_MAP entry for @likejudy

* fix(security): close TOCTOU window in hermes_cli/auth.py credential writers (#21194)

`_save_auth_store`, `_save_qwen_cli_tokens`, and `_write_shared_nous_state`
all created the temp file via `Path.open('w')` / `Path.write_text` and only
tightened permissions to 0o600 afterward. Between create and chmod the file
existed at the process umask (commonly 0o644 = world-readable on multi-user
hosts), briefly exposing OAuth access/refresh tokens for Nous, Codex,
Copilot, Claude, Qwen, Gemini, and every other native OAuth provider that
flows through auth.json.

Switch all three to `os.open(O_WRONLY|O_CREAT|O_EXCL, 0o600)` + `os.fdopen`
+ `fsync` so the file is atomic at 0o600 on creation. Tighten each parent
directory (`~/.hermes/`, Qwen auth dir, Nous shared auth dir) to 0o700 so
siblings can't traverse to the creds. `_save_auth_store` also gains a
per-process random temp suffix to match `agent/google_oauth.py` (#19673)
and `tools/mcp_oauth.py` (#21148).

Adds `tests/hermes_cli/test_auth_toctou_file_modes.py` asserting final
file mode 0o600 and parent dir mode 0o700 across all three writers, plus
an explicit `os.open(flags, mode)` check on the main auth.json writer
that would fail if anyone reintroduces the `Path.open('w')` pattern.
POSIX-only (mode bits skipped on Windows).

* fix(delegate): correct ACP docs — Claude Code CLI has no --acp flag

The delegate_task tool schema descriptions referenced 'claude --acp --stdio'
as an example, but Claude Code CLI does not support --acp or --stdio flags.

The ACP subprocess transport (agent/copilot_acp_client.py) is specifically
built for GitHub Copilot CLI ('copilot --acp --stdio').

Changes:
- Per-task acp_command example: 'claude' → 'copilot'
- Top-level acp_command description: remove 'Claude Code' reference,
  clarify requirement for ACP-compatible CLI (currently Copilot only)
- acp_args description: remove misleading claude-opus-4-6 example

Fixes #19055

* fix: exclude hidden and archive dirs from _find_skill rglob

* fix(gateway): preserve thread routing from cached live session sources

* fix(gateway): cap cached session sources with LRU eviction

Follow-up on top of Zyproth's session-source cache: swap the unbounded
dict for an OrderedDict with a 512-entry LRU cap so long-running
gateways can't accumulate stale entries for dead sessions forever.

- self._session_sources is now an OrderedDict
- _cache_session_source() move_to_end + popitem(last=False) above cap
- _get_cached_session_source() move_to_end on hit (LRU read bump)
- restart_test_helpers.py wires OrderedDict + _session_sources_max

* fix(mcp): give 'mcp add --command' a distinct argparse dest

The --command flag of `hermes mcp add` shared its argparse dest with the
top-level subparser (`dest="command"` in `hermes_cli/_parser.py`). When
the flag was omitted, argparse still wrote `args.command = None`,
clobbering the top-level value of `"mcp"`. The dispatcher then saw
`args.command is None` and fell through to interactive chat, so
`hermes mcp add ...` silently launched chat instead of registering the
server. `cmd_mcp_add` was never reached.

Use `dest="mcp_command"` on the flag and read it from `cmd_mcp_add`.
The user-facing CLI flag `--command` is unchanged; only the in-memory
namespace attribute moves. Also updates the `_make_args` helper in
`tests/hermes_cli/test_mcp_config.py` to populate the new dest, and
adds `tests/hermes_cli/test_mcp_add_command_dest.py` with a parser-
level regression test.

Closes #19785.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore: add discodirector email to AUTHOR_MAP

* fix(bedrock): preserve reasoningContent across converse normalization

* feat(gateway): support [[as_document]] directive for skill media routing

Skills that produce large/lossless images (e.g. info-graph, where a
rendered JPG is 1-2 MB) currently lose quality in Telegram delivery
because `_IMAGE_EXTS` membership routes the file through
`send_multiple_images` → `sendMediaGroup`, which Telegram's server
re-encodes to JPEG @ 1280px max edge. The original bytes only survive
when the file goes through `send_document`, which the dispatch tables
in three places (`_process_message_background`, `_deliver_media_from_response`,
and the `send_message` tool's telegram path) only reach for files
whose extension is NOT in `_IMAGE_EXTS`.

This commit adds an `[[as_document]]` directive that mirrors the
existing `[[audio_as_voice]]` shape: a skill emits the directive once
in its response, and every image-extension MEDIA: file in that response
is delivered via `send_document` instead of `send_multiple_images` /
`sendPhoto`. The directive is detected at the dispatch sites (which see
the raw response) and the directive string is stripped from the
user-visible cleaned text in `extract_media` so it never leaks.

Granularity is intentionally all-or-nothing per response, matching
[[audio_as_voice]]'s scope. Skills that need fine control can split into
two responses.

Verified the targeted use case: info-graph emits

    信息图已生成(...)
    [[as_document]]
    MEDIA:/tmp/info-graph-x/infographic.jpg

→ Telegram receives `infographic.jpg` via sendDocument, original 1MB
JPEG bytes preserved, no recompression. Forwarding and download
filenames stay clean (`infographic.jpg`).

Tests: +3 cases in TestExtractMedia covering directive strip, isolation
from voice flag, and coexistence with [[audio_as_voice]]. All
113 pre-existing media/extract/send tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: update send_message_tool mocks for force_document kwarg

* chore: AUTHOR_MAP entry for @leon7609

* fix(model_switch): live model discovery for custom_providers in /model picker

custom_providers entries (section 4 of list_authenticated_providers) only
read the static models: dict from config.yaml, ignoring the live /v1/models
endpoint.  This means gateways like Bifrost that expose hundreds of models
only show the handful explicitly listed in config.

Add live discovery via fetch_api_models() for custom_providers entries
that have api_key + base_url, matching the existing behavior for user
providers: entries (section 3).  When the endpoint is reachable and
returns models, the live list replaces the static subset.

Fixes: /model picker showing only 9 models from a Bifrost gateway that
actually exposes 581.

* fix(memory): support OpenViking local resource uploads

* test(memory): harden OpenViking local upload coverage

* fix(memory): harden OpenViking local path uploads

* chore(release): add AUTHOR_MAP entries for ggnnggez and ehz0ah

Contributors to OpenViking local resource upload fix (#19569).

* docs(readme): drop misleading RL install-extras claim, defer to CONTRIBUTING

README.md:163 said atroposlib and tinker were pulled in by .[all,dev], but
.[all] does not include .[rl] — those dependencies live in pyproject.toml's
[rl] extra (lines 95-101). With the original wording, a contributor running
uv pip install -e ".[all,dev]" would not have atroposlib or tinker
installed.

Rather than swap one extra for another (which paths users to either of two
parallel install conventions — pip [rl] extra vs tinker-atropos submodule —
without saying which the project considers canonical), this PR drops the
specific install command from the README and links to CONTRIBUTING.md,
which already documents the actual development setup.

* fix(kanban): auto-block workers that exit without completing (#20894) (#21214)

When a kanban worker subprocess exits rc=0 but its task is still in
status='running', the agent almost certainly answered the task
conversationally without calling kanban_complete or kanban_block. The
dispatcher used to classify this as a generic crash and respawn, which
loops forever on small local models (gemma4-e2b q4 etc.) that keep
returning clean but unproductive output.

Dispatcher changes:
- The waitpid reap loop at the top of dispatch_once now records each
  reaped child's raw exit status in a bounded module registry
  (_recent_worker_exits, TTL 600s, size cap 4096).
- _classify_worker_exit distinguishes clean_exit / nonzero_exit /
  signaled / unknown using os.WIFEXITED / WIFSIGNALED.
- detect_crashed_workers consults the classification when a worker
  is found dead. clean_exit → protocol_violation event + immediate
  circuit-breaker trip (failure_limit=1). Everything else keeps the
  existing crashed-event + counter behavior.
- DispatchResult.auto_blocked now includes protocol-violation trips.

Gateway fix (Bug A in #20894):
- gateway.run._notify_active_sessions_of_shutdown snapshots
  self.adapters with list(...) before iterating. adapter.send() can
  hit a fatal-error path that pops the adapter from the dict, which
  was raising 'RuntimeError: dictionary changed size during iteration'
  during shutdown.

Regression tests:
- test_detect_crashed_workers_protocol_violation_auto_blocks verifies
  rc=0 + still-running → status=blocked on first occurrence with
  protocol_violation + gave_up events and NO crashed event.
- test_detect_crashed_workers_nonzero_exit_uses_default_limit verifies
  non-zero exits keep the existing 2-strike behavior.

Closes #20894.

* fix(dashboard): stabilize embedded chat resume and scrollback

* fix(dashboard): let embedded chat use a single scroll system

* fix(dashboard): route browser wheel into inner TUI scrolling

* chore: AUTHOR_MAP entry for @nouseman666

* fix(cli): honor positive tool preview length

* chore: AUTHOR_MAP entry for @GinWU05

* fix(credential_pool): resolve key mix-up when custom providers share base_url

When multiple custom_providers share the same base_url but have different API keys,

get_custom_provider_pool_key() always returned the first match, causing wrong-key

unauthorized errors. Add provider_name parameter to prefer exact name matches

over base_url-only matching, with fallback for backward compatibility.

Fixes #19083

* feat(cli): show context compression count in status bar

Display the number of context compressions in the CLI status bar when
compressions > 0, helping users understand conversation compression
pressure during long sessions.

- Wide layout (>=76 cols): shows 'cmp N' between context percent and duration
- Medium layout (52-75 cols): shows 'cmp N' between percent and duration
- Narrow layout (<52 cols): omitted to save space
- Color-coded: dim for 1-4, warn for 5-9, bad for 10+
- Hidden when zero to keep the bar clean for new sessions

Closes #18564

* refactor: replace 'cmp' text with 🗜️ emoji in status bar

Address review feedback to use the clamp emoji (��️) instead of
the plain text 'cmp' prefix for the compression count indicator.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat(tui): surface compression count in Ink status bar

Parity with the classic CLI status bar (PR #18579). The Python backend
already exposes 'compressions' on SessionUsageResponse; this wires it
through the Ink Usage type and renders 'cmp N' next to the duration
segment of StatusRule.

- types.ts Usage: add optional compressions field
- appChrome.tsx StatusRule: render 'cmp N' when > 0, color-tiered by
  pressure (muted <5, warn 5-9, error 10+)
- Plain text 'cmp' token (no emoji) matches PR #18579's original author
  rationale and avoids Ink layout drift from VS16 emoji width

* chore(release): map altriatree@gmail.com -> @TruaShamu

* fix(curator): make manual runs synchronous

* docs(curator): update CLI docs for synchronous-by-default manual run

Follow-up to the previous commit which flipped 'hermes curator run'
default from async to sync. Updates the curator.md feature page and
cli-commands.md reference to show --background as the opt-in async
flag and note that the default now blocks until the LLM pass finishes.

* fix(install): remove uv exclude-newer cutoff

* docs: clarify API server tool execution locality

* fix(kanban): treat dashboard event-stream cancellation as normal shutdown

Stopping `hermes dashboard` with Ctrl-C while the Kanban dashboard is
open prints an ASGI traceback ending in
`plugins/kanban/dashboard/plugin_api.py::stream_events` at the
`asyncio.sleep(_EVENT_POLL_SECONDS)` line. This is a normal shutdown
path: Uvicorn cancels the open websocket task while it is sleeping in
the 300 ms poll loop. `asyncio.CancelledError` is a `BaseException` in
Python 3.8+ — the bare `except Exception:` handler below the existing
`WebSocketDisconnect:` clause does NOT catch it, so the cancellation
surfaces as an application traceback and routine dashboard exit looks
like a runtime failure.

Add an explicit `except asyncio.CancelledError: return` clause beside
the existing `WebSocketDisconnect` handler. Disconnection (client
closed the tab) and shutdown cancellation (dashboard process exiting)
are conceptually different paths but both warrant a quiet return; the
two clauses are kept separate to keep that intent explicit.

`asyncio` is already imported and used in this scope, so no new
import is needed. The bare `except Exception:` handler is preserved
verbatim, so genuine runtime failures still log a warning and close
the socket cleanly.

Closes #20790.

* chore(release): map SandroHub013 email

* test(kanban): regression for CancelledError swallow in stream_events

Drives stream_events directly and cancels the task while it is sleeping
in the poll loop, asserting the coroutine returns cleanly instead of
letting CancelledError bubble. Regression coverage for the Uvicorn
application traceback on dashboard Ctrl-C fixed by the preceding commit.

* fix(model_tools): log plugin hook exceptions instead of silently swallowing them

* feat(gateway): add `hermes gateway list` to show all profiles' gateway status

Add a new `hermes gateway list` subcommand that shows the running
status of gateways across all profiles in a single view:

    Gateways:
      ✓ default (current)        — PID 155469
      ✓ wx1                      — PID 166893
      ✗ dev                      — not running

Also includes `_print_other_profiles_gateway_status()` which appends
an "Other profiles" section to `hermes gateway status` output when
other profile gateways are running.

Both use existing `list_profiles()` and `find_profile_gateway_processes()`
— no new dependencies.

Closes #19127
Related: #19113, #4402, #4587

* fix(mcp-oauth): persist OAuth server metadata across process restarts (#21226)

The MCP SDK discovers OAuth server metadata (token_endpoint, etc.) on
demand and keeps it in memory only. Without disk persistence, a restart
with valid cached refresh tokens forces the SDK to fall back to the
guessed '{server_url}/token' path — which returns 404 on most real
providers (Notion, Atlassian, GitHub remote MCP, etc.) and triggers a
full browser re-authorization even though the refresh token is fine.

Add a .meta.json file next to the existing tokens/client_info files:

  HERMES_HOME/mcp-tokens/<server>.json        -- tokens (existing)
  HERMES_HOME/mcp-tokens/<server>.client.json -- client info (existing)
  HERMES_HOME/mcp-tokens/<server>.meta.json   -- oauth metadata (new)

Changes:
- HermesTokenStorage.save_oauth_metadata / load_oauth_metadata / _meta_path
  — disk layer for the discovered OAuthMetadata.
- HermesTokenStorage.remove() now also clears .meta.json so
  'hermes mcp remove <name>' and the manager's remove() path clean up fully.
- HermesMCPOAuthProvider._initialize cold-restores from disk before the
  existing pre-flight discovery runs. If disk has metadata we skip the
  discovery HTTP round-trips entirely.
- HermesMCPOAuthProvider._prefetch_oauth_metadata now persists ASM as
  soon as it's discovered, so even the first pre-flight run seeds disk.
- HermesMCPOAuthProvider._persist_oauth_metadata_if_changed() is called
  at the end of async_auth_flow so metadata discovered via the SDK's
  lazy 401-branch (not pre-flight) is also saved for next time.

Tests cover the storage roundtrip (save/load/missing/corrupt/remove) and
the manager provider path (cold-load restore, skip-when-in-memory,
persist-on-discover, noop-when-unchanged, end-to-end async_auth_flow).

Co-authored-by: nocturnum91 <50326054+nocturnum91@users.noreply.github.com>

* feat: add SSE transport support for MCP client

Add support for MCP servers using the SSE transport protocol
(SseServerTransport) alongside the existing Streamable HTTP and stdio
transports. Many MCP servers use SSE (GET /sse + POST /messages/)
which was previously unsupported -- the client silently fell back to
Streamable HTTP, causing 10s connection timeouts.

Changes:
- Import mcp.client.sse.sse_client with graceful fallback
- Check config.get('transport') == 'sse' in _run_http() to select
  the SSE transport path with proper timeout handling
- Read transport type from config in get_mcp_status() instead of
  hardcoding 'http' for URL-based servers
- Update docstring, example config, and feature list

* fix(browser): enforce cloud-metadata SSRF floor in hybrid routing (#16234) (#21228)

Cloud metadata endpoints (169.254.169.254 etc.) are now always blocked
by browser_navigate regardless of hybrid routing, allow_private_urls,
or backend.

Bug: commit 42c076d3 (#16136) added hybrid routing that flips
auto_local_this_nav=True for private URLs and short-circuits
_is_safe_url(). IMDS endpoints are technically private (169.254/16
link-local), so the sidecar happily routed them to a local Chromium,
and the agent could read IAM credentials via browser_snapshot. On
EC2/GCP/Azure this is a full SSRF-to-credential-theft.

Fix: new is_always_blocked_url() in url_safety.py — a narrow floor
that checks _BLOCKED_HOSTNAMES, _ALWAYS_BLOCKED_IPS,
_ALWAYS_BLOCKED_NETWORKS only. Applied as an independent gate in
browser_navigate's pre-nav and post-redirect checks, BEFORE
auto_local_this_nav gets a chance to short-circuit. Ordinary private
URLs (localhost, 192.168.x, 10.x, .local, CGNAT) still route to the
local sidecar as the #16136 feature intends.

Secondary fix (reporter's finding): _url_is_private() now explicitly
checks 172.16.0.0/12. ipaddress.is_private only covers that range on
Python ≥3.11 (bpo-40791), so on 3.10 runtimes those URLs were routed
to cloud instead of the local sidecar. No security impact — just a
correctness fix for the hybrid-routing feature.

Closes #16234.

* fix: WhatsApp bridge process leak and disable config asymmetry

- Add PID file mechanism to track bridge processes and kill stale ones on startup
- Improve _kill_port_process() with lsof fallback when fuser is not available
- Support explicit WhatsApp disable via config.yaml (whatsapp.enabled: false)
- Respect WHATSAPP_ENABLED=false env var to disable WhatsApp

Fixes #19124

* docs(contributing): align tool discovery and test runner with AGENTS.md

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(kanban): make dashboard board pin authoritative over server current file (#21230)

When the user created a new board via the dashboard with "switch" checked,
the server-side `current` file was flipped to the new board. Clicking the
original board's tab then showed no cards even though the count badge read
correctly — the REST fetch dropped `?board=` when the selection was
"default" and the backend fell through to `current` (= the new board),
returning a different board's data than the tab the user clicked.

Fix:
- `withBoard()` always appends `?board=<slug>` when a board is selected,
  including "default". The dashboard's tab selection becomes authoritative
  instead of silently deferring to the server's `current` file.
- `writeSelectedBoard()` persists every selection (including "default")
  to localStorage. Previously "default" was stripped, which meant the
  next page load had nothing to pin to and fell through to `current`.
- Same change applied to the WebSocket query builder in `openWs()`.

Contract verified live:
  current_board = "proj2"
  GET /board                  → proj2's tasks   (bug shape: falls through to current)
  GET /board?board=default    → default's tasks (fix: explicit pin wins)
  GET /board?board=proj2      → proj2's tasks

Closes #20879.

* fix: avoid unsupported anthropic context beta by default

* Follow latest child session on dashboard resume

* fix(openviking): add Bearer auth header and omit empty/legacy tenant headers (#21232)

Authenticated remote OpenViking servers derive tenancy from the Bearer
key, but the client was always sending X-OpenViking-Account and
X-OpenViking-User — defaulted to the literal string "default" — which
overrode the key-derived tenant and broke auth.

- _headers(): skip X-OpenViking-Account/-User when blank or "default"
  (treats the legacy default value as unset, so existing installs don't
  need to touch their .env)
- _headers(): send Authorization: Bearer <key> alongside X-API-Key for
  standard HTTP auth compatibility
- health(): include auth headers so /health works against servers that
  require authentication

Tests cover bearer emission, legacy "default" suppression, empty
suppression, real tenant passthrough, and authenticated health checks.

Fixes the same user report as #20695 (from @ZaynJarvis); that PR could
not be merged because its branch was stale against main and would have
reverted recent OpenViking work (#15696, local resource uploads, summary
URI normalization, fs-stat pre-check).

* feat: add transform_llm_output plugin hook

Enables plugins to transform LLM output text after generation,
useful for vocabulary/personality transformation without burning
inference tokens.

Follows same pattern as transform_tool_result and transform_terminal_output:
- First non-empty string result wins
- Fail-open: exceptions logged as warnings, agent continues
- Signature: (response_text, session_id, model, platform)

* test+docs: cover transform_llm_output hook + release author map

- tests/test_transform_llm_output_hook.py: dispatch semantics
  (kwargs contract, first-non-empty-string-wins, empty-string
  pass-through, raising-plugin fail-open, no-plugins = no-op)
- tests/hermes_cli/test_plugins.py: assert the new hook name is in
  VALID_HOOKS alongside the other transform_* hooks
- website/docs/user-guide/features/hooks.md: summary-table entry +
  full section mirroring transform_tool_result / transform_terminal_output
- scripts/release.py: map barnacleboy.jezzahehn@agentmail.to -> JezzaHehn
  (existing entry only covers the gmail address)

* feat(curator): add `hermes curator list-archived` command (#21236)

Lists the skills sitting in ~/.hermes/skills/.archive/ so users have
something to pass to `hermes curator restore`. `curator status` already
shows counts; this fills the name-discovery gap.

Archive layout is flat (`archive_skill` writes to `.archive/<skill>/`),
so the directory name IS the skill name — no frontmatter parsing
needed. Timestamped collision directories (`<skill>-<ts>`) are listed
literally; user can still pass them to `restore`.

Reshape of @EvilDrag0n's #20651, simplified: drop the frontmatter
rglob + preamble/trailer output + duplicate subcommand registration.

Co-authored-by: EvilDrag0n <lxl694522264@gmail.com>

* fix: require memory schema fields by action

* fix(tui): refresh scroll height at cached bottom

* fix(gateway): preserve max turns after env reload

* fix(discord): scope DISCORD_ALLOWED_ROLES to originating guild (CVSS 8.1)

The initial DISCORD_ALLOWED_ROLES implementation (#11608, merged from #9873)
scans every mutual guild when resolving a user's roles. This allows a
cross-guild DM bypass:

1. Bot is in both public server A and private server B.
2. User holds the allowed role in server A only.
3. User DMs the bot. The role check finds the role in A and authorizes the
   DM, granting access as if the user were trusted in server B.

Fix:
- DMs (no guild context) disable role-based auth by default. Opt-in via
  DISCORD_DM_ROLE_AUTH_GUILD=<guild_id> restricts role lookup to one
  explicitly-trusted guild.
- Guild messages check roles only in the originating guild
  (message.guild), never in other mutual guilds.
- Reject cached author.roles when the Member came from a different guild
  than the current message.

Backwards compatibility:
- DISCORD_ALLOWED_USERS behavior is unchanged (still works in both DMs
  and guild messages).
- Deployments that rely on roles in guild channels continue to work;
  role checks are now strictly scoped to that guild.
- Deployments that intentionally want role-based DM auth can opt into a
  single trusted guild via DISCORD_DM_ROLE_AUTH_GUILD.

Tests: 9 new regression guards in
tests/gateway/test_discord_roles_dm_scope.py covering the bypass path,
the opt-in path, cross-guild guild-message bypass, and backwards-compat
user-ID paths. 47/47 discord-auth tests pass.

Refs: #11608 (initial implementation), #7871 (feature request),
  #9873 (PR author credit @0xyg3n)

* fix(discord): extend role-scope fix to slash surface + fixture update

Sibling-site fix: _evaluate_slash_authorization was the fourth
_is_allowed_user caller and didn't pass guild/is_dm through, so slash
interactions would take the DM branch regardless of whether they came
from a guild channel. Now reads interaction.guild + in_dm and forwards.

Also updates test_discord_slash_auth fixture (_make_interaction) so
the SimpleNamespace guild mock has a get_member(uid)->None method —
required by the new guild-scoped fallback path in _is_allowed_user.
Tests exercising positive role paths still work via user.roles.

Three new regression tests in test_discord_roles_dm_scope:
- Slash DM + role in mutual public guild → rejected
- Slash in guild B + role only in guild A → rejected
- Slash in guild B + role in guild B → allowed (positive control)

368 Discord tests pass. test_discord_free_channel_skips_auto_thread
also fails on clean main (pre-existing, unrelated to this fix).

* fix(discord): route DM role-auth opt-in through config.yaml (not env var)

Per repo policy, ~/.hermes/.env is for secrets only. Guild IDs are
behavioral configuration, not secrets. Replacing the
DISCORD_DM_ROLE_AUTH_GUILD env var from the original fix with
discord.dm_role_auth_guild in config.yaml.

- New module-level _read_dm_role_auth_guild() helper reads
  hermes_cli.config.read_raw_config()['discord']['dm_role_auth_guild'].
  Fails closed on any parse error (safe default = DM role-auth off).
- DEFAULT_CONFIG['discord'] gains dm_role_auth_guild: '' with a comment
  documenting the opt-in.
- …
deestax added a commit to T3-Venture-Labs-Limited/hermes-agent that referenced this pull request May 8, 2026
…gin v1.0.0 docs (#26)

* fix(opencode-go): keep users on opencode-go instead of hijacking to native providers (#20802)

OpenCode Go and OpenCode Zen are flat-namespace model resellers — their
/v1/models returns bare IDs (deepseek-v4-flash, minimax-m2.7), and the
inference API rejects vendor-prefixed names with HTTP 401 'Model not
supported'. Two bugs fixed:

1. `switch_model` in hermes_cli/model_switch.py was silently switching the
   user off opencode-go to native deepseek when they typed
   `/model deepseek-v4-flash`. Step d found the model in opencode-go's live
   catalog, but step e (detect_provider_for_model) still ran and matched
   the bare name against deepseek's static catalog. Fix: track whether
   the live catalog resolved it; skip step e when it did.

2. `normalize_model_for_provider` in hermes_cli/model_normalize.py only
   stripped the exact `opencode-zen/` prefix, leaving arbitrary vendor
   prefixes like `minimax/minimax-m2.7` (commonly copied from aggregator
   slugs into fallback_model configs) intact — causing HTTP 401s when
   the fallback chain activated. Fix: opencode-go/opencode-zen strip ANY
   leading vendor prefix because their APIs are flat-namespace.

Tests: 11 new cases in tests/hermes_cli/test_opencode_go_flat_namespace.py
covering both normalization (prefix stripping, regression guards for
opencode-zen Claude hyphenation and openrouter vendor-prepending) and
switch_model (bare-name resolution on opencode-go's live catalog must
not trigger cross-provider hijack).

Reported by @Ufonik via Discord; Kimi K2.6 always worked because moonshotai
has no overlapping entry in a native provider's static catalog. Deepseek
and minimax failed because their v4/v2.7 names existed in the native
deepseek/minimax catalogs.

* feat(dashboard): add 'default-large' built-in theme with 18px base size (#20820)

Same Hermes Teal palette as the default theme, but with baseSize 18px,
lineHeight 1.65, and spacious density so the whole dashboard scales up.
Gives users a one-click bigger-text preset and a copyable reference for
authoring custom YAML themes with their own typography settings.

* refactor(web): per-capability backend selection for search/extract split

Introduce the foundation for independently selecting web search and
extract backends — enabling future combinations like SearXNG for
search + Firecrawl for extract.

Architecture:
- tools/web_providers/base.py: WebSearchProvider and WebExtractProvider
  ABCs with normalized result contracts (mirrors CloudBrowserProvider)
- tools/web_tools.py: _get_search_backend() and _get_extract_backend()
  read per-capability config keys, fall through to shared web.backend
- hermes_cli/config.py: web.search_backend and web.extract_backend in
  DEFAULT_CONFIG (empty = inherit from web.backend)

Behavioral change:
- web_search_tool() now dispatches via _get_search_backend()
- web_extract_tool() now dispatches via _get_extract_backend()
- When per-capability keys are empty (default), behavior is identical
  to before — _get_search_backend() falls through to _get_backend()

This is purely structural — no new backends are added. SearXNG and
other search-only/extract-only providers can now be added as simple
drop-in modules in follow-up PRs.

12 new tests, 49 existing tests pass with zero regressions.

Ref: #19198

* feat(web): add SearXNG as a native search-only backend

Adds SearXNG as a free, self-hosted web search provider.  SearXNG is a
privacy-respecting metasearch engine that requires no API key — just a
running instance and SEARXNG_URL pointing at it.

## What this adds

- `tools/web_providers/searxng.py` — `SearXNGSearchProvider` implementing
  `WebSearchProvider` (search only; no extract capability)
- `_is_backend_available("searxng")` — gates on SEARXNG_URL
- `_get_backend()` — accepts "searxng" as a configured value; adds it to
  auto-detect candidates (lower priority than paid services)
- `web_search_tool` — dispatches to SearXNG when it is the active backend
- `check_web_api_key()` — includes SearXNG in availability check
- `OPTIONAL_ENV_VARS["SEARXNG_URL"]` — registered with tools=["web_search"]
- `tools_config.py` — SearXNG appears in the `hermes tools` provider picker
- `nous_subscription.py` — `direct_searxng` detection, web_active / web_available
- `setup.py` — SEARXNG_URL listed in the missing-credential hint
- 23 tests covering: is_configured, happy-path search, score sorting, limit,
  HTTP/request errors, _is_backend_available, _get_backend, check_web_api_key

## Config

```yaml
# Use SearXNG for search, any paid provider for extract
web:
  search_backend: "searxng"
  extract_backend: "firecrawl"

# Or: SearXNG as the sole backend (web_extract will use the next available)
web:
  backend: "searxng"
```

SearXNG is search-only — it does not implement WebExtractProvider.  Users
who only configure SEARXNG_URL get web_search available; web_extract falls
back to the next available extract provider (or is unavailable if none).

Closes #19198 (Phase 2 Task 4 — SearXNG provider)
Ref: #11562 (original SearXNG PR)

* docs+skill: add searxng-search optional skill and documentation

Closes the remaining gaps from PR #11562 that weren't covered by the
core SearXNG integration landed in #20823.

- optional-skills/research/searxng-search/ — installable skill with
  SKILL.md (curl-based usage, category support, Python example) and
  searxng.sh helper script for health checks and instance queries
- website/docs/user-guide/configuration.md — SearXNG added to the
  Web Search Backends section (5 backends, backend table, per-capability
  split config example, correct search-only note)
- website/docs/reference/environment-variables.md — SEARXNG_URL row
- website/docs/reference/optional-skills-catalog.md — searxng-search entry

The core SearXNG code, OPTIONAL_ENV_VARS, hermes tools picker, and tests
were already on main via #20823.  This commit is purely additive docs +
the optional skill scaffold.

Credits from #11562 salvage:
  @w4rum — original _searxng_search structure
  @nathansdev — tools_config.py integration
  @moyomartin — category support and result formatting
  @0xMihai — config/env var approach
  @nicobailon — skill and documentation structure
  @searxng-fan — error handling patterns
  @local-first — self-hosted-first philosophy and docs

* docs: add Web Search + Extract feature page with SearXNG setup guide

* fix(feishu): keep topic replies in threads

Route Feishu topic progress, status, approval, stream, and fallback messages through threaded replies by preserving the originating message id as the reply target. Add regressions for tool progress topic metadata and Feishu metadata-driven reply routing.

* chore: follow-up cleanup for Feishu topic thread fix

- Remove dead metadata.get('reply_to') fallback in _send_raw_message;
  nothing in the codebase ever sets 'reply_to' inside a metadata dict —
  the key only appears as a top-level send_voice() keyword argument
- Simplify _status_thread_metadata construction in run.py to use a
  single dict literal instead of create-then-mutate pattern; the
  or-{} guard was dead since source.thread_id implies _progress_thread_id
  is also set for Feishu
- Add yuqian@zmetasoft.com to AUTHOR_MAP for contributor attribution

* fix(kanban): avoid fragile failure-column renames

* chore: follow-up cleanup for Kanban migration fix

- Expand migration comment to name the primary failure mode (missing
  column OperationalError from #20842) ahead of the secondary SQLite
  schema-reparse concern; also document the stale-cols-snapshot invariant
- Add clarifying comments on from_row() legacy fallback branches noting
  they are belt-and-suspenders dead code post-migration
- Add task_events comment in existing test explaining why the table is
  required by the migrator
- Add test_legacy_migration_no_legacy_columns_at_all: Scenario A —
  explicitly asserts the exact #20842 crash no longer occurs and that
  consecutive_failures defaults to 0 on a DB that never had spawn_failures
- Add test_legacy_migration_both_columns_already_present: Scenario D —
  asserts the migration is a no-op when both columns already exist,
  preserving the existing counter value

* fix(tui): bound virtual history offset searches

* ci(docker): don't cancel overlapping builds, guard :latest

Switch top-level concurrency to cancel-in-progress=false so every push
to main gets its own SHA-tagged image published — no more discarded
builds when commits land back-to-back.

Guard the :latest tag with a second job that has its own concurrency
group with cancel-in-progress=true plus a git-ancestor check against
the revision label on the current :latest. Together these guarantee
:latest only ever moves forward in history: a slower run whose commit
isn't a descendant of the current :latest refuses to clobber it, and
a newer push mid-way through the move-latest job preempts the older
one before it can retag.

- Every main push publishes nousresearch/hermes-agent:sha-<commit>
  with an org.opencontainers.image.revision label embedded.
- move-latest job reads that label off :latest, runs merge-base
  --is-ancestor, and only retags (via buildx imagetools create,
  registry-side, no rebuild) if our commit strictly descends.
- fetch-depth bumped to 1000 so merge-base has the history it needs.
- Release tag flow unchanged (unique tag, no race).

* docs(tool-gateway): rewrite as pitch-first marketing page (#20827)

Previous version read like internal API docs \u2014 leading with env var tables,
config YAML, and 'precedence' rules before ever explaining the product.
Complete rewrite inverts the structure so readers see value first,
mechanics last.

Structure now:
- Lede: 'One subscription. Every tool built in.' + pitch paragraph
- CTA: subscribe/manage button styled as a real call-to-action
- What's included: emoji-led table with expanded descriptions per tool.
  Image gen lists all 9 models by name (FLUX 2 Klein/Pro, Z-Image Turbo,
  Nano Banana Pro, GPT Image 1.5/2, Ideogram V3, Recraft V4 Pro, Qwen)
- Why it's here: value bullets \u2014 one bill, one signup, one key, same
  quality, bring-your-own anytime
- Get started: two-command flow (hermes model \u2192 hermes status)
- Eligibility: paid-tier note with upgrade link
- Mix and match: three realistic usage patterns
- Using individual image models: ID reference table for power users
- --- separator ---
- Configuration reference (demoted): use_gateway flag, disabling,
  self-hosted gateway env vars moved below the fold where they belong
- FAQ: streamlined, removed redundant content

Fact-checked against code:
- 9 FAL models confirmed from tools/image_generation_tool.py FAL_MODELS
- Status section output verified against hermes_cli/status.py
- Portal subscription URL preserved
- Self-hosted env vars (TOOL_GATEWAY_DOMAIN etc.) kept accurate

Verified: docusaurus build SUCCESS, page renders, no new broken links.

* fix(auth): fall back to global-root auth.json for providers missing in profile

Profile processes (kanban workers, cron subprocesses, delegated subagents)
read the profile's auth.json only. If a provider was authenticated at the
global root but not inside the profile, the profile's credential_pool
comes back empty and the process fails with 'No LLM provider configured'
— even though the credentials are sitting in ~/.hermes/auth.json. #18594
propagated HERMES_HOME correctly, which is what surfaced this: workers
now land in the right profile, and the profile turns out to shadow global
with no fallback.

Semantics (read-only, per-provider shadowing):
* Profile has any entries for provider X → use profile only (global ignored).
* Profile has zero entries for provider X → fall back to global.
* Writes (write_credential_pool, _save_auth_store) still target the profile.
* Classic mode (HERMES_HOME == global root) skips the fallback entirely —
  _global_auth_file_path() returns None.

Also mirrors the fallback in get_provider_auth_state so OAuth singletons
(nous, minimax-oauth, openai-codex, spotify) inherit cleanly — the Nous
shared-token store (PR #19712) remains the authoritative path for Nous
OAuth rotation, this just makes the read side consistent with it.

Seat belt: _load_global_auth_store() refuses to read the real user's
~/.hermes/auth.json under PYTEST_CURRENT_TEST even when HERMES_HOME points
to a profile-shaped path. Guard uses $HOME (stable across fixtures) rather
than Path.home() (which fixtures often monkeypatch to a tmp root).

Reported by @SeedsForbidden on Twitter as the credential_pool shadowing
follow-up to the #18594 fix.

* feat(gateway): per-platform gateway_restart_notification flag

Adds an opt-out toggle on PlatformConfig that gates both restart
lifecycle pings: the "♻ Gateway restarted" message sent to the chat
that issued /restart, and the "♻️ Gateway online" home-channel
startup notification. Defaults to True so existing deployments are
unaffected.

The motivating split is operator vs. end-user surfaces: a back-channel
like Telegram should keep these pings, while a Slack workspace shared
with end users should not surface gateway lifecycle noise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(gateway): also gate pre-restart "Gateway restarting" notification

Extend the gateway_restart_notification flag to cover
_notify_active_sessions_of_shutdown — the message that fires just
before drain ("⚠️ Gateway restarting — Your current task will be
interrupted. Send any message after restart and I'll try to resume
where you left off.") sent to active sessions and home channels.

Same operator/end-user reasoning: on a Slack workspace shared with
end users, "Gateway restarting" reads as "the bot is broken" — the
operator should be able to suppress it consistently with the other
two lifecycle pings rather than having a partial opt-out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: add guillaumemeyer to AUTHOR_MAP

For cherry-picked commits in PR #20801.

* fix(cli): submit LF enter in thin PTYs (#20896)

* fix(tui): refresh virtual offsets after row resize (#20898)

* fix(tui): honor skin highlight colors (#20895)

* fix(tui): steady transcript scrollbar (#20917)

* fix(tui): steady transcript scrollbar

Keep the visible scrollbar tied to committed viewport position while virtual history can still prefetch against pending scroll targets, and preserve drag grab offset synchronously for native-feeling scrollbar drags.

* fix(tui): smooth precision wheel scroll

Replace the opt-scroll throttle with frame-sized coalescing so modifier wheel gestures stay line-precise without stepping.

* fix(tui): restore voice push-to-talk parity (#20897)

* fix(tui): restore classic CLI voice push-to-talk parity

(cherry picked from commit 93b9ae301bb89f5b5e01b4b9f8ac91ffa74fbd9d)

* fix(tui): harden voice push-to-talk stop flow

Address review feedback from PR #16189 by stopping the active recorder before background transcription, documenting single-shot voice capture, and covering the TUI gateway flags with regression tests.

* fix(tui): preserve silent voice strike tracking

Keep single-shot voice recording's no-speech counter alive across starts so the TUI can still emit the three-strikes auto-disable event, and bind the auto-restart state at module scope for type checking.

* fix(tui): clean up voice stop failure path

Address follow-up review by naming the TUI flow as single-shot push-to-talk and cancelling the recorder when forced stop cannot produce a WAV.

* fix(tui): report busy voice capture starts

Return explicit start state from the voice wrapper so the TUI gateway does not report recording while forced-stop transcription is still cleaning up.

* fix(tui): handle busy voice record responses

Apply the gateway busy status immediately in the TUI and route forced-stop voice events to the session that sent the stop request.

* fix(tui): clear voice recording on null response

Treat a null voice.record RPC result as a failed optimistic start so the REC badge cannot stick after gateway-side errors.

* fix(tui): count silent manual voice stops

Preserve single-shot voice no-speech strikes through forced stop transcription so empty push-to-talk captures still trigger the three-strikes guard.

---------

Co-authored-by: Montbra <montbra@gmail.com>

* fix(gateway): don't dead-end setup wizard when only system-scope unit is installed

The setup wizard dropped non-root users at a bare shell prompt when
trying to start a system-scope gateway service. Previously
_require_root_for_system_service called sys.exit(1), which the
wizard's `except Exception` guards cannot catch (SystemExit is a
BaseException). Users with a pre-existing /etc/systemd/system unit
(e.g. from an earlier `sudo hermes setup` run) hit this whenever
they re-ran `hermes setup` as a regular user.

- Convert _require_root_for_system_service to raise a typed
  SystemScopeRequiresRootError (RuntimeError subclass) instead of
  sys.exit(1). The direct CLI path (`hermes gateway install|start|stop|
  restart|uninstall` without sudo) still exits 1 cleanly via a new
  catch at the top of gateway_command, matching the existing
  UserSystemdUnavailableError pattern.
- Add _system_scope_wizard_would_need_root() pre-check and
  _print_system_scope_remediation() helper. Both setup wizards
  (hermes_cli/setup.py and hermes_cli/gateway.py::gateway_setup) now
  detect the dead-end before prompting and print actionable guidance:
  either `sudo systemctl start <service>` this time, or uninstall the
  system unit and install a per-user one.
- Defense-in-depth: all 5 wizard prompt sites also catch
  SystemScopeRequiresRootError and fall back to the remediation
  helper if the pre-check is bypassed (race, etc.).

Tests: 12 new tests in TestSystemScopeRequiresRootError,
TestSystemScopeWizardPreCheck, TestSystemScopeRemediationOutput, and
TestGatewayCommandCatchesSystemScopeError covering the exception
contract, pre-check matrix (root vs non-root, system-only vs
user-present vs none vs explicit system=True), remediation output
for each action, and the direct-CLI exit-1 path.

* fix(tui): preserve session when switching personality

Previously, /personality in the TUI called _reset_session_agent() which
destroyed the agent, cleared conversation history, and effectively started
a new session. This made personality switching disruptive — users lost
their entire conversation context.

Now /personality updates the agent's ephemeral_system_prompt in-place and
injects a pivot marker into the conversation history. The marker tells
the model to adopt the new persona from that point forward, which is
necessary because LLMs tend to pattern-match their prior responses and
continue the established tone without an explicit signal.

Changes:
- tui_gateway/server.py: Rewrite _apply_personality_to_session to update
  the agent in-place instead of resetting. Inject a user-role pivot
  marker so the model actually switches style mid-conversation.
- ui-tui/src/app/slash/commands/session.ts: Update help text (no longer
  mentions history reset).
- tests/test_tui_gateway_server.py: Update test to verify history is
  preserved, pivot marker is injected, and ephemeral prompt is set.

* fix(gateway): wait for systemd restart readiness

* fix(discord): narrow rate-limit catch and move sync state under gateway/

Two follow-ups on top of helix4u's slash-command sync hardening:

- Only suppress exceptions that are actually Discord 429 rate limits
  (discord.RateLimited, HTTPException with status 429, or a clearly
  rate-limit-named duck type). Arbitrary failures that happen to expose
  a retry_after attribute now re-raise to the outer handler instead of
  silently swallowing a cooldown.
- Move the sync-state JSON under $HERMES_HOME/gateway/ so the home root
  stops collecting ad-hoc runtime files.

Added a test verifying unrelated exceptions don't get misclassified as
rate limits.

* docs(kanban): fix orchestrator skill setup instructions (#20958)

* docs(kanban): fix worker skill setup instructions too (#20960)

Follow-up to #20958. The worker skill section had the same stale
'hermes skills install devops/kanban-worker' command — kanban-worker
is also bundled, so that command fails with 'Could not fetch from any
source.'

Replace with bundled-skill verification + restore pattern, matching
the orchestrator section. Uses <your-worker-profile> placeholder since
assignees vary (researcher, writer, ops, linguist, reviewer, etc.)
rather than a single fixed 'worker' profile.

* feat(profiles): --no-skills flag for empty profile creation (#20986)

Adds `hermes profile create <name> --no-skills` to create a profile with
zero bundled skills. Writes a `.no-bundled-skills` marker file in the
profile root so `hermes update`'s all-profile skill sync loop also skips
the profile — without the marker, every update would re-seed skills and
the user would have to delete them again.

Use case (from @hiut1u): orchestrator profiles and narrow-task profiles
don't need 100+ bundled skills polluting their system prompt.

- create_profile() gains a `no_skills` param, mutually exclusive with
  `--clone` / `--clone-all` (cloning explicitly copies skills).
- seed_profile_skills() no-ops on opted-out profiles and returns
  `{skipped_opt_out: True}` so callers can report cleanly.
- Web API (POST /api/profiles) accepts `no_skills: bool`.
- Delete `.no-bundled-skills` to opt back in — next `hermes update`
  re-seeds normally.

6 new tests in TestNoSkillsOptOut cover marker write, mutual exclusion
with clone, seed_profile_skills opt-out, fresh profile unaffected, and
delete-marker-re-enables-seeding.

* fix: route Telegram image documents through photo handling

* chore: AUTHOR_MAP entry for mrcoferland

* test(docker): align Dockerfile contract tests with simplified TUI flow

The Dockerfile dropped the manual `@hermes/ink` materialisation gymnastics
in favour of letting npm workspaces resolve the bundled package
naturally. Two contract tests still asserted the older flow:

`test_dockerfile_installs_tui_dependencies` required:
    'ui-tui/packages/hermes-ink/package-lock.json' in dockerfile_text

…but the lockfile is no longer COPIED individually \u2014 the entire
`ui-tui/packages/hermes-ink/` tree is COPIED instead (the workspace
reference from `ui-tui/package.json` is `file:` so npm needs the
real source, not just a manifest stub).

`test_dockerfile_materializes_local_tui_ink_package` required a 7-clause
conjunction matching specific `rm -rf` / `npm install --omit=dev`
`--prefix node_modules/@hermes/ink` / `rm -rf .../react` invocations
that were stripped out when the workspace resolution was simplified.

Update the assertions to pin the *contract* the image actually has to
carry rather than the *exact shell incantations* the old flow used:

* TUI deps install: ui-tui/package.json + ui-tui/package-lock.json +
  ui-tui/packages/hermes-ink/ tree are all COPIED, and an npm
  install/ci step runs in ui-tui.
* Bundled hermes-ink: the workspace package source is COPIED (so
  `await import('@hermes/ink')` resolves at runtime).

This keeps the spirit of #15012 / #16690 (zombie reaping + bundled
workspace materialisation must continue to work) without locking the
Dockerfile into one specific implementation flavour.

Validation:

    $ pytest tests/tools/test_dockerfile_pid1_reaping.py -q
    6 passed in 1.43s

No production code change. Fixes the two failures observed on `main`
(run 25250051126):

`tests/tools/test_dockerfile_pid1_reaping.py::test_dockerfile_installs_tui_dependencies`
`tests/tools/test_dockerfile_pid1_reaping.py::test_dockerfile_materializes_local_tui_ink_package`

* test(update): patch isatty on real streams to fix xdist-flaky --yes tests

Two CI tests for the new `--yes` update flag (#18261) flaked under
`pytest-xdist` on Linux/Python 3.11 even though they passed every
local run on macOS/Python 3.14.4:

  FAILED tests/hermes_cli/test_update_yes_flag.py
    ::TestUpdateYesConfigMigration::test_no_yes_flag_still_prompts_in_tty
      `AssertionError: assert <MagicMock 'input'>.called is False`
  FAILED tests/hermes_cli/test_update_yes_flag.py
    ::TestUpdateYesStashRestore::test_yes_restores_stash_without_prompting
      `AssertionError: assert <MagicMock '_restore_stashed_changes'>.called is False`

Captured stdout for the first failure shows `cmd_update` taking the
"Non-interactive session \u2014 skipping config migration prompt." branch
\u2014 i.e. the `sys.stdin.isatty() and sys.stdout.isatty()` check at
`hermes_cli/main.py:7118` evaluated to `False` despite the test doing:

    with patch("hermes_cli.main.sys") as mock_sys:
        mock_sys.stdin.isatty.return_value = True
        mock_sys.stdout.isatty.return_value = True

The whole-module mock is fragile under xdist worker reuse: a sibling
test that imports `hermes_cli.main` first can leave another `sys`
reference resolved inside the function (re-import in a helper, etc.),
and the wholesale module replacement never gets consulted.

Switch to `patch.object(_sys.stdin, "isatty", return_value=True)` (and
the same for `stdout`). That patches the *attribute on the real stream
object* \u2014 every call site, no matter how it reached `sys.stdin`,
hits the patched method. Same fix applied to the stash-restore test
(it took the "non-TTY \u2192 skip restore prompt" branch for the same reason).

Validation:

    $ pytest tests/hermes_cli/test_update_yes_flag.py -q
    3 passed in 5.47s

No production code change. Fixes the two failures observed on `main`
(run 25250051126):

`tests/hermes_cli/test_update_yes_flag.py::TestUpdateYesConfigMigration::test_no_yes_flag_still_prompts_in_tty`
`tests/hermes_cli/test_update_yes_flag.py::TestUpdateYesStashRestore::test_yes_restores_stash_without_prompting`

Refs: #18261 (added the `--yes` flag + these tests).

* fix(web): force light color-scheme on docs iframe

The Documentation tab embeds the public Hermes Agent docs site via an
<iframe>. On any system where the browser's prefers-color-scheme
resolves to dark — the default on macOS with system dark mode, and
common on Linux/Windows too — the docs body text rendered nearly
invisible against its own background.

Cause: Docusaurus intentionally leaves <html> and <body> transparent
and relies on the browser's Canvas color to fill the viewport. Inside
our iframe, the iframe element had bg-background (the dashboard's dark
canvas) AND inherited the dashboard's dark color-scheme, so the
browser set the iframe's Canvas to a dark value. Docusaurus's
transparent body exposed that dark Canvas, and the docs body text
(tuned for a light Canvas) became near-illegible. Affects every
built-in dashboard theme.

Fix: replace bg-background on the iframe with [color-scheme:light]
(spec-blessed cross-origin override of the inherited color-scheme;
forces the iframe's Canvas to light) and bg-white (belt-and-suspenders
fallback during the brief paint window before content loads). The
docs site's own theme toggle keeps working — Docusaurus stores its
choice in localStorage and applies opaque dark backgrounds to its
layout elements that cover the white Canvas we forced.

* fix(security): close TOCTOU window when saving MCP OAuth credentials

_write_json (the persistence helper used by HermesTokenStorage for both
tokens and client_info) created the temp file via Path.write_text and
only chmod'd it to 0o600 afterward. Between create and chmod the file
existed on disk at the process umask (commonly 0o644 = world-readable),
briefly exposing MCP OAuth access/refresh tokens to other local users.

Use os.open with O_WRONLY|O_CREAT|O_EXCL and an explicit S_IRUSR|S_IWUSR
mode so the file is created atomically at 0o600, plus tighten the parent
dir to 0o700 so siblings can't traverse to the creds file. The temp name
also gains a per-process random suffix to avoid collisions between
concurrent writers and stale leftovers from a crashed prior write.

Mirrors the fix shipped for agent/google_oauth.py in #19673.

Adds a regression test asserting the resulting file mode is 0o600 and
the parent directory is 0o700 (skipped on Windows where POSIX mode bits
aren't enforced).

* chore(release): add Gutslabs to AUTHOR_MAP for PR #21148 salvage

* test(update): teach restart-mocks about the post-update survivor sweep

Issue #17648 added a post-update SIGTERM-survivor sweep to `cmd_update`:
~3s after issuing graceful/SIGTERM restarts, the code re-queries
`find_gateway_pids` and SIGKILLs anything still alive. That's the
right fix for stuck-drain gateways in production, but it broke three
unit tests that assumed `find_gateway_pids` would keep returning the
same PIDs forever:

  FAILED ::TestCmdUpdateLaunchdRestart::test_update_restarts_profile_manual_gateways
    AssertionError: Expected 'kill' to not have been called. Called 1 times.
    Calls: [call(12345, <Signals.SIGKILL: 9>)].

  FAILED ::TestCmdUpdateLaunchdRestart::test_update_profile_manual_gateway_falls_back_to_sigterm
    AssertionError: Expected 'kill' to have been called once. Called 2 times.
    Calls: [call(12345, SIGTERM), call(12345, SIGKILL)].

  FAILED ::TestServicePidExclusion::test_update_kills_manual_pid_but_not_service_pid
    assert 2 == 1
      manual_kills = [call(42999, SIGTERM), call(42999, SIGKILL)]

In each test `os.kill` is mocked, so the simulated PID never actually
exits \u2014 the sweep finds it again and escalates. The production code
is correct; the tests just need to model OS behaviour properly.

Two-test fix (profile-manual restart cases): use
`side_effect=[[12345], []]` so the first `find_gateway_pids` call
returns the live PID and the second (the sweep) returns nothing, as if
the OS had reaped the process.

Service-PID-exclusion fix: track which PIDs got killed in a closure
set, and exclude them on subsequent `fake_find` calls. `os.kill`
gets a `side_effect` that records the kill instead of swallowing it
silently. Now the sweep doesn't re-find the manual PID, no SIGKILL
escalation, `manual_kills == 1`.

Validation:

    $ pytest tests/hermes_cli/test_update_gateway_restart.py -q
    43 passed in 4.13s

No production code change. Fixes the three failures observed on `main`
(run 25250051126):

  test_update_restarts_profile_manual_gateways
  test_update_profile_manual_gateway_falls_back_to_sigterm
  test_update_kills_manual_pid_but_not_service_pid

Refs: #17648 (post-update survivor sweep that the tests didn't model).

* fix(image-routing): expose attached image paths in native multimodal text part

In native image mode (vision-capable models like gpt-4o, claude-sonnet-4),
build_native_content_parts() previously emitted only the user's caption
plus image_url parts. The local file path of each attached image never
appeared in the conversation text, so the model could see the pixels but
had no string handle for tools that take image_url: str (custom MCP
tools, vision_analyze on a re-look, attach-to-tracker workflows).

The text-mode path already injects an equivalent hint via
Runner._enrich_message_with_vision ("...vision_analyze using image_url:
<path>..."). This brings native mode to parity by appending one
"[Image attached at: <path>]" line per successfully attached image to
the user-text part of the multimodal turn. Skipped (unreadable) paths
are NOT advertised, so the model is never told a non-existent file is
attached.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(optional-skills): port Anthropic financial-services skills as optional finance bundle (#21180)

Adds 7 optional skills under optional-skills/finance/ adapted from
anthropics/financial-services (Apache-2.0):

  excel-author        — openpyxl conventions: blue/black/green cells,
                        formulas over hardcodes, named ranges, balance
                        checks, sensitivity tables. Ships recalc.py.
  pptx-author         — python-pptx for model-backed decks (pitch,
                        IC memo, earnings note) that bind every number
                        to a source workbook cell.
  dcf-model           — institutional DCF (49KB skill): projections,
                        WACC, terminal value, Bear/Base/Bull scenarios,
                        5x5 sensitivity tables. Ships validate_dcf.py.
  comps-analysis      — comparable company analysis: operating metrics,
                        multiples, statistical benchmarking.
  lbo-model           — leveraged buyout: S&U, debt schedule, cash
                        sweep, exit multiple, IRR/MOIC sensitivity.
  3-statement-model   — fully-integrated IS/BS/CF with balance-check
                        plugs. Ships references/ for formatting,
                        formulas, SEC filings.
  merger-model        — accretion/dilution analysis for M&A.

All seven are optional (not active by default). Users install via
'hermes skills install official/finance/<skill>'.

Hermesification:
- Stripped every Office JS / Office Add-in / mcp__office__*
  branch — skills assume headless openpyxl only.
- Replaced Cowork MCP data-source instructions with 'MCP first (via
  native-mcp), fall back to web_search/web_extract against SEC EDGAR
  and user-provided data'.
- Swapped Claude tool references (Bash, Read, Write, Edit, mcp__*)
  for Hermes-native equivalents and Python library calls.
- Canonical Hermes frontmatter (name/description/version/author/
  license/metadata.hermes.{tags,related_skills}).
- Descriptions tightened to 187-238 chars, trigger-first.
- Attribution preserved: author field credits 'Anthropic (adapted by
  Nous Research)', license: Apache-2.0, each SKILL.md links back to
  the upstream source directory.

Verification:
- All 7 discovered by OptionalSkillSource with source_id='official'
- Bundle fetch includes support files (scripts, references, troubleshooting)
- related_skills cross-refs all resolve within the bundle
- No Claude product / Cowork / Office JS / /mnt/skills leakage
  remains in body text (bounded mentions only in attribution blocks)

Source: https://github.com/anthropics/financial-services (Apache-2.0)

* test(skills): cover additional rescan paths in skill_commands cache (#14536)

The rescan-on-platform-change fix landed in #18739 ships one regression
test that exercises the HERMES_PLATFORM env-var path. Three other code
paths in get_skill_commands / _resolve_skill_commands_platform have no
direct coverage; this commit adds a regression test for each.

- Gateway session context (HERMES_SESSION_PLATFORM via ContextVar): the
  resolver consults get_session_env after HERMES_PLATFORM, and the
  gateway sets that variable through set_session_vars (a ContextVar),
  not os.environ. The test uses set_session_vars / clear_session_vars
  to drive the actual gateway signal, and the disabled-skill stub reads
  the same value via get_session_env. A regression that swapped
  get_session_env for plain os.getenv would still pass an env-var-based
  test but break concurrent gateway sessions, which is the bug the
  ContextVar plumbing exists to prevent.
- Returning to no-platform-scope (CLI / cron / RL rollouts after a
  gateway session): the cached telegram view must be dropped and the
  unfiltered scan repopulated when HERMES_PLATFORM is unset again.
- Same-platform cache hit: consecutive calls under the same platform
  scope must NOT rescan. The rescan trigger is change in scope, not
  "always re-resolve" — a gateway serving many consecutive telegram
  requests should pay the scan cost once, not per request.

The third test wraps scan_skill_commands with a spy after the cache is
primed, so the assertion is on call_count == 0 across three subsequent
get_skill_commands() calls.

All 39 tests in tests/agent/test_skill_commands.py pass under
scripts/run_tests.sh.

* fix(gateway): translate inbound document host paths to container paths for Docker backend

When terminal.backend is docker, inbound documents uploaded via messaging
platforms (Telegram, Slack, Discord, Feishu, Email, etc.) are cached at a host
path under ~/.hermes/cache/documents, but the container sandbox only sees them
at the auto-mounted /root/.hermes/cache/documents path.

This PR adds to_agent_visible_cache_path() in tools/credential_files.py (the
natural sibling to get_cache_directory_mounts()) and calls it at the
document-context-injection site in gateway/run.py so the agent always receives
a path it can open directly, matching the mount layout already established
by get_cache_directory_mounts() (#4846).

Scope: only Docker backend for now; other backends use different mount
semantics and are left unchanged until verified.

Fixes #18787

* feat(gateway): opt-in cleanup of temporary progress bubbles (#21186)

When display.cleanup_progress (or display.platforms.<plat>.cleanup_progress)
is true, the gateway deletes tool-progress bubbles, long-running '⏳ Still
working...' notices, and status-callback messages after the final response
is delivered successfully. Currently effective on adapters that implement
delete_message (Telegram); silently no-ops elsewhere. Off by default.
Failed runs skip cleanup so bubbles stay as breadcrumbs.

Minimal plumbing: base.py's existing post_delivery_callback slot now chains
new registrations onto any existing callback (with per-callback exception
isolation) rather than clobbering. Stale-generation registrations are
rejected so they can't step on a fresher run's callbacks. This lets the
cleanup callback coexist with the background-review release hook already
registered on the same slot.

Co-authored-by: mrcharlesiv <Mrcharlesiv@gmail.com>

* fix(kanban): heartbeat tool extends claim TTL, not just last_heartbeat_at

The kanban_heartbeat tool called heartbeat_worker but never
heartbeat_claim, so a worker that loops the tool while a single tool
call blocks the agent for >DEFAULT_CLAIM_TTL_SECONDS still got
reclaimed by release_stale_claims. The function name and
heartbeat_claim's own docstring imply otherwise:

  "Workers that know they'll exceed 15 minutes should call this
   every few minutes to keep ownership."

But there was no caller in the worker tool path. Workers couldn't
invoke heartbeat_claim themselves either — it isn't exposed as a tool.

Fix: _handle_heartbeat now calls heartbeat_claim first, reading
HERMES_KANBAN_CLAIM_LOCK from the worker env (the dispatcher pins
this in _default_spawn). Falls back to _claimer_id() for locally-
driven workers that didn't go through dispatcher spawn.

Test: tests/tools/test_kanban_tools.py::test_heartbeat_extends_claim_expires
rewinds claim_expires into the past, calls the tool, and asserts the
new value is at least now + DEFAULT_CLAIM_TTL_SECONDS // 2. Verified to
fail against the unfixed code (claim_expires stays at the rewound
value).

Closes the root cause underlying the symptom in #21141 (15-min
respawns of long-running workers). #21141 separately addresses
post-reclaim cleanup; this fixes the upstream "shouldn't have been
reclaimed in the first place" half.

* chore(release): map stephen0110 noreply email

* fix(kanban): stop reclaimed workers before retry

* fix(kanban): reap completed worker children in dispatch_once

The gateway-embedded dispatcher (default since `kanban.dispatch_in_gateway
= true`) is the parent of every spawned kanban worker. `_default_spawn`
calls `subprocess.Popen(..., start_new_session=True)` and returns the
pid — `start_new_session` detaches the controlling tty but does not
reparent to init, so the gateway keeps each worker as a child until it
`wait()`s for them.

Nothing in the dispatch loop ever calls `waitpid`. Result: every
completed worker becomes a `<defunct>` zombie that lingers until the
gateway exits. We hit ~430 zombies on a single hermes-agent container
after ~40 days of steady kanban traffic, approaching process-table
exhaustion on the host.

Fix: add a non-blocking reap loop at the top of `dispatch_once`, so
every dispatcher tick (default 60s) drains zombies that accumulated
since the last tick. WNOHANG keeps the call non-blocking; ChildProcessError
means no children to reap.

Why here, not a SIGCHLD handler:
- signal.signal requires the main thread; gateway threading model makes
  that placement non-trivial.
- Bounded staleness: at default interval=60s the maximum live zombie
  count is one tick's worth of worker completions.
- No interaction with detect_crashed_workers: that function only inspects
  rows where status='running', and rows reach 'done' (and stop being
  inspected) before their workers exit.

* chore(release): map sonic-netizen noreply email

* fix: auto-block repeated kanban retries

* chore(release): map mwnickerson noreply email

* feat(gateway): auto-resume interrupted sessions after restart

* fix(gateway): preserve resume marker on interrupted restart

* refactor(gateway): simplify auto-resume + extend to crash recovery

Follow-up on top of @kyan12's PR #20888 — same feature, cleaner shape,
wider coverage.

Changes:
- Drop the synthetic '[System note: ...]' in the internal MessageEvent.
  The existing _is_resume_pending branch in _handle_message_with_agent
  (run.py ~L13738) already injects a reason-aware recovery system note
  on the next turn.  With kyan's text in place the model saw two stacked
  system notes.  Now the event text is empty and the existing injection
  path owns the wording.
- Drop SessionStore.list_resume_pending() as a new public method.  The
  filter is 8 lines inline in _schedule_resume_pending_sessions() —
  one caller, no other pluggability need.
- Add 'restart_interrupted' to the auto-resume reason set.  That's the
  reason SessionStore.suspend_recently_active() stamps on sessions
  recovered from a crash/OOM/SIGKILL (no .clean_shutdown marker).
  Previously those sessions had to wait for a real user message to
  auto-resume; now they continue automatically at startup like
  drain-timeout interruptions do.
- Reasons live in a _AUTO_RESUME_REASONS frozenset at class scope so
  future reasons (e.g. 'manual_resume_request') can be opted in with
  one line.

Test coverage added:
- drain-timeout + crash-recovery both scheduled
- stale entries skipped (outside freshness window)
- suspended entries skipped (suspended > resume_pending)
- originless entries skipped (no routing target)
- disallowed reasons skipped (graceful forward-compat)

E2E verified end-to-end with a real on-disk SessionStore: 2 eligible
sessions scheduled, 2 ineligible skipped, empty-text internal events
delivered to the adapter.

Co-authored-by: Kevin Yan <kevyan1998@gmail.com>

* fix(auth): sync shared Nous refresh tokens

* refactor(auth): dedupe file-lock helper; document Nous lock order

Extract the shared flock/msvcrt boilerplate from _auth_store_lock and
_nous_shared_store_lock into a single _file_lock(lock_path, holder,
timeout, message) helper. Each caller keeps its own threading.local
holder so reentrancy state stays per-lock.

Also document the lock-ordering invariant on both wrappers:
_auth_store_lock is OUTER, _nous_shared_store_lock is INNER for all
runtime refresh paths. The one exception is _try_import_shared_nous_state,
which holds the shared lock alone across the full HTTP refresh+mint
cycle to prevent concurrent sibling imports from racing on the single-
use shared refresh token; that helper must not be called with the auth
lock already held.

* fix(gateway): avoid duplicated responses history

* chore: add AUTHOR_MAP entries for thelumiereguy and counterposition

* fix(gateway): use monotonic deadlines in QR onboarding flows

* fix(oauth,gateway): monotonic deadlines for polling/timeout loops

Widen PR #20314's fix to the other timeout-polling sites in the codebase
that share the same wall-clock-jump bug class. All of these measure elapsed
timeout duration, not civil time, so they belong on time.monotonic().

- hermes_cli/auth.py: auth-store file-lock timeout, Spotify OAuth callback
  wait, Nous portal device-auth token poll.
- hermes_cli/copilot_auth.py: Copilot OAuth device-flow token poll.
- hermes_cli/gateway.py: gateway systemd restart wait.
- hermes_cli/web_server.py: dashboard Codex device-auth user_code wait,
  dashboard Nous device-auth token poll. (sess["expires_at"] stays on
  time.time() — it's a persisted absolute timestamp, not a local
  deadline-polling variable.)
- agent/copilot_acp_client.py: Copilot ACP JSON-RPC request timeout.

* fix(weixin): replace all aiohttp ClientTimeout with asyncio.wait_for()

aiohttp ClientTimeout uses BaseTimerContext which calls
loop.call_later() internally. When invoked via
asyncio.run_coroutine_threadsafe() from cron jobs, this
triggers "Timeout context manager should be used inside a task"
errors, causing message delivery failures.

Replace all direct ClientTimeout usage with asyncio.wait_for():
- _upload_ciphertext: CDN upload (120s timeout)
- _download_bytes: CDN download (configurable timeout)
- _download_remote_media: remote media fetch (30s timeout)

Also set total=None on _send_session to disable aiohttp built-in
timeout, and change trust_env=True to False to bypass proxy for
WeChat CDN connections.

* test(weixin): update timeout assertion for asyncio.wait_for migration

* chore: AUTHOR_MAP entry for chenlinfeng@ruije / @noOne-list

* feat(security): enable secret redaction by default (#17691, #20785) (#21193)

Flip the default for HERMES_REDACT_SECRETS from off to on so the redactor
already wired into send_message_tool, logs, and tool output actually runs
on a fresh install.

- agent/redact.py: env-var default "" → "true"
- hermes_cli/config.py: DEFAULT_CONFIG security.redact_secrets True;
  two config-template comments rewritten
- gateway/run.py + cli.py: startup log / banner warning when the user
  has explicitly opted out, so the downgrade is visible in agent.log
  and at CLI banner time
- docs/reference/environment-variables.md: description reconciled
- tests: flipped the default-pin, restructured the force=True
  regression test to explicit-false instead of unset

Users who need raw credential values (redactor development) can still
opt out via security.redact_secrets: false in config.yaml or
HERMES_REDACT_SECRETS=false in .env.

Closes #17691.
Addresses #20785 (short-term output-pipeline recommendation).

* feat: add Discord message deletion action

* chore: AUTHOR_MAP entry for @likejudy

* fix(security): close TOCTOU window in hermes_cli/auth.py credential writers (#21194)

`_save_auth_store`, `_save_qwen_cli_tokens`, and `_write_shared_nous_state`
all created the temp file via `Path.open('w')` / `Path.write_text` and only
tightened permissions to 0o600 afterward. Between create and chmod the file
existed at the process umask (commonly 0o644 = world-readable on multi-user
hosts), briefly exposing OAuth access/refresh tokens for Nous, Codex,
Copilot, Claude, Qwen, Gemini, and every other native OAuth provider that
flows through auth.json.

Switch all three to `os.open(O_WRONLY|O_CREAT|O_EXCL, 0o600)` + `os.fdopen`
+ `fsync` so the file is atomic at 0o600 on creation. Tighten each parent
directory (`~/.hermes/`, Qwen auth dir, Nous shared auth dir) to 0o700 so
siblings can't traverse to the creds. `_save_auth_store` also gains a
per-process random temp suffix to match `agent/google_oauth.py` (#19673)
and `tools/mcp_oauth.py` (#21148).

Adds `tests/hermes_cli/test_auth_toctou_file_modes.py` asserting final
file mode 0o600 and parent dir mode 0o700 across all three writers, plus
an explicit `os.open(flags, mode)` check on the main auth.json writer
that would fail if anyone reintroduces the `Path.open('w')` pattern.
POSIX-only (mode bits skipped on Windows).

* fix(delegate): correct ACP docs — Claude Code CLI has no --acp flag

The delegate_task tool schema descriptions referenced 'claude --acp --stdio'
as an example, but Claude Code CLI does not support --acp or --stdio flags.

The ACP subprocess transport (agent/copilot_acp_client.py) is specifically
built for GitHub Copilot CLI ('copilot --acp --stdio').

Changes:
- Per-task acp_command example: 'claude' → 'copilot'
- Top-level acp_command description: remove 'Claude Code' reference,
  clarify requirement for ACP-compatible CLI (currently Copilot only)
- acp_args description: remove misleading claude-opus-4-6 example

Fixes #19055

* fix: exclude hidden and archive dirs from _find_skill rglob

* fix(gateway): preserve thread routing from cached live session sources

* fix(gateway): cap cached session sources with LRU eviction

Follow-up on top of Zyproth's session-source cache: swap the unbounded
dict for an OrderedDict with a 512-entry LRU cap so long-running
gateways can't accumulate stale entries for dead sessions forever.

- self._session_sources is now an OrderedDict
- _cache_session_source() move_to_end + popitem(last=False) above cap
- _get_cached_session_source() move_to_end on hit (LRU read bump)
- restart_test_helpers.py wires OrderedDict + _session_sources_max

* fix(mcp): give 'mcp add --command' a distinct argparse dest

The --command flag of `hermes mcp add` shared its argparse dest with the
top-level subparser (`dest="command"` in `hermes_cli/_parser.py`). When
the flag was omitted, argparse still wrote `args.command = None`,
clobbering the top-level value of `"mcp"`. The dispatcher then saw
`args.command is None` and fell through to interactive chat, so
`hermes mcp add ...` silently launched chat instead of registering the
server. `cmd_mcp_add` was never reached.

Use `dest="mcp_command"` on the flag and read it from `cmd_mcp_add`.
The user-facing CLI flag `--command` is unchanged; only the in-memory
namespace attribute moves. Also updates the `_make_args` helper in
`tests/hermes_cli/test_mcp_config.py` to populate the new dest, and
adds `tests/hermes_cli/test_mcp_add_command_dest.py` with a parser-
level regression test.

Closes #19785.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore: add discodirector email to AUTHOR_MAP

* fix(bedrock): preserve reasoningContent across converse normalization

* feat(gateway): support [[as_document]] directive for skill media routing

Skills that produce large/lossless images (e.g. info-graph, where a
rendered JPG is 1-2 MB) currently lose quality in Telegram delivery
because `_IMAGE_EXTS` membership routes the file through
`send_multiple_images` → `sendMediaGroup`, which Telegram's server
re-encodes to JPEG @ 1280px max edge. The original bytes only survive
when the file goes through `send_document`, which the dispatch tables
in three places (`_process_message_background`, `_deliver_media_from_response`,
and the `send_message` tool's telegram path) only reach for files
whose extension is NOT in `_IMAGE_EXTS`.

This commit adds an `[[as_document]]` directive that mirrors the
existing `[[audio_as_voice]]` shape: a skill emits the directive once
in its response, and every image-extension MEDIA: file in that response
is delivered via `send_document` instead of `send_multiple_images` /
`sendPhoto`. The directive is detected at the dispatch sites (which see
the raw response) and the directive string is stripped from the
user-visible cleaned text in `extract_media` so it never leaks.

Granularity is intentionally all-or-nothing per response, matching
[[audio_as_voice]]'s scope. Skills that need fine control can split into
two responses.

Verified the targeted use case: info-graph emits

    信息图已生成(...)
    [[as_document]]
    MEDIA:/tmp/info-graph-x/infographic.jpg

→ Telegram receives `infographic.jpg` via sendDocument, original 1MB
JPEG bytes preserved, no recompression. Forwarding and download
filenames stay clean (`infographic.jpg`).

Tests: +3 cases in TestExtractMedia covering directive strip, isolation
from voice flag, and coexistence with [[audio_as_voice]]. All
113 pre-existing media/extract/send tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: update send_message_tool mocks for force_document kwarg

* chore: AUTHOR_MAP entry for @leon7609

* fix(model_switch): live model discovery for custom_providers in /model picker

custom_providers entries (section 4 of list_authenticated_providers) only
read the static models: dict from config.yaml, ignoring the live /v1/models
endpoint.  This means gateways like Bifrost that expose hundreds of models
only show the handful explicitly listed in config.

Add live discovery via fetch_api_models() for custom_providers entries
that have api_key + base_url, matching the existing behavior for user
providers: entries (section 3).  When the endpoint is reachable and
returns models, the live list replaces the static subset.

Fixes: /model picker showing only 9 models from a Bifrost gateway that
actually exposes 581.

* fix(memory): support OpenViking local resource uploads

* test(memory): harden OpenViking local upload coverage

* fix(memory): harden OpenViking local path uploads

* chore(release): add AUTHOR_MAP entries for ggnnggez and ehz0ah

Contributors to OpenViking local resource upload fix (#19569).

* docs(readme): drop misleading RL install-extras claim, defer to CONTRIBUTING

README.md:163 said atroposlib and tinker were pulled in by .[all,dev], but
.[all] does not include .[rl] — those dependencies live in pyproject.toml's
[rl] extra (lines 95-101). With the original wording, a contributor running
uv pip install -e ".[all,dev]" would not have atroposlib or tinker
installed.

Rather than swap one extra for another (which paths users to either of two
parallel install conventions — pip [rl] extra vs tinker-atropos submodule —
without saying which the project considers canonical), this PR drops the
specific install command from the README and links to CONTRIBUTING.md,
which already documents the actual development setup.

* fix(kanban): auto-block workers that exit without completing (#20894) (#21214)

When a kanban worker subprocess exits rc=0 but its task is still in
status='running', the agent almost certainly answered the task
conversationally without calling kanban_complete or kanban_block. The
dispatcher used to classify this as a generic crash and respawn, which
loops forever on small local models (gemma4-e2b q4 etc.) that keep
returning clean but unproductive output.

Dispatcher changes:
- The waitpid reap loop at the top of dispatch_once now records each
  reaped child's raw exit status in a bounded module registry
  (_recent_worker_exits, TTL 600s, size cap 4096).
- _classify_worker_exit distinguishes clean_exit / nonzero_exit /
  signaled / unknown using os.WIFEXITED / WIFSIGNALED.
- detect_crashed_workers consults the classification when a worker
  is found dead. clean_exit → protocol_violation event + immediate
  circuit-breaker trip (failure_limit=1). Everything else keeps the
  existing crashed-event + counter behavior.
- DispatchResult.auto_blocked now includes protocol-violation trips.

Gateway fix (Bug A in #20894):
- gateway.run._notify_active_sessions_of_shutdown snapshots
  self.adapters with list(...) before iterating. adapter.send() can
  hit a fatal-error path that pops the adapter from the dict, which
  was raising 'RuntimeError: dictionary changed size during iteration'
  during shutdown.

Regression tests:
- test_detect_crashed_workers_protocol_violation_auto_blocks verifies
  rc=0 + still-running → status=blocked on first occurrence with
  protocol_violation + gave_up events and NO crashed event.
- test_detect_crashed_workers_nonzero_exit_uses_default_limit verifies
  non-zero exits keep the existing 2-strike behavior.

Closes #20894.

* fix(dashboard): stabilize embedded chat resume and scrollback

* fix(dashboard): let embedded chat use a single scroll system

* fix(dashboard): route browser wheel into inner TUI scrolling

* chore: AUTHOR_MAP entry for @nouseman666

* fix(cli): honor positive tool preview length

* chore: AUTHOR_MAP entry for @GinWU05

* fix(credential_pool): resolve key mix-up when custom providers share base_url

When multiple custom_providers share the same base_url but have different API keys,

get_custom_provider_pool_key() always returned the first match, causing wrong-key

unauthorized errors. Add provider_name parameter to prefer exact name matches

over base_url-only matching, with fallback for backward compatibility.

Fixes #19083

* feat(cli): show context compression count in status bar

Display the number of context compressions in the CLI status bar when
compressions > 0, helping users understand conversation compression
pressure during long sessions.

- Wide layout (>=76 cols): shows 'cmp N' between context percent and duration
- Medium layout (52-75 cols): shows 'cmp N' between percent and duration
- Narrow layout (<52 cols): omitted to save space
- Color-coded: dim for 1-4, warn for 5-9, bad for 10+
- Hidden when zero to keep the bar clean for new sessions

Closes #18564

* refactor: replace 'cmp' text with 🗜️ emoji in status bar

Address review feedback to use the clamp emoji (��️) instead of
the plain text 'cmp' prefix for the compression count indicator.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat(tui): surface compression count in Ink status bar

Parity with the classic CLI status bar (PR #18579). The Python backend
already exposes 'compressions' on SessionUsageResponse; this wires it
through the Ink Usage type and renders 'cmp N' next to the duration
segment of StatusRule.

- types.ts Usage: add optional compressions field
- appChrome.tsx StatusRule: render 'cmp N' when > 0, color-tiered by
  pressure (muted <5, warn 5-9, error 10+)
- Plain text 'cmp' token (no emoji) matches PR #18579's original author
  rationale and avoids Ink layout drift from VS16 emoji width

* chore(release): map altriatree@gmail.com -> @TruaShamu

* fix(curator): make manual runs synchronous

* docs(curator): update CLI docs for synchronous-by-default manual run

Follow-up to the previous commit which flipped 'hermes curator run'
default from async to sync. Updates the curator.md feature page and
cli-commands.md reference to show --background as the opt-in async
flag and note that the default now blocks until the LLM pass finishes.

* fix(install): remove uv exclude-newer cutoff

* docs: clarify API server tool execution locality

* fix(kanban): treat dashboard event-stream cancellation as normal shutdown

Stopping `hermes dashboard` with Ctrl-C while the Kanban dashboard is
open prints an ASGI traceback ending in
`plugins/kanban/dashboard/plugin_api.py::stream_events` at the
`asyncio.sleep(_EVENT_POLL_SECONDS)` line. This is a normal shutdown
path: Uvicorn cancels the open websocket task while it is sleeping in
the 300 ms poll loop. `asyncio.CancelledError` is a `BaseException` in
Python 3.8+ — the bare `except Exception:` handler below the existing
`WebSocketDisconnect:` clause does NOT catch it, so the cancellation
surfaces as an application traceback and routine dashboard exit looks
like a runtime failure.

Add an explicit `except asyncio.CancelledError: return` clause beside
the existing `WebSocketDisconnect` handler. Disconnection (client
closed the tab) and shutdown cancellation (dashboard process exiting)
are conceptually different paths but both warrant a quiet return; the
two clauses are kept separate to keep that intent explicit.

`asyncio` is already imported and used in this scope, so no new
import is needed. The bare `except Exception:` handler is preserved
verbatim, so genuine runtime failures still log a warning and close
the socket cleanly.

Closes #20790.

* chore(release): map SandroHub013 email

* test(kanban): regression for CancelledError swallow in stream_events

Drives stream_events directly and cancels the task while it is sleeping
in the poll loop, asserting the coroutine returns cleanly instead of
letting CancelledError bubble. Regression coverage for the Uvicorn
application traceback on dashboard Ctrl-C fixed by the preceding commit.

* fix(model_tools): log plugin hook exceptions instead of silently swallowing them

* feat(gateway): add `hermes gateway list` to show all profiles' gateway status

Add a new `hermes gateway list` subcommand that shows the running
status of gateways across all profiles in a single view:

    Gateways:
      ✓ default (current)        — PID 155469
      ✓ wx1                      — PID 166893
      ✗ dev                      — not running

Also includes `_print_other_profiles_gateway_status()` which appends
an "Other profiles" section to `hermes gateway status` output when
other profile gateways are running.

Both use existing `list_profiles()` and `find_profile_gateway_processes()`
— no new dependencies.

Closes #19127
Related: #19113, #4402, #4587

* fix(mcp-oauth): persist OAuth server metadata across process restarts (#21226)

The MCP SDK discovers OAuth server metadata (token_endpoint, etc.) on
demand and keeps it in memory only. Without disk persistence, a restart
with valid cached refresh tokens forces the SDK to fall back to the
guessed '{server_url}/token' path — which returns 404 on most real
providers (Notion, Atlassian, GitHub remote MCP, etc.) and triggers a
full browser re-authorization even though the refresh token is fine.

Add a .meta.json file next to the existing tokens/client_info files:

  HERMES_HOME/mcp-tokens/<server>.json        -- tokens (existing)
  HERMES_HOME/mcp-tokens/<server>.client.json -- client info (existing)
  HERMES_HOME/mcp-tokens/<server>.meta.json   -- oauth metadata (new)

Changes:
- HermesTokenStorage.save_oauth_metadata / load_oauth_metadata / _meta_path
  — disk layer for the discovered OAuthMetadata.
- HermesTokenStorage.remove() now also clears .meta.json so
  'hermes mcp remove <name>' and the manager's remove() path clean up fully.
- HermesMCPOAuthProvider._initialize cold-restores from disk before the
  existing pre-flight discovery runs. If disk has metadata we skip the
  discovery HTTP round-trips entirely.
- HermesMCPOAuthProvider._prefetch_oauth_metadata now persists ASM as
  soon as it's discovered, so even the first pre-flight run seeds disk.
- HermesMCPOAuthProvider._persist_oauth_metadata_if_changed() is called
  at the end of async_auth_flow so metadata discovered via the SDK's
  lazy 401-branch (not pre-flight) is also saved for next time.

Tests cover the storage roundtrip (save/load/missing/corrupt/remove) and
the manager provider path (cold-load restore, skip-when-in-memory,
persist-on-discover, noop-when-unchanged, end-to-end async_auth_flow).

Co-authored-by: nocturnum91 <50326054+nocturnum91@users.noreply.github.com>

* feat: add SSE transport support for MCP client

Add support for MCP servers using the SSE transport protocol
(SseServerTransport) alongside the existing Streamable HTTP and stdio
transports. Many MCP servers use SSE (GET /sse + POST /messages/)
which was previously unsupported -- the client silently fell back to
Streamable HTTP, causing 10s connection timeouts.

Changes:
- Import mcp.client.sse.sse_client with graceful fallback
- Check config.get('transport') == 'sse' in _run_http() to select
  the SSE transport path with proper timeout handling
- Read transport type from config in get_mcp_status() instead of
  hardcoding 'http' for URL-based servers
- Update docstring, example config, and feature list

* fix(browser): enforce cloud-metadata SSRF floor in hybrid routing (#16234) (#21228)

Cloud metadata endpoints (169.254.169.254 etc.) are now always blocked
by browser_navigate regardless of hybrid routing, allow_private_urls,
or backend.

Bug: commit 42c076d3 (#16136) added hybrid routing that flips
auto_local_this_nav=True for private URLs and short-circuits
_is_safe_url(). IMDS endpoints are technically private (169.254/16
link-local), so the sidecar happily routed them to a local Chromium,
and the agent could read IAM credentials via browser_snapshot. On
EC2/GCP/Azure this is a full SSRF-to-credential-theft.

Fix: new is_always_blocked_url() in url_safety.py — a narrow floor
that checks _BLOCKED_HOSTNAMES, _ALWAYS_BLOCKED_IPS,
_ALWAYS_BLOCKED_NETWORKS only. Applied as an independent gate in
browser_navigate's pre-nav and post-redirect checks, BEFORE
auto_local_this_nav gets a chance to short-circuit. Ordinary private
URLs (localhost, 192.168.x, 10.x, .local, CGNAT) still route to the
local sidecar as the #16136 feature intends.

Secondary fix (reporter's finding): _url_is_private() now explicitly
checks 172.16.0.0/12. ipaddress.is_private only covers that range on
Python ≥3.11 (bpo-40791), so on 3.10 runtimes those URLs were routed
to cloud instead of the local sidecar. No security impact — just a
correctness fix for the hybrid-routing feature.

Closes #16234.

* fix: WhatsApp bridge process leak and disable config asymmetry

- Add PID file mechanism to track bridge processes and kill stale ones on startup
- Improve _kill_port_process() with lsof fallback when fuser is not available
- Support explicit WhatsApp disable via config.yaml (whatsapp.enabled: false)
- Respect WHATSAPP_ENABLED=false env var to disable WhatsApp

Fixes #19124

* docs(contributing): align tool discovery and test runner with AGENTS.md

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(kanban): make dashboard board pin authoritative over server current file (#21230)

When the user created a new board via the dashboard with "switch" checked,
the server-side `current` file was flipped to the new board. Clicking the
original board's tab then showed no card…
RationallyPrime pushed a commit to RationallyPrime/hermes-agent that referenced this pull request May 8, 2026
…earch#20894) (NousResearch#21214)

When a kanban worker subprocess exits rc=0 but its task is still in
status='running', the agent almost certainly answered the task
conversationally without calling kanban_complete or kanban_block. The
dispatcher used to classify this as a generic crash and respawn, which
loops forever on small local models (gemma4-e2b q4 etc.) that keep
returning clean but unproductive output.

Dispatcher changes:
- The waitpid reap loop at the top of dispatch_once now records each
  reaped child's raw exit status in a bounded module registry
  (_recent_worker_exits, TTL 600s, size cap 4096).
- _classify_worker_exit distinguishes clean_exit / nonzero_exit /
  signaled / unknown using os.WIFEXITED / WIFSIGNALED.
- detect_crashed_workers consults the classification when a worker
  is found dead. clean_exit → protocol_violation event + immediate
  circuit-breaker trip (failure_limit=1). Everything else keeps the
  existing crashed-event + counter behavior.
- DispatchResult.auto_blocked now includes protocol-violation trips.

Gateway fix (Bug A in NousResearch#20894):
- gateway.run._notify_active_sessions_of_shutdown snapshots
  self.adapters with list(...) before iterating. adapter.send() can
  hit a fatal-error path that pops the adapter from the dict, which
  was raising 'RuntimeError: dictionary changed size during iteration'
  during shutdown.

Regression tests:
- test_detect_crashed_workers_protocol_violation_auto_blocks verifies
  rc=0 + still-running → status=blocked on first occurrence with
  protocol_violation + gave_up events and NO crashed event.
- test_detect_crashed_workers_nonzero_exit_uses_default_limit verifies
  non-zero exits keep the existing 2-strike behavior.

Closes NousResearch#20894.
gzsiang pushed a commit to gzsiang/hermes-agent that referenced this pull request May 10, 2026
…earch#20894) (NousResearch#21214)

When a kanban worker subprocess exits rc=0 but its task is still in
status='running', the agent almost certainly answered the task
conversationally without calling kanban_complete or kanban_block. The
dispatcher used to classify this as a generic crash and respawn, which
loops forever on small local models (gemma4-e2b q4 etc.) that keep
returning clean but unproductive output.

Dispatcher changes:
- The waitpid reap loop at the top of dispatch_once now records each
  reaped child's raw exit status in a bounded module registry
  (_recent_worker_exits, TTL 600s, size cap 4096).
- _classify_worker_exit distinguishes clean_exit / nonzero_exit /
  signaled / unknown using os.WIFEXITED / WIFSIGNALED.
- detect_crashed_workers consults the classification when a worker
  is found dead. clean_exit → protocol_violation event + immediate
  circuit-breaker trip (failure_limit=1). Everything else keeps the
  existing crashed-event + counter behavior.
- DispatchResult.auto_blocked now includes protocol-violation trips.

Gateway fix (Bug A in NousResearch#20894):
- gateway.run._notify_active_sessions_of_shutdown snapshots
  self.adapters with list(...) before iterating. adapter.send() can
  hit a fatal-error path that pops the adapter from the dict, which
  was raising 'RuntimeError: dictionary changed size during iteration'
  during shutdown.

Regression tests:
- test_detect_crashed_workers_protocol_violation_auto_blocks verifies
  rc=0 + still-running → status=blocked on first occurrence with
  protocol_violation + gave_up events and NO crashed event.
- test_detect_crashed_workers_nonzero_exit_uses_default_limit verifies
  non-zero exits keep the existing 2-strike behavior.

Closes NousResearch#20894.
rmulligan pushed a commit to rmulligan/hermes-agent that referenced this pull request May 11, 2026
…earch#20894) (NousResearch#21214)

When a kanban worker subprocess exits rc=0 but its task is still in
status='running', the agent almost certainly answered the task
conversationally without calling kanban_complete or kanban_block. The
dispatcher used to classify this as a generic crash and respawn, which
loops forever on small local models (gemma4-e2b q4 etc.) that keep
returning clean but unproductive output.

Dispatcher changes:
- The waitpid reap loop at the top of dispatch_once now records each
  reaped child's raw exit status in a bounded module registry
  (_recent_worker_exits, TTL 600s, size cap 4096).
- _classify_worker_exit distinguishes clean_exit / nonzero_exit /
  signaled / unknown using os.WIFEXITED / WIFSIGNALED.
- detect_crashed_workers consults the classification when a worker
  is found dead. clean_exit → protocol_violation event + immediate
  circuit-breaker trip (failure_limit=1). Everything else keeps the
  existing crashed-event + counter behavior.
- DispatchResult.auto_blocked now includes protocol-violation trips.

Gateway fix (Bug A in NousResearch#20894):
- gateway.run._notify_active_sessions_of_shutdown snapshots
  self.adapters with list(...) before iterating. adapter.send() can
  hit a fatal-error path that pops the adapter from the dict, which
  was raising 'RuntimeError: dictionary changed size during iteration'
  during shutdown.

Regression tests:
- test_detect_crashed_workers_protocol_violation_auto_blocks verifies
  rc=0 + still-running → status=blocked on first occurrence with
  protocol_violation + gave_up events and NO crashed event.
- test_detect_crashed_workers_nonzero_exit_uses_default_limit verifies
  non-zero exits keep the existing 2-strike behavior.

Closes NousResearch#20894.
JinyuID pushed a commit to JinyuID/hermes-agent that referenced this pull request May 11, 2026
…earch#20894) (NousResearch#21214)

When a kanban worker subprocess exits rc=0 but its task is still in
status='running', the agent almost certainly answered the task
conversationally without calling kanban_complete or kanban_block. The
dispatcher used to classify this as a generic crash and respawn, which
loops forever on small local models (gemma4-e2b q4 etc.) that keep
returning clean but unproductive output.

Dispatcher changes:
- The waitpid reap loop at the top of dispatch_once now records each
  reaped child's raw exit status in a bounded module registry
  (_recent_worker_exits, TTL 600s, size cap 4096).
- _classify_worker_exit distinguishes clean_exit / nonzero_exit /
  signaled / unknown using os.WIFEXITED / WIFSIGNALED.
- detect_crashed_workers consults the classification when a worker
  is found dead. clean_exit → protocol_violation event + immediate
  circuit-breaker trip (failure_limit=1). Everything else keeps the
  existing crashed-event + counter behavior.
- DispatchResult.auto_blocked now includes protocol-violation trips.

Gateway fix (Bug A in NousResearch#20894):
- gateway.run._notify_active_sessions_of_shutdown snapshots
  self.adapters with list(...) before iterating. adapter.send() can
  hit a fatal-error path that pops the adapter from the dict, which
  was raising 'RuntimeError: dictionary changed size during iteration'
  during shutdown.

Regression tests:
- test_detect_crashed_workers_protocol_violation_auto_blocks verifies
  rc=0 + still-running → status=blocked on first occurrence with
  protocol_violation + gave_up events and NO crashed event.
- test_detect_crashed_workers_nonzero_exit_uses_default_limit verifies
  non-zero exits keep the existing 2-strike behavior.

Closes NousResearch#20894.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Kanban worker enters PID NOT ALIVE respawn loop after successful completion

2 participants