fix(security): close TOCTOU window in hermes_cli/auth.py credential writers#21194
Merged
Conversation
…riters
`_save_auth_store`, `_save_qwen_cli_tokens`, and `_write_shared_nous_state`
all created the temp file via `Path.open('w')` / `Path.write_text` and only
tightened permissions to 0o600 afterward. Between create and chmod the file
existed at the process umask (commonly 0o644 = world-readable on multi-user
hosts), briefly exposing OAuth access/refresh tokens for Nous, Codex,
Copilot, Claude, Qwen, Gemini, and every other native OAuth provider that
flows through auth.json.
Switch all three to `os.open(O_WRONLY|O_CREAT|O_EXCL, 0o600)` + `os.fdopen`
+ `fsync` so the file is atomic at 0o600 on creation. Tighten each parent
directory (`~/.hermes/`, Qwen auth dir, Nous shared auth dir) to 0o700 so
siblings can't traverse to the creds. `_save_auth_store` also gains a
per-process random temp suffix to match `agent/google_oauth.py` (#19673)
and `tools/mcp_oauth.py` (#21148).
Adds `tests/hermes_cli/test_auth_toctou_file_modes.py` asserting final
file mode 0o600 and parent dir mode 0o700 across all three writers, plus
an explicit `os.open(flags, mode)` check on the main auth.json writer
that would fail if anyone reintroduces the `Path.open('w')` pattern.
POSIX-only (mode bits skipped on Windows).
24ff681 to
e0f736d
Compare
Contributor
🔎 Lint report:
|
| Rule | Count |
|---|---|
invalid-argument-type |
28 |
unresolved-attribute |
6 |
unsupported-operator |
4 |
unresolved-import |
1 |
First entries
run_agent.py:7973: [invalid-argument-type] invalid-argument-type: Argument to function `get_provider_request_timeout` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:2303: [invalid-argument-type] invalid-argument-type: Argument to function `ensure_lmstudio_model_loaded` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:11163: [invalid-argument-type] invalid-argument-type: Argument to function `apply_anthropic_cache_control` is incorrect: Expected `bool`, found `int | str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | dict[Unknown, Unknown]`
run_agent.py:2889: [invalid-argument-type] invalid-argument-type: Argument to function `get_provider_stale_timeout` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:2889: [invalid-argument-type] invalid-argument-type: Argument to function `get_provider_stale_timeout` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:7973: [invalid-argument-type] invalid-argument-type: Argument to function `get_provider_request_timeout` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:11859: [invalid-argument-type] invalid-argument-type: Argument to function `normalize_usage` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:8993: [unresolved-attribute] unresolved-attribute: Attribute `lower` is not defined on `dict[Unknown | str, Unknown | str | dict[str, str]] & ~AlwaysFalsy`, `int & ~AlwaysFalsy`, `dict[Unknown, Unknown] & ~AlwaysFalsy` in union `(str & ~AlwaysFalsy) | (Unknown & ~AlwaysFalsy) | (dict[Unknown | str, Unknown | str | dict[str, str]] & ~AlwaysFalsy) | ... omitted 3 union elements`
run_agent.py:10432: [invalid-argument-type] invalid-argument-type: Argument to function `_fixed_temperature_for_model` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
tests/run_agent/test_provider_attribution_headers.py:65: [unresolved-attribute] unresolved-attribute: Attribute `startswith` is not defined on `dict[str, str]` in union `Unknown | str | dict[str, str]`
run_agent.py:8568: [invalid-argument-type] invalid-argument-type: Argument to function `get_provider_profile` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:7972: [invalid-argument-type] invalid-argument-type: Argument to function `build_anthropic_client` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
tests/run_agent/test_provider_attribution_headers.py:131: [unsupported-operator] unsupported-operator: Operator `not in` is not supported between objects of type `Literal["X-OpenRouter-Cache-TTL"]` and `Unknown | str | dict[str, str] | ... omitted 3 union elements`
run_agent.py:3661: [invalid-argument-type] invalid-argument-type: Argument to `AIAgent.__init__` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:12549: [invalid-argument-type] invalid-argument-type: Argument to function `_pool_may_recover_from_rate_limit` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:4078: [invalid-argument-type] invalid-argument-type: Argument to function `save_trajectory` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
cli.py:7973: [invalid-argument-type] invalid-argument-type: Argument to function `estimate_usage_cost` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:8714: [invalid-argument-type] invalid-argument-type: Argument to function `lmstudio_model_reasoning_options` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
tests/run_agent/test_provider_attribution_headers.py:130: [unsupported-operator] unsupported-operator: Operator `not in` is not supported between objects of type `Literal["X-OpenRouter-Cache"]` and `Unknown | str | dict[str, str] | ... omitted 3 union elements`
run_agent.py:5071: [unresolved-attribute] unresolved-attribute: Attribute `split` is not defined on `dict[Unknown | str, Unknown | str | dict[str, str]]`, `int`, `dict[Unknown, Unknown]` in union `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:7889: [invalid-argument-type] invalid-argument-type: Argument to bound method `ContextCompressor.update_model` is incorrect: Expected `int`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:4626: [invalid-argument-type] invalid-argument-type: Argument to function `parse_rate_limit_headers` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
run_agent.py:12315: [invalid-argument-type] invalid-argument-type: Argument to function `_is_oauth_token` is incorrect: Expected `str`, found `str | dict[Unknown, Unknown] | Any | ... omitted 3 union elements`
tests/agent/test_codex_cloudflare_headers.py:163: [unresolved-attribute] unresolved-attribute: Attribute `get` is not defined on `str & ~AlwaysFalsy`, `int & ~AlwaysFalsy` in union `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 3 union elements`
run_agent.py:11877: [invalid-argument-type] invalid-argument-type: Argument to function `save_context_length` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown | str, Unknown | str | dict[str, str]] | int | dict[Unknown, Unknown]`
... and 14 more
✅ Fixed issues (42):
| Rule | Count |
|---|---|
invalid-argument-type |
31 |
unresolved-attribute |
7 |
unsupported-operator |
4 |
First entries
run_agent.py:11906: [invalid-argument-type] invalid-argument-type: Argument to function `estimate_usage_cost` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown, Unknown] | int | dict[Unknown | str, Unknown | str | dict[str, str]]`
tests/agent/test_codex_cloudflare_headers.py:181: [unsupported-operator] unsupported-operator: Operator `in` is not supported between objects of type `Literal["originator"]` and `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 4 union elements`
run_agent.py:8157: [invalid-argument-type] invalid-argument-type: Argument to function `get_transport` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown, Unknown] | int | dict[Unknown | str, Unknown | str | dict[str, str]]`
tests/run_agent/test_provider_attribution_headers.py:65: [unresolved-attribute] unresolved-attribute: Attribute `startswith` is not defined on `dict[str, str]` in union `Unknown | str | Divergent | dict[str, str]`
run_agent.py:7889: [invalid-argument-type] invalid-argument-type: Argument to bound method `ContextCompressor.update_model` is incorrect: Expected `int`, found `Divergent | Unknown | str | ... omitted 3 union elements`
cli.py:7973: [invalid-argument-type] invalid-argument-type: Argument to function `estimate_usage_cost` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown, Unknown] | int | dict[Unknown | str, Unknown | str | dict[str, str]]`
run_agent.py:8741: [invalid-argument-type] invalid-argument-type: Argument to function `github_model_reasoning_efforts` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown, Unknown] | int | dict[Unknown | str, Unknown | str | dict[str, str]]`
run_agent.py:3661: [invalid-argument-type] invalid-argument-type: Argument to `AIAgent.__init__` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown, Unknown] | int | dict[Unknown | str, Unknown | str | dict[str, str]]`
run_agent.py:7973: [invalid-argument-type] invalid-argument-type: Argument to function `get_provider_request_timeout` is incorrect: Expected `str | None`, found `Divergent | Unknown | str | ... omitted 3 union elements`
run_agent.py:11859: [invalid-argument-type] invalid-argument-type: Argument to function `normalize_usage` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown, Unknown] | int | dict[Unknown | str, Unknown | str | dict[str, str]]`
run_agent.py:12315: [invalid-argument-type] invalid-argument-type: Argument to function `_is_oauth_token` is incorrect: Expected `str`, found `str | dict[Unknown | str, Unknown | str | dict[str, str]] | Any | ... omitted 4 union elements`
tests/run_agent/test_provider_attribution_headers.py:131: [unsupported-operator] unsupported-operator: Operator `not in` is not supported between objects of type `Literal["X-OpenRouter-Cache-TTL"]` and `Unknown | str | dict[str, str] | ... omitted 4 union elements`
run_agent.py:7892: [invalid-argument-type] invalid-argument-type: Argument to bound method `ContextCompressor.update_model` is incorrect: Expected `str`, found `Divergent | Unknown | str | ... omitted 3 union elements`
run_agent.py:2889: [invalid-argument-type] invalid-argument-type: Argument to function `get_provider_stale_timeout` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown, Unknown] | int | dict[Unknown | str, Unknown | str | dict[str, str]]`
run_agent.py:5071: [unresolved-attribute] unresolved-attribute: Attribute `split` is not defined on `dict[Unknown, Unknown]`, `int`, `dict[Unknown | str, Unknown | str | dict[str, str]]` in union `str | Unknown | dict[Unknown, Unknown] | int | dict[Unknown | str, Unknown | str | dict[str, str]]`
run_agent.py:2889: [invalid-argument-type] invalid-argument-type: Argument to function `get_provider_stale_timeout` is incorrect: Expected `str | None`, found `str | Unknown | dict[Unknown, Unknown] | int | dict[Unknown | str, Unknown | str | dict[str, str]]`
run_agent.py:4078: [invalid-argument-type] invalid-argument-type: Argument to function `save_trajectory` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown, Unknown] | int | dict[Unknown | str, Unknown | str | dict[str, str]]`
run_agent.py:7651: [unresolved-attribute] unresolved-attribute: Attribute `strip` is not defined on `dict[Unknown, Unknown] & ~AlwaysFalsy`, `int & ~AlwaysFalsy`, `dict[Unknown | str, Unknown | str | dict[str, str]] & ~AlwaysFalsy` in union `Divergent | (Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | ... omitted 4 union elements`
run_agent.py:11163: [invalid-argument-type] invalid-argument-type: Argument to function `apply_anthropic_cache_control` is incorrect: Expected `bool`, found `int | Divergent | Unknown | ... omitted 3 union elements`
tests/agent/test_codex_cloudflare_headers.py:163: [unresolved-attribute] unresolved-attribute: Attribute `get` is not defined on `str & ~AlwaysFalsy`, `int & ~AlwaysFalsy` in union `(Unknown & ~AlwaysFalsy) | (str & ~AlwaysFalsy) | (dict[str, str] & ~AlwaysFalsy) | ... omitted 4 union elements`
run_agent.py:8993: [unresolved-attribute] unresolved-attribute: Attribute `lower` is not defined on `dict[Unknown, Unknown] & ~AlwaysFalsy`, `int & ~AlwaysFalsy`, `dict[Unknown | str, Unknown | str | dict[str, str]] & ~AlwaysFalsy` in union `(str & ~AlwaysFalsy) | (Unknown & ~AlwaysFalsy) | (dict[Unknown, Unknown] & ~AlwaysFalsy) | ... omitted 3 union elements`
run_agent.py:11904: [invalid-argument-type] invalid-argument-type: Argument to function `estimate_usage_cost` is incorrect: Expected `str`, found `str | Unknown | dict[Unknown, Unknown] | int | dict[Unknown | str, Unknown | str | dict[str, str]]`
run_agent.py:7973: [invalid-argument-type] invalid-argument-type: Argument to function `get_provider_request_timeout` is incorrect: Expected `str`, found `Divergent | Unknown | str | ... omitted 3 union elements`
run_agent.py:6283: [invalid-argument-type] invalid-argument-type: Argument to function `_codex_cloudflare_headers` is incorrect: Expected `str`, found `Unknown | str | dict[str, str] | ... omitted 4 union elements`
run_agent.py:10492: [unresolved-attribute] unresolved-attribute: Attribute `strip` is not defined on `dict[Unknown, Unknown] & ~AlwaysFalsy`, `int & ~AlwaysFalsy`, `dict[Unknown | str, Unknown | str | dict[str, str]] & ~AlwaysFalsy` in union `(str & ~AlwaysFalsy) | (Unknown & ~AlwaysFalsy) | (dict[Unknown, Unknown] & ~AlwaysFalsy) | ... omitted 3 union elements`
... and 17 more
Unchanged: 3893 pre-existing issues carried over.
Diagnostics are surfaced as warnings — this check never fails the build.
Collaborator
|
Supersedes #21154 — same TOCTOU fix for |
1 task
bot-ted
added a commit
to bot-ted/hermes-agent
that referenced
this pull request
May 8, 2026
* fix(kanban): avoid fragile failure-column renames
* chore: follow-up cleanup for Kanban migration fix
- Expand migration comment to name the primary failure mode (missing
column OperationalError from #20842) ahead of the secondary SQLite
schema-reparse concern; also document the stale-cols-snapshot invariant
- Add clarifying comments on from_row() legacy fallback branches noting
they are belt-and-suspenders dead code post-migration
- Add task_events comment in existing test explaining why the table is
required by the migrator
- Add test_legacy_migration_no_legacy_columns_at_all: Scenario A —
explicitly asserts the exact #20842 crash no longer occurs and that
consecutive_failures defaults to 0 on a DB that never had spawn_failures
- Add test_legacy_migration_both_columns_already_present: Scenario D —
asserts the migration is a no-op when both columns already exist,
preserving the existing counter value
* fix(tui): bound virtual history offset searches
* ci(docker): don't cancel overlapping builds, guard :latest
Switch top-level concurrency to cancel-in-progress=false so every push
to main gets its own SHA-tagged image published — no more discarded
builds when commits land back-to-back.
Guard the :latest tag with a second job that has its own concurrency
group with cancel-in-progress=true plus a git-ancestor check against
the revision label on the current :latest. Together these guarantee
:latest only ever moves forward in history: a slower run whose commit
isn't a descendant of the current :latest refuses to clobber it, and
a newer push mid-way through the move-latest job preempts the older
one before it can retag.
- Every main push publishes nousresearch/hermes-agent:sha-<commit>
with an org.opencontainers.image.revision label embedded.
- move-latest job reads that label off :latest, runs merge-base
--is-ancestor, and only retags (via buildx imagetools create,
registry-side, no rebuild) if our commit strictly descends.
- fetch-depth bumped to 1000 so merge-base has the history it needs.
- Release tag flow unchanged (unique tag, no race).
* docs(tool-gateway): rewrite as pitch-first marketing page (#20827)
Previous version read like internal API docs \u2014 leading with env var tables,
config YAML, and 'precedence' rules before ever explaining the product.
Complete rewrite inverts the structure so readers see value first,
mechanics last.
Structure now:
- Lede: 'One subscription. Every tool built in.' + pitch paragraph
- CTA: subscribe/manage button styled as a real call-to-action
- What's included: emoji-led table with expanded descriptions per tool.
Image gen lists all 9 models by name (FLUX 2 Klein/Pro, Z-Image Turbo,
Nano Banana Pro, GPT Image 1.5/2, Ideogram V3, Recraft V4 Pro, Qwen)
- Why it's here: value bullets \u2014 one bill, one signup, one key, same
quality, bring-your-own anytime
- Get started: two-command flow (hermes model \u2192 hermes status)
- Eligibility: paid-tier note with upgrade link
- Mix and match: three realistic usage patterns
- Using individual image models: ID reference table for power users
- --- separator ---
- Configuration reference (demoted): use_gateway flag, disabling,
self-hosted gateway env vars moved below the fold where they belong
- FAQ: streamlined, removed redundant content
Fact-checked against code:
- 9 FAL models confirmed from tools/image_generation_tool.py FAL_MODELS
- Status section output verified against hermes_cli/status.py
- Portal subscription URL preserved
- Self-hosted env vars (TOOL_GATEWAY_DOMAIN etc.) kept accurate
Verified: docusaurus build SUCCESS, page renders, no new broken links.
* fix(auth): fall back to global-root auth.json for providers missing in profile
Profile processes (kanban workers, cron subprocesses, delegated subagents)
read the profile's auth.json only. If a provider was authenticated at the
global root but not inside the profile, the profile's credential_pool
comes back empty and the process fails with 'No LLM provider configured'
— even though the credentials are sitting in ~/.hermes/auth.json. #18594
propagated HERMES_HOME correctly, which is what surfaced this: workers
now land in the right profile, and the profile turns out to shadow global
with no fallback.
Semantics (read-only, per-provider shadowing):
* Profile has any entries for provider X → use profile only (global ignored).
* Profile has zero entries for provider X → fall back to global.
* Writes (write_credential_pool, _save_auth_store) still target the profile.
* Classic mode (HERMES_HOME == global root) skips the fallback entirely —
_global_auth_file_path() returns None.
Also mirrors the fallback in get_provider_auth_state so OAuth singletons
(nous, minimax-oauth, openai-codex, spotify) inherit cleanly — the Nous
shared-token store (PR #19712) remains the authoritative path for Nous
OAuth rotation, this just makes the read side consistent with it.
Seat belt: _load_global_auth_store() refuses to read the real user's
~/.hermes/auth.json under PYTEST_CURRENT_TEST even when HERMES_HOME points
to a profile-shaped path. Guard uses $HOME (stable across fixtures) rather
than Path.home() (which fixtures often monkeypatch to a tmp root).
Reported by @SeedsForbidden on Twitter as the credential_pool shadowing
follow-up to the #18594 fix.
* feat(gateway): per-platform gateway_restart_notification flag
Adds an opt-out toggle on PlatformConfig that gates both restart
lifecycle pings: the "♻ Gateway restarted" message sent to the chat
that issued /restart, and the "♻️ Gateway online" home-channel
startup notification. Defaults to True so existing deployments are
unaffected.
The motivating split is operator vs. end-user surfaces: a back-channel
like Telegram should keep these pings, while a Slack workspace shared
with end users should not surface gateway lifecycle noise.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(gateway): also gate pre-restart "Gateway restarting" notification
Extend the gateway_restart_notification flag to cover
_notify_active_sessions_of_shutdown — the message that fires just
before drain ("⚠️ Gateway restarting — Your current task will be
interrupted. Send any message after restart and I'll try to resume
where you left off.") sent to active sessions and home channels.
Same operator/end-user reasoning: on a Slack workspace shared with
end users, "Gateway restarting" reads as "the bot is broken" — the
operator should be able to suppress it consistently with the other
two lifecycle pings rather than having a partial opt-out.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: add guillaumemeyer to AUTHOR_MAP
For cherry-picked commits in PR #20801.
* fix(cli): submit LF enter in thin PTYs (#20896)
* fix(tui): refresh virtual offsets after row resize (#20898)
* fix(tui): honor skin highlight colors (#20895)
* fix(tui): steady transcript scrollbar (#20917)
* fix(tui): steady transcript scrollbar
Keep the visible scrollbar tied to committed viewport position while virtual history can still prefetch against pending scroll targets, and preserve drag grab offset synchronously for native-feeling scrollbar drags.
* fix(tui): smooth precision wheel scroll
Replace the opt-scroll throttle with frame-sized coalescing so modifier wheel gestures stay line-precise without stepping.
* fix(tui): restore voice push-to-talk parity (#20897)
* fix(tui): restore classic CLI voice push-to-talk parity
(cherry picked from commit 93b9ae301bb89f5b5e01b4b9f8ac91ffa74fbd9d)
* fix(tui): harden voice push-to-talk stop flow
Address review feedback from PR #16189 by stopping the active recorder before background transcription, documenting single-shot voice capture, and covering the TUI gateway flags with regression tests.
* fix(tui): preserve silent voice strike tracking
Keep single-shot voice recording's no-speech counter alive across starts so the TUI can still emit the three-strikes auto-disable event, and bind the auto-restart state at module scope for type checking.
* fix(tui): clean up voice stop failure path
Address follow-up review by naming the TUI flow as single-shot push-to-talk and cancelling the recorder when forced stop cannot produce a WAV.
* fix(tui): report busy voice capture starts
Return explicit start state from the voice wrapper so the TUI gateway does not report recording while forced-stop transcription is still cleaning up.
* fix(tui): handle busy voice record responses
Apply the gateway busy status immediately in the TUI and route forced-stop voice events to the session that sent the stop request.
* fix(tui): clear voice recording on null response
Treat a null voice.record RPC result as a failed optimistic start so the REC badge cannot stick after gateway-side errors.
* fix(tui): count silent manual voice stops
Preserve single-shot voice no-speech strikes through forced stop transcription so empty push-to-talk captures still trigger the three-strikes guard.
---------
Co-authored-by: Montbra <montbra@gmail.com>
* fix(gateway): don't dead-end setup wizard when only system-scope unit is installed
The setup wizard dropped non-root users at a bare shell prompt when
trying to start a system-scope gateway service. Previously
_require_root_for_system_service called sys.exit(1), which the
wizard's `except Exception` guards cannot catch (SystemExit is a
BaseException). Users with a pre-existing /etc/systemd/system unit
(e.g. from an earlier `sudo hermes setup` run) hit this whenever
they re-ran `hermes setup` as a regular user.
- Convert _require_root_for_system_service to raise a typed
SystemScopeRequiresRootError (RuntimeError subclass) instead of
sys.exit(1). The direct CLI path (`hermes gateway install|start|stop|
restart|uninstall` without sudo) still exits 1 cleanly via a new
catch at the top of gateway_command, matching the existing
UserSystemdUnavailableError pattern.
- Add _system_scope_wizard_would_need_root() pre-check and
_print_system_scope_remediation() helper. Both setup wizards
(hermes_cli/setup.py and hermes_cli/gateway.py::gateway_setup) now
detect the dead-end before prompting and print actionable guidance:
either `sudo systemctl start <service>` this time, or uninstall the
system unit and install a per-user one.
- Defense-in-depth: all 5 wizard prompt sites also catch
SystemScopeRequiresRootError and fall back to the remediation
helper if the pre-check is bypassed (race, etc.).
Tests: 12 new tests in TestSystemScopeRequiresRootError,
TestSystemScopeWizardPreCheck, TestSystemScopeRemediationOutput, and
TestGatewayCommandCatchesSystemScopeError covering the exception
contract, pre-check matrix (root vs non-root, system-only vs
user-present vs none vs explicit system=True), remediation output
for each action, and the direct-CLI exit-1 path.
* fix(tui): preserve session when switching personality
Previously, /personality in the TUI called _reset_session_agent() which
destroyed the agent, cleared conversation history, and effectively started
a new session. This made personality switching disruptive — users lost
their entire conversation context.
Now /personality updates the agent's ephemeral_system_prompt in-place and
injects a pivot marker into the conversation history. The marker tells
the model to adopt the new persona from that point forward, which is
necessary because LLMs tend to pattern-match their prior responses and
continue the established tone without an explicit signal.
Changes:
- tui_gateway/server.py: Rewrite _apply_personality_to_session to update
the agent in-place instead of resetting. Inject a user-role pivot
marker so the model actually switches style mid-conversation.
- ui-tui/src/app/slash/commands/session.ts: Update help text (no longer
mentions history reset).
- tests/test_tui_gateway_server.py: Update test to verify history is
preserved, pivot marker is injected, and ephemeral prompt is set.
* fix(gateway): wait for systemd restart readiness
* fix(discord): narrow rate-limit catch and move sync state under gateway/
Two follow-ups on top of helix4u's slash-command sync hardening:
- Only suppress exceptions that are actually Discord 429 rate limits
(discord.RateLimited, HTTPException with status 429, or a clearly
rate-limit-named duck type). Arbitrary failures that happen to expose
a retry_after attribute now re-raise to the outer handler instead of
silently swallowing a cooldown.
- Move the sync-state JSON under $HERMES_HOME/gateway/ so the home root
stops collecting ad-hoc runtime files.
Added a test verifying unrelated exceptions don't get misclassified as
rate limits.
* docs(kanban): fix orchestrator skill setup instructions (#20958)
* docs(kanban): fix worker skill setup instructions too (#20960)
Follow-up to #20958. The worker skill section had the same stale
'hermes skills install devops/kanban-worker' command — kanban-worker
is also bundled, so that command fails with 'Could not fetch from any
source.'
Replace with bundled-skill verification + restore pattern, matching
the orchestrator section. Uses <your-worker-profile> placeholder since
assignees vary (researcher, writer, ops, linguist, reviewer, etc.)
rather than a single fixed 'worker' profile.
* feat(profiles): --no-skills flag for empty profile creation (#20986)
Adds `hermes profile create <name> --no-skills` to create a profile with
zero bundled skills. Writes a `.no-bundled-skills` marker file in the
profile root so `hermes update`'s all-profile skill sync loop also skips
the profile — without the marker, every update would re-seed skills and
the user would have to delete them again.
Use case (from @hiut1u): orchestrator profiles and narrow-task profiles
don't need 100+ bundled skills polluting their system prompt.
- create_profile() gains a `no_skills` param, mutually exclusive with
`--clone` / `--clone-all` (cloning explicitly copies skills).
- seed_profile_skills() no-ops on opted-out profiles and returns
`{skipped_opt_out: True}` so callers can report cleanly.
- Web API (POST /api/profiles) accepts `no_skills: bool`.
- Delete `.no-bundled-skills` to opt back in — next `hermes update`
re-seeds normally.
6 new tests in TestNoSkillsOptOut cover marker write, mutual exclusion
with clone, seed_profile_skills opt-out, fresh profile unaffected, and
delete-marker-re-enables-seeding.
* fix: route Telegram image documents through photo handling
* chore: AUTHOR_MAP entry for mrcoferland
* test(docker): align Dockerfile contract tests with simplified TUI flow
The Dockerfile dropped the manual `@hermes/ink` materialisation gymnastics
in favour of letting npm workspaces resolve the bundled package
naturally. Two contract tests still asserted the older flow:
`test_dockerfile_installs_tui_dependencies` required:
'ui-tui/packages/hermes-ink/package-lock.json' in dockerfile_text
…but the lockfile is no longer COPIED individually \u2014 the entire
`ui-tui/packages/hermes-ink/` tree is COPIED instead (the workspace
reference from `ui-tui/package.json` is `file:` so npm needs the
real source, not just a manifest stub).
`test_dockerfile_materializes_local_tui_ink_package` required a 7-clause
conjunction matching specific `rm -rf` / `npm install --omit=dev`
`--prefix node_modules/@hermes/ink` / `rm -rf .../react` invocations
that were stripped out when the workspace resolution was simplified.
Update the assertions to pin the *contract* the image actually has to
carry rather than the *exact shell incantations* the old flow used:
* TUI deps install: ui-tui/package.json + ui-tui/package-lock.json +
ui-tui/packages/hermes-ink/ tree are all COPIED, and an npm
install/ci step runs in ui-tui.
* Bundled hermes-ink: the workspace package source is COPIED (so
`await import('@hermes/ink')` resolves at runtime).
This keeps the spirit of #15012 / #16690 (zombie reaping + bundled
workspace materialisation must continue to work) without locking the
Dockerfile into one specific implementation flavour.
Validation:
$ pytest tests/tools/test_dockerfile_pid1_reaping.py -q
6 passed in 1.43s
No production code change. Fixes the two failures observed on `main`
(run 25250051126):
`tests/tools/test_dockerfile_pid1_reaping.py::test_dockerfile_installs_tui_dependencies`
`tests/tools/test_dockerfile_pid1_reaping.py::test_dockerfile_materializes_local_tui_ink_package`
* test(update): patch isatty on real streams to fix xdist-flaky --yes tests
Two CI tests for the new `--yes` update flag (#18261) flaked under
`pytest-xdist` on Linux/Python 3.11 even though they passed every
local run on macOS/Python 3.14.4:
FAILED tests/hermes_cli/test_update_yes_flag.py
::TestUpdateYesConfigMigration::test_no_yes_flag_still_prompts_in_tty
`AssertionError: assert <MagicMock 'input'>.called is False`
FAILED tests/hermes_cli/test_update_yes_flag.py
::TestUpdateYesStashRestore::test_yes_restores_stash_without_prompting
`AssertionError: assert <MagicMock '_restore_stashed_changes'>.called is False`
Captured stdout for the first failure shows `cmd_update` taking the
"Non-interactive session \u2014 skipping config migration prompt." branch
\u2014 i.e. the `sys.stdin.isatty() and sys.stdout.isatty()` check at
`hermes_cli/main.py:7118` evaluated to `False` despite the test doing:
with patch("hermes_cli.main.sys") as mock_sys:
mock_sys.stdin.isatty.return_value = True
mock_sys.stdout.isatty.return_value = True
The whole-module mock is fragile under xdist worker reuse: a sibling
test that imports `hermes_cli.main` first can leave another `sys`
reference resolved inside the function (re-import in a helper, etc.),
and the wholesale module replacement never gets consulted.
Switch to `patch.object(_sys.stdin, "isatty", return_value=True)` (and
the same for `stdout`). That patches the *attribute on the real stream
object* \u2014 every call site, no matter how it reached `sys.stdin`,
hits the patched method. Same fix applied to the stash-restore test
(it took the "non-TTY \u2192 skip restore prompt" branch for the same reason).
Validation:
$ pytest tests/hermes_cli/test_update_yes_flag.py -q
3 passed in 5.47s
No production code change. Fixes the two failures observed on `main`
(run 25250051126):
`tests/hermes_cli/test_update_yes_flag.py::TestUpdateYesConfigMigration::test_no_yes_flag_still_prompts_in_tty`
`tests/hermes_cli/test_update_yes_flag.py::TestUpdateYesStashRestore::test_yes_restores_stash_without_prompting`
Refs: #18261 (added the `--yes` flag + these tests).
* fix(web): force light color-scheme on docs iframe
The Documentation tab embeds the public Hermes Agent docs site via an
<iframe>. On any system where the browser's prefers-color-scheme
resolves to dark — the default on macOS with system dark mode, and
common on Linux/Windows too — the docs body text rendered nearly
invisible against its own background.
Cause: Docusaurus intentionally leaves <html> and <body> transparent
and relies on the browser's Canvas color to fill the viewport. Inside
our iframe, the iframe element had bg-background (the dashboard's dark
canvas) AND inherited the dashboard's dark color-scheme, so the
browser set the iframe's Canvas to a dark value. Docusaurus's
transparent body exposed that dark Canvas, and the docs body text
(tuned for a light Canvas) became near-illegible. Affects every
built-in dashboard theme.
Fix: replace bg-background on the iframe with [color-scheme:light]
(spec-blessed cross-origin override of the inherited color-scheme;
forces the iframe's Canvas to light) and bg-white (belt-and-suspenders
fallback during the brief paint window before content loads). The
docs site's own theme toggle keeps working — Docusaurus stores its
choice in localStorage and applies opaque dark backgrounds to its
layout elements that cover the white Canvas we forced.
* fix(security): close TOCTOU window when saving MCP OAuth credentials
_write_json (the persistence helper used by HermesTokenStorage for both
tokens and client_info) created the temp file via Path.write_text and
only chmod'd it to 0o600 afterward. Between create and chmod the file
existed on disk at the process umask (commonly 0o644 = world-readable),
briefly exposing MCP OAuth access/refresh tokens to other local users.
Use os.open with O_WRONLY|O_CREAT|O_EXCL and an explicit S_IRUSR|S_IWUSR
mode so the file is created atomically at 0o600, plus tighten the parent
dir to 0o700 so siblings can't traverse to the creds file. The temp name
also gains a per-process random suffix to avoid collisions between
concurrent writers and stale leftovers from a crashed prior write.
Mirrors the fix shipped for agent/google_oauth.py in #19673.
Adds a regression test asserting the resulting file mode is 0o600 and
the parent directory is 0o700 (skipped on Windows where POSIX mode bits
aren't enforced).
* chore(release): add Gutslabs to AUTHOR_MAP for PR #21148 salvage
* test(update): teach restart-mocks about the post-update survivor sweep
Issue #17648 added a post-update SIGTERM-survivor sweep to `cmd_update`:
~3s after issuing graceful/SIGTERM restarts, the code re-queries
`find_gateway_pids` and SIGKILLs anything still alive. That's the
right fix for stuck-drain gateways in production, but it broke three
unit tests that assumed `find_gateway_pids` would keep returning the
same PIDs forever:
FAILED ::TestCmdUpdateLaunchdRestart::test_update_restarts_profile_manual_gateways
AssertionError: Expected 'kill' to not have been called. Called 1 times.
Calls: [call(12345, <Signals.SIGKILL: 9>)].
FAILED ::TestCmdUpdateLaunchdRestart::test_update_profile_manual_gateway_falls_back_to_sigterm
AssertionError: Expected 'kill' to have been called once. Called 2 times.
Calls: [call(12345, SIGTERM), call(12345, SIGKILL)].
FAILED ::TestServicePidExclusion::test_update_kills_manual_pid_but_not_service_pid
assert 2 == 1
manual_kills = [call(42999, SIGTERM), call(42999, SIGKILL)]
In each test `os.kill` is mocked, so the simulated PID never actually
exits \u2014 the sweep finds it again and escalates. The production code
is correct; the tests just need to model OS behaviour properly.
Two-test fix (profile-manual restart cases): use
`side_effect=[[12345], []]` so the first `find_gateway_pids` call
returns the live PID and the second (the sweep) returns nothing, as if
the OS had reaped the process.
Service-PID-exclusion fix: track which PIDs got killed in a closure
set, and exclude them on subsequent `fake_find` calls. `os.kill`
gets a `side_effect` that records the kill instead of swallowing it
silently. Now the sweep doesn't re-find the manual PID, no SIGKILL
escalation, `manual_kills == 1`.
Validation:
$ pytest tests/hermes_cli/test_update_gateway_restart.py -q
43 passed in 4.13s
No production code change. Fixes the three failures observed on `main`
(run 25250051126):
test_update_restarts_profile_manual_gateways
test_update_profile_manual_gateway_falls_back_to_sigterm
test_update_kills_manual_pid_but_not_service_pid
Refs: #17648 (post-update survivor sweep that the tests didn't model).
* fix(image-routing): expose attached image paths in native multimodal text part
In native image mode (vision-capable models like gpt-4o, claude-sonnet-4),
build_native_content_parts() previously emitted only the user's caption
plus image_url parts. The local file path of each attached image never
appeared in the conversation text, so the model could see the pixels but
had no string handle for tools that take image_url: str (custom MCP
tools, vision_analyze on a re-look, attach-to-tracker workflows).
The text-mode path already injects an equivalent hint via
Runner._enrich_message_with_vision ("...vision_analyze using image_url:
<path>..."). This brings native mode to parity by appending one
"[Image attached at: <path>]" line per successfully attached image to
the user-text part of the multimodal turn. Skipped (unreadable) paths
are NOT advertised, so the model is never told a non-existent file is
attached.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(optional-skills): port Anthropic financial-services skills as optional finance bundle (#21180)
Adds 7 optional skills under optional-skills/finance/ adapted from
anthropics/financial-services (Apache-2.0):
excel-author — openpyxl conventions: blue/black/green cells,
formulas over hardcodes, named ranges, balance
checks, sensitivity tables. Ships recalc.py.
pptx-author — python-pptx for model-backed decks (pitch,
IC memo, earnings note) that bind every number
to a source workbook cell.
dcf-model — institutional DCF (49KB skill): projections,
WACC, terminal value, Bear/Base/Bull scenarios,
5x5 sensitivity tables. Ships validate_dcf.py.
comps-analysis — comparable company analysis: operating metrics,
multiples, statistical benchmarking.
lbo-model — leveraged buyout: S&U, debt schedule, cash
sweep, exit multiple, IRR/MOIC sensitivity.
3-statement-model — fully-integrated IS/BS/CF with balance-check
plugs. Ships references/ for formatting,
formulas, SEC filings.
merger-model — accretion/dilution analysis for M&A.
All seven are optional (not active by default). Users install via
'hermes skills install official/finance/<skill>'.
Hermesification:
- Stripped every Office JS / Office Add-in / mcp__office__*
branch — skills assume headless openpyxl only.
- Replaced Cowork MCP data-source instructions with 'MCP first (via
native-mcp), fall back to web_search/web_extract against SEC EDGAR
and user-provided data'.
- Swapped Claude tool references (Bash, Read, Write, Edit, mcp__*)
for Hermes-native equivalents and Python library calls.
- Canonical Hermes frontmatter (name/description/version/author/
license/metadata.hermes.{tags,related_skills}).
- Descriptions tightened to 187-238 chars, trigger-first.
- Attribution preserved: author field credits 'Anthropic (adapted by
Nous Research)', license: Apache-2.0, each SKILL.md links back to
the upstream source directory.
Verification:
- All 7 discovered by OptionalSkillSource with source_id='official'
- Bundle fetch includes support files (scripts, references, troubleshooting)
- related_skills cross-refs all resolve within the bundle
- No Claude product / Cowork / Office JS / /mnt/skills leakage
remains in body text (bounded mentions only in attribution blocks)
Source: https://github.com/anthropics/financial-services (Apache-2.0)
* test(skills): cover additional rescan paths in skill_commands cache (#14536)
The rescan-on-platform-change fix landed in #18739 ships one regression
test that exercises the HERMES_PLATFORM env-var path. Three other code
paths in get_skill_commands / _resolve_skill_commands_platform have no
direct coverage; this commit adds a regression test for each.
- Gateway session context (HERMES_SESSION_PLATFORM via ContextVar): the
resolver consults get_session_env after HERMES_PLATFORM, and the
gateway sets that variable through set_session_vars (a ContextVar),
not os.environ. The test uses set_session_vars / clear_session_vars
to drive the actual gateway signal, and the disabled-skill stub reads
the same value via get_session_env. A regression that swapped
get_session_env for plain os.getenv would still pass an env-var-based
test but break concurrent gateway sessions, which is the bug the
ContextVar plumbing exists to prevent.
- Returning to no-platform-scope (CLI / cron / RL rollouts after a
gateway session): the cached telegram view must be dropped and the
unfiltered scan repopulated when HERMES_PLATFORM is unset again.
- Same-platform cache hit: consecutive calls under the same platform
scope must NOT rescan. The rescan trigger is change in scope, not
"always re-resolve" — a gateway serving many consecutive telegram
requests should pay the scan cost once, not per request.
The third test wraps scan_skill_commands with a spy after the cache is
primed, so the assertion is on call_count == 0 across three subsequent
get_skill_commands() calls.
All 39 tests in tests/agent/test_skill_commands.py pass under
scripts/run_tests.sh.
* fix(gateway): translate inbound document host paths to container paths for Docker backend
When terminal.backend is docker, inbound documents uploaded via messaging
platforms (Telegram, Slack, Discord, Feishu, Email, etc.) are cached at a host
path under ~/.hermes/cache/documents, but the container sandbox only sees them
at the auto-mounted /root/.hermes/cache/documents path.
This PR adds to_agent_visible_cache_path() in tools/credential_files.py (the
natural sibling to get_cache_directory_mounts()) and calls it at the
document-context-injection site in gateway/run.py so the agent always receives
a path it can open directly, matching the mount layout already established
by get_cache_directory_mounts() (#4846).
Scope: only Docker backend for now; other backends use different mount
semantics and are left unchanged until verified.
Fixes #18787
* feat(gateway): opt-in cleanup of temporary progress bubbles (#21186)
When display.cleanup_progress (or display.platforms.<plat>.cleanup_progress)
is true, the gateway deletes tool-progress bubbles, long-running '⏳ Still
working...' notices, and status-callback messages after the final response
is delivered successfully. Currently effective on adapters that implement
delete_message (Telegram); silently no-ops elsewhere. Off by default.
Failed runs skip cleanup so bubbles stay as breadcrumbs.
Minimal plumbing: base.py's existing post_delivery_callback slot now chains
new registrations onto any existing callback (with per-callback exception
isolation) rather than clobbering. Stale-generation registrations are
rejected so they can't step on a fresher run's callbacks. This lets the
cleanup callback coexist with the background-review release hook already
registered on the same slot.
Co-authored-by: mrcharlesiv <Mrcharlesiv@gmail.com>
* fix(kanban): heartbeat tool extends claim TTL, not just last_heartbeat_at
The kanban_heartbeat tool called heartbeat_worker but never
heartbeat_claim, so a worker that loops the tool while a single tool
call blocks the agent for >DEFAULT_CLAIM_TTL_SECONDS still got
reclaimed by release_stale_claims. The function name and
heartbeat_claim's own docstring imply otherwise:
"Workers that know they'll exceed 15 minutes should call this
every few minutes to keep ownership."
But there was no caller in the worker tool path. Workers couldn't
invoke heartbeat_claim themselves either — it isn't exposed as a tool.
Fix: _handle_heartbeat now calls heartbeat_claim first, reading
HERMES_KANBAN_CLAIM_LOCK from the worker env (the dispatcher pins
this in _default_spawn). Falls back to _claimer_id() for locally-
driven workers that didn't go through dispatcher spawn.
Test: tests/tools/test_kanban_tools.py::test_heartbeat_extends_claim_expires
rewinds claim_expires into the past, calls the tool, and asserts the
new value is at least now + DEFAULT_CLAIM_TTL_SECONDS // 2. Verified to
fail against the unfixed code (claim_expires stays at the rewound
value).
Closes the root cause underlying the symptom in #21141 (15-min
respawns of long-running workers). #21141 separately addresses
post-reclaim cleanup; this fixes the upstream "shouldn't have been
reclaimed in the first place" half.
* chore(release): map stephen0110 noreply email
* fix(kanban): stop reclaimed workers before retry
* fix(kanban): reap completed worker children in dispatch_once
The gateway-embedded dispatcher (default since `kanban.dispatch_in_gateway
= true`) is the parent of every spawned kanban worker. `_default_spawn`
calls `subprocess.Popen(..., start_new_session=True)` and returns the
pid — `start_new_session` detaches the controlling tty but does not
reparent to init, so the gateway keeps each worker as a child until it
`wait()`s for them.
Nothing in the dispatch loop ever calls `waitpid`. Result: every
completed worker becomes a `<defunct>` zombie that lingers until the
gateway exits. We hit ~430 zombies on a single hermes-agent container
after ~40 days of steady kanban traffic, approaching process-table
exhaustion on the host.
Fix: add a non-blocking reap loop at the top of `dispatch_once`, so
every dispatcher tick (default 60s) drains zombies that accumulated
since the last tick. WNOHANG keeps the call non-blocking; ChildProcessError
means no children to reap.
Why here, not a SIGCHLD handler:
- signal.signal requires the main thread; gateway threading model makes
that placement non-trivial.
- Bounded staleness: at default interval=60s the maximum live zombie
count is one tick's worth of worker completions.
- No interaction with detect_crashed_workers: that function only inspects
rows where status='running', and rows reach 'done' (and stop being
inspected) before their workers exit.
* chore(release): map sonic-netizen noreply email
* fix: auto-block repeated kanban retries
* chore(release): map mwnickerson noreply email
* feat(gateway): auto-resume interrupted sessions after restart
* fix(gateway): preserve resume marker on interrupted restart
* refactor(gateway): simplify auto-resume + extend to crash recovery
Follow-up on top of @kyan12's PR #20888 — same feature, cleaner shape,
wider coverage.
Changes:
- Drop the synthetic '[System note: ...]' in the internal MessageEvent.
The existing _is_resume_pending branch in _handle_message_with_agent
(run.py ~L13738) already injects a reason-aware recovery system note
on the next turn. With kyan's text in place the model saw two stacked
system notes. Now the event text is empty and the existing injection
path owns the wording.
- Drop SessionStore.list_resume_pending() as a new public method. The
filter is 8 lines inline in _schedule_resume_pending_sessions() —
one caller, no other pluggability need.
- Add 'restart_interrupted' to the auto-resume reason set. That's the
reason SessionStore.suspend_recently_active() stamps on sessions
recovered from a crash/OOM/SIGKILL (no .clean_shutdown marker).
Previously those sessions had to wait for a real user message to
auto-resume; now they continue automatically at startup like
drain-timeout interruptions do.
- Reasons live in a _AUTO_RESUME_REASONS frozenset at class scope so
future reasons (e.g. 'manual_resume_request') can be opted in with
one line.
Test coverage added:
- drain-timeout + crash-recovery both scheduled
- stale entries skipped (outside freshness window)
- suspended entries skipped (suspended > resume_pending)
- originless entries skipped (no routing target)
- disallowed reasons skipped (graceful forward-compat)
E2E verified end-to-end with a real on-disk SessionStore: 2 eligible
sessions scheduled, 2 ineligible skipped, empty-text internal events
delivered to the adapter.
Co-authored-by: Kevin Yan <kevyan1998@gmail.com>
* fix(auth): sync shared Nous refresh tokens
* refactor(auth): dedupe file-lock helper; document Nous lock order
Extract the shared flock/msvcrt boilerplate from _auth_store_lock and
_nous_shared_store_lock into a single _file_lock(lock_path, holder,
timeout, message) helper. Each caller keeps its own threading.local
holder so reentrancy state stays per-lock.
Also document the lock-ordering invariant on both wrappers:
_auth_store_lock is OUTER, _nous_shared_store_lock is INNER for all
runtime refresh paths. The one exception is _try_import_shared_nous_state,
which holds the shared lock alone across the full HTTP refresh+mint
cycle to prevent concurrent sibling imports from racing on the single-
use shared refresh token; that helper must not be called with the auth
lock already held.
* fix(gateway): avoid duplicated responses history
* chore: add AUTHOR_MAP entries for thelumiereguy and counterposition
* fix(gateway): use monotonic deadlines in QR onboarding flows
* fix(oauth,gateway): monotonic deadlines for polling/timeout loops
Widen PR #20314's fix to the other timeout-polling sites in the codebase
that share the same wall-clock-jump bug class. All of these measure elapsed
timeout duration, not civil time, so they belong on time.monotonic().
- hermes_cli/auth.py: auth-store file-lock timeout, Spotify OAuth callback
wait, Nous portal device-auth token poll.
- hermes_cli/copilot_auth.py: Copilot OAuth device-flow token poll.
- hermes_cli/gateway.py: gateway systemd restart wait.
- hermes_cli/web_server.py: dashboard Codex device-auth user_code wait,
dashboard Nous device-auth token poll. (sess["expires_at"] stays on
time.time() — it's a persisted absolute timestamp, not a local
deadline-polling variable.)
- agent/copilot_acp_client.py: Copilot ACP JSON-RPC request timeout.
* fix(weixin): replace all aiohttp ClientTimeout with asyncio.wait_for()
aiohttp ClientTimeout uses BaseTimerContext which calls
loop.call_later() internally. When invoked via
asyncio.run_coroutine_threadsafe() from cron jobs, this
triggers "Timeout context manager should be used inside a task"
errors, causing message delivery failures.
Replace all direct ClientTimeout usage with asyncio.wait_for():
- _upload_ciphertext: CDN upload (120s timeout)
- _download_bytes: CDN download (configurable timeout)
- _download_remote_media: remote media fetch (30s timeout)
Also set total=None on _send_session to disable aiohttp built-in
timeout, and change trust_env=True to False to bypass proxy for
WeChat CDN connections.
* test(weixin): update timeout assertion for asyncio.wait_for migration
* chore: AUTHOR_MAP entry for chenlinfeng@ruije / @noOne-list
* feat(security): enable secret redaction by default (#17691, #20785) (#21193)
Flip the default for HERMES_REDACT_SECRETS from off to on so the redactor
already wired into send_message_tool, logs, and tool output actually runs
on a fresh install.
- agent/redact.py: env-var default "" → "true"
- hermes_cli/config.py: DEFAULT_CONFIG security.redact_secrets True;
two config-template comments rewritten
- gateway/run.py + cli.py: startup log / banner warning when the user
has explicitly opted out, so the downgrade is visible in agent.log
and at CLI banner time
- docs/reference/environment-variables.md: description reconciled
- tests: flipped the default-pin, restructured the force=True
regression test to explicit-false instead of unset
Users who need raw credential values (redactor development) can still
opt out via security.redact_secrets: false in config.yaml or
HERMES_REDACT_SECRETS=false in .env.
Closes #17691.
Addresses #20785 (short-term output-pipeline recommendation).
* feat: add Discord message deletion action
* chore: AUTHOR_MAP entry for @likejudy
* fix(security): close TOCTOU window in hermes_cli/auth.py credential writers (#21194)
`_save_auth_store`, `_save_qwen_cli_tokens`, and `_write_shared_nous_state`
all created the temp file via `Path.open('w')` / `Path.write_text` and only
tightened permissions to 0o600 afterward. Between create and chmod the file
existed at the process umask (commonly 0o644 = world-readable on multi-user
hosts), briefly exposing OAuth access/refresh tokens for Nous, Codex,
Copilot, Claude, Qwen, Gemini, and every other native OAuth provider that
flows through auth.json.
Switch all three to `os.open(O_WRONLY|O_CREAT|O_EXCL, 0o600)` + `os.fdopen`
+ `fsync` so the file is atomic at 0o600 on creation. Tighten each parent
directory (`~/.hermes/`, Qwen auth dir, Nous shared auth dir) to 0o700 so
siblings can't traverse to the creds. `_save_auth_store` also gains a
per-process random temp suffix to match `agent/google_oauth.py` (#19673)
and `tools/mcp_oauth.py` (#21148).
Adds `tests/hermes_cli/test_auth_toctou_file_modes.py` asserting final
file mode 0o600 and parent dir mode 0o700 across all three writers, plus
an explicit `os.open(flags, mode)` check on the main auth.json writer
that would fail if anyone reintroduces the `Path.open('w')` pattern.
POSIX-only (mode bits skipped on Windows).
* fix(delegate): correct ACP docs — Claude Code CLI has no --acp flag
The delegate_task tool schema descriptions referenced 'claude --acp --stdio'
as an example, but Claude Code CLI does not support --acp or --stdio flags.
The ACP subprocess transport (agent/copilot_acp_client.py) is specifically
built for GitHub Copilot CLI ('copilot --acp --stdio').
Changes:
- Per-task acp_command example: 'claude' → 'copilot'
- Top-level acp_command description: remove 'Claude Code' reference,
clarify requirement for ACP-compatible CLI (currently Copilot only)
- acp_args description: remove misleading claude-opus-4-6 example
Fixes #19055
* fix: exclude hidden and archive dirs from _find_skill rglob
* fix(gateway): preserve thread routing from cached live session sources
* fix(gateway): cap cached session sources with LRU eviction
Follow-up on top of Zyproth's session-source cache: swap the unbounded
dict for an OrderedDict with a 512-entry LRU cap so long-running
gateways can't accumulate stale entries for dead sessions forever.
- self._session_sources is now an OrderedDict
- _cache_session_source() move_to_end + popitem(last=False) above cap
- _get_cached_session_source() move_to_end on hit (LRU read bump)
- restart_test_helpers.py wires OrderedDict + _session_sources_max
* fix(mcp): give 'mcp add --command' a distinct argparse dest
The --command flag of `hermes mcp add` shared its argparse dest with the
top-level subparser (`dest="command"` in `hermes_cli/_parser.py`). When
the flag was omitted, argparse still wrote `args.command = None`,
clobbering the top-level value of `"mcp"`. The dispatcher then saw
`args.command is None` and fell through to interactive chat, so
`hermes mcp add ...` silently launched chat instead of registering the
server. `cmd_mcp_add` was never reached.
Use `dest="mcp_command"` on the flag and read it from `cmd_mcp_add`.
The user-facing CLI flag `--command` is unchanged; only the in-memory
namespace attribute moves. Also updates the `_make_args` helper in
`tests/hermes_cli/test_mcp_config.py` to populate the new dest, and
adds `tests/hermes_cli/test_mcp_add_command_dest.py` with a parser-
level regression test.
Closes #19785.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* chore: add discodirector email to AUTHOR_MAP
* fix(bedrock): preserve reasoningContent across converse normalization
* feat(gateway): support [[as_document]] directive for skill media routing
Skills that produce large/lossless images (e.g. info-graph, where a
rendered JPG is 1-2 MB) currently lose quality in Telegram delivery
because `_IMAGE_EXTS` membership routes the file through
`send_multiple_images` → `sendMediaGroup`, which Telegram's server
re-encodes to JPEG @ 1280px max edge. The original bytes only survive
when the file goes through `send_document`, which the dispatch tables
in three places (`_process_message_background`, `_deliver_media_from_response`,
and the `send_message` tool's telegram path) only reach for files
whose extension is NOT in `_IMAGE_EXTS`.
This commit adds an `[[as_document]]` directive that mirrors the
existing `[[audio_as_voice]]` shape: a skill emits the directive once
in its response, and every image-extension MEDIA: file in that response
is delivered via `send_document` instead of `send_multiple_images` /
`sendPhoto`. The directive is detected at the dispatch sites (which see
the raw response) and the directive string is stripped from the
user-visible cleaned text in `extract_media` so it never leaks.
Granularity is intentionally all-or-nothing per response, matching
[[audio_as_voice]]'s scope. Skills that need fine control can split into
two responses.
Verified the targeted use case: info-graph emits
信息图已生成(...)
[[as_document]]
MEDIA:/tmp/info-graph-x/infographic.jpg
→ Telegram receives `infographic.jpg` via sendDocument, original 1MB
JPEG bytes preserved, no recompression. Forwarding and download
filenames stay clean (`infographic.jpg`).
Tests: +3 cases in TestExtractMedia covering directive strip, isolation
from voice flag, and coexistence with [[audio_as_voice]]. All
113 pre-existing media/extract/send tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: update send_message_tool mocks for force_document kwarg
* chore: AUTHOR_MAP entry for @leon7609
* fix(model_switch): live model discovery for custom_providers in /model picker
custom_providers entries (section 4 of list_authenticated_providers) only
read the static models: dict from config.yaml, ignoring the live /v1/models
endpoint. This means gateways like Bifrost that expose hundreds of models
only show the handful explicitly listed in config.
Add live discovery via fetch_api_models() for custom_providers entries
that have api_key + base_url, matching the existing behavior for user
providers: entries (section 3). When the endpoint is reachable and
returns models, the live list replaces the static subset.
Fixes: /model picker showing only 9 models from a Bifrost gateway that
actually exposes 581.
* fix(memory): support OpenViking local resource uploads
* test(memory): harden OpenViking local upload coverage
* fix(memory): harden OpenViking local path uploads
* chore(release): add AUTHOR_MAP entries for ggnnggez and ehz0ah
Contributors to OpenViking local resource upload fix (#19569).
* docs(readme): drop misleading RL install-extras claim, defer to CONTRIBUTING
README.md:163 said atroposlib and tinker were pulled in by .[all,dev], but
.[all] does not include .[rl] — those dependencies live in pyproject.toml's
[rl] extra (lines 95-101). With the original wording, a contributor running
uv pip install -e ".[all,dev]" would not have atroposlib or tinker
installed.
Rather than swap one extra for another (which paths users to either of two
parallel install conventions — pip [rl] extra vs tinker-atropos submodule —
without saying which the project considers canonical), this PR drops the
specific install command from the README and links to CONTRIBUTING.md,
which already documents the actual development setup.
* fix(kanban): auto-block workers that exit without completing (#20894) (#21214)
When a kanban worker subprocess exits rc=0 but its task is still in
status='running', the agent almost certainly answered the task
conversationally without calling kanban_complete or kanban_block. The
dispatcher used to classify this as a generic crash and respawn, which
loops forever on small local models (gemma4-e2b q4 etc.) that keep
returning clean but unproductive output.
Dispatcher changes:
- The waitpid reap loop at the top of dispatch_once now records each
reaped child's raw exit status in a bounded module registry
(_recent_worker_exits, TTL 600s, size cap 4096).
- _classify_worker_exit distinguishes clean_exit / nonzero_exit /
signaled / unknown using os.WIFEXITED / WIFSIGNALED.
- detect_crashed_workers consults the classification when a worker
is found dead. clean_exit → protocol_violation event + immediate
circuit-breaker trip (failure_limit=1). Everything else keeps the
existing crashed-event + counter behavior.
- DispatchResult.auto_blocked now includes protocol-violation trips.
Gateway fix (Bug A in #20894):
- gateway.run._notify_active_sessions_of_shutdown snapshots
self.adapters with list(...) before iterating. adapter.send() can
hit a fatal-error path that pops the adapter from the dict, which
was raising 'RuntimeError: dictionary changed size during iteration'
during shutdown.
Regression tests:
- test_detect_crashed_workers_protocol_violation_auto_blocks verifies
rc=0 + still-running → status=blocked on first occurrence with
protocol_violation + gave_up events and NO crashed event.
- test_detect_crashed_workers_nonzero_exit_uses_default_limit verifies
non-zero exits keep the existing 2-strike behavior.
Closes #20894.
* fix(dashboard): stabilize embedded chat resume and scrollback
* fix(dashboard): let embedded chat use a single scroll system
* fix(dashboard): route browser wheel into inner TUI scrolling
* chore: AUTHOR_MAP entry for @nouseman666
* fix(cli): honor positive tool preview length
* chore: AUTHOR_MAP entry for @GinWU05
* fix(credential_pool): resolve key mix-up when custom providers share base_url
When multiple custom_providers share the same base_url but have different API keys,
get_custom_provider_pool_key() always returned the first match, causing wrong-key
unauthorized errors. Add provider_name parameter to prefer exact name matches
over base_url-only matching, with fallback for backward compatibility.
Fixes #19083
* feat(cli): show context compression count in status bar
Display the number of context compressions in the CLI status bar when
compressions > 0, helping users understand conversation compression
pressure during long sessions.
- Wide layout (>=76 cols): shows 'cmp N' between context percent and duration
- Medium layout (52-75 cols): shows 'cmp N' between percent and duration
- Narrow layout (<52 cols): omitted to save space
- Color-coded: dim for 1-4, warn for 5-9, bad for 10+
- Hidden when zero to keep the bar clean for new sessions
Closes #18564
* refactor: replace 'cmp' text with 🗜️ emoji in status bar
Address review feedback to use the clamp emoji (��️) instead of
the plain text 'cmp' prefix for the compression count indicator.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat(tui): surface compression count in Ink status bar
Parity with the classic CLI status bar (PR #18579). The Python backend
already exposes 'compressions' on SessionUsageResponse; this wires it
through the Ink Usage type and renders 'cmp N' next to the duration
segment of StatusRule.
- types.ts Usage: add optional compressions field
- appChrome.tsx StatusRule: render 'cmp N' when > 0, color-tiered by
pressure (muted <5, warn 5-9, error 10+)
- Plain text 'cmp' token (no emoji) matches PR #18579's original author
rationale and avoids Ink layout drift from VS16 emoji width
* chore(release): map altriatree@gmail.com -> @TruaShamu
* fix(curator): make manual runs synchronous
* docs(curator): update CLI docs for synchronous-by-default manual run
Follow-up to the previous commit which flipped 'hermes curator run'
default from async to sync. Updates the curator.md feature page and
cli-commands.md reference to show --background as the opt-in async
flag and note that the default now blocks until the LLM pass finishes.
* fix(install): remove uv exclude-newer cutoff
* docs: clarify API server tool execution locality
* fix(kanban): treat dashboard event-stream cancellation as normal shutdown
Stopping `hermes dashboard` with Ctrl-C while the Kanban dashboard is
open prints an ASGI traceback ending in
`plugins/kanban/dashboard/plugin_api.py::stream_events` at the
`asyncio.sleep(_EVENT_POLL_SECONDS)` line. This is a normal shutdown
path: Uvicorn cancels the open websocket task while it is sleeping in
the 300 ms poll loop. `asyncio.CancelledError` is a `BaseException` in
Python 3.8+ — the bare `except Exception:` handler below the existing
`WebSocketDisconnect:` clause does NOT catch it, so the cancellation
surfaces as an application traceback and routine dashboard exit looks
like a runtime failure.
Add an explicit `except asyncio.CancelledError: return` clause beside
the existing `WebSocketDisconnect` handler. Disconnection (client
closed the tab) and shutdown cancellation (dashboard process exiting)
are conceptually different paths but both warrant a quiet return; the
two clauses are kept separate to keep that intent explicit.
`asyncio` is already imported and used in this scope, so no new
import is needed. The bare `except Exception:` handler is preserved
verbatim, so genuine runtime failures still log a warning and close
the socket cleanly.
Closes #20790.
* chore(release): map SandroHub013 email
* test(kanban): regression for CancelledError swallow in stream_events
Drives stream_events directly and cancels the task while it is sleeping
in the poll loop, asserting the coroutine returns cleanly instead of
letting CancelledError bubble. Regression coverage for the Uvicorn
application traceback on dashboard Ctrl-C fixed by the preceding commit.
* fix(model_tools): log plugin hook exceptions instead of silently swallowing them
* feat(gateway): add `hermes gateway list` to show all profiles' gateway status
Add a new `hermes gateway list` subcommand that shows the running
status of gateways across all profiles in a single view:
Gateways:
✓ default (current) — PID 155469
✓ wx1 — PID 166893
✗ dev — not running
Also includes `_print_other_profiles_gateway_status()` which appends
an "Other profiles" section to `hermes gateway status` output when
other profile gateways are running.
Both use existing `list_profiles()` and `find_profile_gateway_processes()`
— no new dependencies.
Closes #19127
Related: #19113, #4402, #4587
* fix(mcp-oauth): persist OAuth server metadata across process restarts (#21226)
The MCP SDK discovers OAuth server metadata (token_endpoint, etc.) on
demand and keeps it in memory only. Without disk persistence, a restart
with valid cached refresh tokens forces the SDK to fall back to the
guessed '{server_url}/token' path — which returns 404 on most real
providers (Notion, Atlassian, GitHub remote MCP, etc.) and triggers a
full browser re-authorization even though the refresh token is fine.
Add a .meta.json file next to the existing tokens/client_info files:
HERMES_HOME/mcp-tokens/<server>.json -- tokens (existing)
HERMES_HOME/mcp-tokens/<server>.client.json -- client info (existing)
HERMES_HOME/mcp-tokens/<server>.meta.json -- oauth metadata (new)
Changes:
- HermesTokenStorage.save_oauth_metadata / load_oauth_metadata / _meta_path
— disk layer for the discovered OAuthMetadata.
- HermesTokenStorage.remove() now also clears .meta.json so
'hermes mcp remove <name>' and the manager's remove() path clean up fully.
- HermesMCPOAuthProvider._initialize cold-restores from disk before the
existing pre-flight discovery runs. If disk has metadata we skip the
discovery HTTP round-trips entirely.
- HermesMCPOAuthProvider._prefetch_oauth_metadata now persists ASM as
soon as it's discovered, so even the first pre-flight run seeds disk.
- HermesMCPOAuthProvider._persist_oauth_metadata_if_changed() is called
at the end of async_auth_flow so metadata discovered via the SDK's
lazy 401-branch (not pre-flight) is also saved for next time.
Tests cover the storage roundtrip (save/load/missing/corrupt/remove) and
the manager provider path (cold-load restore, skip-when-in-memory,
persist-on-discover, noop-when-unchanged, end-to-end async_auth_flow).
Co-authored-by: nocturnum91 <50326054+nocturnum91@users.noreply.github.com>
* feat: add SSE transport support for MCP client
Add support for MCP servers using the SSE transport protocol
(SseServerTransport) alongside the existing Streamable HTTP and stdio
transports. Many MCP servers use SSE (GET /sse + POST /messages/)
which was previously unsupported -- the client silently fell back to
Streamable HTTP, causing 10s connection timeouts.
Changes:
- Import mcp.client.sse.sse_client with graceful fallback
- Check config.get('transport') == 'sse' in _run_http() to select
the SSE transport path with proper timeout handling
- Read transport type from config in get_mcp_status() instead of
hardcoding 'http' for URL-based servers
- Update docstring, example config, and feature list
* fix(browser): enforce cloud-metadata SSRF floor in hybrid routing (#16234) (#21228)
Cloud metadata endpoints (169.254.169.254 etc.) are now always blocked
by browser_navigate regardless of hybrid routing, allow_private_urls,
or backend.
Bug: commit 42c076d3 (#16136) added hybrid routing that flips
auto_local_this_nav=True for private URLs and short-circuits
_is_safe_url(). IMDS endpoints are technically private (169.254/16
link-local), so the sidecar happily routed them to a local Chromium,
and the agent could read IAM credentials via browser_snapshot. On
EC2/GCP/Azure this is a full SSRF-to-credential-theft.
Fix: new is_always_blocked_url() in url_safety.py — a narrow floor
that checks _BLOCKED_HOSTNAMES, _ALWAYS_BLOCKED_IPS,
_ALWAYS_BLOCKED_NETWORKS only. Applied as an independent gate in
browser_navigate's pre-nav and post-redirect checks, BEFORE
auto_local_this_nav gets a chance to short-circuit. Ordinary private
URLs (localhost, 192.168.x, 10.x, .local, CGNAT) still route to the
local sidecar as the #16136 feature intends.
Secondary fix (reporter's finding): _url_is_private() now explicitly
checks 172.16.0.0/12. ipaddress.is_private only covers that range on
Python ≥3.11 (bpo-40791), so on 3.10 runtimes those URLs were routed
to cloud instead of the local sidecar. No security impact — just a
correctness fix for the hybrid-routing feature.
Closes #16234.
* fix: WhatsApp bridge process leak and disable config asymmetry
- Add PID file mechanism to track bridge processes and kill stale ones on startup
- Improve _kill_port_process() with lsof fallback when fuser is not available
- Support explicit WhatsApp disable via config.yaml (whatsapp.enabled: false)
- Respect WHATSAPP_ENABLED=false env var to disable WhatsApp
Fixes #19124
* docs(contributing): align tool discovery and test runner with AGENTS.md
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(kanban): make dashboard board pin authoritative over server current file (#21230)
When the user created a new board via the dashboard with "switch" checked,
the server-side `current` file was flipped to the new board. Clicking the
original board's tab then showed no cards even though the count badge read
correctly — the REST fetch dropped `?board=` when the selection was
"default" and the backend fell through to `current` (= the new board),
returning a different board's data than the tab the user clicked.
Fix:
- `withBoard()` always appends `?board=<slug>` when a board is selected,
including "default". The dashboard's tab selection becomes authoritative
instead of silently deferring to the server's `current` file.
- `writeSelectedBoard()` persists every selection (including "default")
to localStorage. Previously "default" was stripped, which meant the
next page load had nothing to pin to and fell through to `current`.
- Same change applied to the WebSocket query builder in `openWs()`.
Contract verified live:
current_board = "proj2"
GET /board → proj2's tasks (bug shape: falls through to current)
GET /board?board=default → default's tasks (fix: explicit pin wins)
GET /board?board=proj2 → proj2's tasks
Closes #20879.
* fix: avoid unsupported anthropic context beta by default
* Follow latest child session on dashboard resume
* fix(openviking): add Bearer auth header and omit empty/legacy tenant headers (#21232)
Authenticated remote OpenViking servers derive tenancy from the Bearer
key, but the client was always sending X-OpenViking-Account and
X-OpenViking-User — defaulted to the literal string "default" — which
overrode the key-derived tenant and broke auth.
- _headers(): skip X-OpenViking-Account/-User when blank or "default"
(treats the legacy default value as unset, so existing installs don't
need to touch their .env)
- _headers(): send Authorization: Bearer <key> alongside X-API-Key for
standard HTTP auth compatibility
- health(): include auth headers so /health works against servers that
require authentication
Tests cover bearer emission, legacy "default" suppression, empty
suppression, real tenant passthrough, and authenticated health checks.
Fixes the same user report as #20695 (from @ZaynJarvis); that PR could
not be merged because its branch was stale against main and would have
reverted recent OpenViking work (#15696, local resource uploads, summary
URI normalization, fs-stat pre-check).
* feat: add transform_llm_output plugin hook
Enables plugins to transform LLM output text after generation,
useful for vocabulary/personality transformation without burning
inference tokens.
Follows same pattern as transform_tool_result and transform_terminal_output:
- First non-empty string result wins
- Fail-open: exceptions logged as warnings, agent continues
- Signature: (response_text, session_id, model, platform)
* test+docs: cover transform_llm_output hook + release author map
- tests/test_transform_llm_output_hook.py: dispatch semantics
(kwargs contract, first-non-empty-string-wins, empty-string
pass-through, raising-plugin fail-open, no-plugins = no-op)
- tests/hermes_cli/test_plugins.py: assert the new hook name is in
VALID_HOOKS alongside the other transform_* hooks
- website/docs/user-guide/features/hooks.md: summary-table entry +
full section mirroring transform_tool_result / transform_terminal_output
- scripts/release.py: map barnacleboy.jezzahehn@agentmail.to -> JezzaHehn
(existing entry only covers the gmail address)
* feat(curator): add `hermes curator list-archived` command (#21236)
Lists the skills sitting in ~/.hermes/skills/.archive/ so users have
something to pass to `hermes curator restore`. `curator status` already
shows counts; this fills the name-discovery gap.
Archive layout is flat (`archive_skill` writes to `.archive/<skill>/`),
so the directory name IS the skill name — no frontmatter parsing
needed. Timestamped collision directories (`<skill>-<ts>`) are listed
literally; user can still pass them to `restore`.
Reshape of @EvilDrag0n's #20651, simplified: drop the frontmatter
rglob + preamble/trailer output + duplicate subcommand registration.
Co-authored-by: EvilDrag0n <lxl694522264@gmail.com>
* fix: require memory schema fields by action
* fix(tui): refresh scroll height at cached bottom
* fix(gateway): preserve max turns after env reload
* fix(discord): scope DISCORD_ALLOWED_ROLES to originating guild (CVSS 8.1)
The initial DISCORD_ALLOWED_ROLES implementation (#11608, merged from #9873)
scans every mutual guild when resolving a user's roles. This allows a
cross-guild DM bypass:
1. Bot is in both public server A and private server B.
2. User holds the allowed role in server A only.
3. User DMs the bot. The role check finds the role in A and authorizes the
DM, granting access as if the user were trusted in server B.
Fix:
- DMs (no guild context) disable role-based auth by default. Opt-in via
DISCORD_DM_ROLE_AUTH_GUILD=<guild_id> restricts role lookup to one
explicitly-trusted guild.
- Guild messages check roles only in the originating guild
(message.guild), never in other mutual guilds.
- Reject cached author.roles when the Member came from a different guild
than the current message.
Backwards compatibility:
- DISCORD_ALLOWED_USERS behavior is unchanged (still works in both DMs
and guild messages).
- Deployments that rely on roles in guild channels continue to work;
role checks are now strictly scoped to that guild.
- Deployments that intentionally want role-based DM auth can opt into a
single trusted guild via DISCORD_DM_ROLE_AUTH_GUILD.
Tests: 9 new regression guards in
tests/gateway/test_discord_roles_dm_scope.py covering the bypass path,
the opt-in path, cross-guild guild-message bypass, and backwards-compat
user-ID paths. 47/47 discord-auth tests pass.
Refs: #11608 (initial implementation), #7871 (feature request),
#9873 (PR author credit @0xyg3n)
* fix(discord): extend role-scope fix to slash surface + fixture update
Sibling-site fix: _evaluate_slash_authorization was the fourth
_is_allowed_user caller and didn't pass guild/is_dm through, so slash
interactions would take the DM branch regardless of whether they came
from a guild channel. Now reads interaction.guild + in_dm and forwards.
Also updates test_discord_slash_auth fixture (_make_interaction) so
the SimpleNamespace guild mock has a get_member(uid)->None method —
required by the new guild-scoped fallback path in _is_allowed_user.
Tests exercising positive role paths still work via user.roles.
Three new regression tests in test_discord_roles_dm_scope:
- Slash DM + role in mutual public guild → rejected
- Slash in guild B + role only in guild A → rejected
- Slash in guild B + role in guild B → allowed (positive control)
368 Discord tests pass. test_discord_free_channel_skips_auto_thread
also fails on clean main (pre-existing, unrelated to this fix).
* fix(discord): route DM role-auth opt-in through config.yaml (not env var)
Per repo policy, ~/.hermes/.env is for secrets only. Guild IDs are
behavioral configuration, not secrets. Replacing the
DISCORD_DM_ROLE_AUTH_GUILD env var from the original fix with
discord.dm_role_auth_guild in config.yaml.
- New module-level _read_dm_role_auth_guild() helper reads
hermes_cli.config.read_raw_config()['discord']['dm_role_auth_guild'].
Fails closed on any parse error (safe default = DM role-auth off).
- DEFAULT_CONFIG['discord'] gains dm_role_auth_guild: '' with a comment
documenting the opt-in.
- …
deestax
added a commit
to T3-Venture-Labs-Limited/hermes-agent
that referenced
this pull request
May 8, 2026
…gin v1.0.0 docs (#26) * fix(opencode-go): keep users on opencode-go instead of hijacking to native providers (#20802) OpenCode Go and OpenCode Zen are flat-namespace model resellers — their /v1/models returns bare IDs (deepseek-v4-flash, minimax-m2.7), and the inference API rejects vendor-prefixed names with HTTP 401 'Model not supported'. Two bugs fixed: 1. `switch_model` in hermes_cli/model_switch.py was silently switching the user off opencode-go to native deepseek when they typed `/model deepseek-v4-flash`. Step d found the model in opencode-go's live catalog, but step e (detect_provider_for_model) still ran and matched the bare name against deepseek's static catalog. Fix: track whether the live catalog resolved it; skip step e when it did. 2. `normalize_model_for_provider` in hermes_cli/model_normalize.py only stripped the exact `opencode-zen/` prefix, leaving arbitrary vendor prefixes like `minimax/minimax-m2.7` (commonly copied from aggregator slugs into fallback_model configs) intact — causing HTTP 401s when the fallback chain activated. Fix: opencode-go/opencode-zen strip ANY leading vendor prefix because their APIs are flat-namespace. Tests: 11 new cases in tests/hermes_cli/test_opencode_go_flat_namespace.py covering both normalization (prefix stripping, regression guards for opencode-zen Claude hyphenation and openrouter vendor-prepending) and switch_model (bare-name resolution on opencode-go's live catalog must not trigger cross-provider hijack). Reported by @Ufonik via Discord; Kimi K2.6 always worked because moonshotai has no overlapping entry in a native provider's static catalog. Deepseek and minimax failed because their v4/v2.7 names existed in the native deepseek/minimax catalogs. * feat(dashboard): add 'default-large' built-in theme with 18px base size (#20820) Same Hermes Teal palette as the default theme, but with baseSize 18px, lineHeight 1.65, and spacious density so the whole dashboard scales up. Gives users a one-click bigger-text preset and a copyable reference for authoring custom YAML themes with their own typography settings. * refactor(web): per-capability backend selection for search/extract split Introduce the foundation for independently selecting web search and extract backends — enabling future combinations like SearXNG for search + Firecrawl for extract. Architecture: - tools/web_providers/base.py: WebSearchProvider and WebExtractProvider ABCs with normalized result contracts (mirrors CloudBrowserProvider) - tools/web_tools.py: _get_search_backend() and _get_extract_backend() read per-capability config keys, fall through to shared web.backend - hermes_cli/config.py: web.search_backend and web.extract_backend in DEFAULT_CONFIG (empty = inherit from web.backend) Behavioral change: - web_search_tool() now dispatches via _get_search_backend() - web_extract_tool() now dispatches via _get_extract_backend() - When per-capability keys are empty (default), behavior is identical to before — _get_search_backend() falls through to _get_backend() This is purely structural — no new backends are added. SearXNG and other search-only/extract-only providers can now be added as simple drop-in modules in follow-up PRs. 12 new tests, 49 existing tests pass with zero regressions. Ref: #19198 * feat(web): add SearXNG as a native search-only backend Adds SearXNG as a free, self-hosted web search provider. SearXNG is a privacy-respecting metasearch engine that requires no API key — just a running instance and SEARXNG_URL pointing at it. ## What this adds - `tools/web_providers/searxng.py` — `SearXNGSearchProvider` implementing `WebSearchProvider` (search only; no extract capability) - `_is_backend_available("searxng")` — gates on SEARXNG_URL - `_get_backend()` — accepts "searxng" as a configured value; adds it to auto-detect candidates (lower priority than paid services) - `web_search_tool` — dispatches to SearXNG when it is the active backend - `check_web_api_key()` — includes SearXNG in availability check - `OPTIONAL_ENV_VARS["SEARXNG_URL"]` — registered with tools=["web_search"] - `tools_config.py` — SearXNG appears in the `hermes tools` provider picker - `nous_subscription.py` — `direct_searxng` detection, web_active / web_available - `setup.py` — SEARXNG_URL listed in the missing-credential hint - 23 tests covering: is_configured, happy-path search, score sorting, limit, HTTP/request errors, _is_backend_available, _get_backend, check_web_api_key ## Config ```yaml # Use SearXNG for search, any paid provider for extract web: search_backend: "searxng" extract_backend: "firecrawl" # Or: SearXNG as the sole backend (web_extract will use the next available) web: backend: "searxng" ``` SearXNG is search-only — it does not implement WebExtractProvider. Users who only configure SEARXNG_URL get web_search available; web_extract falls back to the next available extract provider (or is unavailable if none). Closes #19198 (Phase 2 Task 4 — SearXNG provider) Ref: #11562 (original SearXNG PR) * docs+skill: add searxng-search optional skill and documentation Closes the remaining gaps from PR #11562 that weren't covered by the core SearXNG integration landed in #20823. - optional-skills/research/searxng-search/ — installable skill with SKILL.md (curl-based usage, category support, Python example) and searxng.sh helper script for health checks and instance queries - website/docs/user-guide/configuration.md — SearXNG added to the Web Search Backends section (5 backends, backend table, per-capability split config example, correct search-only note) - website/docs/reference/environment-variables.md — SEARXNG_URL row - website/docs/reference/optional-skills-catalog.md — searxng-search entry The core SearXNG code, OPTIONAL_ENV_VARS, hermes tools picker, and tests were already on main via #20823. This commit is purely additive docs + the optional skill scaffold. Credits from #11562 salvage: @w4rum — original _searxng_search structure @nathansdev — tools_config.py integration @moyomartin — category support and result formatting @0xMihai — config/env var approach @nicobailon — skill and documentation structure @searxng-fan — error handling patterns @local-first — self-hosted-first philosophy and docs * docs: add Web Search + Extract feature page with SearXNG setup guide * fix(feishu): keep topic replies in threads Route Feishu topic progress, status, approval, stream, and fallback messages through threaded replies by preserving the originating message id as the reply target. Add regressions for tool progress topic metadata and Feishu metadata-driven reply routing. * chore: follow-up cleanup for Feishu topic thread fix - Remove dead metadata.get('reply_to') fallback in _send_raw_message; nothing in the codebase ever sets 'reply_to' inside a metadata dict — the key only appears as a top-level send_voice() keyword argument - Simplify _status_thread_metadata construction in run.py to use a single dict literal instead of create-then-mutate pattern; the or-{} guard was dead since source.thread_id implies _progress_thread_id is also set for Feishu - Add yuqian@zmetasoft.com to AUTHOR_MAP for contributor attribution * fix(kanban): avoid fragile failure-column renames * chore: follow-up cleanup for Kanban migration fix - Expand migration comment to name the primary failure mode (missing column OperationalError from #20842) ahead of the secondary SQLite schema-reparse concern; also document the stale-cols-snapshot invariant - Add clarifying comments on from_row() legacy fallback branches noting they are belt-and-suspenders dead code post-migration - Add task_events comment in existing test explaining why the table is required by the migrator - Add test_legacy_migration_no_legacy_columns_at_all: Scenario A — explicitly asserts the exact #20842 crash no longer occurs and that consecutive_failures defaults to 0 on a DB that never had spawn_failures - Add test_legacy_migration_both_columns_already_present: Scenario D — asserts the migration is a no-op when both columns already exist, preserving the existing counter value * fix(tui): bound virtual history offset searches * ci(docker): don't cancel overlapping builds, guard :latest Switch top-level concurrency to cancel-in-progress=false so every push to main gets its own SHA-tagged image published — no more discarded builds when commits land back-to-back. Guard the :latest tag with a second job that has its own concurrency group with cancel-in-progress=true plus a git-ancestor check against the revision label on the current :latest. Together these guarantee :latest only ever moves forward in history: a slower run whose commit isn't a descendant of the current :latest refuses to clobber it, and a newer push mid-way through the move-latest job preempts the older one before it can retag. - Every main push publishes nousresearch/hermes-agent:sha-<commit> with an org.opencontainers.image.revision label embedded. - move-latest job reads that label off :latest, runs merge-base --is-ancestor, and only retags (via buildx imagetools create, registry-side, no rebuild) if our commit strictly descends. - fetch-depth bumped to 1000 so merge-base has the history it needs. - Release tag flow unchanged (unique tag, no race). * docs(tool-gateway): rewrite as pitch-first marketing page (#20827) Previous version read like internal API docs \u2014 leading with env var tables, config YAML, and 'precedence' rules before ever explaining the product. Complete rewrite inverts the structure so readers see value first, mechanics last. Structure now: - Lede: 'One subscription. Every tool built in.' + pitch paragraph - CTA: subscribe/manage button styled as a real call-to-action - What's included: emoji-led table with expanded descriptions per tool. Image gen lists all 9 models by name (FLUX 2 Klein/Pro, Z-Image Turbo, Nano Banana Pro, GPT Image 1.5/2, Ideogram V3, Recraft V4 Pro, Qwen) - Why it's here: value bullets \u2014 one bill, one signup, one key, same quality, bring-your-own anytime - Get started: two-command flow (hermes model \u2192 hermes status) - Eligibility: paid-tier note with upgrade link - Mix and match: three realistic usage patterns - Using individual image models: ID reference table for power users - --- separator --- - Configuration reference (demoted): use_gateway flag, disabling, self-hosted gateway env vars moved below the fold where they belong - FAQ: streamlined, removed redundant content Fact-checked against code: - 9 FAL models confirmed from tools/image_generation_tool.py FAL_MODELS - Status section output verified against hermes_cli/status.py - Portal subscription URL preserved - Self-hosted env vars (TOOL_GATEWAY_DOMAIN etc.) kept accurate Verified: docusaurus build SUCCESS, page renders, no new broken links. * fix(auth): fall back to global-root auth.json for providers missing in profile Profile processes (kanban workers, cron subprocesses, delegated subagents) read the profile's auth.json only. If a provider was authenticated at the global root but not inside the profile, the profile's credential_pool comes back empty and the process fails with 'No LLM provider configured' — even though the credentials are sitting in ~/.hermes/auth.json. #18594 propagated HERMES_HOME correctly, which is what surfaced this: workers now land in the right profile, and the profile turns out to shadow global with no fallback. Semantics (read-only, per-provider shadowing): * Profile has any entries for provider X → use profile only (global ignored). * Profile has zero entries for provider X → fall back to global. * Writes (write_credential_pool, _save_auth_store) still target the profile. * Classic mode (HERMES_HOME == global root) skips the fallback entirely — _global_auth_file_path() returns None. Also mirrors the fallback in get_provider_auth_state so OAuth singletons (nous, minimax-oauth, openai-codex, spotify) inherit cleanly — the Nous shared-token store (PR #19712) remains the authoritative path for Nous OAuth rotation, this just makes the read side consistent with it. Seat belt: _load_global_auth_store() refuses to read the real user's ~/.hermes/auth.json under PYTEST_CURRENT_TEST even when HERMES_HOME points to a profile-shaped path. Guard uses $HOME (stable across fixtures) rather than Path.home() (which fixtures often monkeypatch to a tmp root). Reported by @SeedsForbidden on Twitter as the credential_pool shadowing follow-up to the #18594 fix. * feat(gateway): per-platform gateway_restart_notification flag Adds an opt-out toggle on PlatformConfig that gates both restart lifecycle pings: the "♻ Gateway restarted" message sent to the chat that issued /restart, and the "♻️ Gateway online" home-channel startup notification. Defaults to True so existing deployments are unaffected. The motivating split is operator vs. end-user surfaces: a back-channel like Telegram should keep these pings, while a Slack workspace shared with end users should not surface gateway lifecycle noise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(gateway): also gate pre-restart "Gateway restarting" notification Extend the gateway_restart_notification flag to cover _notify_active_sessions_of_shutdown — the message that fires just before drain ("⚠️ Gateway restarting — Your current task will be interrupted. Send any message after restart and I'll try to resume where you left off.") sent to active sessions and home channels. Same operator/end-user reasoning: on a Slack workspace shared with end users, "Gateway restarting" reads as "the bot is broken" — the operator should be able to suppress it consistently with the other two lifecycle pings rather than having a partial opt-out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: add guillaumemeyer to AUTHOR_MAP For cherry-picked commits in PR #20801. * fix(cli): submit LF enter in thin PTYs (#20896) * fix(tui): refresh virtual offsets after row resize (#20898) * fix(tui): honor skin highlight colors (#20895) * fix(tui): steady transcript scrollbar (#20917) * fix(tui): steady transcript scrollbar Keep the visible scrollbar tied to committed viewport position while virtual history can still prefetch against pending scroll targets, and preserve drag grab offset synchronously for native-feeling scrollbar drags. * fix(tui): smooth precision wheel scroll Replace the opt-scroll throttle with frame-sized coalescing so modifier wheel gestures stay line-precise without stepping. * fix(tui): restore voice push-to-talk parity (#20897) * fix(tui): restore classic CLI voice push-to-talk parity (cherry picked from commit 93b9ae301bb89f5b5e01b4b9f8ac91ffa74fbd9d) * fix(tui): harden voice push-to-talk stop flow Address review feedback from PR #16189 by stopping the active recorder before background transcription, documenting single-shot voice capture, and covering the TUI gateway flags with regression tests. * fix(tui): preserve silent voice strike tracking Keep single-shot voice recording's no-speech counter alive across starts so the TUI can still emit the three-strikes auto-disable event, and bind the auto-restart state at module scope for type checking. * fix(tui): clean up voice stop failure path Address follow-up review by naming the TUI flow as single-shot push-to-talk and cancelling the recorder when forced stop cannot produce a WAV. * fix(tui): report busy voice capture starts Return explicit start state from the voice wrapper so the TUI gateway does not report recording while forced-stop transcription is still cleaning up. * fix(tui): handle busy voice record responses Apply the gateway busy status immediately in the TUI and route forced-stop voice events to the session that sent the stop request. * fix(tui): clear voice recording on null response Treat a null voice.record RPC result as a failed optimistic start so the REC badge cannot stick after gateway-side errors. * fix(tui): count silent manual voice stops Preserve single-shot voice no-speech strikes through forced stop transcription so empty push-to-talk captures still trigger the three-strikes guard. --------- Co-authored-by: Montbra <montbra@gmail.com> * fix(gateway): don't dead-end setup wizard when only system-scope unit is installed The setup wizard dropped non-root users at a bare shell prompt when trying to start a system-scope gateway service. Previously _require_root_for_system_service called sys.exit(1), which the wizard's `except Exception` guards cannot catch (SystemExit is a BaseException). Users with a pre-existing /etc/systemd/system unit (e.g. from an earlier `sudo hermes setup` run) hit this whenever they re-ran `hermes setup` as a regular user. - Convert _require_root_for_system_service to raise a typed SystemScopeRequiresRootError (RuntimeError subclass) instead of sys.exit(1). The direct CLI path (`hermes gateway install|start|stop| restart|uninstall` without sudo) still exits 1 cleanly via a new catch at the top of gateway_command, matching the existing UserSystemdUnavailableError pattern. - Add _system_scope_wizard_would_need_root() pre-check and _print_system_scope_remediation() helper. Both setup wizards (hermes_cli/setup.py and hermes_cli/gateway.py::gateway_setup) now detect the dead-end before prompting and print actionable guidance: either `sudo systemctl start <service>` this time, or uninstall the system unit and install a per-user one. - Defense-in-depth: all 5 wizard prompt sites also catch SystemScopeRequiresRootError and fall back to the remediation helper if the pre-check is bypassed (race, etc.). Tests: 12 new tests in TestSystemScopeRequiresRootError, TestSystemScopeWizardPreCheck, TestSystemScopeRemediationOutput, and TestGatewayCommandCatchesSystemScopeError covering the exception contract, pre-check matrix (root vs non-root, system-only vs user-present vs none vs explicit system=True), remediation output for each action, and the direct-CLI exit-1 path. * fix(tui): preserve session when switching personality Previously, /personality in the TUI called _reset_session_agent() which destroyed the agent, cleared conversation history, and effectively started a new session. This made personality switching disruptive — users lost their entire conversation context. Now /personality updates the agent's ephemeral_system_prompt in-place and injects a pivot marker into the conversation history. The marker tells the model to adopt the new persona from that point forward, which is necessary because LLMs tend to pattern-match their prior responses and continue the established tone without an explicit signal. Changes: - tui_gateway/server.py: Rewrite _apply_personality_to_session to update the agent in-place instead of resetting. Inject a user-role pivot marker so the model actually switches style mid-conversation. - ui-tui/src/app/slash/commands/session.ts: Update help text (no longer mentions history reset). - tests/test_tui_gateway_server.py: Update test to verify history is preserved, pivot marker is injected, and ephemeral prompt is set. * fix(gateway): wait for systemd restart readiness * fix(discord): narrow rate-limit catch and move sync state under gateway/ Two follow-ups on top of helix4u's slash-command sync hardening: - Only suppress exceptions that are actually Discord 429 rate limits (discord.RateLimited, HTTPException with status 429, or a clearly rate-limit-named duck type). Arbitrary failures that happen to expose a retry_after attribute now re-raise to the outer handler instead of silently swallowing a cooldown. - Move the sync-state JSON under $HERMES_HOME/gateway/ so the home root stops collecting ad-hoc runtime files. Added a test verifying unrelated exceptions don't get misclassified as rate limits. * docs(kanban): fix orchestrator skill setup instructions (#20958) * docs(kanban): fix worker skill setup instructions too (#20960) Follow-up to #20958. The worker skill section had the same stale 'hermes skills install devops/kanban-worker' command — kanban-worker is also bundled, so that command fails with 'Could not fetch from any source.' Replace with bundled-skill verification + restore pattern, matching the orchestrator section. Uses <your-worker-profile> placeholder since assignees vary (researcher, writer, ops, linguist, reviewer, etc.) rather than a single fixed 'worker' profile. * feat(profiles): --no-skills flag for empty profile creation (#20986) Adds `hermes profile create <name> --no-skills` to create a profile with zero bundled skills. Writes a `.no-bundled-skills` marker file in the profile root so `hermes update`'s all-profile skill sync loop also skips the profile — without the marker, every update would re-seed skills and the user would have to delete them again. Use case (from @hiut1u): orchestrator profiles and narrow-task profiles don't need 100+ bundled skills polluting their system prompt. - create_profile() gains a `no_skills` param, mutually exclusive with `--clone` / `--clone-all` (cloning explicitly copies skills). - seed_profile_skills() no-ops on opted-out profiles and returns `{skipped_opt_out: True}` so callers can report cleanly. - Web API (POST /api/profiles) accepts `no_skills: bool`. - Delete `.no-bundled-skills` to opt back in — next `hermes update` re-seeds normally. 6 new tests in TestNoSkillsOptOut cover marker write, mutual exclusion with clone, seed_profile_skills opt-out, fresh profile unaffected, and delete-marker-re-enables-seeding. * fix: route Telegram image documents through photo handling * chore: AUTHOR_MAP entry for mrcoferland * test(docker): align Dockerfile contract tests with simplified TUI flow The Dockerfile dropped the manual `@hermes/ink` materialisation gymnastics in favour of letting npm workspaces resolve the bundled package naturally. Two contract tests still asserted the older flow: `test_dockerfile_installs_tui_dependencies` required: 'ui-tui/packages/hermes-ink/package-lock.json' in dockerfile_text …but the lockfile is no longer COPIED individually \u2014 the entire `ui-tui/packages/hermes-ink/` tree is COPIED instead (the workspace reference from `ui-tui/package.json` is `file:` so npm needs the real source, not just a manifest stub). `test_dockerfile_materializes_local_tui_ink_package` required a 7-clause conjunction matching specific `rm -rf` / `npm install --omit=dev` `--prefix node_modules/@hermes/ink` / `rm -rf .../react` invocations that were stripped out when the workspace resolution was simplified. Update the assertions to pin the *contract* the image actually has to carry rather than the *exact shell incantations* the old flow used: * TUI deps install: ui-tui/package.json + ui-tui/package-lock.json + ui-tui/packages/hermes-ink/ tree are all COPIED, and an npm install/ci step runs in ui-tui. * Bundled hermes-ink: the workspace package source is COPIED (so `await import('@hermes/ink')` resolves at runtime). This keeps the spirit of #15012 / #16690 (zombie reaping + bundled workspace materialisation must continue to work) without locking the Dockerfile into one specific implementation flavour. Validation: $ pytest tests/tools/test_dockerfile_pid1_reaping.py -q 6 passed in 1.43s No production code change. Fixes the two failures observed on `main` (run 25250051126): `tests/tools/test_dockerfile_pid1_reaping.py::test_dockerfile_installs_tui_dependencies` `tests/tools/test_dockerfile_pid1_reaping.py::test_dockerfile_materializes_local_tui_ink_package` * test(update): patch isatty on real streams to fix xdist-flaky --yes tests Two CI tests for the new `--yes` update flag (#18261) flaked under `pytest-xdist` on Linux/Python 3.11 even though they passed every local run on macOS/Python 3.14.4: FAILED tests/hermes_cli/test_update_yes_flag.py ::TestUpdateYesConfigMigration::test_no_yes_flag_still_prompts_in_tty `AssertionError: assert <MagicMock 'input'>.called is False` FAILED tests/hermes_cli/test_update_yes_flag.py ::TestUpdateYesStashRestore::test_yes_restores_stash_without_prompting `AssertionError: assert <MagicMock '_restore_stashed_changes'>.called is False` Captured stdout for the first failure shows `cmd_update` taking the "Non-interactive session \u2014 skipping config migration prompt." branch \u2014 i.e. the `sys.stdin.isatty() and sys.stdout.isatty()` check at `hermes_cli/main.py:7118` evaluated to `False` despite the test doing: with patch("hermes_cli.main.sys") as mock_sys: mock_sys.stdin.isatty.return_value = True mock_sys.stdout.isatty.return_value = True The whole-module mock is fragile under xdist worker reuse: a sibling test that imports `hermes_cli.main` first can leave another `sys` reference resolved inside the function (re-import in a helper, etc.), and the wholesale module replacement never gets consulted. Switch to `patch.object(_sys.stdin, "isatty", return_value=True)` (and the same for `stdout`). That patches the *attribute on the real stream object* \u2014 every call site, no matter how it reached `sys.stdin`, hits the patched method. Same fix applied to the stash-restore test (it took the "non-TTY \u2192 skip restore prompt" branch for the same reason). Validation: $ pytest tests/hermes_cli/test_update_yes_flag.py -q 3 passed in 5.47s No production code change. Fixes the two failures observed on `main` (run 25250051126): `tests/hermes_cli/test_update_yes_flag.py::TestUpdateYesConfigMigration::test_no_yes_flag_still_prompts_in_tty` `tests/hermes_cli/test_update_yes_flag.py::TestUpdateYesStashRestore::test_yes_restores_stash_without_prompting` Refs: #18261 (added the `--yes` flag + these tests). * fix(web): force light color-scheme on docs iframe The Documentation tab embeds the public Hermes Agent docs site via an <iframe>. On any system where the browser's prefers-color-scheme resolves to dark — the default on macOS with system dark mode, and common on Linux/Windows too — the docs body text rendered nearly invisible against its own background. Cause: Docusaurus intentionally leaves <html> and <body> transparent and relies on the browser's Canvas color to fill the viewport. Inside our iframe, the iframe element had bg-background (the dashboard's dark canvas) AND inherited the dashboard's dark color-scheme, so the browser set the iframe's Canvas to a dark value. Docusaurus's transparent body exposed that dark Canvas, and the docs body text (tuned for a light Canvas) became near-illegible. Affects every built-in dashboard theme. Fix: replace bg-background on the iframe with [color-scheme:light] (spec-blessed cross-origin override of the inherited color-scheme; forces the iframe's Canvas to light) and bg-white (belt-and-suspenders fallback during the brief paint window before content loads). The docs site's own theme toggle keeps working — Docusaurus stores its choice in localStorage and applies opaque dark backgrounds to its layout elements that cover the white Canvas we forced. * fix(security): close TOCTOU window when saving MCP OAuth credentials _write_json (the persistence helper used by HermesTokenStorage for both tokens and client_info) created the temp file via Path.write_text and only chmod'd it to 0o600 afterward. Between create and chmod the file existed on disk at the process umask (commonly 0o644 = world-readable), briefly exposing MCP OAuth access/refresh tokens to other local users. Use os.open with O_WRONLY|O_CREAT|O_EXCL and an explicit S_IRUSR|S_IWUSR mode so the file is created atomically at 0o600, plus tighten the parent dir to 0o700 so siblings can't traverse to the creds file. The temp name also gains a per-process random suffix to avoid collisions between concurrent writers and stale leftovers from a crashed prior write. Mirrors the fix shipped for agent/google_oauth.py in #19673. Adds a regression test asserting the resulting file mode is 0o600 and the parent directory is 0o700 (skipped on Windows where POSIX mode bits aren't enforced). * chore(release): add Gutslabs to AUTHOR_MAP for PR #21148 salvage * test(update): teach restart-mocks about the post-update survivor sweep Issue #17648 added a post-update SIGTERM-survivor sweep to `cmd_update`: ~3s after issuing graceful/SIGTERM restarts, the code re-queries `find_gateway_pids` and SIGKILLs anything still alive. That's the right fix for stuck-drain gateways in production, but it broke three unit tests that assumed `find_gateway_pids` would keep returning the same PIDs forever: FAILED ::TestCmdUpdateLaunchdRestart::test_update_restarts_profile_manual_gateways AssertionError: Expected 'kill' to not have been called. Called 1 times. Calls: [call(12345, <Signals.SIGKILL: 9>)]. FAILED ::TestCmdUpdateLaunchdRestart::test_update_profile_manual_gateway_falls_back_to_sigterm AssertionError: Expected 'kill' to have been called once. Called 2 times. Calls: [call(12345, SIGTERM), call(12345, SIGKILL)]. FAILED ::TestServicePidExclusion::test_update_kills_manual_pid_but_not_service_pid assert 2 == 1 manual_kills = [call(42999, SIGTERM), call(42999, SIGKILL)] In each test `os.kill` is mocked, so the simulated PID never actually exits \u2014 the sweep finds it again and escalates. The production code is correct; the tests just need to model OS behaviour properly. Two-test fix (profile-manual restart cases): use `side_effect=[[12345], []]` so the first `find_gateway_pids` call returns the live PID and the second (the sweep) returns nothing, as if the OS had reaped the process. Service-PID-exclusion fix: track which PIDs got killed in a closure set, and exclude them on subsequent `fake_find` calls. `os.kill` gets a `side_effect` that records the kill instead of swallowing it silently. Now the sweep doesn't re-find the manual PID, no SIGKILL escalation, `manual_kills == 1`. Validation: $ pytest tests/hermes_cli/test_update_gateway_restart.py -q 43 passed in 4.13s No production code change. Fixes the three failures observed on `main` (run 25250051126): test_update_restarts_profile_manual_gateways test_update_profile_manual_gateway_falls_back_to_sigterm test_update_kills_manual_pid_but_not_service_pid Refs: #17648 (post-update survivor sweep that the tests didn't model). * fix(image-routing): expose attached image paths in native multimodal text part In native image mode (vision-capable models like gpt-4o, claude-sonnet-4), build_native_content_parts() previously emitted only the user's caption plus image_url parts. The local file path of each attached image never appeared in the conversation text, so the model could see the pixels but had no string handle for tools that take image_url: str (custom MCP tools, vision_analyze on a re-look, attach-to-tracker workflows). The text-mode path already injects an equivalent hint via Runner._enrich_message_with_vision ("...vision_analyze using image_url: <path>..."). This brings native mode to parity by appending one "[Image attached at: <path>]" line per successfully attached image to the user-text part of the multimodal turn. Skipped (unreadable) paths are NOT advertised, so the model is never told a non-existent file is attached. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(optional-skills): port Anthropic financial-services skills as optional finance bundle (#21180) Adds 7 optional skills under optional-skills/finance/ adapted from anthropics/financial-services (Apache-2.0): excel-author — openpyxl conventions: blue/black/green cells, formulas over hardcodes, named ranges, balance checks, sensitivity tables. Ships recalc.py. pptx-author — python-pptx for model-backed decks (pitch, IC memo, earnings note) that bind every number to a source workbook cell. dcf-model — institutional DCF (49KB skill): projections, WACC, terminal value, Bear/Base/Bull scenarios, 5x5 sensitivity tables. Ships validate_dcf.py. comps-analysis — comparable company analysis: operating metrics, multiples, statistical benchmarking. lbo-model — leveraged buyout: S&U, debt schedule, cash sweep, exit multiple, IRR/MOIC sensitivity. 3-statement-model — fully-integrated IS/BS/CF with balance-check plugs. Ships references/ for formatting, formulas, SEC filings. merger-model — accretion/dilution analysis for M&A. All seven are optional (not active by default). Users install via 'hermes skills install official/finance/<skill>'. Hermesification: - Stripped every Office JS / Office Add-in / mcp__office__* branch — skills assume headless openpyxl only. - Replaced Cowork MCP data-source instructions with 'MCP first (via native-mcp), fall back to web_search/web_extract against SEC EDGAR and user-provided data'. - Swapped Claude tool references (Bash, Read, Write, Edit, mcp__*) for Hermes-native equivalents and Python library calls. - Canonical Hermes frontmatter (name/description/version/author/ license/metadata.hermes.{tags,related_skills}). - Descriptions tightened to 187-238 chars, trigger-first. - Attribution preserved: author field credits 'Anthropic (adapted by Nous Research)', license: Apache-2.0, each SKILL.md links back to the upstream source directory. Verification: - All 7 discovered by OptionalSkillSource with source_id='official' - Bundle fetch includes support files (scripts, references, troubleshooting) - related_skills cross-refs all resolve within the bundle - No Claude product / Cowork / Office JS / /mnt/skills leakage remains in body text (bounded mentions only in attribution blocks) Source: https://github.com/anthropics/financial-services (Apache-2.0) * test(skills): cover additional rescan paths in skill_commands cache (#14536) The rescan-on-platform-change fix landed in #18739 ships one regression test that exercises the HERMES_PLATFORM env-var path. Three other code paths in get_skill_commands / _resolve_skill_commands_platform have no direct coverage; this commit adds a regression test for each. - Gateway session context (HERMES_SESSION_PLATFORM via ContextVar): the resolver consults get_session_env after HERMES_PLATFORM, and the gateway sets that variable through set_session_vars (a ContextVar), not os.environ. The test uses set_session_vars / clear_session_vars to drive the actual gateway signal, and the disabled-skill stub reads the same value via get_session_env. A regression that swapped get_session_env for plain os.getenv would still pass an env-var-based test but break concurrent gateway sessions, which is the bug the ContextVar plumbing exists to prevent. - Returning to no-platform-scope (CLI / cron / RL rollouts after a gateway session): the cached telegram view must be dropped and the unfiltered scan repopulated when HERMES_PLATFORM is unset again. - Same-platform cache hit: consecutive calls under the same platform scope must NOT rescan. The rescan trigger is change in scope, not "always re-resolve" — a gateway serving many consecutive telegram requests should pay the scan cost once, not per request. The third test wraps scan_skill_commands with a spy after the cache is primed, so the assertion is on call_count == 0 across three subsequent get_skill_commands() calls. All 39 tests in tests/agent/test_skill_commands.py pass under scripts/run_tests.sh. * fix(gateway): translate inbound document host paths to container paths for Docker backend When terminal.backend is docker, inbound documents uploaded via messaging platforms (Telegram, Slack, Discord, Feishu, Email, etc.) are cached at a host path under ~/.hermes/cache/documents, but the container sandbox only sees them at the auto-mounted /root/.hermes/cache/documents path. This PR adds to_agent_visible_cache_path() in tools/credential_files.py (the natural sibling to get_cache_directory_mounts()) and calls it at the document-context-injection site in gateway/run.py so the agent always receives a path it can open directly, matching the mount layout already established by get_cache_directory_mounts() (#4846). Scope: only Docker backend for now; other backends use different mount semantics and are left unchanged until verified. Fixes #18787 * feat(gateway): opt-in cleanup of temporary progress bubbles (#21186) When display.cleanup_progress (or display.platforms.<plat>.cleanup_progress) is true, the gateway deletes tool-progress bubbles, long-running '⏳ Still working...' notices, and status-callback messages after the final response is delivered successfully. Currently effective on adapters that implement delete_message (Telegram); silently no-ops elsewhere. Off by default. Failed runs skip cleanup so bubbles stay as breadcrumbs. Minimal plumbing: base.py's existing post_delivery_callback slot now chains new registrations onto any existing callback (with per-callback exception isolation) rather than clobbering. Stale-generation registrations are rejected so they can't step on a fresher run's callbacks. This lets the cleanup callback coexist with the background-review release hook already registered on the same slot. Co-authored-by: mrcharlesiv <Mrcharlesiv@gmail.com> * fix(kanban): heartbeat tool extends claim TTL, not just last_heartbeat_at The kanban_heartbeat tool called heartbeat_worker but never heartbeat_claim, so a worker that loops the tool while a single tool call blocks the agent for >DEFAULT_CLAIM_TTL_SECONDS still got reclaimed by release_stale_claims. The function name and heartbeat_claim's own docstring imply otherwise: "Workers that know they'll exceed 15 minutes should call this every few minutes to keep ownership." But there was no caller in the worker tool path. Workers couldn't invoke heartbeat_claim themselves either — it isn't exposed as a tool. Fix: _handle_heartbeat now calls heartbeat_claim first, reading HERMES_KANBAN_CLAIM_LOCK from the worker env (the dispatcher pins this in _default_spawn). Falls back to _claimer_id() for locally- driven workers that didn't go through dispatcher spawn. Test: tests/tools/test_kanban_tools.py::test_heartbeat_extends_claim_expires rewinds claim_expires into the past, calls the tool, and asserts the new value is at least now + DEFAULT_CLAIM_TTL_SECONDS // 2. Verified to fail against the unfixed code (claim_expires stays at the rewound value). Closes the root cause underlying the symptom in #21141 (15-min respawns of long-running workers). #21141 separately addresses post-reclaim cleanup; this fixes the upstream "shouldn't have been reclaimed in the first place" half. * chore(release): map stephen0110 noreply email * fix(kanban): stop reclaimed workers before retry * fix(kanban): reap completed worker children in dispatch_once The gateway-embedded dispatcher (default since `kanban.dispatch_in_gateway = true`) is the parent of every spawned kanban worker. `_default_spawn` calls `subprocess.Popen(..., start_new_session=True)` and returns the pid — `start_new_session` detaches the controlling tty but does not reparent to init, so the gateway keeps each worker as a child until it `wait()`s for them. Nothing in the dispatch loop ever calls `waitpid`. Result: every completed worker becomes a `<defunct>` zombie that lingers until the gateway exits. We hit ~430 zombies on a single hermes-agent container after ~40 days of steady kanban traffic, approaching process-table exhaustion on the host. Fix: add a non-blocking reap loop at the top of `dispatch_once`, so every dispatcher tick (default 60s) drains zombies that accumulated since the last tick. WNOHANG keeps the call non-blocking; ChildProcessError means no children to reap. Why here, not a SIGCHLD handler: - signal.signal requires the main thread; gateway threading model makes that placement non-trivial. - Bounded staleness: at default interval=60s the maximum live zombie count is one tick's worth of worker completions. - No interaction with detect_crashed_workers: that function only inspects rows where status='running', and rows reach 'done' (and stop being inspected) before their workers exit. * chore(release): map sonic-netizen noreply email * fix: auto-block repeated kanban retries * chore(release): map mwnickerson noreply email * feat(gateway): auto-resume interrupted sessions after restart * fix(gateway): preserve resume marker on interrupted restart * refactor(gateway): simplify auto-resume + extend to crash recovery Follow-up on top of @kyan12's PR #20888 — same feature, cleaner shape, wider coverage. Changes: - Drop the synthetic '[System note: ...]' in the internal MessageEvent. The existing _is_resume_pending branch in _handle_message_with_agent (run.py ~L13738) already injects a reason-aware recovery system note on the next turn. With kyan's text in place the model saw two stacked system notes. Now the event text is empty and the existing injection path owns the wording. - Drop SessionStore.list_resume_pending() as a new public method. The filter is 8 lines inline in _schedule_resume_pending_sessions() — one caller, no other pluggability need. - Add 'restart_interrupted' to the auto-resume reason set. That's the reason SessionStore.suspend_recently_active() stamps on sessions recovered from a crash/OOM/SIGKILL (no .clean_shutdown marker). Previously those sessions had to wait for a real user message to auto-resume; now they continue automatically at startup like drain-timeout interruptions do. - Reasons live in a _AUTO_RESUME_REASONS frozenset at class scope so future reasons (e.g. 'manual_resume_request') can be opted in with one line. Test coverage added: - drain-timeout + crash-recovery both scheduled - stale entries skipped (outside freshness window) - suspended entries skipped (suspended > resume_pending) - originless entries skipped (no routing target) - disallowed reasons skipped (graceful forward-compat) E2E verified end-to-end with a real on-disk SessionStore: 2 eligible sessions scheduled, 2 ineligible skipped, empty-text internal events delivered to the adapter. Co-authored-by: Kevin Yan <kevyan1998@gmail.com> * fix(auth): sync shared Nous refresh tokens * refactor(auth): dedupe file-lock helper; document Nous lock order Extract the shared flock/msvcrt boilerplate from _auth_store_lock and _nous_shared_store_lock into a single _file_lock(lock_path, holder, timeout, message) helper. Each caller keeps its own threading.local holder so reentrancy state stays per-lock. Also document the lock-ordering invariant on both wrappers: _auth_store_lock is OUTER, _nous_shared_store_lock is INNER for all runtime refresh paths. The one exception is _try_import_shared_nous_state, which holds the shared lock alone across the full HTTP refresh+mint cycle to prevent concurrent sibling imports from racing on the single- use shared refresh token; that helper must not be called with the auth lock already held. * fix(gateway): avoid duplicated responses history * chore: add AUTHOR_MAP entries for thelumiereguy and counterposition * fix(gateway): use monotonic deadlines in QR onboarding flows * fix(oauth,gateway): monotonic deadlines for polling/timeout loops Widen PR #20314's fix to the other timeout-polling sites in the codebase that share the same wall-clock-jump bug class. All of these measure elapsed timeout duration, not civil time, so they belong on time.monotonic(). - hermes_cli/auth.py: auth-store file-lock timeout, Spotify OAuth callback wait, Nous portal device-auth token poll. - hermes_cli/copilot_auth.py: Copilot OAuth device-flow token poll. - hermes_cli/gateway.py: gateway systemd restart wait. - hermes_cli/web_server.py: dashboard Codex device-auth user_code wait, dashboard Nous device-auth token poll. (sess["expires_at"] stays on time.time() — it's a persisted absolute timestamp, not a local deadline-polling variable.) - agent/copilot_acp_client.py: Copilot ACP JSON-RPC request timeout. * fix(weixin): replace all aiohttp ClientTimeout with asyncio.wait_for() aiohttp ClientTimeout uses BaseTimerContext which calls loop.call_later() internally. When invoked via asyncio.run_coroutine_threadsafe() from cron jobs, this triggers "Timeout context manager should be used inside a task" errors, causing message delivery failures. Replace all direct ClientTimeout usage with asyncio.wait_for(): - _upload_ciphertext: CDN upload (120s timeout) - _download_bytes: CDN download (configurable timeout) - _download_remote_media: remote media fetch (30s timeout) Also set total=None on _send_session to disable aiohttp built-in timeout, and change trust_env=True to False to bypass proxy for WeChat CDN connections. * test(weixin): update timeout assertion for asyncio.wait_for migration * chore: AUTHOR_MAP entry for chenlinfeng@ruije / @noOne-list * feat(security): enable secret redaction by default (#17691, #20785) (#21193) Flip the default for HERMES_REDACT_SECRETS from off to on so the redactor already wired into send_message_tool, logs, and tool output actually runs on a fresh install. - agent/redact.py: env-var default "" → "true" - hermes_cli/config.py: DEFAULT_CONFIG security.redact_secrets True; two config-template comments rewritten - gateway/run.py + cli.py: startup log / banner warning when the user has explicitly opted out, so the downgrade is visible in agent.log and at CLI banner time - docs/reference/environment-variables.md: description reconciled - tests: flipped the default-pin, restructured the force=True regression test to explicit-false instead of unset Users who need raw credential values (redactor development) can still opt out via security.redact_secrets: false in config.yaml or HERMES_REDACT_SECRETS=false in .env. Closes #17691. Addresses #20785 (short-term output-pipeline recommendation). * feat: add Discord message deletion action * chore: AUTHOR_MAP entry for @likejudy * fix(security): close TOCTOU window in hermes_cli/auth.py credential writers (#21194) `_save_auth_store`, `_save_qwen_cli_tokens`, and `_write_shared_nous_state` all created the temp file via `Path.open('w')` / `Path.write_text` and only tightened permissions to 0o600 afterward. Between create and chmod the file existed at the process umask (commonly 0o644 = world-readable on multi-user hosts), briefly exposing OAuth access/refresh tokens for Nous, Codex, Copilot, Claude, Qwen, Gemini, and every other native OAuth provider that flows through auth.json. Switch all three to `os.open(O_WRONLY|O_CREAT|O_EXCL, 0o600)` + `os.fdopen` + `fsync` so the file is atomic at 0o600 on creation. Tighten each parent directory (`~/.hermes/`, Qwen auth dir, Nous shared auth dir) to 0o700 so siblings can't traverse to the creds. `_save_auth_store` also gains a per-process random temp suffix to match `agent/google_oauth.py` (#19673) and `tools/mcp_oauth.py` (#21148). Adds `tests/hermes_cli/test_auth_toctou_file_modes.py` asserting final file mode 0o600 and parent dir mode 0o700 across all three writers, plus an explicit `os.open(flags, mode)` check on the main auth.json writer that would fail if anyone reintroduces the `Path.open('w')` pattern. POSIX-only (mode bits skipped on Windows). * fix(delegate): correct ACP docs — Claude Code CLI has no --acp flag The delegate_task tool schema descriptions referenced 'claude --acp --stdio' as an example, but Claude Code CLI does not support --acp or --stdio flags. The ACP subprocess transport (agent/copilot_acp_client.py) is specifically built for GitHub Copilot CLI ('copilot --acp --stdio'). Changes: - Per-task acp_command example: 'claude' → 'copilot' - Top-level acp_command description: remove 'Claude Code' reference, clarify requirement for ACP-compatible CLI (currently Copilot only) - acp_args description: remove misleading claude-opus-4-6 example Fixes #19055 * fix: exclude hidden and archive dirs from _find_skill rglob * fix(gateway): preserve thread routing from cached live session sources * fix(gateway): cap cached session sources with LRU eviction Follow-up on top of Zyproth's session-source cache: swap the unbounded dict for an OrderedDict with a 512-entry LRU cap so long-running gateways can't accumulate stale entries for dead sessions forever. - self._session_sources is now an OrderedDict - _cache_session_source() move_to_end + popitem(last=False) above cap - _get_cached_session_source() move_to_end on hit (LRU read bump) - restart_test_helpers.py wires OrderedDict + _session_sources_max * fix(mcp): give 'mcp add --command' a distinct argparse dest The --command flag of `hermes mcp add` shared its argparse dest with the top-level subparser (`dest="command"` in `hermes_cli/_parser.py`). When the flag was omitted, argparse still wrote `args.command = None`, clobbering the top-level value of `"mcp"`. The dispatcher then saw `args.command is None` and fell through to interactive chat, so `hermes mcp add ...` silently launched chat instead of registering the server. `cmd_mcp_add` was never reached. Use `dest="mcp_command"` on the flag and read it from `cmd_mcp_add`. The user-facing CLI flag `--command` is unchanged; only the in-memory namespace attribute moves. Also updates the `_make_args` helper in `tests/hermes_cli/test_mcp_config.py` to populate the new dest, and adds `tests/hermes_cli/test_mcp_add_command_dest.py` with a parser- level regression test. Closes #19785. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore: add discodirector email to AUTHOR_MAP * fix(bedrock): preserve reasoningContent across converse normalization * feat(gateway): support [[as_document]] directive for skill media routing Skills that produce large/lossless images (e.g. info-graph, where a rendered JPG is 1-2 MB) currently lose quality in Telegram delivery because `_IMAGE_EXTS` membership routes the file through `send_multiple_images` → `sendMediaGroup`, which Telegram's server re-encodes to JPEG @ 1280px max edge. The original bytes only survive when the file goes through `send_document`, which the dispatch tables in three places (`_process_message_background`, `_deliver_media_from_response`, and the `send_message` tool's telegram path) only reach for files whose extension is NOT in `_IMAGE_EXTS`. This commit adds an `[[as_document]]` directive that mirrors the existing `[[audio_as_voice]]` shape: a skill emits the directive once in its response, and every image-extension MEDIA: file in that response is delivered via `send_document` instead of `send_multiple_images` / `sendPhoto`. The directive is detected at the dispatch sites (which see the raw response) and the directive string is stripped from the user-visible cleaned text in `extract_media` so it never leaks. Granularity is intentionally all-or-nothing per response, matching [[audio_as_voice]]'s scope. Skills that need fine control can split into two responses. Verified the targeted use case: info-graph emits 信息图已生成(...) [[as_document]] MEDIA:/tmp/info-graph-x/infographic.jpg → Telegram receives `infographic.jpg` via sendDocument, original 1MB JPEG bytes preserved, no recompression. Forwarding and download filenames stay clean (`infographic.jpg`). Tests: +3 cases in TestExtractMedia covering directive strip, isolation from voice flag, and coexistence with [[audio_as_voice]]. All 113 pre-existing media/extract/send tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: update send_message_tool mocks for force_document kwarg * chore: AUTHOR_MAP entry for @leon7609 * fix(model_switch): live model discovery for custom_providers in /model picker custom_providers entries (section 4 of list_authenticated_providers) only read the static models: dict from config.yaml, ignoring the live /v1/models endpoint. This means gateways like Bifrost that expose hundreds of models only show the handful explicitly listed in config. Add live discovery via fetch_api_models() for custom_providers entries that have api_key + base_url, matching the existing behavior for user providers: entries (section 3). When the endpoint is reachable and returns models, the live list replaces the static subset. Fixes: /model picker showing only 9 models from a Bifrost gateway that actually exposes 581. * fix(memory): support OpenViking local resource uploads * test(memory): harden OpenViking local upload coverage * fix(memory): harden OpenViking local path uploads * chore(release): add AUTHOR_MAP entries for ggnnggez and ehz0ah Contributors to OpenViking local resource upload fix (#19569). * docs(readme): drop misleading RL install-extras claim, defer to CONTRIBUTING README.md:163 said atroposlib and tinker were pulled in by .[all,dev], but .[all] does not include .[rl] — those dependencies live in pyproject.toml's [rl] extra (lines 95-101). With the original wording, a contributor running uv pip install -e ".[all,dev]" would not have atroposlib or tinker installed. Rather than swap one extra for another (which paths users to either of two parallel install conventions — pip [rl] extra vs tinker-atropos submodule — without saying which the project considers canonical), this PR drops the specific install command from the README and links to CONTRIBUTING.md, which already documents the actual development setup. * fix(kanban): auto-block workers that exit without completing (#20894) (#21214) When a kanban worker subprocess exits rc=0 but its task is still in status='running', the agent almost certainly answered the task conversationally without calling kanban_complete or kanban_block. The dispatcher used to classify this as a generic crash and respawn, which loops forever on small local models (gemma4-e2b q4 etc.) that keep returning clean but unproductive output. Dispatcher changes: - The waitpid reap loop at the top of dispatch_once now records each reaped child's raw exit status in a bounded module registry (_recent_worker_exits, TTL 600s, size cap 4096). - _classify_worker_exit distinguishes clean_exit / nonzero_exit / signaled / unknown using os.WIFEXITED / WIFSIGNALED. - detect_crashed_workers consults the classification when a worker is found dead. clean_exit → protocol_violation event + immediate circuit-breaker trip (failure_limit=1). Everything else keeps the existing crashed-event + counter behavior. - DispatchResult.auto_blocked now includes protocol-violation trips. Gateway fix (Bug A in #20894): - gateway.run._notify_active_sessions_of_shutdown snapshots self.adapters with list(...) before iterating. adapter.send() can hit a fatal-error path that pops the adapter from the dict, which was raising 'RuntimeError: dictionary changed size during iteration' during shutdown. Regression tests: - test_detect_crashed_workers_protocol_violation_auto_blocks verifies rc=0 + still-running → status=blocked on first occurrence with protocol_violation + gave_up events and NO crashed event. - test_detect_crashed_workers_nonzero_exit_uses_default_limit verifies non-zero exits keep the existing 2-strike behavior. Closes #20894. * fix(dashboard): stabilize embedded chat resume and scrollback * fix(dashboard): let embedded chat use a single scroll system * fix(dashboard): route browser wheel into inner TUI scrolling * chore: AUTHOR_MAP entry for @nouseman666 * fix(cli): honor positive tool preview length * chore: AUTHOR_MAP entry for @GinWU05 * fix(credential_pool): resolve key mix-up when custom providers share base_url When multiple custom_providers share the same base_url but have different API keys, get_custom_provider_pool_key() always returned the first match, causing wrong-key unauthorized errors. Add provider_name parameter to prefer exact name matches over base_url-only matching, with fallback for backward compatibility. Fixes #19083 * feat(cli): show context compression count in status bar Display the number of context compressions in the CLI status bar when compressions > 0, helping users understand conversation compression pressure during long sessions. - Wide layout (>=76 cols): shows 'cmp N' between context percent and duration - Medium layout (52-75 cols): shows 'cmp N' between percent and duration - Narrow layout (<52 cols): omitted to save space - Color-coded: dim for 1-4, warn for 5-9, bad for 10+ - Hidden when zero to keep the bar clean for new sessions Closes #18564 * refactor: replace 'cmp' text with 🗜️ emoji in status bar Address review feedback to use the clamp emoji (��️) instead of the plain text 'cmp' prefix for the compression count indicator. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat(tui): surface compression count in Ink status bar Parity with the classic CLI status bar (PR #18579). The Python backend already exposes 'compressions' on SessionUsageResponse; this wires it through the Ink Usage type and renders 'cmp N' next to the duration segment of StatusRule. - types.ts Usage: add optional compressions field - appChrome.tsx StatusRule: render 'cmp N' when > 0, color-tiered by pressure (muted <5, warn 5-9, error 10+) - Plain text 'cmp' token (no emoji) matches PR #18579's original author rationale and avoids Ink layout drift from VS16 emoji width * chore(release): map altriatree@gmail.com -> @TruaShamu * fix(curator): make manual runs synchronous * docs(curator): update CLI docs for synchronous-by-default manual run Follow-up to the previous commit which flipped 'hermes curator run' default from async to sync. Updates the curator.md feature page and cli-commands.md reference to show --background as the opt-in async flag and note that the default now blocks until the LLM pass finishes. * fix(install): remove uv exclude-newer cutoff * docs: clarify API server tool execution locality * fix(kanban): treat dashboard event-stream cancellation as normal shutdown Stopping `hermes dashboard` with Ctrl-C while the Kanban dashboard is open prints an ASGI traceback ending in `plugins/kanban/dashboard/plugin_api.py::stream_events` at the `asyncio.sleep(_EVENT_POLL_SECONDS)` line. This is a normal shutdown path: Uvicorn cancels the open websocket task while it is sleeping in the 300 ms poll loop. `asyncio.CancelledError` is a `BaseException` in Python 3.8+ — the bare `except Exception:` handler below the existing `WebSocketDisconnect:` clause does NOT catch it, so the cancellation surfaces as an application traceback and routine dashboard exit looks like a runtime failure. Add an explicit `except asyncio.CancelledError: return` clause beside the existing `WebSocketDisconnect` handler. Disconnection (client closed the tab) and shutdown cancellation (dashboard process exiting) are conceptually different paths but both warrant a quiet return; the two clauses are kept separate to keep that intent explicit. `asyncio` is already imported and used in this scope, so no new import is needed. The bare `except Exception:` handler is preserved verbatim, so genuine runtime failures still log a warning and close the socket cleanly. Closes #20790. * chore(release): map SandroHub013 email * test(kanban): regression for CancelledError swallow in stream_events Drives stream_events directly and cancels the task while it is sleeping in the poll loop, asserting the coroutine returns cleanly instead of letting CancelledError bubble. Regression coverage for the Uvicorn application traceback on dashboard Ctrl-C fixed by the preceding commit. * fix(model_tools): log plugin hook exceptions instead of silently swallowing them * feat(gateway): add `hermes gateway list` to show all profiles' gateway status Add a new `hermes gateway list` subcommand that shows the running status of gateways across all profiles in a single view: Gateways: ✓ default (current) — PID 155469 ✓ wx1 — PID 166893 ✗ dev — not running Also includes `_print_other_profiles_gateway_status()` which appends an "Other profiles" section to `hermes gateway status` output when other profile gateways are running. Both use existing `list_profiles()` and `find_profile_gateway_processes()` — no new dependencies. Closes #19127 Related: #19113, #4402, #4587 * fix(mcp-oauth): persist OAuth server metadata across process restarts (#21226) The MCP SDK discovers OAuth server metadata (token_endpoint, etc.) on demand and keeps it in memory only. Without disk persistence, a restart with valid cached refresh tokens forces the SDK to fall back to the guessed '{server_url}/token' path — which returns 404 on most real providers (Notion, Atlassian, GitHub remote MCP, etc.) and triggers a full browser re-authorization even though the refresh token is fine. Add a .meta.json file next to the existing tokens/client_info files: HERMES_HOME/mcp-tokens/<server>.json -- tokens (existing) HERMES_HOME/mcp-tokens/<server>.client.json -- client info (existing) HERMES_HOME/mcp-tokens/<server>.meta.json -- oauth metadata (new) Changes: - HermesTokenStorage.save_oauth_metadata / load_oauth_metadata / _meta_path — disk layer for the discovered OAuthMetadata. - HermesTokenStorage.remove() now also clears .meta.json so 'hermes mcp remove <name>' and the manager's remove() path clean up fully. - HermesMCPOAuthProvider._initialize cold-restores from disk before the existing pre-flight discovery runs. If disk has metadata we skip the discovery HTTP round-trips entirely. - HermesMCPOAuthProvider._prefetch_oauth_metadata now persists ASM as soon as it's discovered, so even the first pre-flight run seeds disk. - HermesMCPOAuthProvider._persist_oauth_metadata_if_changed() is called at the end of async_auth_flow so metadata discovered via the SDK's lazy 401-branch (not pre-flight) is also saved for next time. Tests cover the storage roundtrip (save/load/missing/corrupt/remove) and the manager provider path (cold-load restore, skip-when-in-memory, persist-on-discover, noop-when-unchanged, end-to-end async_auth_flow). Co-authored-by: nocturnum91 <50326054+nocturnum91@users.noreply.github.com> * feat: add SSE transport support for MCP client Add support for MCP servers using the SSE transport protocol (SseServerTransport) alongside the existing Streamable HTTP and stdio transports. Many MCP servers use SSE (GET /sse + POST /messages/) which was previously unsupported -- the client silently fell back to Streamable HTTP, causing 10s connection timeouts. Changes: - Import mcp.client.sse.sse_client with graceful fallback - Check config.get('transport') == 'sse' in _run_http() to select the SSE transport path with proper timeout handling - Read transport type from config in get_mcp_status() instead of hardcoding 'http' for URL-based servers - Update docstring, example config, and feature list * fix(browser): enforce cloud-metadata SSRF floor in hybrid routing (#16234) (#21228) Cloud metadata endpoints (169.254.169.254 etc.) are now always blocked by browser_navigate regardless of hybrid routing, allow_private_urls, or backend. Bug: commit 42c076d3 (#16136) added hybrid routing that flips auto_local_this_nav=True for private URLs and short-circuits _is_safe_url(). IMDS endpoints are technically private (169.254/16 link-local), so the sidecar happily routed them to a local Chromium, and the agent could read IAM credentials via browser_snapshot. On EC2/GCP/Azure this is a full SSRF-to-credential-theft. Fix: new is_always_blocked_url() in url_safety.py — a narrow floor that checks _BLOCKED_HOSTNAMES, _ALWAYS_BLOCKED_IPS, _ALWAYS_BLOCKED_NETWORKS only. Applied as an independent gate in browser_navigate's pre-nav and post-redirect checks, BEFORE auto_local_this_nav gets a chance to short-circuit. Ordinary private URLs (localhost, 192.168.x, 10.x, .local, CGNAT) still route to the local sidecar as the #16136 feature intends. Secondary fix (reporter's finding): _url_is_private() now explicitly checks 172.16.0.0/12. ipaddress.is_private only covers that range on Python ≥3.11 (bpo-40791), so on 3.10 runtimes those URLs were routed to cloud instead of the local sidecar. No security impact — just a correctness fix for the hybrid-routing feature. Closes #16234. * fix: WhatsApp bridge process leak and disable config asymmetry - Add PID file mechanism to track bridge processes and kill stale ones on startup - Improve _kill_port_process() with lsof fallback when fuser is not available - Support explicit WhatsApp disable via config.yaml (whatsapp.enabled: false) - Respect WHATSAPP_ENABLED=false env var to disable WhatsApp Fixes #19124 * docs(contributing): align tool discovery and test runner with AGENTS.md Co-authored-by: Cursor <cursoragent@cursor.com> * fix(kanban): make dashboard board pin authoritative over server current file (#21230) When the user created a new board via the dashboard with "switch" checked, the server-side `current` file was flipped to the new board. Clicking the original board's tab then showed no card…
RationallyPrime
pushed a commit
to RationallyPrime/hermes-agent
that referenced
this pull request
May 8, 2026
…riters (NousResearch#21194) `_save_auth_store`, `_save_qwen_cli_tokens`, and `_write_shared_nous_state` all created the temp file via `Path.open('w')` / `Path.write_text` and only tightened permissions to 0o600 afterward. Between create and chmod the file existed at the process umask (commonly 0o644 = world-readable on multi-user hosts), briefly exposing OAuth access/refresh tokens for Nous, Codex, Copilot, Claude, Qwen, Gemini, and every other native OAuth provider that flows through auth.json. Switch all three to `os.open(O_WRONLY|O_CREAT|O_EXCL, 0o600)` + `os.fdopen` + `fsync` so the file is atomic at 0o600 on creation. Tighten each parent directory (`~/.hermes/`, Qwen auth dir, Nous shared auth dir) to 0o700 so siblings can't traverse to the creds. `_save_auth_store` also gains a per-process random temp suffix to match `agent/google_oauth.py` (NousResearch#19673) and `tools/mcp_oauth.py` (NousResearch#21148). Adds `tests/hermes_cli/test_auth_toctou_file_modes.py` asserting final file mode 0o600 and parent dir mode 0o700 across all three writers, plus an explicit `os.open(flags, mode)` check on the main auth.json writer that would fail if anyone reintroduces the `Path.open('w')` pattern. POSIX-only (mode bits skipped on Windows).
gzsiang
pushed a commit
to gzsiang/hermes-agent
that referenced
this pull request
May 10, 2026
…riters (NousResearch#21194) `_save_auth_store`, `_save_qwen_cli_tokens`, and `_write_shared_nous_state` all created the temp file via `Path.open('w')` / `Path.write_text` and only tightened permissions to 0o600 afterward. Between create and chmod the file existed at the process umask (commonly 0o644 = world-readable on multi-user hosts), briefly exposing OAuth access/refresh tokens for Nous, Codex, Copilot, Claude, Qwen, Gemini, and every other native OAuth provider that flows through auth.json. Switch all three to `os.open(O_WRONLY|O_CREAT|O_EXCL, 0o600)` + `os.fdopen` + `fsync` so the file is atomic at 0o600 on creation. Tighten each parent directory (`~/.hermes/`, Qwen auth dir, Nous shared auth dir) to 0o700 so siblings can't traverse to the creds. `_save_auth_store` also gains a per-process random temp suffix to match `agent/google_oauth.py` (NousResearch#19673) and `tools/mcp_oauth.py` (NousResearch#21148). Adds `tests/hermes_cli/test_auth_toctou_file_modes.py` asserting final file mode 0o600 and parent dir mode 0o700 across all three writers, plus an explicit `os.open(flags, mode)` check on the main auth.json writer that would fail if anyone reintroduces the `Path.open('w')` pattern. POSIX-only (mode bits skipped on Windows).
rmulligan
pushed a commit
to rmulligan/hermes-agent
that referenced
this pull request
May 11, 2026
…riters (NousResearch#21194) `_save_auth_store`, `_save_qwen_cli_tokens`, and `_write_shared_nous_state` all created the temp file via `Path.open('w')` / `Path.write_text` and only tightened permissions to 0o600 afterward. Between create and chmod the file existed at the process umask (commonly 0o644 = world-readable on multi-user hosts), briefly exposing OAuth access/refresh tokens for Nous, Codex, Copilot, Claude, Qwen, Gemini, and every other native OAuth provider that flows through auth.json. Switch all three to `os.open(O_WRONLY|O_CREAT|O_EXCL, 0o600)` + `os.fdopen` + `fsync` so the file is atomic at 0o600 on creation. Tighten each parent directory (`~/.hermes/`, Qwen auth dir, Nous shared auth dir) to 0o700 so siblings can't traverse to the creds. `_save_auth_store` also gains a per-process random temp suffix to match `agent/google_oauth.py` (NousResearch#19673) and `tools/mcp_oauth.py` (NousResearch#21148). Adds `tests/hermes_cli/test_auth_toctou_file_modes.py` asserting final file mode 0o600 and parent dir mode 0o700 across all three writers, plus an explicit `os.open(flags, mode)` check on the main auth.json writer that would fail if anyone reintroduces the `Path.open('w')` pattern. POSIX-only (mode bits skipped on Windows).
JinyuID
pushed a commit
to JinyuID/hermes-agent
that referenced
this pull request
May 11, 2026
…riters (NousResearch#21194) `_save_auth_store`, `_save_qwen_cli_tokens`, and `_write_shared_nous_state` all created the temp file via `Path.open('w')` / `Path.write_text` and only tightened permissions to 0o600 afterward. Between create and chmod the file existed at the process umask (commonly 0o644 = world-readable on multi-user hosts), briefly exposing OAuth access/refresh tokens for Nous, Codex, Copilot, Claude, Qwen, Gemini, and every other native OAuth provider that flows through auth.json. Switch all three to `os.open(O_WRONLY|O_CREAT|O_EXCL, 0o600)` + `os.fdopen` + `fsync` so the file is atomic at 0o600 on creation. Tighten each parent directory (`~/.hermes/`, Qwen auth dir, Nous shared auth dir) to 0o700 so siblings can't traverse to the creds. `_save_auth_store` also gains a per-process random temp suffix to match `agent/google_oauth.py` (NousResearch#19673) and `tools/mcp_oauth.py` (NousResearch#21148). Adds `tests/hermes_cli/test_auth_toctou_file_modes.py` asserting final file mode 0o600 and parent dir mode 0o700 across all three writers, plus an explicit `os.open(flags, mode)` check on the main auth.json writer that would fail if anyone reintroduces the `Path.open('w')` pattern. POSIX-only (mode bits skipped on Windows).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #21148 / #21176 — same TOCTOU bug class, three parallel call sites in
hermes_cli/auth.pythat neither PR touched. Found while auditing siblings during the MCP salvage. Closes the umask exposure window for every native OAuth provider Hermes supports (Nous, Codex, Copilot, Claude, Qwen, Gemini, etc.).Summary
_save_auth_store(~/.hermes/auth.json),_save_qwen_cli_tokens, and_write_shared_nous_stateall created the temp file viaPath.open('w')/Path.write_textandchmodd to 0o600 afterward. Between create and chmod the file existed at process umask (commonly 0o644 = world-readable on multi-user hosts), briefly exposing OAuth access/refresh tokens to other local users. All three now useos.open(O_WRONLY|O_CREAT|O_EXCL, 0o600)+os.fdopen+fsync, matching the pattern already shipped foragent/google_oauth.py(#19673) andtools/mcp_oauth.py(#21148).Scariest of the three was
_save_auth_store: it was the central writer for every native OAuth provider's tokens AND its chmod ran on the final target file AFTERatomic_replace, meaning the exposure window spanned the entire write.Changes
hermes_cli/auth.py:_save_auth_store(line ~954) —os.open(O_EXCL, 0o600)+fdopen+fsync; parent dir tightened to 0o700._save_qwen_cli_tokens(line ~1523) — same pattern; per-process random temp suffix replaces fixed.tmp; parent dir 0o700._write_shared_nous_state(line ~2849) — same pattern; parent dir 0o700.tests/hermes_cli/test_auth_toctou_file_modes.py(new) — POSIX-gated regression tests:_save_auth_storespecifically callsos.openwith O_CREAT|O_EXCL and mode=0o600 (catches reintroducingPath.open('w'))Validation
scripts/run_tests.shovertest_auth_*+ spotify + new regression fileos.openassertion against simulated old (Path.open('w')) codeos.open was never called for auth.json.tmp)