Skip to content

fix(gateway): handle planned service stops#19876

Closed
helix4u wants to merge 1 commit into
NousResearch:mainfrom
helix4u:fix/gateway-planned-stop-marker
Closed

fix(gateway): handle planned service stops#19876
helix4u wants to merge 1 commit into
NousResearch:mainfrom
helix4u:fix/gateway-planned-stop-marker

Conversation

@helix4u
Copy link
Copy Markdown
Contributor

@helix4u helix4u commented May 4, 2026

What does this PR do?

Adds an explicit planned-stop marker for gateway stop paths so a deliberate hermes gateway stop, launchd stop, or profile-scoped stop can be distinguished from an unexpected external SIGTERM.

The gateway currently exits non-zero for signal-initiated shutdowns so systemd can revive it after an unexpected kill. That is useful for real failures, but it also means an intentional service stop can be treated as failure in some user-systemd/WSL setups. The result is that systemctl --user stop hermes-gateway may time out or the gateway may be revived immediately, leaving users with "Gateway already running" and platform identity conflicts such as Feishu app_id already in use.

This follows the existing --replace takeover marker pattern: the CLI writes a short-lived marker naming the target PID and process start time before sending the stop signal, and the gateway consumes that marker during signal handling to exit cleanly. Interactive Ctrl+C is also treated as an intentional foreground stop.

Related but not duplicate fixes:

This PR covers the missing planned-stop path: the gateway should not ask the service manager to revive it after the user explicitly stops it.

Related Issue

Related to #14128, #14176, #17198.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • gateway/status.py: adds planned-stop marker helpers and shares marker consumption logic with takeover markers.
  • hermes_cli/gateway.py: writes the planned-stop marker before systemd, launchd, or profile-scoped stop sends SIGTERM.
  • gateway/run.py: consumes the planned-stop marker during SIGTERM handling and treats Ctrl+C as an intentional clean stop.
  • tests/gateway/test_status.py: adds planned-stop marker coverage.
  • tests/hermes_cli/test_gateway_service.py: verifies systemd_stop() marks the target gateway before stopping and makes generated-unit timeout assertions deterministic against the default drain timeout.

How to Test

  1. python -m py_compile gateway/status.py hermes_cli/gateway.py gateway/run.py
  2. .venv/bin/python -m pytest -n 4 tests/gateway/test_status.py tests/hermes_cli/test_gateway_service.py -q153 passed before the final deterministic test hardening.
  3. .venv/bin/python -m pytest -n 4 tests/gateway/test_runner_startup_failures.py tests/gateway/test_clean_shutdown_marker.py -q15 passed.
  4. .venv/bin/python -m pytest -n 4 tests/hermes_cli/test_gateway_service.py tests/gateway/test_status.py tests/gateway/test_runner_startup_failures.py tests/gateway/test_clean_shutdown_marker.py -q168 passed after the final deterministic test hardening.
  5. .venv/bin/python -m pytest -n 4 tests/ -q — full suite attempted on the final tree, but the repo-wide suite is currently red: 116 failed, 19535 passed, 59 skipped, 223 warnings in 617.59s. The failures are outside the planned-stop marker path and are concentrated in existing cron, approval, auxiliary client, Bedrock beta header, gateway config, DingTalk, Discord, update, browser, delegate, and sandbox tests.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: Linux/WSL-style local checkout via targeted pytest and full-suite attempt

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

For New Skills

N/A

Screenshots / Logs

User reports showed hermes gateway stop timing out in systemctl --user stop hermes-gateway after 90s, followed by a still-running gateway PID and platform identity conflicts. The code path also logged Exiting with code 1 (signal-initiated shutdown without restart request) so systemd Restart=on-failure can revive the gateway. during shutdown, which is the behavior this marker prevents for intentional stops.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery comp/cli CLI entry point, hermes_cli/, setup wizard labels May 4, 2026
@helix4u helix4u force-pushed the fix/gateway-planned-stop-marker branch from 6a8f30a to 9630ce2 Compare May 4, 2026 20:00
@helix4u helix4u force-pushed the fix/gateway-planned-stop-marker branch from 9630ce2 to 72a57e9 Compare May 4, 2026 20:03
@helix4u helix4u marked this pull request as ready for review May 4, 2026 20:05
@teknium1
Copy link
Copy Markdown
Contributor

teknium1 commented May 4, 2026

Merged via #19936 — your commit was cherry-picked onto current main with authorship preserved (commit b632290). Thanks @helix4u! Opening a separate follow-up issue for the related drain-hang on wedged adapter sockets (WSL + hung Feishu/Weixin websockets), which your marker correctly left out of scope.

praxstack pushed a commit to praxstack/NousResearch-hermes-agent that referenced this pull request May 6, 2026
ROOT CAUSE
==========

`hermes update` running from a feature branch had two distinct bugs in
`_cmd_update_impl` (hermes_cli/main.py):

1. **Asymmetric branch handling.** At line 7071 the function checks out
   main for the update. The 'already up to date' path at line 7107-7114
   switches back to the original branch, but the 'successful update'
   path (lines 7118-7183) never did. Every successful `hermes update`
   silently left HEAD on main, abandoning the user's feature branch.

2. **Stash restore order in both paths.** Even in the up-to-date path,
   the stash was restored BEFORE the checkout-back. That meant
   `git stash apply` ran while HEAD was still on main, applying
   feature-branch-local edits on top of main. The checkout-back only
   restored the branch tip pointer — the working tree contamination
   had already happened.

REPRODUCTION
============

Visible in the reflog of any session that ran `hermes update`:

    checkout: feat/... → main       ← line 7071
    pull --ff-only origin main       ← line 7140 (fast-forward)
    (nothing — feat/... abandoned)

The 05:24 IST rebase abort this morning was damage control after this
exact bug left HEAD on main. The parity branch was saved by a manual
`git checkout feat/native-bedrock-provider-20260428`, not by the
update command itself.

FIX
===

Swap the order in both finally blocks: checkout original branch FIRST,
then restore the stash. This ensures:

  - HEAD lands back where the user started
  - `git stash apply` runs on the correct branch
  - Detached HEAD / main sessions are unchanged (guarded with
    `current_branch not in ('main', 'HEAD')`)

Does NOT: rebase main into feature branch. That's still user's job
(silent rebase would be worse than this bug).

TESTS
=====

Three regression tests added to tests/hermes_cli/test_update_autostash.py:

  test_cmd_update_restores_feature_branch_after_successful_update
    — asserts `checkout feat/my-work` runs after a successful pull

  test_cmd_update_checkout_back_happens_before_stash_restore
    — asserts ORDERING: checkout runs before _restore_stashed_changes
      in the successful-update path

  test_cmd_update_already_up_to_date_checkout_back_before_restore
    — same ordering invariant for the up-to-date path

Verified: all 3 tests fail without the fix, pass with it.
Total: 26/26 pass in test_update_autostash.py.

Co-discovered by sibling hermes session working on the same problem.

Refs: NousResearch#19876 (bedrock parity branch discovered the bug in use)
praxstack pushed a commit to praxstack/NousResearch-hermes-agent that referenced this pull request May 8, 2026
ROOT CAUSE
==========

`hermes update` running from a feature branch had two distinct bugs in
`_cmd_update_impl` (hermes_cli/main.py):

1. **Asymmetric branch handling.** At line 7071 the function checks out
   main for the update. The 'already up to date' path at line 7107-7114
   switches back to the original branch, but the 'successful update'
   path (lines 7118-7183) never did. Every successful `hermes update`
   silently left HEAD on main, abandoning the user's feature branch.

2. **Stash restore order in both paths.** Even in the up-to-date path,
   the stash was restored BEFORE the checkout-back. That meant
   `git stash apply` ran while HEAD was still on main, applying
   feature-branch-local edits on top of main. The checkout-back only
   restored the branch tip pointer — the working tree contamination
   had already happened.

REPRODUCTION
============

Visible in the reflog of any session that ran `hermes update`:

    checkout: feat/... → main       ← line 7071
    pull --ff-only origin main       ← line 7140 (fast-forward)
    (nothing — feat/... abandoned)

The 05:24 IST rebase abort this morning was damage control after this
exact bug left HEAD on main. The parity branch was saved by a manual
`git checkout feat/native-bedrock-provider-20260428`, not by the
update command itself.

FIX
===

Swap the order in both finally blocks: checkout original branch FIRST,
then restore the stash. This ensures:

  - HEAD lands back where the user started
  - `git stash apply` runs on the correct branch
  - Detached HEAD / main sessions are unchanged (guarded with
    `current_branch not in ('main', 'HEAD')`)

Does NOT: rebase main into feature branch. That's still user's job
(silent rebase would be worse than this bug).

TESTS
=====

Three regression tests added to tests/hermes_cli/test_update_autostash.py:

  test_cmd_update_restores_feature_branch_after_successful_update
    — asserts `checkout feat/my-work` runs after a successful pull

  test_cmd_update_checkout_back_happens_before_stash_restore
    — asserts ORDERING: checkout runs before _restore_stashed_changes
      in the successful-update path

  test_cmd_update_already_up_to_date_checkout_back_before_restore
    — same ordering invariant for the up-to-date path

Verified: all 3 tests fail without the fix, pass with it.
Total: 26/26 pass in test_update_autostash.py.

Co-discovered by sibling hermes session working on the same problem.

Refs: NousResearch#19876 (bedrock parity branch discovered the bug in use)
praxstack pushed a commit to praxstack/NousResearch-hermes-agent that referenced this pull request May 9, 2026
ROOT CAUSE
==========

`hermes update` running from a feature branch had two distinct bugs in
`_cmd_update_impl` (hermes_cli/main.py):

1. **Asymmetric branch handling.** At line 7071 the function checks out
   main for the update. The 'already up to date' path at line 7107-7114
   switches back to the original branch, but the 'successful update'
   path (lines 7118-7183) never did. Every successful `hermes update`
   silently left HEAD on main, abandoning the user's feature branch.

2. **Stash restore order in both paths.** Even in the up-to-date path,
   the stash was restored BEFORE the checkout-back. That meant
   `git stash apply` ran while HEAD was still on main, applying
   feature-branch-local edits on top of main. The checkout-back only
   restored the branch tip pointer — the working tree contamination
   had already happened.

REPRODUCTION
============

Visible in the reflog of any session that ran `hermes update`:

    checkout: feat/... → main       ← line 7071
    pull --ff-only origin main       ← line 7140 (fast-forward)
    (nothing — feat/... abandoned)

The 05:24 IST rebase abort this morning was damage control after this
exact bug left HEAD on main. The parity branch was saved by a manual
`git checkout feat/native-bedrock-provider-20260428`, not by the
update command itself.

FIX
===

Swap the order in both finally blocks: checkout original branch FIRST,
then restore the stash. This ensures:

  - HEAD lands back where the user started
  - `git stash apply` runs on the correct branch
  - Detached HEAD / main sessions are unchanged (guarded with
    `current_branch not in ('main', 'HEAD')`)

Does NOT: rebase main into feature branch. That's still user's job
(silent rebase would be worse than this bug).

TESTS
=====

Three regression tests added to tests/hermes_cli/test_update_autostash.py:

  test_cmd_update_restores_feature_branch_after_successful_update
    — asserts `checkout feat/my-work` runs after a successful pull

  test_cmd_update_checkout_back_happens_before_stash_restore
    — asserts ORDERING: checkout runs before _restore_stashed_changes
      in the successful-update path

  test_cmd_update_already_up_to_date_checkout_back_before_restore
    — same ordering invariant for the up-to-date path

Verified: all 3 tests fail without the fix, pass with it.
Total: 26/26 pass in test_update_autostash.py.

Co-discovered by sibling hermes session working on the same problem.

Refs: NousResearch#19876 (bedrock parity branch discovered the bug in use)
praxstack pushed a commit to praxstack/NousResearch-hermes-agent that referenced this pull request May 10, 2026
ROOT CAUSE
==========

`hermes update` running from a feature branch had two distinct bugs in
`_cmd_update_impl` (hermes_cli/main.py):

1. **Asymmetric branch handling.** At line 7071 the function checks out
   main for the update. The 'already up to date' path at line 7107-7114
   switches back to the original branch, but the 'successful update'
   path (lines 7118-7183) never did. Every successful `hermes update`
   silently left HEAD on main, abandoning the user's feature branch.

2. **Stash restore order in both paths.** Even in the up-to-date path,
   the stash was restored BEFORE the checkout-back. That meant
   `git stash apply` ran while HEAD was still on main, applying
   feature-branch-local edits on top of main. The checkout-back only
   restored the branch tip pointer — the working tree contamination
   had already happened.

REPRODUCTION
============

Visible in the reflog of any session that ran `hermes update`:

    checkout: feat/... → main       ← line 7071
    pull --ff-only origin main       ← line 7140 (fast-forward)
    (nothing — feat/... abandoned)

The 05:24 IST rebase abort this morning was damage control after this
exact bug left HEAD on main. The parity branch was saved by a manual
`git checkout feat/native-bedrock-provider-20260428`, not by the
update command itself.

FIX
===

Swap the order in both finally blocks: checkout original branch FIRST,
then restore the stash. This ensures:

  - HEAD lands back where the user started
  - `git stash apply` runs on the correct branch
  - Detached HEAD / main sessions are unchanged (guarded with
    `current_branch not in ('main', 'HEAD')`)

Does NOT: rebase main into feature branch. That's still user's job
(silent rebase would be worse than this bug).

TESTS
=====

Three regression tests added to tests/hermes_cli/test_update_autostash.py:

  test_cmd_update_restores_feature_branch_after_successful_update
    — asserts `checkout feat/my-work` runs after a successful pull

  test_cmd_update_checkout_back_happens_before_stash_restore
    — asserts ORDERING: checkout runs before _restore_stashed_changes
      in the successful-update path

  test_cmd_update_already_up_to_date_checkout_back_before_restore
    — same ordering invariant for the up-to-date path

Verified: all 3 tests fail without the fix, pass with it.
Total: 26/26 pass in test_update_autostash.py.

Co-discovered by sibling hermes session working on the same problem.

Refs: NousResearch#19876 (bedrock parity branch discovered the bug in use)
praxstack pushed a commit to praxstack/NousResearch-hermes-agent that referenced this pull request May 11, 2026
ROOT CAUSE
==========

`hermes update` running from a feature branch had two distinct bugs in
`_cmd_update_impl` (hermes_cli/main.py):

1. **Asymmetric branch handling.** At line 7071 the function checks out
   main for the update. The 'already up to date' path at line 7107-7114
   switches back to the original branch, but the 'successful update'
   path (lines 7118-7183) never did. Every successful `hermes update`
   silently left HEAD on main, abandoning the user's feature branch.

2. **Stash restore order in both paths.** Even in the up-to-date path,
   the stash was restored BEFORE the checkout-back. That meant
   `git stash apply` ran while HEAD was still on main, applying
   feature-branch-local edits on top of main. The checkout-back only
   restored the branch tip pointer — the working tree contamination
   had already happened.

REPRODUCTION
============

Visible in the reflog of any session that ran `hermes update`:

    checkout: feat/... → main       ← line 7071
    pull --ff-only origin main       ← line 7140 (fast-forward)
    (nothing — feat/... abandoned)

The 05:24 IST rebase abort this morning was damage control after this
exact bug left HEAD on main. The parity branch was saved by a manual
`git checkout feat/native-bedrock-provider-20260428`, not by the
update command itself.

FIX
===

Swap the order in both finally blocks: checkout original branch FIRST,
then restore the stash. This ensures:

  - HEAD lands back where the user started
  - `git stash apply` runs on the correct branch
  - Detached HEAD / main sessions are unchanged (guarded with
    `current_branch not in ('main', 'HEAD')`)

Does NOT: rebase main into feature branch. That's still user's job
(silent rebase would be worse than this bug).

TESTS
=====

Three regression tests added to tests/hermes_cli/test_update_autostash.py:

  test_cmd_update_restores_feature_branch_after_successful_update
    — asserts `checkout feat/my-work` runs after a successful pull

  test_cmd_update_checkout_back_happens_before_stash_restore
    — asserts ORDERING: checkout runs before _restore_stashed_changes
      in the successful-update path

  test_cmd_update_already_up_to_date_checkout_back_before_restore
    — same ordering invariant for the up-to-date path

Verified: all 3 tests fail without the fix, pass with it.
Total: 26/26 pass in test_update_autostash.py.

Co-discovered by sibling hermes session working on the same problem.

Refs: NousResearch#19876 (bedrock parity branch discovered the bug in use)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants