fix(gateway): handle planned service stops#19876
Closed
helix4u wants to merge 1 commit into
Closed
Conversation
6a8f30a to
9630ce2
Compare
9630ce2 to
72a57e9
Compare
Contributor
|
Merged via #19936 — your commit was cherry-picked onto current main with authorship preserved (commit b632290). Thanks @helix4u! Opening a separate follow-up issue for the related drain-hang on wedged adapter sockets (WSL + hung Feishu/Weixin websockets), which your marker correctly left out of scope. |
praxstack
pushed a commit
to praxstack/NousResearch-hermes-agent
that referenced
this pull request
May 6, 2026
ROOT CAUSE
==========
`hermes update` running from a feature branch had two distinct bugs in
`_cmd_update_impl` (hermes_cli/main.py):
1. **Asymmetric branch handling.** At line 7071 the function checks out
main for the update. The 'already up to date' path at line 7107-7114
switches back to the original branch, but the 'successful update'
path (lines 7118-7183) never did. Every successful `hermes update`
silently left HEAD on main, abandoning the user's feature branch.
2. **Stash restore order in both paths.** Even in the up-to-date path,
the stash was restored BEFORE the checkout-back. That meant
`git stash apply` ran while HEAD was still on main, applying
feature-branch-local edits on top of main. The checkout-back only
restored the branch tip pointer — the working tree contamination
had already happened.
REPRODUCTION
============
Visible in the reflog of any session that ran `hermes update`:
checkout: feat/... → main ← line 7071
pull --ff-only origin main ← line 7140 (fast-forward)
(nothing — feat/... abandoned)
The 05:24 IST rebase abort this morning was damage control after this
exact bug left HEAD on main. The parity branch was saved by a manual
`git checkout feat/native-bedrock-provider-20260428`, not by the
update command itself.
FIX
===
Swap the order in both finally blocks: checkout original branch FIRST,
then restore the stash. This ensures:
- HEAD lands back where the user started
- `git stash apply` runs on the correct branch
- Detached HEAD / main sessions are unchanged (guarded with
`current_branch not in ('main', 'HEAD')`)
Does NOT: rebase main into feature branch. That's still user's job
(silent rebase would be worse than this bug).
TESTS
=====
Three regression tests added to tests/hermes_cli/test_update_autostash.py:
test_cmd_update_restores_feature_branch_after_successful_update
— asserts `checkout feat/my-work` runs after a successful pull
test_cmd_update_checkout_back_happens_before_stash_restore
— asserts ORDERING: checkout runs before _restore_stashed_changes
in the successful-update path
test_cmd_update_already_up_to_date_checkout_back_before_restore
— same ordering invariant for the up-to-date path
Verified: all 3 tests fail without the fix, pass with it.
Total: 26/26 pass in test_update_autostash.py.
Co-discovered by sibling hermes session working on the same problem.
Refs: NousResearch#19876 (bedrock parity branch discovered the bug in use)
praxstack
pushed a commit
to praxstack/NousResearch-hermes-agent
that referenced
this pull request
May 8, 2026
ROOT CAUSE
==========
`hermes update` running from a feature branch had two distinct bugs in
`_cmd_update_impl` (hermes_cli/main.py):
1. **Asymmetric branch handling.** At line 7071 the function checks out
main for the update. The 'already up to date' path at line 7107-7114
switches back to the original branch, but the 'successful update'
path (lines 7118-7183) never did. Every successful `hermes update`
silently left HEAD on main, abandoning the user's feature branch.
2. **Stash restore order in both paths.** Even in the up-to-date path,
the stash was restored BEFORE the checkout-back. That meant
`git stash apply` ran while HEAD was still on main, applying
feature-branch-local edits on top of main. The checkout-back only
restored the branch tip pointer — the working tree contamination
had already happened.
REPRODUCTION
============
Visible in the reflog of any session that ran `hermes update`:
checkout: feat/... → main ← line 7071
pull --ff-only origin main ← line 7140 (fast-forward)
(nothing — feat/... abandoned)
The 05:24 IST rebase abort this morning was damage control after this
exact bug left HEAD on main. The parity branch was saved by a manual
`git checkout feat/native-bedrock-provider-20260428`, not by the
update command itself.
FIX
===
Swap the order in both finally blocks: checkout original branch FIRST,
then restore the stash. This ensures:
- HEAD lands back where the user started
- `git stash apply` runs on the correct branch
- Detached HEAD / main sessions are unchanged (guarded with
`current_branch not in ('main', 'HEAD')`)
Does NOT: rebase main into feature branch. That's still user's job
(silent rebase would be worse than this bug).
TESTS
=====
Three regression tests added to tests/hermes_cli/test_update_autostash.py:
test_cmd_update_restores_feature_branch_after_successful_update
— asserts `checkout feat/my-work` runs after a successful pull
test_cmd_update_checkout_back_happens_before_stash_restore
— asserts ORDERING: checkout runs before _restore_stashed_changes
in the successful-update path
test_cmd_update_already_up_to_date_checkout_back_before_restore
— same ordering invariant for the up-to-date path
Verified: all 3 tests fail without the fix, pass with it.
Total: 26/26 pass in test_update_autostash.py.
Co-discovered by sibling hermes session working on the same problem.
Refs: NousResearch#19876 (bedrock parity branch discovered the bug in use)
praxstack
pushed a commit
to praxstack/NousResearch-hermes-agent
that referenced
this pull request
May 9, 2026
ROOT CAUSE
==========
`hermes update` running from a feature branch had two distinct bugs in
`_cmd_update_impl` (hermes_cli/main.py):
1. **Asymmetric branch handling.** At line 7071 the function checks out
main for the update. The 'already up to date' path at line 7107-7114
switches back to the original branch, but the 'successful update'
path (lines 7118-7183) never did. Every successful `hermes update`
silently left HEAD on main, abandoning the user's feature branch.
2. **Stash restore order in both paths.** Even in the up-to-date path,
the stash was restored BEFORE the checkout-back. That meant
`git stash apply` ran while HEAD was still on main, applying
feature-branch-local edits on top of main. The checkout-back only
restored the branch tip pointer — the working tree contamination
had already happened.
REPRODUCTION
============
Visible in the reflog of any session that ran `hermes update`:
checkout: feat/... → main ← line 7071
pull --ff-only origin main ← line 7140 (fast-forward)
(nothing — feat/... abandoned)
The 05:24 IST rebase abort this morning was damage control after this
exact bug left HEAD on main. The parity branch was saved by a manual
`git checkout feat/native-bedrock-provider-20260428`, not by the
update command itself.
FIX
===
Swap the order in both finally blocks: checkout original branch FIRST,
then restore the stash. This ensures:
- HEAD lands back where the user started
- `git stash apply` runs on the correct branch
- Detached HEAD / main sessions are unchanged (guarded with
`current_branch not in ('main', 'HEAD')`)
Does NOT: rebase main into feature branch. That's still user's job
(silent rebase would be worse than this bug).
TESTS
=====
Three regression tests added to tests/hermes_cli/test_update_autostash.py:
test_cmd_update_restores_feature_branch_after_successful_update
— asserts `checkout feat/my-work` runs after a successful pull
test_cmd_update_checkout_back_happens_before_stash_restore
— asserts ORDERING: checkout runs before _restore_stashed_changes
in the successful-update path
test_cmd_update_already_up_to_date_checkout_back_before_restore
— same ordering invariant for the up-to-date path
Verified: all 3 tests fail without the fix, pass with it.
Total: 26/26 pass in test_update_autostash.py.
Co-discovered by sibling hermes session working on the same problem.
Refs: NousResearch#19876 (bedrock parity branch discovered the bug in use)
praxstack
pushed a commit
to praxstack/NousResearch-hermes-agent
that referenced
this pull request
May 10, 2026
ROOT CAUSE
==========
`hermes update` running from a feature branch had two distinct bugs in
`_cmd_update_impl` (hermes_cli/main.py):
1. **Asymmetric branch handling.** At line 7071 the function checks out
main for the update. The 'already up to date' path at line 7107-7114
switches back to the original branch, but the 'successful update'
path (lines 7118-7183) never did. Every successful `hermes update`
silently left HEAD on main, abandoning the user's feature branch.
2. **Stash restore order in both paths.** Even in the up-to-date path,
the stash was restored BEFORE the checkout-back. That meant
`git stash apply` ran while HEAD was still on main, applying
feature-branch-local edits on top of main. The checkout-back only
restored the branch tip pointer — the working tree contamination
had already happened.
REPRODUCTION
============
Visible in the reflog of any session that ran `hermes update`:
checkout: feat/... → main ← line 7071
pull --ff-only origin main ← line 7140 (fast-forward)
(nothing — feat/... abandoned)
The 05:24 IST rebase abort this morning was damage control after this
exact bug left HEAD on main. The parity branch was saved by a manual
`git checkout feat/native-bedrock-provider-20260428`, not by the
update command itself.
FIX
===
Swap the order in both finally blocks: checkout original branch FIRST,
then restore the stash. This ensures:
- HEAD lands back where the user started
- `git stash apply` runs on the correct branch
- Detached HEAD / main sessions are unchanged (guarded with
`current_branch not in ('main', 'HEAD')`)
Does NOT: rebase main into feature branch. That's still user's job
(silent rebase would be worse than this bug).
TESTS
=====
Three regression tests added to tests/hermes_cli/test_update_autostash.py:
test_cmd_update_restores_feature_branch_after_successful_update
— asserts `checkout feat/my-work` runs after a successful pull
test_cmd_update_checkout_back_happens_before_stash_restore
— asserts ORDERING: checkout runs before _restore_stashed_changes
in the successful-update path
test_cmd_update_already_up_to_date_checkout_back_before_restore
— same ordering invariant for the up-to-date path
Verified: all 3 tests fail without the fix, pass with it.
Total: 26/26 pass in test_update_autostash.py.
Co-discovered by sibling hermes session working on the same problem.
Refs: NousResearch#19876 (bedrock parity branch discovered the bug in use)
praxstack
pushed a commit
to praxstack/NousResearch-hermes-agent
that referenced
this pull request
May 11, 2026
ROOT CAUSE
==========
`hermes update` running from a feature branch had two distinct bugs in
`_cmd_update_impl` (hermes_cli/main.py):
1. **Asymmetric branch handling.** At line 7071 the function checks out
main for the update. The 'already up to date' path at line 7107-7114
switches back to the original branch, but the 'successful update'
path (lines 7118-7183) never did. Every successful `hermes update`
silently left HEAD on main, abandoning the user's feature branch.
2. **Stash restore order in both paths.** Even in the up-to-date path,
the stash was restored BEFORE the checkout-back. That meant
`git stash apply` ran while HEAD was still on main, applying
feature-branch-local edits on top of main. The checkout-back only
restored the branch tip pointer — the working tree contamination
had already happened.
REPRODUCTION
============
Visible in the reflog of any session that ran `hermes update`:
checkout: feat/... → main ← line 7071
pull --ff-only origin main ← line 7140 (fast-forward)
(nothing — feat/... abandoned)
The 05:24 IST rebase abort this morning was damage control after this
exact bug left HEAD on main. The parity branch was saved by a manual
`git checkout feat/native-bedrock-provider-20260428`, not by the
update command itself.
FIX
===
Swap the order in both finally blocks: checkout original branch FIRST,
then restore the stash. This ensures:
- HEAD lands back where the user started
- `git stash apply` runs on the correct branch
- Detached HEAD / main sessions are unchanged (guarded with
`current_branch not in ('main', 'HEAD')`)
Does NOT: rebase main into feature branch. That's still user's job
(silent rebase would be worse than this bug).
TESTS
=====
Three regression tests added to tests/hermes_cli/test_update_autostash.py:
test_cmd_update_restores_feature_branch_after_successful_update
— asserts `checkout feat/my-work` runs after a successful pull
test_cmd_update_checkout_back_happens_before_stash_restore
— asserts ORDERING: checkout runs before _restore_stashed_changes
in the successful-update path
test_cmd_update_already_up_to_date_checkout_back_before_restore
— same ordering invariant for the up-to-date path
Verified: all 3 tests fail without the fix, pass with it.
Total: 26/26 pass in test_update_autostash.py.
Co-discovered by sibling hermes session working on the same problem.
Refs: NousResearch#19876 (bedrock parity branch discovered the bug in use)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds an explicit planned-stop marker for gateway stop paths so a deliberate
hermes gateway stop, launchd stop, or profile-scoped stop can be distinguished from an unexpected external SIGTERM.The gateway currently exits non-zero for signal-initiated shutdowns so systemd can revive it after an unexpected kill. That is useful for real failures, but it also means an intentional service stop can be treated as failure in some user-systemd/WSL setups. The result is that
systemctl --user stop hermes-gatewaymay time out or the gateway may be revived immediately, leaving users with "Gateway already running" and platform identity conflicts such as Feishuapp_idalready in use.This follows the existing
--replacetakeover marker pattern: the CLI writes a short-lived marker naming the target PID and process start time before sending the stop signal, and the gateway consumes that marker during signal handling to exit cleanly. Interactive Ctrl+C is also treated as an intentional foreground stop.Related but not duplicate fixes:
TimeoutStopSecfor planned restart drains.This PR covers the missing planned-stop path: the gateway should not ask the service manager to revive it after the user explicitly stops it.
Related Issue
Related to #14128, #14176, #17198.
Type of Change
Changes Made
gateway/status.py: adds planned-stop marker helpers and shares marker consumption logic with takeover markers.hermes_cli/gateway.py: writes the planned-stop marker before systemd, launchd, or profile-scoped stop sends SIGTERM.gateway/run.py: consumes the planned-stop marker during SIGTERM handling and treats Ctrl+C as an intentional clean stop.tests/gateway/test_status.py: adds planned-stop marker coverage.tests/hermes_cli/test_gateway_service.py: verifiessystemd_stop()marks the target gateway before stopping and makes generated-unit timeout assertions deterministic against the default drain timeout.How to Test
python -m py_compile gateway/status.py hermes_cli/gateway.py gateway/run.py.venv/bin/python -m pytest -n 4 tests/gateway/test_status.py tests/hermes_cli/test_gateway_service.py -q—153 passedbefore the final deterministic test hardening..venv/bin/python -m pytest -n 4 tests/gateway/test_runner_startup_failures.py tests/gateway/test_clean_shutdown_marker.py -q—15 passed..venv/bin/python -m pytest -n 4 tests/hermes_cli/test_gateway_service.py tests/gateway/test_status.py tests/gateway/test_runner_startup_failures.py tests/gateway/test_clean_shutdown_marker.py -q—168 passedafter the final deterministic test hardening..venv/bin/python -m pytest -n 4 tests/ -q— full suite attempted on the final tree, but the repo-wide suite is currently red:116 failed, 19535 passed, 59 skipped, 223 warnings in 617.59s. The failures are outside the planned-stop marker path and are concentrated in existing cron, approval, auxiliary client, Bedrock beta header, gateway config, DingTalk, Discord, update, browser, delegate, and sandbox tests.Checklist
Code
fix(scope):,feat(scope):, etc.)pytest tests/ -qand all tests passDocumentation & Housekeeping
docs/, docstrings) — or N/Acli-config.yaml.exampleif I added/changed config keys — or N/ACONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — or N/AFor New Skills
N/A
Screenshots / Logs
User reports showed
hermes gateway stoptiming out insystemctl --user stop hermes-gatewayafter 90s, followed by a still-running gateway PID and platform identity conflicts. The code path also loggedExiting with code 1 (signal-initiated shutdown without restart request) so systemd Restart=on-failure can revive the gateway.during shutdown, which is the behavior this marker prevents for intentional stops.