test(update): teach restart-mocks about the post-update survivor sweep by Sanjays2402 · Pull Request #19031 · NousResearch/hermes-agent

Sanjays2402 · 2026-05-03T00:07:53Z

Summary

Fixes three CI failures observed on main:

FAILED tests/hermes_cli/test_update_gateway_restart.py
  ::TestCmdUpdateLaunchdRestart::test_update_restarts_profile_manual_gateways
    AssertionError: Expected 'kill' to not have been called. Called 1 times.
    Calls: [call(12345, <Signals.SIGKILL: 9>)].

FAILED tests/hermes_cli/test_update_gateway_restart.py
  ::TestCmdUpdateLaunchdRestart::test_update_profile_manual_gateway_falls_back_to_sigterm
    AssertionError: Expected 'kill' to have been called once. Called 2 times.
    Calls: [call(12345, SIGTERM), call(12345, SIGKILL)].

FAILED tests/hermes_cli/test_update_gateway_restart.py
  ::TestServicePidExclusion::test_update_kills_manual_pid_but_not_service_pid
    assert 2 == 1
      manual_kills = [call(42999, SIGTERM), call(42999, SIGKILL)]

Reference run: 25250051126 on 5d3be898a.

Root cause

Issue #17648 added a post-update SIGTERM-survivor sweep to cmd_update. ~3s after issuing graceful restart / SIGTERM, the code re-queries find_gateway_pids and SIGKILLs anything still alive — the right fix for stuck-drain gateways in production:

# hermes_cli/main.py:7553
# --- Post-restart survivor sweep -----------------------------
# Issue #17648: some gateways ignore SIGTERM (stuck drain, blocked I/O, ...)
_time.sleep(3.0)
_surviving = find_gateway_pids(exclude_pids=_service_pids_after, all_profiles=True)
_stuck = [pid for pid in _surviving if pid in killed_pids]
if _stuck:
    print(f"  ⚠ {len(_stuck)} gateway process(es) ignored SIGTERM — force-killing")
    for pid in _stuck:
        os.kill(pid, _signal.SIGKILL)

But the three unit tests assumed find_gateway_pids would keep returning the same PIDs forever (return_value=[12345]). With os.kill mocked, the simulated PID never actually exits → the sweep finds it again → SIGKILL escalation → assertion fires.

The production code is correct; the tests just need to model OS behaviour properly.

Fix

Two profile-manual restart tests: use side_effect=[[12345], []] so the first find_gateway_pids call returns the live PID and the second (the sweep) returns nothing, as if the OS had reaped the process.

Service-PID-exclusion test: track which PIDs got killed in a closure set, and exclude them on subsequent fake_find calls. Give os.kill a side_effect that records the kill instead of swallowing it silently. Now the sweep doesn't re-find the manual PID, no SIGKILL escalation, manual_kills == 1.

Validation

$ pytest tests/hermes_cli/test_update_gateway_restart.py -q
43 passed in 4.13s

Scope

✅ No production code change (test-only)
✅ All 43 tests in the file pass
✅ The survivor-sweep contract from [Bug]: Matrix messages returning error #17648 is preserved

Refs

[Bug]: Matrix messages returning error #17648 — post-update survivor sweep that the tests didn't model

Out of scope

The other ~5 main-CI failures — separate focused PRs.

Issue NousResearch#17648 added a post-update SIGTERM-survivor sweep to `cmd_update`: ~3s after issuing graceful/SIGTERM restarts, the code re-queries `find_gateway_pids` and SIGKILLs anything still alive. That's the right fix for stuck-drain gateways in production, but it broke three unit tests that assumed `find_gateway_pids` would keep returning the same PIDs forever: FAILED ::TestCmdUpdateLaunchdRestart::test_update_restarts_profile_manual_gateways AssertionError: Expected 'kill' to not have been called. Called 1 times. Calls: [call(12345, <Signals.SIGKILL: 9>)]. FAILED ::TestCmdUpdateLaunchdRestart::test_update_profile_manual_gateway_falls_back_to_sigterm AssertionError: Expected 'kill' to have been called once. Called 2 times. Calls: [call(12345, SIGTERM), call(12345, SIGKILL)]. FAILED ::TestServicePidExclusion::test_update_kills_manual_pid_but_not_service_pid assert 2 == 1 manual_kills = [call(42999, SIGTERM), call(42999, SIGKILL)] In each test `os.kill` is mocked, so the simulated PID never actually exits \u2014 the sweep finds it again and escalates. The production code is correct; the tests just need to model OS behaviour properly. Two-test fix (profile-manual restart cases): use `side_effect=[[12345], []]` so the first `find_gateway_pids` call returns the live PID and the second (the sweep) returns nothing, as if the OS had reaped the process. Service-PID-exclusion fix: track which PIDs got killed in a closure set, and exclude them on subsequent `fake_find` calls. `os.kill` gets a `side_effect` that records the kill instead of swallowing it silently. Now the sweep doesn't re-find the manual PID, no SIGKILL escalation, `manual_kills == 1`. Validation: $ pytest tests/hermes_cli/test_update_gateway_restart.py -q 43 passed in 4.13s No production code change. Fixes the three failures observed on `main` (run 25250051126): test_update_restarts_profile_manual_gateways test_update_profile_manual_gateway_falls_back_to_sigterm test_update_kills_manual_pid_but_not_service_pid Refs: NousResearch#17648 (post-update survivor sweep that the tests didn't model).

alt-glitch added type/test Test coverage or test infrastructure P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard labels May 3, 2026

teknium1 mentioned this pull request May 7, 2026

test(update): teach restart-mocks about the post-update survivor sweep (salvage #19031) #21177

Merged

teknium1 closed this in #21177 May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(update): teach restart-mocks about the post-update survivor sweep#19031

test(update): teach restart-mocks about the post-update survivor sweep#19031
Sanjays2402 wants to merge 1 commit into
NousResearch:mainfrom
Sanjays2402:fix/main-ci-update-restart-survivor-sweep-tests

Sanjays2402 commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Sanjays2402 commented May 3, 2026

Summary

Root cause

Fix

Validation

Scope

Refs

Out of scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants