Skip to content

test(update): teach restart-mocks about the post-update survivor sweep#19031

Closed
Sanjays2402 wants to merge 1 commit into
NousResearch:mainfrom
Sanjays2402:fix/main-ci-update-restart-survivor-sweep-tests
Closed

test(update): teach restart-mocks about the post-update survivor sweep#19031
Sanjays2402 wants to merge 1 commit into
NousResearch:mainfrom
Sanjays2402:fix/main-ci-update-restart-survivor-sweep-tests

Conversation

@Sanjays2402
Copy link
Copy Markdown
Contributor

Summary

Fixes three CI failures observed on main:

FAILED tests/hermes_cli/test_update_gateway_restart.py
  ::TestCmdUpdateLaunchdRestart::test_update_restarts_profile_manual_gateways
    AssertionError: Expected 'kill' to not have been called. Called 1 times.
    Calls: [call(12345, <Signals.SIGKILL: 9>)].

FAILED tests/hermes_cli/test_update_gateway_restart.py
  ::TestCmdUpdateLaunchdRestart::test_update_profile_manual_gateway_falls_back_to_sigterm
    AssertionError: Expected 'kill' to have been called once. Called 2 times.
    Calls: [call(12345, SIGTERM), call(12345, SIGKILL)].

FAILED tests/hermes_cli/test_update_gateway_restart.py
  ::TestServicePidExclusion::test_update_kills_manual_pid_but_not_service_pid
    assert 2 == 1
      manual_kills = [call(42999, SIGTERM), call(42999, SIGKILL)]

Reference run: 25250051126 on 5d3be898a.

Root cause

Issue #17648 added a post-update SIGTERM-survivor sweep to cmd_update. ~3s after issuing graceful restart / SIGTERM, the code re-queries find_gateway_pids and SIGKILLs anything still alive — the right fix for stuck-drain gateways in production:

# hermes_cli/main.py:7553
# --- Post-restart survivor sweep -----------------------------
# Issue #17648: some gateways ignore SIGTERM (stuck drain, blocked I/O, ...)
_time.sleep(3.0)
_surviving = find_gateway_pids(exclude_pids=_service_pids_after, all_profiles=True)
_stuck = [pid for pid in _surviving if pid in killed_pids]
if _stuck:
    print(f"  ⚠ {len(_stuck)} gateway process(es) ignored SIGTERM — force-killing")
    for pid in _stuck:
        os.kill(pid, _signal.SIGKILL)

But the three unit tests assumed find_gateway_pids would keep returning the same PIDs forever (return_value=[12345]). With os.kill mocked, the simulated PID never actually exits → the sweep finds it again → SIGKILL escalation → assertion fires.

The production code is correct; the tests just need to model OS behaviour properly.

Fix

Two profile-manual restart tests: use side_effect=[[12345], []] so the first find_gateway_pids call returns the live PID and the second (the sweep) returns nothing, as if the OS had reaped the process.

Service-PID-exclusion test: track which PIDs got killed in a closure set, and exclude them on subsequent fake_find calls. Give os.kill a side_effect that records the kill instead of swallowing it silently. Now the sweep doesn't re-find the manual PID, no SIGKILL escalation, manual_kills == 1.

Validation

$ pytest tests/hermes_cli/test_update_gateway_restart.py -q
43 passed in 4.13s

Scope

Refs

Out of scope

The other ~5 main-CI failures — separate focused PRs.

Issue NousResearch#17648 added a post-update SIGTERM-survivor sweep to `cmd_update`:
~3s after issuing graceful/SIGTERM restarts, the code re-queries
`find_gateway_pids` and SIGKILLs anything still alive. That's the
right fix for stuck-drain gateways in production, but it broke three
unit tests that assumed `find_gateway_pids` would keep returning the
same PIDs forever:

  FAILED ::TestCmdUpdateLaunchdRestart::test_update_restarts_profile_manual_gateways
    AssertionError: Expected 'kill' to not have been called. Called 1 times.
    Calls: [call(12345, <Signals.SIGKILL: 9>)].

  FAILED ::TestCmdUpdateLaunchdRestart::test_update_profile_manual_gateway_falls_back_to_sigterm
    AssertionError: Expected 'kill' to have been called once. Called 2 times.
    Calls: [call(12345, SIGTERM), call(12345, SIGKILL)].

  FAILED ::TestServicePidExclusion::test_update_kills_manual_pid_but_not_service_pid
    assert 2 == 1
      manual_kills = [call(42999, SIGTERM), call(42999, SIGKILL)]

In each test `os.kill` is mocked, so the simulated PID never actually
exits \u2014 the sweep finds it again and escalates. The production code
is correct; the tests just need to model OS behaviour properly.

Two-test fix (profile-manual restart cases): use
`side_effect=[[12345], []]` so the first `find_gateway_pids` call
returns the live PID and the second (the sweep) returns nothing, as if
the OS had reaped the process.

Service-PID-exclusion fix: track which PIDs got killed in a closure
set, and exclude them on subsequent `fake_find` calls. `os.kill`
gets a `side_effect` that records the kill instead of swallowing it
silently. Now the sweep doesn't re-find the manual PID, no SIGKILL
escalation, `manual_kills == 1`.

Validation:

    $ pytest tests/hermes_cli/test_update_gateway_restart.py -q
    43 passed in 4.13s

No production code change. Fixes the three failures observed on `main`
(run 25250051126):

  test_update_restarts_profile_manual_gateways
  test_update_profile_manual_gateway_falls_back_to_sigterm
  test_update_kills_manual_pid_but_not_service_pid

Refs: NousResearch#17648 (post-update survivor sweep that the tests didn't model).
@alt-glitch alt-glitch added type/test Test coverage or test infrastructure P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard labels May 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard P3 Low — cosmetic, nice to have type/test Test coverage or test infrastructure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants