test(update): teach restart-mocks about the post-update survivor sweep#19031
Closed
Sanjays2402 wants to merge 1 commit into
Closed
test(update): teach restart-mocks about the post-update survivor sweep#19031Sanjays2402 wants to merge 1 commit into
Sanjays2402 wants to merge 1 commit into
Conversation
Issue NousResearch#17648 added a post-update SIGTERM-survivor sweep to `cmd_update`: ~3s after issuing graceful/SIGTERM restarts, the code re-queries `find_gateway_pids` and SIGKILLs anything still alive. That's the right fix for stuck-drain gateways in production, but it broke three unit tests that assumed `find_gateway_pids` would keep returning the same PIDs forever: FAILED ::TestCmdUpdateLaunchdRestart::test_update_restarts_profile_manual_gateways AssertionError: Expected 'kill' to not have been called. Called 1 times. Calls: [call(12345, <Signals.SIGKILL: 9>)]. FAILED ::TestCmdUpdateLaunchdRestart::test_update_profile_manual_gateway_falls_back_to_sigterm AssertionError: Expected 'kill' to have been called once. Called 2 times. Calls: [call(12345, SIGTERM), call(12345, SIGKILL)]. FAILED ::TestServicePidExclusion::test_update_kills_manual_pid_but_not_service_pid assert 2 == 1 manual_kills = [call(42999, SIGTERM), call(42999, SIGKILL)] In each test `os.kill` is mocked, so the simulated PID never actually exits \u2014 the sweep finds it again and escalates. The production code is correct; the tests just need to model OS behaviour properly. Two-test fix (profile-manual restart cases): use `side_effect=[[12345], []]` so the first `find_gateway_pids` call returns the live PID and the second (the sweep) returns nothing, as if the OS had reaped the process. Service-PID-exclusion fix: track which PIDs got killed in a closure set, and exclude them on subsequent `fake_find` calls. `os.kill` gets a `side_effect` that records the kill instead of swallowing it silently. Now the sweep doesn't re-find the manual PID, no SIGKILL escalation, `manual_kills == 1`. Validation: $ pytest tests/hermes_cli/test_update_gateway_restart.py -q 43 passed in 4.13s No production code change. Fixes the three failures observed on `main` (run 25250051126): test_update_restarts_profile_manual_gateways test_update_profile_manual_gateway_falls_back_to_sigterm test_update_kills_manual_pid_but_not_service_pid Refs: NousResearch#17648 (post-update survivor sweep that the tests didn't model).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes three CI failures observed on
main:Reference run: 25250051126 on
5d3be898a.Root cause
Issue #17648 added a post-update SIGTERM-survivor sweep to
cmd_update. ~3s after issuing graceful restart / SIGTERM, the code re-queriesfind_gateway_pidsand SIGKILLs anything still alive — the right fix for stuck-drain gateways in production:But the three unit tests assumed
find_gateway_pidswould keep returning the same PIDs forever (return_value=[12345]). Withos.killmocked, the simulated PID never actually exits → the sweep finds it again → SIGKILL escalation → assertion fires.The production code is correct; the tests just need to model OS behaviour properly.
Fix
Two profile-manual restart tests: use
side_effect=[[12345], []]so the firstfind_gateway_pidscall returns the live PID and the second (the sweep) returns nothing, as if the OS had reaped the process.Service-PID-exclusion test: track which PIDs got killed in a closure set, and exclude them on subsequent
fake_findcalls. Giveos.killaside_effectthat records the kill instead of swallowing it silently. Now the sweep doesn't re-find the manual PID, no SIGKILL escalation,manual_kills == 1.Validation
Scope
Refs
Out of scope
The other ~5 main-CI failures — separate focused PRs.