feat(gateway): auto-resume interrupted sessions after restart (salvage #20888)#21192
Merged
Conversation
Follow-up on top of @kyan12's PR #20888 — same feature, cleaner shape, wider coverage. Changes: - Drop the synthetic '[System note: ...]' in the internal MessageEvent. The existing _is_resume_pending branch in _handle_message_with_agent (run.py ~L13738) already injects a reason-aware recovery system note on the next turn. With kyan's text in place the model saw two stacked system notes. Now the event text is empty and the existing injection path owns the wording. - Drop SessionStore.list_resume_pending() as a new public method. The filter is 8 lines inline in _schedule_resume_pending_sessions() — one caller, no other pluggability need. - Add 'restart_interrupted' to the auto-resume reason set. That's the reason SessionStore.suspend_recently_active() stamps on sessions recovered from a crash/OOM/SIGKILL (no .clean_shutdown marker). Previously those sessions had to wait for a real user message to auto-resume; now they continue automatically at startup like drain-timeout interruptions do. - Reasons live in a _AUTO_RESUME_REASONS frozenset at class scope so future reasons (e.g. 'manual_resume_request') can be opted in with one line. Test coverage added: - drain-timeout + crash-recovery both scheduled - stale entries skipped (outside freshness window) - suspended entries skipped (suspended > resume_pending) - originless entries skipped (no routing target) - disallowed reasons skipped (graceful forward-compat) E2E verified end-to-end with a real on-disk SessionStore: 2 eligible sessions scheduled, 2 ineligible skipped, empty-text internal events delivered to the adapter. Co-authored-by: Kevin Yan <kevyan1998@gmail.com>
Contributor
🔎 Lint report:
|
| Rule | Count |
|---|---|
unresolved-attribute |
4 |
invalid-argument-type |
1 |
invalid-assignment |
1 |
First entries
gateway/run.py:2824: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `SessionSource`, found `SessionSource | None`
tests/gateway/test_restart_resume_pending.py:1129: [invalid-assignment] invalid-assignment: Object of type `AsyncMock` is not assignable to attribute `handle_message` of type `def handle_message(self, event: MessageEvent) -> CoroutineType[Any, Any, None]`
tests/gateway/test_restart_resume_pending.py:1134: [unresolved-attribute] unresolved-attribute: Object of type `bound method BasePlatformAdapter.handle_message(event: MessageEvent) -> CoroutineType[Any, Any, None]` has no attribute `assert_not_called`
gateway/run.py:2814: [unresolved-attribute] unresolved-attribute: Attribute `platform` is not defined on `None` in union `SessionSource | None`
tests/gateway/test_restart_resume_pending.py:1006: [unresolved-attribute] unresolved-attribute: Object of type `bound method BasePlatformAdapter.handle_message(event: MessageEvent) -> CoroutineType[Any, Any, None]` has no attribute `assert_awaited_once`
tests/gateway/test_restart_resume_pending.py:965: [unresolved-attribute] unresolved-attribute: Object of type `bound method BasePlatformAdapter.handle_message(event: MessageEvent) -> CoroutineType[Any, Any, None]` has no attribute `await_args`
✅ Fixed issues: none
Unchanged: 3922 pre-existing issues carried over.
Diagnostics are surfaced as warnings — this check never fails the build.
1 task
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Salvage of @kyan12's PR #20888 onto current main with simplification + crash-recovery extension.
Summary
resume_pendingsessions once gateway adapters are back online, so users don't have to send a placeholder ping.Changes vs original PR
_is_resume_pendingbranch in_handle_message_with_agentalready injects a reason-aware recovery system note on the next turn. Kyan's version also put the note in the synthetic event text, so the model saw two stacked system notes. Event is now empty-text; existing path owns wording.SessionStore.list_resume_pending(): filter inlined in the single caller (8 lines). No new public API for one consumer.restart_interruptedto the auto-resume reason set. That's the reasonSessionStore.suspend_recently_active()stamps on in-flight sessions when the previous gateway exit left no.clean_shutdownmarker. These now continue automatically at startup like drain-timeout interruptions._AUTO_RESUME_REASONSfrozenset at class scope — future reasons opt in with one line._should_clear_resume_pending_after_turn+ result-dict plumbing from kyan's second commit; his dogfood caught a real bug (soft interrupt looked like "done" and cleared the marker before startup could schedule).Validation
scripts/run_tests.sh tests/gateway/test_restart_resume_pending.py tests/gateway/test_session.py tests/gateway/test_session_store_prune.py→ 162/162 passtests/gateway/→ 4684/4685 (1 pre-existingtest_discord_free_channel_skips_auto_threadfailure, unrelated, confirmed on clean main)SessionStore: drain-timeout + crash-recovery sessions scheduled; suspended + stale correctly skipped; empty-text internal events delivered.Test coverage added
Closes #20888. Credit to @kyan12 — their two commits are preserved in git log (rebase merge); the follow-up commit has
Co-authored-by: Kevin Yan.