Skip to content

feat(gateway): auto-resume interrupted sessions after restart (salvage #20888)#21192

Merged
teknium1 merged 3 commits into
mainfrom
hermes/hermes-fad5cdfa
May 7, 2026
Merged

feat(gateway): auto-resume interrupted sessions after restart (salvage #20888)#21192
teknium1 merged 3 commits into
mainfrom
hermes/hermes-fad5cdfa

Conversation

@teknium1
Copy link
Copy Markdown
Contributor

@teknium1 teknium1 commented May 7, 2026

Salvage of @kyan12's PR #20888 onto current main with simplification + crash-recovery extension.

Summary

Changes vs original PR

  • Empty event text: the existing _is_resume_pending branch in _handle_message_with_agent already injects a reason-aware recovery system note on the next turn. Kyan's version also put the note in the synthetic event text, so the model saw two stacked system notes. Event is now empty-text; existing path owns wording.
  • Dropped SessionStore.list_resume_pending(): filter inlined in the single caller (8 lines). No new public API for one consumer.
  • Crash recovery covered: added restart_interrupted to the auto-resume reason set. That's the reason SessionStore.suspend_recently_active() stamps on in-flight sessions when the previous gateway exit left no .clean_shutdown marker. These now continue automatically at startup like drain-timeout interruptions.
  • _AUTO_RESUME_REASONS frozenset at class scope — future reasons opt in with one line.
  • Kept _should_clear_resume_pending_after_turn + result-dict plumbing from kyan's second commit; his dogfood caught a real bug (soft interrupt looked like "done" and cleared the marker before startup could schedule).
  • AUTHOR_MAP entry for kevyan1998@gmail.com → kyan12.

Validation

  • Targeted: scripts/run_tests.sh tests/gateway/test_restart_resume_pending.py tests/gateway/test_session.py tests/gateway/test_session_store_prune.py162/162 pass
  • Broader: full tests/gateway/ → 4684/4685 (1 pre-existing test_discord_free_channel_skips_auto_thread failure, unrelated, confirmed on clean main)
  • E2E with real on-disk SessionStore: drain-timeout + crash-recovery sessions scheduled; suspended + stale correctly skipped; empty-text internal events delivered.

Test coverage added

  • drain-timeout + crash-recovery both scheduled
  • stale entries skipped (outside freshness window)
  • suspended entries skipped (suspended > resume_pending)
  • originless entries skipped (no routing target)
  • disallowed reasons skipped (forward-compat)

Closes #20888. Credit to @kyan12 — their two commits are preserved in git log (rebase merge); the follow-up commit has Co-authored-by: Kevin Yan.

kyan12 and others added 3 commits May 7, 2026 04:52
Follow-up on top of @kyan12's PR #20888 — same feature, cleaner shape,
wider coverage.

Changes:
- Drop the synthetic '[System note: ...]' in the internal MessageEvent.
  The existing _is_resume_pending branch in _handle_message_with_agent
  (run.py ~L13738) already injects a reason-aware recovery system note
  on the next turn.  With kyan's text in place the model saw two stacked
  system notes.  Now the event text is empty and the existing injection
  path owns the wording.
- Drop SessionStore.list_resume_pending() as a new public method.  The
  filter is 8 lines inline in _schedule_resume_pending_sessions() —
  one caller, no other pluggability need.
- Add 'restart_interrupted' to the auto-resume reason set.  That's the
  reason SessionStore.suspend_recently_active() stamps on sessions
  recovered from a crash/OOM/SIGKILL (no .clean_shutdown marker).
  Previously those sessions had to wait for a real user message to
  auto-resume; now they continue automatically at startup like
  drain-timeout interruptions do.
- Reasons live in a _AUTO_RESUME_REASONS frozenset at class scope so
  future reasons (e.g. 'manual_resume_request') can be opted in with
  one line.

Test coverage added:
- drain-timeout + crash-recovery both scheduled
- stale entries skipped (outside freshness window)
- suspended entries skipped (suspended > resume_pending)
- originless entries skipped (no routing target)
- disallowed reasons skipped (graceful forward-compat)

E2E verified end-to-end with a real on-disk SessionStore: 2 eligible
sessions scheduled, 2 ineligible skipped, empty-text internal events
delivered to the adapter.

Co-authored-by: Kevin Yan <kevyan1998@gmail.com>
@teknium1 teknium1 merged commit 38b1c7d into main May 7, 2026
10 of 11 checks passed
@teknium1 teknium1 deleted the hermes/hermes-fad5cdfa branch May 7, 2026 12:05
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

🔎 Lint report: hermes/hermes-fad5cdfa vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 7479 on HEAD, 7462 on base (🆕 +17)

🆕 New issues (6):

Rule Count
unresolved-attribute 4
invalid-argument-type 1
invalid-assignment 1
First entries
gateway/run.py:2824: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `SessionSource`, found `SessionSource | None`
tests/gateway/test_restart_resume_pending.py:1129: [invalid-assignment] invalid-assignment: Object of type `AsyncMock` is not assignable to attribute `handle_message` of type `def handle_message(self, event: MessageEvent) -> CoroutineType[Any, Any, None]`
tests/gateway/test_restart_resume_pending.py:1134: [unresolved-attribute] unresolved-attribute: Object of type `bound method BasePlatformAdapter.handle_message(event: MessageEvent) -> CoroutineType[Any, Any, None]` has no attribute `assert_not_called`
gateway/run.py:2814: [unresolved-attribute] unresolved-attribute: Attribute `platform` is not defined on `None` in union `SessionSource | None`
tests/gateway/test_restart_resume_pending.py:1006: [unresolved-attribute] unresolved-attribute: Object of type `bound method BasePlatformAdapter.handle_message(event: MessageEvent) -> CoroutineType[Any, Any, None]` has no attribute `assert_awaited_once`
tests/gateway/test_restart_resume_pending.py:965: [unresolved-attribute] unresolved-attribute: Object of type `bound method BasePlatformAdapter.handle_message(event: MessageEvent) -> CoroutineType[Any, Any, None]` has no attribute `await_args`

✅ Fixed issues: none

Unchanged: 3922 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants