Skip to content

revert(gateway): remove stale-code self-check and auto-restart#20156

Open
teknium1 wants to merge 1 commit into
mainfrom
fix/gateway-remove-stale-code-self-restart
Open

revert(gateway): remove stale-code self-check and auto-restart#20156
teknium1 wants to merge 1 commit into
mainfrom
fix/gateway-remove-stale-code-self-restart

Conversation

@teknium1
Copy link
Copy Markdown
Contributor

@teknium1 teknium1 commented May 5, 2026

What

Removes the "stale-code self-check" feature that made the gateway auto-restart itself whenever git HEAD moved on the checkout after boot. When triggered, it dropped the user's current message and replied with:

⟳ Gateway code was updated in the background — restarting this gateway so your next message runs on the new code. Please retry in a moment.

Why

Undesirable in practice. A long-lived gateway + any ad-hoc git operation on the checkout (branch switch, rebase, pull, even git worktree add in the common dir) flips HEAD and the user's next message gets hijacked into a forced restart notice with no opt-out. No config flag, no way to disable it per-profile — if HEAD moved, your message died.

The original motivation (Issue #17648) was to prevent ImportError from stale sys.modules after hermes update. That concern is already handled on the hermes update side by the SIGKILL-survivor sweep in hermes_cli/main.py (same issue number), which forces the supervisor to respawn with fresh code. The gateway-side detection loop was a belt-and-suspenders second mechanism, and the suspenders were cutting off circulation.

What was removed

All in gateway/run.py:

  • Module-level: _STALE_CODE_SENTINELS, _GIT_SHA_CACHE_TTL_SECS, _read_git_head_sha(), _compute_repo_mtime()
  • Class-level defaults: _boot_wall_time, _boot_repo_mtime, _boot_git_sha, _stale_code_restart_triggered
  • __init__ boot-snapshot block (git HEAD read, mtime compute, cache init)
  • Methods: _current_git_sha_cached(), _detect_stale_code(), _trigger_stale_code_restart()
  • The check + user-facing notice at the top of _handle_message()

Also deleted: tests/gateway/test_stale_code_self_check.py (412 lines).

706 lines removed, 0 added.

Verification

  • python -c "from gateway import run" → imports clean.
  • ripgrep '_detect_stale_code|_trigger_stale_code_restart|_read_git_head_sha|_compute_repo_mtime|_GIT_SHA_CACHE_TTL_SECS|_STALE_CODE_SENTINELS|_stale_code_restart_triggered|_boot_git_sha|_boot_repo_mtime|_cached_current_sha|_current_git_sha_cached|_repo_root_for_staleness|_stale_code_notified' → zero hits anywhere in the repo.
  • scripts/run_tests.sh tests/gateway/ → 4589 passed. The 3 pre-existing unrelated failures (test_discord_free_channel_skips_auto_thread, test_hydrate_bot_identity_populates_self_ids_from_bot_v3_info, Teams test_send_typing) exist on clean main and are unchanged by this revert.

Reverts the behaviour introduced in #17648 / #18409 and the SHA-based follow-up in #19740.

Removes the _detect_stale_code / _trigger_stale_code_restart mechanism
introduced in #17648 and iterated in #19740. On every incoming message
the gateway compared the boot-time git HEAD SHA to the current SHA on
disk, and if they differed it would reply with

    Gateway code was updated in the background --
    restarting this gateway so your next message runs
    on the new code. Please retry in a moment.

and then kick off a graceful restart. This is unwanted behaviour:
users who run a long-lived gateway and do their own ad-hoc git
operations on the checkout end up with their chat interrupted and
the current message dropped every time HEAD moves, with no way to
opt out.

If an operator really needs the old protection against stale
sys.modules after "hermes update", the SIGKILL-survivor sweep in
hermes update (hermes_cli/main.py, also tagged #17648) already
handles the supervisor-respawn case on its own.

Removed:
  gateway/run.py:
    - _STALE_CODE_SENTINELS, _GIT_SHA_CACHE_TTL_SECS
    - _read_git_head_sha(), _compute_repo_mtime() module helpers
    - class-level _boot_wall_time / _boot_repo_mtime / _boot_git_sha /
      _stale_code_restart_triggered defaults
    - __init__ boot-snapshot block (_boot_*, _cached_current_sha*,
      _repo_root_for_staleness, _stale_code_notified)
    - _current_git_sha_cached(), _detect_stale_code(),
      _trigger_stale_code_restart() methods
    - stale-code check + user-facing restart notice at the top of
      _handle_message()
  tests/gateway/test_stale_code_self_check.py (deleted, 412 lines)

No new logic added. Zero remaining references to any removed
symbol. Gateway test suite passes the same 4589 tests it passed
before; the 3 pre-existing unrelated failures (discord free-channel,
feishu bot admission, teams typing) are unchanged by this commit.
@alt-glitch alt-glitch added type/refactor Code restructuring, no behavior change P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/refactor Code restructuring, no behavior change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants