Skip to content

fix(gateway): wait for systemd restart readiness#20544

Closed
helix4u wants to merge 1 commit into
NousResearch:mainfrom
helix4u:fix/gateway-restart-waits-for-new-pid
Closed

fix(gateway): wait for systemd restart readiness#20544
helix4u wants to merge 1 commit into
NousResearch:mainfrom
helix4u:fix/gateway-restart-waits-for-new-pid

Conversation

@helix4u
Copy link
Copy Markdown
Contributor

@helix4u helix4u commented May 6, 2026

What does this PR do?

Fixes the systemd-managed gateway restart path so hermes gateway restart does not report success before the replacement gateway process is actually running.

The old path relied on reload-or-restart, but the generated unit has ExecReload=/bin/kill -USR1 $MAINPID. systemd treats that as successful once the signal is delivered, even though the gateway still has to drain, exit with the planned restart code, wait through any RestartSec handoff, and start a new runtime. That made restart output look complete while Discord/Telegram could still be unavailable.

This also hardens Discord slash command reconciliation during reconnect. A restart can otherwise trigger slash-command management writes on every reconnect, and Discord's command-management bucket can rate-limit those writes for minutes. The sync now records successful command fingerprints, respects Discord retry-after signals, and spaces mutation writes instead of slamming the API.

Related Issue

N/A

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • hermes_cli/gateway.py

    • Avoids the misleading reload-or-restart path for gateway restarts.
    • Uses the live gateway PID, falling back to systemd MainPID when runtime status is not yet available.
    • Requests graceful SIGUSR1 shutdown, then kicks systemctl restart to avoid sitting in systemd's RestartSec auto-restart window after planned exit code 75.
    • Waits for a replacement PID and matching gateway runtime status before printing restart success.
    • Reports systemd start-limit/rate-limit state in both restart and status output.
  • gateway/platforms/discord.py

    • Records slash-command sync state under Hermes home with an atomic JSON write.
    • Skips reconnect syncs when the desired slash-command fingerprint already synced successfully.
    • Persists Discord retry-after cooldowns when slash-command sync is rate-limited.
    • Adds a fixed delay between Discord slash-command mutation writes.
  • tests/hermes_cli/test_gateway_service.py

    • Covers restart readiness waiting, systemd MainPID fallback, immediate restart kick after graceful exit, and start-limit reporting.
  • tests/gateway/test_discord_connect.py

    • Covers fingerprint-based sync skipping, retry-after cooldown handling, and paced Discord command mutations.

How to Test

  1. Run targeted gateway and Discord tests: .venv/bin/pytest tests/gateway/test_discord_connect.py tests/hermes_cli/test_gateway_service.py -n 4
  2. Run syntax/diff checks: .venv/bin/python -m py_compile gateway/platforms/discord.py hermes_cli/gateway.py and git diff --check
  3. Manually test on a systemd user-service gateway: hermes gateway restart, then hermes gateway status, then verify a messaging-platform slash command responds after the restart message lands.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: Ubuntu/WSL systemd user service

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

For New Skills

N/A

Screenshots / Logs

Targeted checks before opening the draft PR:

  • .venv/bin/python -m py_compile gateway/platforms/discord.py hermes_cli/gateway.py
  • .venv/bin/pytest tests/gateway/test_discord_connect.py -k "post_connect_initialization or safe_sync" -n 4 — 8 passed
  • .venv/bin/pytest tests/gateway/test_discord_connect.py tests/hermes_cli/test_gateway_service.py -n 4 — 132 passed
  • git diff --check

Full suite after opening the draft PR:

  • scripts/run_tests.sh — failed: 46 failed, 20038 passed, 51 skipped, 229 warnings in 529.48s
  • The failing tests are outside the changed gateway restart / Discord sync test files. Failures include cron prompt handling, gateway approval blocking, auxiliary client/provider selection, Bedrock beta headers, gateway config parsing, DingTalk card lifecycle, browser Chromium checks, update/restart tests, delegation credential resolution, skill provenance, and sandbox environment tests.

Related PRs checked and not duplicative:

@teknium1
Copy link
Copy Markdown
Contributor

teknium1 commented May 7, 2026

Merged via #20949. Your commit was cherry-picked onto current main with your authorship preserved (d797755), plus two small follow-ups on top: narrowed the rate-limit except Exception to actual Discord 429s so unrelated failures don't get swallowed, and moved the sync-state JSON under $HERMES_HOME/gateway/. Thanks for the fix and the rate-limit analysis — this was exactly what Gille was hitting.

@teknium1 teknium1 closed this May 7, 2026
@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery comp/cli CLI entry point, hermes_cli/, setup wizard platform/discord Discord bot adapter P2 Medium — degraded but workaround exists labels May 7, 2026
@alt-glitch
Copy link
Copy Markdown
Collaborator

Superseded by #20949 (merged), which cherry-picks this PR's changes plus follow-up improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists platform/discord Discord bot adapter type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants