Bug Report: Gateway Hang on Clean Exit / Restart Race with Stale PID
Observed Behavior
- Gateway Telegram bot stops responding to messages
systemctl --user restart hermes-gateway times out (60s)
- Process exits cleanly after SIGTERM drain timeout ("Gateway stopped" with exit code 0)
- Systemd (
Restart=on-failure) does not restart because exit 0 = success
- Stale
~/.hermes/gateway.pid blocks any future start ("Gateway already..."
- Gateway stays dead until manual
kill -9 + service restart
Root Cause
- Restart policy too narrow:
Restart=on-failure misses clean exits
- No PID cleanup on stop: Stale PID file causes race condition on restart
Environment
- hermes-agent commit: (current main)
- OS: Debian 13 (trixie) aarch64
- Runtime: systemd user service
Fix Applied
1. PID cleanup script (~/scripts/hermes-gateway-pid-cleanup.sh)
#!/usr/bin/env python3
import json, os, sys
PID_FILE = "/home/ramit/.hermes/gateway.pid"
def main():
if not os.path.exists(PID_FILE):
sys.exit(0)
try:
with open(PID_FILE, "r") as f:
data = json.load(f)
pid = data.get("pid")
except (json.JSONDecodeError, OSError):
os.remove(PID_FILE)
sys.exit(0)
exists = False
if pid is not None:
try:
os.kill(pid, 0)
exists = True
except ProcessLookupError:
exists = False
if not exists:
os.remove(PID_FILE)
if __name__ == "__main__":
main()
2. Patched systemd unit (~/.config/systemd/user/hermes-gateway.service)
[Unit]
Description=Hermes Agent Gateway - Messaging Platform Integration
After=network.target
StartLimitIntervalSec=600
StartLimitBurst=5
[Service]
Type=simple
ExecStart=/home/ramit/.hermes/hermes-agent/venv/bin/python -m hermes_cli.main gateway run --replace
ExecStartPre=/home/ramit/.hermes/hermes-agent/venv/bin/python /home/ramit/scripts/hermes-gateway-pid-cleanup.sh
WorkingDirectory=/home/ramit/.hermes/hermes-agent
Environment="PATH=/home/ramit/.hermes/hermes-agent/venv/bin:/home/ramit/.hermes/hermes-agent/node_modules/.bin:/home/ramit/.nvm/versions/node/v24.14.0/bin:/home/ramit/.local/bin:/home/ramit/.cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
Environment="VIRTUAL_ENV=/home/ramit/.hermes/hermes-agent/venv"
Environment="HERMES_HOME=/home/ramit/.hermes"
Restart=always
RestartSec=30
RestartForceExitStatus=75
ExecStopPost=/home/ramit/.hermes/hermes-agent/venv/bin/python /home/ramit/scripts/hermes-gateway-pid-cleanup.sh
KillMode=mixed
KillSignal=SIGTERM
ExecReload=/bin/kill -USR1 $MAINPID
TimeoutStopSec=60
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=default.target
Key Changes
| Directive |
Before |
After |
Purpose |
Restart |
on-failure |
always |
Restart even after clean exit (exit 0) |
ExecStartPre |
— |
cleanup script |
Remove stale PID before start |
ExecStopPost |
— |
cleanup script |
Remove stale PID after any stop |
Verification
daemon-reload + restart: service active, Telegram reconnected
ExecStartPre exits 0/SUCCESS
- No stale PID race observed
Suggested Upstream Action
- Ship
scripts/hermes-gateway-pid-cleanup.py in repo
- Update sample systemd unit in docs/install.md with
Restart=always + ExecStartPre/ExecStopPost
Bug Report: Gateway Hang on Clean Exit / Restart Race with Stale PID
Observed Behavior
systemctl --user restart hermes-gatewaytimes out (60s)Restart=on-failure) does not restart because exit 0 = success~/.hermes/gateway.pidblocks any future start ("Gateway already..."kill -9+ service restartRoot Cause
Restart=on-failuremisses clean exitsEnvironment
Fix Applied
1. PID cleanup script (
~/scripts/hermes-gateway-pid-cleanup.sh)2. Patched systemd unit (
~/.config/systemd/user/hermes-gateway.service)Key Changes
Restarton-failurealwaysExecStartPreExecStopPostVerification
daemon-reload+ restart: service active, Telegram reconnectedExecStartPreexits 0/SUCCESSSuggested Upstream Action
scripts/hermes-gateway-pid-cleanup.pyin repoRestart=always+ExecStartPre/ExecStopPost