Skip to content

[Bug]: Gateway hang on clean exit / restart race with stale PID #14176

@ramit-mitra

Description

@ramit-mitra

Bug Report: Gateway Hang on Clean Exit / Restart Race with Stale PID

Observed Behavior

  • Gateway Telegram bot stops responding to messages
  • systemctl --user restart hermes-gateway times out (60s)
  • Process exits cleanly after SIGTERM drain timeout ("Gateway stopped" with exit code 0)
  • Systemd (Restart=on-failure) does not restart because exit 0 = success
  • Stale ~/.hermes/gateway.pid blocks any future start ("Gateway already..."
  • Gateway stays dead until manual kill -9 + service restart

Root Cause

  1. Restart policy too narrow: Restart=on-failure misses clean exits
  2. No PID cleanup on stop: Stale PID file causes race condition on restart

Environment

  • hermes-agent commit: (current main)
  • OS: Debian 13 (trixie) aarch64
  • Runtime: systemd user service

Fix Applied

1. PID cleanup script (~/scripts/hermes-gateway-pid-cleanup.sh)

#!/usr/bin/env python3
import json, os, sys

PID_FILE = "/home/ramit/.hermes/gateway.pid"

def main():
    if not os.path.exists(PID_FILE):
        sys.exit(0)
    try:
        with open(PID_FILE, "r") as f:
            data = json.load(f)
        pid = data.get("pid")
    except (json.JSONDecodeError, OSError):
        os.remove(PID_FILE)
        sys.exit(0)
    exists = False
    if pid is not None:
        try:
            os.kill(pid, 0)
            exists = True
        except ProcessLookupError:
            exists = False
    if not exists:
        os.remove(PID_FILE)

if __name__ == "__main__":
    main()

2. Patched systemd unit (~/.config/systemd/user/hermes-gateway.service)

[Unit]
Description=Hermes Agent Gateway - Messaging Platform Integration
After=network.target
StartLimitIntervalSec=600
StartLimitBurst=5

[Service]
Type=simple
ExecStart=/home/ramit/.hermes/hermes-agent/venv/bin/python -m hermes_cli.main gateway run --replace
ExecStartPre=/home/ramit/.hermes/hermes-agent/venv/bin/python /home/ramit/scripts/hermes-gateway-pid-cleanup.sh
WorkingDirectory=/home/ramit/.hermes/hermes-agent
Environment="PATH=/home/ramit/.hermes/hermes-agent/venv/bin:/home/ramit/.hermes/hermes-agent/node_modules/.bin:/home/ramit/.nvm/versions/node/v24.14.0/bin:/home/ramit/.local/bin:/home/ramit/.cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
Environment="VIRTUAL_ENV=/home/ramit/.hermes/hermes-agent/venv"
Environment="HERMES_HOME=/home/ramit/.hermes"
Restart=always
RestartSec=30
RestartForceExitStatus=75
ExecStopPost=/home/ramit/.hermes/hermes-agent/venv/bin/python /home/ramit/scripts/hermes-gateway-pid-cleanup.sh
KillMode=mixed
KillSignal=SIGTERM
ExecReload=/bin/kill -USR1 $MAINPID
TimeoutStopSec=60
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=default.target

Key Changes

Directive Before After Purpose
Restart on-failure always Restart even after clean exit (exit 0)
ExecStartPre cleanup script Remove stale PID before start
ExecStopPost cleanup script Remove stale PID after any stop

Verification

  • daemon-reload + restart: service active, Telegram reconnected
  • ExecStartPre exits 0/SUCCESS
  • No stale PID race observed

Suggested Upstream Action

  1. Ship scripts/hermes-gateway-pid-cleanup.py in repo
  2. Update sample systemd unit in docs/install.md with Restart=always + ExecStartPre/ExecStopPost

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions