Summary
When a long-running gateway session (hermes gateway run) trips the MCP
circuit breaker for a stdio-transport server and the underlying subprocess
has died, the breaker's half-open probe never respawns the subprocess —
it just retries through a closed pipe. Result: every probe re-fails →
breaker re-opens → loop forever. The MCP server is permanently broken
for the lifetime of the gateway process.
Verified on v0.11.0 (2026.4.23), single-host deployment, Python 3.11.
Reproduction
- Configure a stdio MCP server whose subprocess can crash mid-session
(e.g. mcr.microsoft.com/playwright/mcp:v0.0.70 via command: docker,
where Chrome can OOM under bursty asset loads).
- Trigger ≥
_CIRCUIT_BREAKER_THRESHOLD (3) consecutive tool-call
failures that take down the subprocess. In our case: a Squid forward
proxy returned 403 for a CDN, Chrome went into a 387 req/sec retry
storm, MCP tool calls timed out.
- Wait ≥
_CIRCUIT_BREAKER_COOLDOWN_SEC (60s) — breaker enters
half-open, fires probe.
Expected
Probe failure detects "subprocess pipe dead" and respawns it before
calling, OR returns clean error to model so it stops calling.
Actual
- Probe call goes to the dead pipe → returns empty error
~/.hermes/logs/errors.log shows: MCP tool playwright/browser_navigate call failed: (empty after the colon, repeating every ~60s)
docker ps -a --filter ancestor=<image> returns zero entries — gateway never re-spawns the container
- Gateway-bound chats: agent reports a "browser crashed, recovering ~58s" message forever
hermes mcp test playwright from CLI succeeds in <1s — proves the
config is correct, the docker run works, and the issue is gateway
in-process state
Workaround
systemctl --user restart hermes-gateway.service — fully resets the
in-process breaker dicts (_server_error_counts, _server_breaker_opened_at).
Brief gateway downtime (~10-30s).
Suggested fix
In tools/mcp_tool.py around the half-open probe path, before reusing the
existing session, check if the underlying transport is alive (e.g.
subprocess.poll() is not None for stdio) and force a fresh
session via the existing reconnect path (_reconnect_event) if not.
Refs: _CIRCUIT_BREAKER_THRESHOLD, _CIRCUIT_BREAKER_COOLDOWN_SEC,
_bump_server_error, _reset_server_error at tools/mcp_tool.py:1377-1404.
Notes
- No data loss — once gateway restarts, the persistent profile volume
picks up where it left off (logins survive)
- Worked around locally with a systemd timer that watchdogs the empty-error
pattern in errors.log and restarts the gateway. Happy to share if useful.
Summary
When a long-running gateway session (
hermes gateway run) trips the MCPcircuit breaker for a stdio-transport server and the underlying subprocess
has died, the breaker's half-open probe never respawns the subprocess —
it just retries through a closed pipe. Result: every probe re-fails →
breaker re-opens → loop forever. The MCP server is permanently broken
for the lifetime of the gateway process.
Verified on v0.11.0 (2026.4.23), single-host deployment, Python 3.11.
Reproduction
(e.g.
mcr.microsoft.com/playwright/mcp:v0.0.70viacommand: docker,where Chrome can OOM under bursty asset loads).
_CIRCUIT_BREAKER_THRESHOLD(3) consecutive tool-callfailures that take down the subprocess. In our case: a Squid forward
proxy returned 403 for a CDN, Chrome went into a 387 req/sec retry
storm, MCP tool calls timed out.
_CIRCUIT_BREAKER_COOLDOWN_SEC(60s) — breaker entershalf-open, fires probe.
Expected
Probe failure detects "subprocess pipe dead" and respawns it before
calling, OR returns clean error to model so it stops calling.
Actual
~/.hermes/logs/errors.logshows:MCP tool playwright/browser_navigate call failed:(empty after the colon, repeating every ~60s)docker ps -a --filter ancestor=<image>returns zero entries — gateway never re-spawns the containerhermes mcp test playwrightfrom CLI succeeds in <1s — proves theconfig is correct, the docker run works, and the issue is gateway
in-process state
Workaround
systemctl --user restart hermes-gateway.service— fully resets thein-process breaker dicts (
_server_error_counts,_server_breaker_opened_at).Brief gateway downtime (~10-30s).
Suggested fix
In
tools/mcp_tool.pyaround the half-open probe path, before reusing theexisting session, check if the underlying transport is alive (e.g.
subprocess.poll() is not Nonefor stdio) and force a freshsession via the existing reconnect path (
_reconnect_event) if not.Refs:
_CIRCUIT_BREAKER_THRESHOLD,_CIRCUIT_BREAKER_COOLDOWN_SEC,_bump_server_error,_reset_server_errorattools/mcp_tool.py:1377-1404.Notes
picks up where it left off (logins survive)
pattern in
errors.logand restarts the gateway. Happy to share if useful.