MCP circuit breaker open state can permanently wedge gateway when stdio subprocess dies

## Summary

When a long-running gateway session (`hermes gateway run`) trips the MCP
circuit breaker for a stdio-transport server and the underlying subprocess
has died, the breaker's half-open probe never respawns the subprocess —
it just retries through a closed pipe. Result: every probe re-fails →
breaker re-opens → loop forever. The MCP server is permanently broken
for the lifetime of the gateway process.

Verified on **v0.11.0 (2026.4.23)**, single-host deployment, Python 3.11.

## Reproduction

1. Configure a stdio MCP server whose subprocess can crash mid-session
   (e.g. `mcr.microsoft.com/playwright/mcp:v0.0.70` via `command: docker`,
   where Chrome can OOM under bursty asset loads).
2. Trigger ≥ `_CIRCUIT_BREAKER_THRESHOLD` (3) consecutive tool-call
   failures that take down the subprocess. In our case: a Squid forward
   proxy returned 403 for a CDN, Chrome went into a 387 req/sec retry
   storm, MCP tool calls timed out.
3. Wait ≥ `_CIRCUIT_BREAKER_COOLDOWN_SEC` (60s) — breaker enters
   half-open, fires probe.

## Expected

Probe failure detects "subprocess pipe dead" and respawns it before
calling, OR returns clean error to model so it stops calling.

## Actual

- Probe call goes to the dead pipe → returns empty error
- `~/.hermes/logs/errors.log` shows: `MCP tool playwright/browser_navigate call failed:` (empty after the colon, repeating every ~60s)
- `docker ps -a --filter ancestor=<image>` returns **zero** entries — gateway never re-spawns the container
- Gateway-bound chats: agent reports a "browser crashed, recovering ~58s" message forever
- `hermes mcp test playwright` from CLI **succeeds in <1s** — proves the
  config is correct, the docker run works, and the issue is gateway
  in-process state

## Workaround

`systemctl --user restart hermes-gateway.service` — fully resets the
in-process breaker dicts (`_server_error_counts`, `_server_breaker_opened_at`).
Brief gateway downtime (~10-30s).

## Suggested fix

In `tools/mcp_tool.py` around the half-open probe path, before reusing the
existing session, check if the underlying transport is alive (e.g.
`subprocess.poll() is not None` for stdio) and force a fresh
session via the existing reconnect path (`_reconnect_event`) if not.

Refs: `_CIRCUIT_BREAKER_THRESHOLD`, `_CIRCUIT_BREAKER_COOLDOWN_SEC`,
`_bump_server_error`, `_reset_server_error` at `tools/mcp_tool.py:1377-1404`.

## Notes

- No data loss — once gateway restarts, the persistent profile volume
  picks up where it left off (logins survive)
- Worked around locally with a systemd timer that watchdogs the empty-error
  pattern in `errors.log` and restarts the gateway. Happy to share if useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCP circuit breaker open state can permanently wedge gateway when stdio subprocess dies #16788

Summary

Reproduction

Expected

Actual

Workaround

Suggested fix

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MCP circuit breaker open state can permanently wedge gateway when stdio subprocess dies #16788

Description

Summary

Reproduction

Expected

Actual

Workaround

Suggested fix

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions