Skip to content

MCP circuit breaker open state can permanently wedge gateway when stdio subprocess dies #16788

@jagga99

Description

@jagga99

Summary

When a long-running gateway session (hermes gateway run) trips the MCP
circuit breaker for a stdio-transport server and the underlying subprocess
has died, the breaker's half-open probe never respawns the subprocess —
it just retries through a closed pipe. Result: every probe re-fails →
breaker re-opens → loop forever. The MCP server is permanently broken
for the lifetime of the gateway process.

Verified on v0.11.0 (2026.4.23), single-host deployment, Python 3.11.

Reproduction

  1. Configure a stdio MCP server whose subprocess can crash mid-session
    (e.g. mcr.microsoft.com/playwright/mcp:v0.0.70 via command: docker,
    where Chrome can OOM under bursty asset loads).
  2. Trigger ≥ _CIRCUIT_BREAKER_THRESHOLD (3) consecutive tool-call
    failures that take down the subprocess. In our case: a Squid forward
    proxy returned 403 for a CDN, Chrome went into a 387 req/sec retry
    storm, MCP tool calls timed out.
  3. Wait ≥ _CIRCUIT_BREAKER_COOLDOWN_SEC (60s) — breaker enters
    half-open, fires probe.

Expected

Probe failure detects "subprocess pipe dead" and respawns it before
calling, OR returns clean error to model so it stops calling.

Actual

  • Probe call goes to the dead pipe → returns empty error
  • ~/.hermes/logs/errors.log shows: MCP tool playwright/browser_navigate call failed: (empty after the colon, repeating every ~60s)
  • docker ps -a --filter ancestor=<image> returns zero entries — gateway never re-spawns the container
  • Gateway-bound chats: agent reports a "browser crashed, recovering ~58s" message forever
  • hermes mcp test playwright from CLI succeeds in <1s — proves the
    config is correct, the docker run works, and the issue is gateway
    in-process state

Workaround

systemctl --user restart hermes-gateway.service — fully resets the
in-process breaker dicts (_server_error_counts, _server_breaker_opened_at).
Brief gateway downtime (~10-30s).

Suggested fix

In tools/mcp_tool.py around the half-open probe path, before reusing the
existing session, check if the underlying transport is alive (e.g.
subprocess.poll() is not None for stdio) and force a fresh
session via the existing reconnect path (_reconnect_event) if not.

Refs: _CIRCUIT_BREAKER_THRESHOLD, _CIRCUIT_BREAKER_COOLDOWN_SEC,
_bump_server_error, _reset_server_error at tools/mcp_tool.py:1377-1404.

Notes

  • No data loss — once gateway restarts, the persistent profile volume
    picks up where it left off (logins survive)
  • Worked around locally with a systemd timer that watchdogs the empty-error
    pattern in errors.log and restarts the gateway. Happy to share if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/gatewayGateway runner, session dispatch, deliverytool/mcpMCP client and OAuthtype/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions