Skip to content

fix: MCP circuit breaker recovery and HTTP keepalive (#16788, #17003)#17016

Closed
vominh1919 wants to merge 1 commit into
NousResearch:mainfrom
vominh1919:fix/mcp-circuit-breaker-recovery
Closed

fix: MCP circuit breaker recovery and HTTP keepalive (#16788, #17003)#17016
vominh1919 wants to merge 1 commit into
NousResearch:mainfrom
vominh1919:fix/mcp-circuit-breaker-recovery

Conversation

@vominh1919
Copy link
Copy Markdown
Contributor

Problem

Two related MCP reliability issues affect long-running gateway sessions:

1. Circuit breaker permanently blocks recovery (#16788)

When the MCP circuit breaker trips (3 consecutive failures), it blocks the server permanently with no recovery mechanism. If the underlying subprocess dies and later becomes available again, the breaker never allows a probe call through. The gateway must be restarted to recover.

2. HTTP connections go stale during idle periods (#17003)

_wait_for_lifecycle_event() blocks indefinitely without generating any traffic. After extended idle periods (~12h), TCP connections become stale. The next tool call fails silently with an empty error message.

Fix

Circuit breaker half-open recovery (#16788)

  • Added _CIRCUIT_BREAKER_COOLDOWN_SEC = 60 — cooldown period before allowing a probe
  • Added _server_breaker_opened_at — tracks when the breaker tripped
  • After cooldown elapses, the handler allows one probe call through (half-open state)
  • If probe succeeds → error count resets, server is usable again
  • If probe fails → breaker re-opens with fresh cooldown
  • Added _bump_server_error() and _reset_server_error() helpers for consistent state management

HTTP keepalive (#17003)

  • _wait_for_lifecycle_event() now uses asyncio.wait() with a 3-minute timeout
  • On each timeout, sends a lightweight list_tools() keepalive to exercise the connection
  • If keepalive fails → triggers automatic reconnect via _reconnect_event
  • Prevents TCP connections from going stale during long idle periods

Testing

  • Verified syntax with ast.parse() on the modified file
  • All existing error count tracking replaced with helper functions for consistency
  • Both fixes are backward-compatible — no config changes required

Fixes #16788
Fixes #17003

- Add half-open recovery to circuit breaker (NousResearch#16788): after cooldown
  period (60s), allow a single probe call through instead of permanently
  blocking the server. If probe succeeds, reset error count; if it fails,
  re-open the breaker with a fresh cooldown.

- Add periodic keepalive to _wait_for_lifecycle_event() (NousResearch#17003): send
  list_tools() every 3 minutes to prevent TCP connections from going stale
  during long idle periods. If keepalive fails, trigger automatic reconnect.

Both fixes improve MCP server reliability for long-running gateway sessions
where subprocess crashes or network idle timeouts previously required a full
gateway restart to recover.
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists tool/mcp MCP client and OAuth labels Apr 28, 2026
@grubmeshi
Copy link
Copy Markdown

@vominh1919 Thanks for your PR. I really need this as my MCP connections get lost easily. I hope you can resolve the merge conflicts soon and this gets released ASAP.

@vominh1919
Copy link
Copy Markdown
Contributor Author

@vominh1919 Thanks for your PR. I really need this as my MCP connections get lost easily. I hope you can resolve the merge conflicts soon and this gets released ASAP.

Wait team to check @teknium1 @alt-glitch

teknium1 pushed a commit that referenced this pull request May 5, 2026
Sends a lightweight list_tools() probe every 3 minutes during idle
periods to prevent TCP connections from going stale behind LB / NAT
idle timeouts (commonly 300-600s).  When the keepalive fails, the
reconnect event fires so the transport rebuilds the session cleanly.

Salvages the keepalive portion of @vominh1919's PR #17016. The
circuit-breaker half-open recovery from the same PR was independently
landed on main via #benbarclay's commit 8cc3ceb ("fix(mcp): add
half-open state to circuit breaker", Apr 21); only the keepalive is
salvaged here.

Fixes #17003.
teknium1 pushed a commit that referenced this pull request May 5, 2026
Sends a lightweight list_tools() probe every 3 minutes during idle
periods to prevent TCP connections from going stale behind LB / NAT
idle timeouts (commonly 300-600s).  When the keepalive fails, the
reconnect event fires so the transport rebuilds the session cleanly.

Salvages the keepalive portion of @vominh1919's PR #17016. The
circuit-breaker half-open recovery from the same PR was independently
landed on main via #benbarclay's commit 8cc3ceb ("fix(mcp): add
half-open state to circuit breaker", Apr 21); only the keepalive is
salvaged here.

Fixes #17003.
chris-han pushed a commit to chris-han/hermes-agent that referenced this pull request May 6, 2026
Sends a lightweight list_tools() probe every 3 minutes during idle
periods to prevent TCP connections from going stale behind LB / NAT
idle timeouts (commonly 300-600s).  When the keepalive fails, the
reconnect event fires so the transport rebuilds the session cleanly.

Salvages the keepalive portion of @vominh1919's PR NousResearch#17016. The
circuit-breaker half-open recovery from the same PR was independently
landed on main via #benbarclay's commit 8cc3ceb ("fix(mcp): add
half-open state to circuit breaker", Apr 21); only the keepalive is
salvaged here.

Fixes NousResearch#17003.
nickdlkk pushed a commit to nickdlkk/hermes-agent that referenced this pull request May 11, 2026
Sends a lightweight list_tools() probe every 3 minutes during idle
periods to prevent TCP connections from going stale behind LB / NAT
idle timeouts (commonly 300-600s).  When the keepalive fails, the
reconnect event fires so the transport rebuilds the session cleanly.

Salvages the keepalive portion of @vominh1919's PR NousResearch#17016. The
circuit-breaker half-open recovery from the same PR was independently
landed on main via #benbarclay's commit 8cc3ceb ("fix(mcp): add
half-open state to circuit breaker", Apr 21); only the keepalive is
salvaged here.

Fixes NousResearch#17003.
rmulligan pushed a commit to rmulligan/hermes-agent that referenced this pull request May 11, 2026
Sends a lightweight list_tools() probe every 3 minutes during idle
periods to prevent TCP connections from going stale behind LB / NAT
idle timeouts (commonly 300-600s).  When the keepalive fails, the
reconnect event fires so the transport rebuilds the session cleanly.

Salvages the keepalive portion of @vominh1919's PR NousResearch#17016. The
circuit-breaker half-open recovery from the same PR was independently
landed on main via #benbarclay's commit 8cc3ceb ("fix(mcp): add
half-open state to circuit breaker", Apr 21); only the keepalive is
salvaged here.

Fixes NousResearch#17003.
JinyuID pushed a commit to JinyuID/hermes-agent that referenced this pull request May 11, 2026
Sends a lightweight list_tools() probe every 3 minutes during idle
periods to prevent TCP connections from going stale behind LB / NAT
idle timeouts (commonly 300-600s).  When the keepalive fails, the
reconnect event fires so the transport rebuilds the session cleanly.

Salvages the keepalive portion of @vominh1919's PR NousResearch#17016. The
circuit-breaker half-open recovery from the same PR was independently
landed on main via #benbarclay's commit 76bc64f ("fix(mcp): add
half-open state to circuit breaker", Apr 21); only the keepalive is
salvaged here.

Fixes NousResearch#17003.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P2 Medium — degraded but workaround exists tool/mcp MCP client and OAuth type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MCP circuit breaker open state can permanently wedge gateway when stdio subprocess dies MCP HTTP connections go stale after extended idle periods

3 participants