fix: MCP circuit breaker recovery and HTTP keepalive (#16788, #17003)#17016
Closed
vominh1919 wants to merge 1 commit into
Closed
fix: MCP circuit breaker recovery and HTTP keepalive (#16788, #17003)#17016vominh1919 wants to merge 1 commit into
vominh1919 wants to merge 1 commit into
Conversation
- Add half-open recovery to circuit breaker (NousResearch#16788): after cooldown period (60s), allow a single probe call through instead of permanently blocking the server. If probe succeeds, reset error count; if it fails, re-open the breaker with a fresh cooldown. - Add periodic keepalive to _wait_for_lifecycle_event() (NousResearch#17003): send list_tools() every 3 minutes to prevent TCP connections from going stale during long idle periods. If keepalive fails, trigger automatic reconnect. Both fixes improve MCP server reliability for long-running gateway sessions where subprocess crashes or network idle timeouts previously required a full gateway restart to recover.
This was referenced Apr 28, 2026
|
@vominh1919 Thanks for your PR. I really need this as my MCP connections get lost easily. I hope you can resolve the merge conflicts soon and this gets released ASAP. |
Contributor
Author
Wait team to check @teknium1 @alt-glitch |
teknium1
pushed a commit
that referenced
this pull request
May 5, 2026
Sends a lightweight list_tools() probe every 3 minutes during idle periods to prevent TCP connections from going stale behind LB / NAT idle timeouts (commonly 300-600s). When the keepalive fails, the reconnect event fires so the transport rebuilds the session cleanly. Salvages the keepalive portion of @vominh1919's PR #17016. The circuit-breaker half-open recovery from the same PR was independently landed on main via #benbarclay's commit 8cc3ceb ("fix(mcp): add half-open state to circuit breaker", Apr 21); only the keepalive is salvaged here. Fixes #17003.
teknium1
pushed a commit
that referenced
this pull request
May 5, 2026
Sends a lightweight list_tools() probe every 3 minutes during idle periods to prevent TCP connections from going stale behind LB / NAT idle timeouts (commonly 300-600s). When the keepalive fails, the reconnect event fires so the transport rebuilds the session cleanly. Salvages the keepalive portion of @vominh1919's PR #17016. The circuit-breaker half-open recovery from the same PR was independently landed on main via #benbarclay's commit 8cc3ceb ("fix(mcp): add half-open state to circuit breaker", Apr 21); only the keepalive is salvaged here. Fixes #17003.
chris-han
pushed a commit
to chris-han/hermes-agent
that referenced
this pull request
May 6, 2026
Sends a lightweight list_tools() probe every 3 minutes during idle periods to prevent TCP connections from going stale behind LB / NAT idle timeouts (commonly 300-600s). When the keepalive fails, the reconnect event fires so the transport rebuilds the session cleanly. Salvages the keepalive portion of @vominh1919's PR NousResearch#17016. The circuit-breaker half-open recovery from the same PR was independently landed on main via #benbarclay's commit 8cc3ceb ("fix(mcp): add half-open state to circuit breaker", Apr 21); only the keepalive is salvaged here. Fixes NousResearch#17003.
nickdlkk
pushed a commit
to nickdlkk/hermes-agent
that referenced
this pull request
May 11, 2026
Sends a lightweight list_tools() probe every 3 minutes during idle periods to prevent TCP connections from going stale behind LB / NAT idle timeouts (commonly 300-600s). When the keepalive fails, the reconnect event fires so the transport rebuilds the session cleanly. Salvages the keepalive portion of @vominh1919's PR NousResearch#17016. The circuit-breaker half-open recovery from the same PR was independently landed on main via #benbarclay's commit 8cc3ceb ("fix(mcp): add half-open state to circuit breaker", Apr 21); only the keepalive is salvaged here. Fixes NousResearch#17003.
rmulligan
pushed a commit
to rmulligan/hermes-agent
that referenced
this pull request
May 11, 2026
Sends a lightweight list_tools() probe every 3 minutes during idle periods to prevent TCP connections from going stale behind LB / NAT idle timeouts (commonly 300-600s). When the keepalive fails, the reconnect event fires so the transport rebuilds the session cleanly. Salvages the keepalive portion of @vominh1919's PR NousResearch#17016. The circuit-breaker half-open recovery from the same PR was independently landed on main via #benbarclay's commit 8cc3ceb ("fix(mcp): add half-open state to circuit breaker", Apr 21); only the keepalive is salvaged here. Fixes NousResearch#17003.
JinyuID
pushed a commit
to JinyuID/hermes-agent
that referenced
this pull request
May 11, 2026
Sends a lightweight list_tools() probe every 3 minutes during idle periods to prevent TCP connections from going stale behind LB / NAT idle timeouts (commonly 300-600s). When the keepalive fails, the reconnect event fires so the transport rebuilds the session cleanly. Salvages the keepalive portion of @vominh1919's PR NousResearch#17016. The circuit-breaker half-open recovery from the same PR was independently landed on main via #benbarclay's commit 76bc64f ("fix(mcp): add half-open state to circuit breaker", Apr 21); only the keepalive is salvaged here. Fixes NousResearch#17003.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Two related MCP reliability issues affect long-running gateway sessions:
1. Circuit breaker permanently blocks recovery (#16788)
When the MCP circuit breaker trips (3 consecutive failures), it blocks the server permanently with no recovery mechanism. If the underlying subprocess dies and later becomes available again, the breaker never allows a probe call through. The gateway must be restarted to recover.
2. HTTP connections go stale during idle periods (#17003)
_wait_for_lifecycle_event()blocks indefinitely without generating any traffic. After extended idle periods (~12h), TCP connections become stale. The next tool call fails silently with an empty error message.Fix
Circuit breaker half-open recovery (#16788)
_CIRCUIT_BREAKER_COOLDOWN_SEC = 60— cooldown period before allowing a probe_server_breaker_opened_at— tracks when the breaker tripped_bump_server_error()and_reset_server_error()helpers for consistent state managementHTTP keepalive (#17003)
_wait_for_lifecycle_event()now usesasyncio.wait()with a 3-minute timeoutlist_tools()keepalive to exercise the connection_reconnect_eventTesting
ast.parse()on the modified fileFixes #16788
Fixes #17003