fix(mcp): add periodic keepalive to _wait_for_lifecycle_event (salvage #17016)#20209
Merged
Conversation
Collaborator
|
Supersedes keepalive portion of #17016 (salvage onto current main). Half-open circuit breaker already merged separately. |
Sends a lightweight list_tools() probe every 3 minutes during idle periods to prevent TCP connections from going stale behind LB / NAT idle timeouts (commonly 300-600s). When the keepalive fails, the reconnect event fires so the transport rebuilds the session cleanly. Salvages the keepalive portion of @vominh1919's PR #17016. The circuit-breaker half-open recovery from the same PR was independently landed on main via #benbarclay's commit 8cc3ceb ("fix(mcp): add half-open state to circuit breaker", Apr 21); only the keepalive is salvaged here. Fixes #17003.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Salvages the keepalive portion of @vominh1919's PR #17016. Fixes #17003.
What it does
During long idle periods, the MCP session's TCP connection goes stale behind NAT / LB idle timeouts (commonly 300–600s). Sends a lightweight
list_toolsprobe every 3 minutes; if it fails, triggers a reconnect cleanly instead of leaving the agent stuck on a dead socket.Changes
tools/mcp_tool.py—_wait_for_lifecycle_eventwrapsasyncio.waitin a loop with a 180s timeout; on timeout, probessession.list_tools()with a 30s deadline and fires_reconnect_eventon failure.Dropped vs original
The original PR #17016 also added half-open state to the circuit breaker, but that feature was already landed on main via @benbarclay's commit 8cc3ceb ("fix(mcp): add half-open state to circuit breaker", Apr 21). Only the keepalive is salvaged here.
Validation
tests/tools/test_mcp_tool.py— 182 passed locally.Closes #17016 via salvage.