Skip to content

Commit f75365e

Browse files
vominh1919Ryan
authored andcommitted
fix(mcp): add periodic keepalive to _wait_for_lifecycle_event
Sends a lightweight list_tools() probe every 3 minutes during idle periods to prevent TCP connections from going stale behind LB / NAT idle timeouts (commonly 300-600s). When the keepalive fails, the reconnect event fires so the transport rebuilds the session cleanly. Salvages the keepalive portion of @vominh1919's PR NousResearch#17016. The circuit-breaker half-open recovery from the same PR was independently landed on main via #benbarclay's commit 8cc3ceb ("fix(mcp): add half-open state to circuit breaker", Apr 21); only the keepalive is salvaged here. Fixes NousResearch#17003.
1 parent 074a9af commit f75365e

1 file changed

Lines changed: 33 additions & 4 deletions

File tree

tools/mcp_tool.py

Lines changed: 33 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1038,14 +1038,43 @@ async def _wait_for_lifecycle_event(self) -> str:
10381038
with a fresh signal.
10391039
10401040
Shutdown takes precedence if both events are set simultaneously.
1041+
1042+
Periodically sends a lightweight keepalive (``list_tools``) to
1043+
prevent TCP connections from going stale during long idle
1044+
periods (#17003). If the keepalive fails, triggers a reconnect.
10411045
"""
1046+
# Keepalive interval in seconds. Must be shorter than typical
1047+
# LB / NAT idle-timeout (commonly 300-600s).
1048+
_KEEPALIVE_INTERVAL = 180 # 3 minutes
1049+
10421050
shutdown_task = asyncio.create_task(self._shutdown_event.wait())
10431051
reconnect_task = asyncio.create_task(self._reconnect_event.wait())
10441052
try:
1045-
await asyncio.wait(
1046-
{shutdown_task, reconnect_task},
1047-
return_when=asyncio.FIRST_COMPLETED,
1048-
)
1053+
while True:
1054+
done, _pending = await asyncio.wait(
1055+
{shutdown_task, reconnect_task},
1056+
timeout=_KEEPALIVE_INTERVAL,
1057+
return_when=asyncio.FIRST_COMPLETED,
1058+
)
1059+
if done:
1060+
break
1061+
1062+
# Timeout — no lifecycle event fired. Send a keepalive
1063+
# to exercise the connection and detect stale sockets.
1064+
if self.session:
1065+
try:
1066+
await asyncio.wait_for(
1067+
self.session.list_tools(),
1068+
timeout=30.0,
1069+
)
1070+
except Exception as exc:
1071+
logger.warning(
1072+
"MCP server '%s' keepalive failed, "
1073+
"triggering reconnect: %s",
1074+
self.name, exc,
1075+
)
1076+
self._reconnect_event.set()
1077+
break
10491078
finally:
10501079
for t in (shutdown_task, reconnect_task):
10511080
if not t.done():

0 commit comments

Comments
 (0)