Skip to content

Commit 1dd4b5f

Browse files
vominh1919Chris Han
authored andcommitted
fix(mcp): add periodic keepalive to _wait_for_lifecycle_event
Sends a lightweight list_tools() probe every 3 minutes during idle periods to prevent TCP connections from going stale behind LB / NAT idle timeouts (commonly 300-600s). When the keepalive fails, the reconnect event fires so the transport rebuilds the session cleanly. Salvages the keepalive portion of @vominh1919's PR NousResearch#17016. The circuit-breaker half-open recovery from the same PR was independently landed on main via #benbarclay's commit 8cc3ceb ("fix(mcp): add half-open state to circuit breaker", Apr 21); only the keepalive is salvaged here. Fixes NousResearch#17003.
1 parent 7964e77 commit 1dd4b5f

1 file changed

Lines changed: 33 additions & 4 deletions

File tree

tools/mcp_tool.py

Lines changed: 33 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1066,14 +1066,43 @@ async def _wait_for_lifecycle_event(self) -> str:
10661066
with a fresh signal.
10671067
10681068
Shutdown takes precedence if both events are set simultaneously.
1069+
1070+
Periodically sends a lightweight keepalive (``list_tools``) to
1071+
prevent TCP connections from going stale during long idle
1072+
periods (#17003). If the keepalive fails, triggers a reconnect.
10691073
"""
1074+
# Keepalive interval in seconds. Must be shorter than typical
1075+
# LB / NAT idle-timeout (commonly 300-600s).
1076+
_KEEPALIVE_INTERVAL = 180 # 3 minutes
1077+
10701078
shutdown_task = asyncio.create_task(self._shutdown_event.wait())
10711079
reconnect_task = asyncio.create_task(self._reconnect_event.wait())
10721080
try:
1073-
await asyncio.wait(
1074-
{shutdown_task, reconnect_task},
1075-
return_when=asyncio.FIRST_COMPLETED,
1076-
)
1081+
while True:
1082+
done, _pending = await asyncio.wait(
1083+
{shutdown_task, reconnect_task},
1084+
timeout=_KEEPALIVE_INTERVAL,
1085+
return_when=asyncio.FIRST_COMPLETED,
1086+
)
1087+
if done:
1088+
break
1089+
1090+
# Timeout — no lifecycle event fired. Send a keepalive
1091+
# to exercise the connection and detect stale sockets.
1092+
if self.session:
1093+
try:
1094+
await asyncio.wait_for(
1095+
self.session.list_tools(),
1096+
timeout=30.0,
1097+
)
1098+
except Exception as exc:
1099+
logger.warning(
1100+
"MCP server '%s' keepalive failed, "
1101+
"triggering reconnect: %s",
1102+
self.name, exc,
1103+
)
1104+
self._reconnect_event.set()
1105+
break
10771106
finally:
10781107
for t in (shutdown_task, reconnect_task):
10791108
if not t.done():

0 commit comments

Comments
 (0)