fix(gateway): CLOSE_WAIT fd leak audit — httpx keepalive + whatsapp aiohttp leak + Feishu hygiene (#18451)#18766
Merged
Merged
Conversation
…ent context (#18502) Snapshot Content-Type and body while the client context is still active so pooled connections fully release on exit. Previously the read happened after `async with httpx.AsyncClient(...)` returned — which works today only because httpx eagerly buffers non-streaming responses; a future refactor to `.stream()` would silently read- after-close. Part of the #18451 connection-hygiene audit. Salvage of #18502.
…nse leak (#18451) Two mitigations for the CLOSE_WAIT accumulation reported against QQ Bot + Feishu on macOS behind Cloudflare Warp. 1. Shared httpx.Limits helper (gateway/platforms/_http_client_limits.py). Every long-lived platform adapter now constructs httpx.AsyncClient with max_keepalive_connections=10 and keepalive_expiry=2.0, vs httpx's default of unbounded keepalive pool and 5.0s expiry. On macOS/Warp the default 5s window let idle keepalive sockets sit in CLOSE_WAIT long enough for seven persistent adapters (QQ Bot, WeCom, DingTalk, Signal, BlueBubbles, WeCom-callback, plus the transient Feishu helper) to compound to the 256-fd ulimit. Tunable via HERMES_GATEWAY_HTTPX_KEEPALIVE_EXPIRY and HERMES_GATEWAY_HTTPX_MAX_KEEPALIVE env vars. 2. whatsapp.send_typing aiohttp leak. The call was 'await self._http_session.post(...)' with no 'async with' and no variable capture — the ClientResponse went out of scope unclosed, holding its TCP socket in CLOSE_WAIT until GC. Fixed by wrapping in 'async with'. This was the only bare-await aiohttp leak in the gateway/tools/plugins tree per audit; all other aiohttp sites use the context-manager pattern correctly. The underlying reporter also saw Feishu SDK (lark-oapi) connections in CLOSE_WAIT — those are inside the SDK and out of our direct control, but tightening httpx keepalive across adapters reduces the aggregate pool pressure regardless of which individual adapter leaks.
Follow-up for PR #18502 salvage.
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three fixes for the CLOSE_WAIT fd-exhaustion reported on macOS + Cloudflare Warp, where the gateway hit its 256-fd limit after ~1-2 hours and all platforms went silent. @beibi9966's #18502 found one symptom (Feishu document-download hygiene); this PR salvages that and adds the two load-bearing fixes after auditing every persistent httpx/aiohttp client in the gateway + tools + plugins tree.
Commits
1. Feishu download response snapshot — salvaged from @beibi9966 #18502 with authorship preserved. Reads Content-Type + body inside the
async with httpx.AsyncClientblock; works today only because httpx eagerly buffers non-streaming responses. Structural cleanup, not the CLOSE_WAIT root cause.2. WhatsApp
send_typingaiohttp leak — the call wasawait self._http_session.post(...)with noasync withand no variable capture. The resultingClientResponsewent out of scope unreleased, holding its TCP socket in CLOSE_WAIT until GC. Full audit across gateway/tools/plugins confirmed this is the only bare-await aiohttp leak in the tree.3. Shared httpx.Limits helper for long-lived adapters — every persistent-client platform (QQ Bot, WeCom, DingTalk, Signal, BlueBubbles, WeCom-callback) now uses
gateway/platforms/_http_client_limits.platform_httpx_limits()which returnshttpx.Limits(max_keepalive_connections=10, keepalive_expiry=2.0)— tighter than httpx's defaults (unbounded keepalive pool, 5.0s expiry). Under macOS/Warp the 5s window is long enough for seven concurrent adapter pools to compound to the 256-fd limit; 2s drains idle keepalive sockets before they pile up. Tunable viaHERMES_GATEWAY_HTTPX_KEEPALIVE_EXPIRYandHERMES_GATEWAY_HTTPX_MAX_KEEPALIVEenv vars for under-load tuning.What I am NOT claiming
The Feishu SDK (
lark-oapi) itself may also hold connections in CLOSE_WAIT — the reporter saw some lark-oapi retry failures in the logs. That's inside the SDK and out of our direct control. Tightening httpx keepalive across our adapters reduces aggregate pool pressure regardless, but an operator still seeing fd exhaustion should (a) bumpulimit -nand (b) file upstream against lark-oapi if its pool grows unbounded.Validation
scripts/run_tests.sh tests/gateway/→ 4514 passed (2 pre-existing failures unrelated to this change — whatsapp fd-ready assertion in a different test path, teams send_typing mock — confirmed present onorigin/mainwithout these commits).scripts/run_tests.sh tests/gateway/test_platform_http_client_limits.py tests/gateway/test_feishu.py→ 202 passed.httpx.AsyncClient(limits=platform_httpx_limits())constructed and closed cleanly. Helper returns tighter values thanhttpx.Limits()defaults. Env-var overrides honoured. Source-inspection tests confirm whatsapp.send_typing usesasync with(not bareawait) and feishu._download_remote_document snapshots response inside client context.Audit trail (for reviewers verifying scope)
Every
httpx.AsyncClient(...)andaiohttp.ClientSession(...)instantiation ingateway/,tools/, andplugins/was checked for release hygiene. All short-livedasync withusages are correct. The only genuine leak was whatsapp.py:905 (fixed here). All six persistent platform clients (qqbot/wecom/dingtalk/signal/bluebubbles/wecom_callback) now use the tighter keepalive pool.Closes #18451. Salvages #18502.