Skip to content

fix(gateway): CLOSE_WAIT fd leak audit — httpx keepalive + whatsapp aiohttp leak + Feishu hygiene (#18451)#18766

Merged
teknium1 merged 3 commits into
mainfrom
hermes/hermes-3d89efe9
May 2, 2026
Merged

fix(gateway): CLOSE_WAIT fd leak audit — httpx keepalive + whatsapp aiohttp leak + Feishu hygiene (#18451)#18766
teknium1 merged 3 commits into
mainfrom
hermes/hermes-3d89efe9

Conversation

@teknium1
Copy link
Copy Markdown
Contributor

@teknium1 teknium1 commented May 2, 2026

Three fixes for the CLOSE_WAIT fd-exhaustion reported on macOS + Cloudflare Warp, where the gateway hit its 256-fd limit after ~1-2 hours and all platforms went silent. @beibi9966's #18502 found one symptom (Feishu document-download hygiene); this PR salvages that and adds the two load-bearing fixes after auditing every persistent httpx/aiohttp client in the gateway + tools + plugins tree.

Commits

1. Feishu download response snapshot — salvaged from @beibi9966 #18502 with authorship preserved. Reads Content-Type + body inside the async with httpx.AsyncClient block; works today only because httpx eagerly buffers non-streaming responses. Structural cleanup, not the CLOSE_WAIT root cause.

2. WhatsApp send_typing aiohttp leak — the call was await self._http_session.post(...) with no async with and no variable capture. The resulting ClientResponse went out of scope unreleased, holding its TCP socket in CLOSE_WAIT until GC. Full audit across gateway/tools/plugins confirmed this is the only bare-await aiohttp leak in the tree.

3. Shared httpx.Limits helper for long-lived adapters — every persistent-client platform (QQ Bot, WeCom, DingTalk, Signal, BlueBubbles, WeCom-callback) now uses gateway/platforms/_http_client_limits.platform_httpx_limits() which returns httpx.Limits(max_keepalive_connections=10, keepalive_expiry=2.0) — tighter than httpx's defaults (unbounded keepalive pool, 5.0s expiry). Under macOS/Warp the 5s window is long enough for seven concurrent adapter pools to compound to the 256-fd limit; 2s drains idle keepalive sockets before they pile up. Tunable via HERMES_GATEWAY_HTTPX_KEEPALIVE_EXPIRY and HERMES_GATEWAY_HTTPX_MAX_KEEPALIVE env vars for under-load tuning.

What I am NOT claiming

The Feishu SDK (lark-oapi) itself may also hold connections in CLOSE_WAIT — the reporter saw some lark-oapi retry failures in the logs. That's inside the SDK and out of our direct control. Tightening httpx keepalive across our adapters reduces aggregate pool pressure regardless, but an operator still seeing fd exhaustion should (a) bump ulimit -n and (b) file upstream against lark-oapi if its pool grows unbounded.

Validation

  • scripts/run_tests.sh tests/gateway/ → 4514 passed (2 pre-existing failures unrelated to this change — whatsapp fd-ready assertion in a different test path, teams send_typing mock — confirmed present on origin/main without these commits).
  • scripts/run_tests.sh tests/gateway/test_platform_http_client_limits.py tests/gateway/test_feishu.py → 202 passed.
  • E2E: real httpx.AsyncClient(limits=platform_httpx_limits()) constructed and closed cleanly. Helper returns tighter values than httpx.Limits() defaults. Env-var overrides honoured. Source-inspection tests confirm whatsapp.send_typing uses async with (not bare await) and feishu._download_remote_document snapshots response inside client context.

Audit trail (for reviewers verifying scope)

Every httpx.AsyncClient(...) and aiohttp.ClientSession(...) instantiation in gateway/, tools/, and plugins/ was checked for release hygiene. All short-lived async with usages are correct. The only genuine leak was whatsapp.py:905 (fixed here). All six persistent platform clients (qqbot/wecom/dingtalk/signal/bluebubbles/wecom_callback) now use the tighter keepalive pool.

Closes #18451. Salvages #18502.

beibi9966 and others added 3 commits May 2, 2026 02:22
…ent context (#18502)

Snapshot Content-Type and body while the client context is still
active so pooled connections fully release on exit. Previously the
read happened after `async with httpx.AsyncClient(...)` returned —
which works today only because httpx eagerly buffers non-streaming
responses; a future refactor to `.stream()` would silently read-
after-close.

Part of the #18451 connection-hygiene audit. Salvage of #18502.
…nse leak (#18451)

Two mitigations for the CLOSE_WAIT accumulation reported against QQ Bot
+ Feishu on macOS behind Cloudflare Warp.

1. Shared httpx.Limits helper (gateway/platforms/_http_client_limits.py).
   Every long-lived platform adapter now constructs httpx.AsyncClient
   with max_keepalive_connections=10 and keepalive_expiry=2.0, vs httpx's
   default of unbounded keepalive pool and 5.0s expiry. On macOS/Warp the
   default 5s window let idle keepalive sockets sit in CLOSE_WAIT long
   enough for seven persistent adapters (QQ Bot, WeCom, DingTalk, Signal,
   BlueBubbles, WeCom-callback, plus the transient Feishu helper) to
   compound to the 256-fd ulimit. Tunable via
   HERMES_GATEWAY_HTTPX_KEEPALIVE_EXPIRY and
   HERMES_GATEWAY_HTTPX_MAX_KEEPALIVE env vars.

2. whatsapp.send_typing aiohttp leak. The call was
   'await self._http_session.post(...)' with no 'async with' and no
   variable capture — the ClientResponse went out of scope unclosed,
   holding its TCP socket in CLOSE_WAIT until GC. Fixed by wrapping in
   'async with'. This was the only bare-await aiohttp leak in the
   gateway/tools/plugins tree per audit; all other aiohttp sites use
   the context-manager pattern correctly.

The underlying reporter also saw Feishu SDK (lark-oapi) connections in
CLOSE_WAIT — those are inside the SDK and out of our direct control, but
tightening httpx keepalive across adapters reduces the aggregate pool
pressure regardless of which individual adapter leaks.
@teknium1 teknium1 merged commit 73bcd83 into main May 2, 2026
9 of 10 checks passed
@teknium1 teknium1 deleted the hermes/hermes-3d89efe9 branch May 2, 2026 09:23
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/gateway Gateway runner, session dispatch, delivery platform/whatsapp WhatsApp Business adapter platform/feishu Feishu / Lark adapter labels May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround platform/feishu Feishu / Lark adapter platform/whatsapp WhatsApp Business adapter type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: CLOSE_WAIT fd leak causes all platforms to stop responding after ~1-2 hours

3 participants