fix(gateway): CLOSE_WAIT fd leak audit — httpx keepalive + whatsapp aiohttp leak + Feishu hygiene (#18451) by teknium1 · Pull Request #18766 · NousResearch/hermes-agent

teknium1 · 2026-05-02T09:23:32Z

Three fixes for the CLOSE_WAIT fd-exhaustion reported on macOS + Cloudflare Warp, where the gateway hit its 256-fd limit after ~1-2 hours and all platforms went silent. @beibi9966's #18502 found one symptom (Feishu document-download hygiene); this PR salvages that and adds the two load-bearing fixes after auditing every persistent httpx/aiohttp client in the gateway + tools + plugins tree.

Commits

1. Feishu download response snapshot — salvaged from @beibi9966 #18502 with authorship preserved. Reads Content-Type + body inside the async with httpx.AsyncClient block; works today only because httpx eagerly buffers non-streaming responses. Structural cleanup, not the CLOSE_WAIT root cause.

2. WhatsApp send_typing aiohttp leak — the call was await self._http_session.post(...) with no async with and no variable capture. The resulting ClientResponse went out of scope unreleased, holding its TCP socket in CLOSE_WAIT until GC. Full audit across gateway/tools/plugins confirmed this is the only bare-await aiohttp leak in the tree.

3. Shared httpx.Limits helper for long-lived adapters — every persistent-client platform (QQ Bot, WeCom, DingTalk, Signal, BlueBubbles, WeCom-callback) now uses gateway/platforms/_http_client_limits.platform_httpx_limits() which returns httpx.Limits(max_keepalive_connections=10, keepalive_expiry=2.0) — tighter than httpx's defaults (unbounded keepalive pool, 5.0s expiry). Under macOS/Warp the 5s window is long enough for seven concurrent adapter pools to compound to the 256-fd limit; 2s drains idle keepalive sockets before they pile up. Tunable via HERMES_GATEWAY_HTTPX_KEEPALIVE_EXPIRY and HERMES_GATEWAY_HTTPX_MAX_KEEPALIVE env vars for under-load tuning.

What I am NOT claiming

The Feishu SDK (lark-oapi) itself may also hold connections in CLOSE_WAIT — the reporter saw some lark-oapi retry failures in the logs. That's inside the SDK and out of our direct control. Tightening httpx keepalive across our adapters reduces aggregate pool pressure regardless, but an operator still seeing fd exhaustion should (a) bump ulimit -n and (b) file upstream against lark-oapi if its pool grows unbounded.

Validation

scripts/run_tests.sh tests/gateway/ → 4514 passed (2 pre-existing failures unrelated to this change — whatsapp fd-ready assertion in a different test path, teams send_typing mock — confirmed present on origin/main without these commits).
scripts/run_tests.sh tests/gateway/test_platform_http_client_limits.py tests/gateway/test_feishu.py → 202 passed.
E2E: real httpx.AsyncClient(limits=platform_httpx_limits()) constructed and closed cleanly. Helper returns tighter values than httpx.Limits() defaults. Env-var overrides honoured. Source-inspection tests confirm whatsapp.send_typing uses async with (not bare await) and feishu._download_remote_document snapshots response inside client context.

Audit trail (for reviewers verifying scope)

Every httpx.AsyncClient(...) and aiohttp.ClientSession(...) instantiation in gateway/, tools/, and plugins/ was checked for release hygiene. All short-lived async with usages are correct. The only genuine leak was whatsapp.py:905 (fixed here). All six persistent platform clients (qqbot/wecom/dingtalk/signal/bluebubbles/wecom_callback) now use the tighter keepalive pool.

Closes #18451. Salvages #18502.

…ent context (#18502) Snapshot Content-Type and body while the client context is still active so pooled connections fully release on exit. Previously the read happened after `async with httpx.AsyncClient(...)` returned — which works today only because httpx eagerly buffers non-streaming responses; a future refactor to `.stream()` would silently read- after-close. Part of the #18451 connection-hygiene audit. Salvage of #18502.

…nse leak (#18451) Two mitigations for the CLOSE_WAIT accumulation reported against QQ Bot + Feishu on macOS behind Cloudflare Warp. 1. Shared httpx.Limits helper (gateway/platforms/_http_client_limits.py). Every long-lived platform adapter now constructs httpx.AsyncClient with max_keepalive_connections=10 and keepalive_expiry=2.0, vs httpx's default of unbounded keepalive pool and 5.0s expiry. On macOS/Warp the default 5s window let idle keepalive sockets sit in CLOSE_WAIT long enough for seven persistent adapters (QQ Bot, WeCom, DingTalk, Signal, BlueBubbles, WeCom-callback, plus the transient Feishu helper) to compound to the 256-fd ulimit. Tunable via HERMES_GATEWAY_HTTPX_KEEPALIVE_EXPIRY and HERMES_GATEWAY_HTTPX_MAX_KEEPALIVE env vars. 2. whatsapp.send_typing aiohttp leak. The call was 'await self._http_session.post(...)' with no 'async with' and no variable capture — the ClientResponse went out of scope unclosed, holding its TCP socket in CLOSE_WAIT until GC. Fixed by wrapping in 'async with'. This was the only bare-await aiohttp leak in the gateway/tools/plugins tree per audit; all other aiohttp sites use the context-manager pattern correctly. The underlying reporter also saw Feishu SDK (lark-oapi) connections in CLOSE_WAIT — those are inside the SDK and out of our direct control, but tightening httpx keepalive across adapters reduces the aggregate pool pressure regardless of which individual adapter leaks.

Follow-up for PR #18502 salvage.

beibi9966 and others added 3 commits May 2, 2026 02:22

chore(release): map beibi9966 email for AUTHOR_MAP

abff228

Follow-up for PR #18502 salvage.

teknium1 merged commit 73bcd83 into main May 2, 2026
9 of 10 checks passed

teknium1 deleted the hermes/hermes-3d89efe9 branch May 2, 2026 09:23

teknium1 mentioned this pull request May 2, 2026

fix(feishu): finalize remote document downloads inside httpx.AsyncClient context #18502

Closed

alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/gateway Gateway runner, session dispatch, delivery platform/whatsapp WhatsApp Business adapter platform/feishu Feishu / Lark adapter labels May 2, 2026

BrewTestBot mentioned this pull request May 7, 2026

hermes-agent 2026.5.7 Homebrew/homebrew-core#281437

Merged

1 task

github-actions Bot mentioned this pull request May 8, 2026

chore: bump NousResearch/hermes-agent version from v2026.4.30 to v2026.5.7 Docker-Hub-sirmark/docker-hermes-agent#5

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): CLOSE_WAIT fd leak audit — httpx keepalive + whatsapp aiohttp leak + Feishu hygiene (#18451)#18766

fix(gateway): CLOSE_WAIT fd leak audit — httpx keepalive + whatsapp aiohttp leak + Feishu hygiene (#18451)#18766
teknium1 merged 3 commits into
mainfrom
hermes/hermes-3d89efe9

teknium1 commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

teknium1 commented May 2, 2026

Commits

What I am NOT claiming

Validation

Audit trail (for reviewers verifying scope)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants