Skip to content

[Bug]: CLOSE_WAIT fd leak causes all platforms to stop responding after ~1-2 hours #18451

@h382110229

Description

@h382110229

Description

All messaging platforms (QQ Bot, Feishu/Lark) stop responding simultaneously after ~1-2 hours of uptime. The root cause is a file descriptor leak -- TCP sockets accumulate in CLOSE_WAIT state until the process hits macOS default soft limit of 256.

Symptoms

  • QQ Bot and Feishu both go silent at the same time
  • Logs show: [Errno 24] Too many open files
  • Feishu SDK retries fail: connect failed, err: [Errno 24] Too many open files
  • lsof -p <PID> | grep CLOSE_WAIT shows 200+ leaked connections
  • kanban notifier tick failed: unable to open database file (secondary effect)

Root Cause

The Feishu SDK (lark-oapi) and/or the HTTP client used for API calls leak TCP connections. Each API call (LLM requests, tool calls, message sending) leaves a socket in CLOSE_WAIT state. After ~200 leaked connections, the process exhausts its file descriptor limit (macOS soft limit: 256), causing all new network requests to fail -- not just Feishu, but also QQ Bot, LLM API calls, MCP servers, etc.

Environment

  • Hermes version: v0.12.0 (2026.4.30)
  • Python: 3.14.4
  • OS: macOS (Apple Silicon)
  • Platforms: QQ Bot + Feishu (both affected)
  • Warp: Enabled (all CLOSE_WAIT connections route through connectivity-check.warp-svc)

Reproduction Steps

  1. Start gateway with QQ Bot + Feishu platforms
  2. Wait 1-2 hours with moderate message traffic
  3. lsof -p <PID> | grep CLOSE_WAIT | wc -l grows steadily
  4. When CLOSE_WAIT > 200, all platforms stop responding
  5. hermes gateway restart fixes it temporarily

Evidence

$ lsof -p <PID> -i TCP | grep CLOSE_WAIT | wc -l
215

$ lsof -p <PID> | wc -l
353

$ ulimit -n
256

# All CLOSE_WAIT connections route through Warp proxy:
$ lsof -p <PID> -i TCP | grep CLOSE_WAIT | awk '{print $9}' | head -5
connectivity-check.warp-svc:57890->43.159.235.46:https
connectivity-check.warp-svc:57885->43.159.235.46:https

Workaround (currently in place)

  • Cron job monitoring CLOSE_WAIT count every 10 minutes
  • Auto-restart when CLOSE_WAIT > 30 or FD count > 200
  • Increased macOS fd limit via launchd SoftResourceLimits

Suggested Fix

The underlying issue is in the HTTP connection pool management. Connections should be properly closed/released after each API call, not left in CLOSE_WAIT state. This likely requires:

  1. Ensuring aiohttp sessions are properly closed after each request
  2. Or adding connection pool cleanup/reaping logic
  3. Investigating whether Warp proxy exacerbates the issue (all leaked connections go through Warp)

Related

This is the same root cause as the QQ Bot 4009 "Session timed out" disconnects -- fd exhaustion prevents the adapter from maintaining WebSocket heartbeats.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/gatewayGateway runner, session dispatch, deliveryplatform/feishuFeishu / Lark adaptertype/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions