Description
All messaging platforms (QQ Bot, Feishu/Lark) stop responding simultaneously after ~1-2 hours of uptime. The root cause is a file descriptor leak -- TCP sockets accumulate in CLOSE_WAIT state until the process hits macOS default soft limit of 256.
Symptoms
- QQ Bot and Feishu both go silent at the same time
- Logs show:
[Errno 24] Too many open files
- Feishu SDK retries fail:
connect failed, err: [Errno 24] Too many open files
lsof -p <PID> | grep CLOSE_WAIT shows 200+ leaked connections
kanban notifier tick failed: unable to open database file (secondary effect)
Root Cause
The Feishu SDK (lark-oapi) and/or the HTTP client used for API calls leak TCP connections. Each API call (LLM requests, tool calls, message sending) leaves a socket in CLOSE_WAIT state. After ~200 leaked connections, the process exhausts its file descriptor limit (macOS soft limit: 256), causing all new network requests to fail -- not just Feishu, but also QQ Bot, LLM API calls, MCP servers, etc.
Environment
- Hermes version: v0.12.0 (2026.4.30)
- Python: 3.14.4
- OS: macOS (Apple Silicon)
- Platforms: QQ Bot + Feishu (both affected)
- Warp: Enabled (all CLOSE_WAIT connections route through
connectivity-check.warp-svc)
Reproduction Steps
- Start gateway with QQ Bot + Feishu platforms
- Wait 1-2 hours with moderate message traffic
lsof -p <PID> | grep CLOSE_WAIT | wc -l grows steadily
- When CLOSE_WAIT > 200, all platforms stop responding
hermes gateway restart fixes it temporarily
Evidence
$ lsof -p <PID> -i TCP | grep CLOSE_WAIT | wc -l
215
$ lsof -p <PID> | wc -l
353
$ ulimit -n
256
# All CLOSE_WAIT connections route through Warp proxy:
$ lsof -p <PID> -i TCP | grep CLOSE_WAIT | awk '{print $9}' | head -5
connectivity-check.warp-svc:57890->43.159.235.46:https
connectivity-check.warp-svc:57885->43.159.235.46:https
Workaround (currently in place)
- Cron job monitoring CLOSE_WAIT count every 10 minutes
- Auto-restart when CLOSE_WAIT > 30 or FD count > 200
- Increased macOS fd limit via launchd SoftResourceLimits
Suggested Fix
The underlying issue is in the HTTP connection pool management. Connections should be properly closed/released after each API call, not left in CLOSE_WAIT state. This likely requires:
- Ensuring aiohttp sessions are properly closed after each request
- Or adding connection pool cleanup/reaping logic
- Investigating whether Warp proxy exacerbates the issue (all leaked connections go through Warp)
Related
This is the same root cause as the QQ Bot 4009 "Session timed out" disconnects -- fd exhaustion prevents the adapter from maintaining WebSocket heartbeats.
Description
All messaging platforms (QQ Bot, Feishu/Lark) stop responding simultaneously after ~1-2 hours of uptime. The root cause is a file descriptor leak -- TCP sockets accumulate in
CLOSE_WAITstate until the process hits macOS default soft limit of 256.Symptoms
[Errno 24] Too many open filesconnect failed, err: [Errno 24] Too many open fileslsof -p <PID> | grep CLOSE_WAITshows 200+ leaked connectionskanban notifier tick failed: unable to open database file(secondary effect)Root Cause
The Feishu SDK (lark-oapi) and/or the HTTP client used for API calls leak TCP connections. Each API call (LLM requests, tool calls, message sending) leaves a socket in
CLOSE_WAITstate. After ~200 leaked connections, the process exhausts its file descriptor limit (macOS soft limit: 256), causing all new network requests to fail -- not just Feishu, but also QQ Bot, LLM API calls, MCP servers, etc.Environment
connectivity-check.warp-svc)Reproduction Steps
lsof -p <PID> | grep CLOSE_WAIT | wc -lgrows steadilyhermes gateway restartfixes it temporarilyEvidence
Workaround (currently in place)
Suggested Fix
The underlying issue is in the HTTP connection pool management. Connections should be properly closed/released after each API call, not left in
CLOSE_WAITstate. This likely requires:Related
This is the same root cause as the QQ Bot 4009 "Session timed out" disconnects -- fd exhaustion prevents the adapter from maintaining WebSocket heartbeats.