[Bug]: CLOSE_WAIT fd leak causes all platforms to stop responding after ~1-2 hours

## Description

All messaging platforms (QQ Bot, Feishu/Lark) stop responding simultaneously after ~1-2 hours of uptime. The root cause is a **file descriptor leak** -- TCP sockets accumulate in `CLOSE_WAIT` state until the process hits macOS default soft limit of 256.

## Symptoms

- QQ Bot and Feishu both go silent at the same time
- Logs show: `[Errno 24] Too many open files`
- Feishu SDK retries fail: `connect failed, err: [Errno 24] Too many open files`
- `lsof -p <PID> | grep CLOSE_WAIT` shows 200+ leaked connections
- `kanban notifier tick failed: unable to open database file` (secondary effect)

## Root Cause

The Feishu SDK (lark-oapi) and/or the HTTP client used for API calls leak TCP connections. Each API call (LLM requests, tool calls, message sending) leaves a socket in `CLOSE_WAIT` state. After ~200 leaked connections, the process exhausts its file descriptor limit (macOS soft limit: 256), causing **all** new network requests to fail -- not just Feishu, but also QQ Bot, LLM API calls, MCP servers, etc.

## Environment

- **Hermes version**: v0.12.0 (2026.4.30)
- **Python**: 3.14.4
- **OS**: macOS (Apple Silicon)
- **Platforms**: QQ Bot + Feishu (both affected)
- **Warp**: Enabled (all CLOSE_WAIT connections route through `connectivity-check.warp-svc`)

## Reproduction Steps

1. Start gateway with QQ Bot + Feishu platforms
2. Wait 1-2 hours with moderate message traffic
3. `lsof -p <PID> | grep CLOSE_WAIT | wc -l` grows steadily
4. When CLOSE_WAIT > 200, all platforms stop responding
5. `hermes gateway restart` fixes it temporarily

## Evidence

```
$ lsof -p <PID> -i TCP | grep CLOSE_WAIT | wc -l
215

$ lsof -p <PID> | wc -l
353

$ ulimit -n
256

# All CLOSE_WAIT connections route through Warp proxy:
$ lsof -p <PID> -i TCP | grep CLOSE_WAIT | awk '{print $9}' | head -5
connectivity-check.warp-svc:57890->43.159.235.46:https
connectivity-check.warp-svc:57885->43.159.235.46:https
```

## Workaround (currently in place)

- Cron job monitoring CLOSE_WAIT count every 10 minutes
- Auto-restart when CLOSE_WAIT > 30 or FD count > 200
- Increased macOS fd limit via launchd SoftResourceLimits

## Suggested Fix

The underlying issue is in the HTTP connection pool management. Connections should be properly closed/released after each API call, not left in `CLOSE_WAIT` state. This likely requires:

1. Ensuring aiohttp sessions are properly closed after each request
2. Or adding connection pool cleanup/reaping logic
3. Investigating whether Warp proxy exacerbates the issue (all leaked connections go through Warp)

## Related

This is the same root cause as the QQ Bot 4009 "Session timed out" disconnects -- fd exhaustion prevents the adapter from maintaining WebSocket heartbeats.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: CLOSE_WAIT fd leak causes all platforms to stop responding after ~1-2 hours #18451

Description

Symptoms

Root Cause

Environment

Reproduction Steps

Evidence

Workaround (currently in place)

Suggested Fix

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: CLOSE_WAIT fd leak causes all platforms to stop responding after ~1-2 hours #18451

Description

Description

Symptoms

Root Cause

Environment

Reproduction Steps

Evidence

Workaround (currently in place)

Suggested Fix

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions