fix: abort hung SDK queries with AbortController idle timeout by IYENTeam · Pull Request #1572 · qwibitai/nanoclaw

IYENTeam · 2026-04-01T05:34:45Z

Problem

When the SDK query() hangs — due to rate limits, network failures, or any other reason — the container becomes unresponsive. The host's container-level timeout (default 30 min) resets on every output marker, so if a rate limit notice is emitted right before the hang, the timer resets and the user waits another 30 minutes with no response.

The host's existing retry logic (exponential backoff in GroupQueue) only triggers when the container exits with an error. A hung container never exits, so retries never fire.

Alternatives considered

Parse rate limit text from Claude output (#670)

Regex-parses "resets Xam (Timezone)" from the result text. Fragile — breaks if the output format changes. Also only covers rate limits, not network failures, SDK bugs, or any other hang scenario.

Host-side input→output timeout

Track when the host sends a message and kill the container if no output arrives within N minutes. The host can't distinguish a legitimately long tool execution (e.g. a build) from a hang — both look like silence.

Heartbeat from agent-runner

Emit periodic heartbeats to stderr. The SDK's for await loop is blocking — there's no opportunity to emit a heartbeat while waiting for the next message, which is exactly when the hang occurs. A worker thread could work but adds complexity for no benefit over AbortController.

Solution

Use the SDK's built-in abortController option to set a per-query idle timeout. If no SDK messages arrive within 5 minutes (configurable via QUERY_IDLE_TIMEOUT env var), the query is aborted cleanly. The container exits with an error, and the host's existing retry logic handles the rest.

This works because SDK messages stream continuously during normal operation — tool invocations, intermediate output, system events all produce messages. Five minutes of complete silence means the query is stuck, regardless of the cause.

Changes

Add QUERY_IDLE_TIMEOUT_MS constant (default 5 min, env-configurable)
Create AbortController per query, pass to SDK query() options
Reset idle timer on every SDK message
On timeout: abort query, emit error output, exit container

What this covers

Rate limit hangs
Network failures
SDK internal issues
Any future cause of query hangs

What this doesn't change

No host-side code changes
Container-level timeout preserved as safety net
Existing retry logic (GroupQueue exponential backoff) reused as-is

Test plan

Deploy and trigger a rate limit to confirm the container aborts within 5 min instead of hanging for 30 min
Verify normal long-running tasks (multi-step tool use, large file operations) complete without being aborted
Confirm host retries the message after container exits

Closes #186

When the SDK query() hangs — due to rate limits, network failures, or any other reason — the container becomes unresponsive. The host's container-level timeout (default 30 min) resets on every output marker, so if a rate limit notice is emitted right before the hang, the timer resets and the user waits another 30 minutes with no response. The host's existing retry logic (exponential backoff in GroupQueue) only triggers when the container exits with an error. A hung container never exits, so retries never fire. ## Alternatives considered ### Parse rate limit text from Claude output (#670) Regex-parses "resets Xam (Timezone)" from Claude's result text. Fragile — breaks if Claude's output format changes. Also only covers rate limits, not network failures, SDK bugs, or any other hang scenario. ### Host-side input→output timeout Track when the host sends a message and kill the container if no output arrives within N minutes. The host can't distinguish a legitimately long tool execution (e.g. a build) from a hang — both look like silence. ### Heartbeat from agent-runner Emit periodic heartbeats to stderr. The SDK's for-await loop is blocking — there's no opportunity to emit a heartbeat while waiting for the next message, which is exactly when the hang occurs. A worker thread could work but adds complexity for no benefit over AbortController. ## Solution Use the SDK's built-in abortController option to set a per-query idle timeout. If no SDK messages arrive within 5 minutes (configurable via QUERY_IDLE_TIMEOUT env var), the query is aborted cleanly. The container exits with an error, and the host's existing retry logic handles the rest. This works because SDK messages stream continuously during normal operation — tool invocations, intermediate output, system events all produce messages. Five minutes of complete silence means the query is stuck, regardless of the cause. Closes #186

#1 AbortController idle timeout (PR qwibitai#1572): - Aborts hung SDK queries after 5min of no messages - Configurable via QUERY_IDLE_TIMEOUT env var - Container exits with error for host retry qwibitai#2 Session JSONL rotation (PR qwibitai#700): - Rotates session files exceeding 5MB - Prevents container timeouts from session bloat - Auto-creates fresh session on rotation qwibitai#3 Per-group .mcp.json config (PR qwibitai#1515): - Groups can define MCP servers in .mcp.json - Servers auto-discovered and tools auto-allowed - No code changes needed to add group-specific MCP Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jorgenclaw · 2026-04-02T07:06:42Z

Test Report — PR #1572

Environment:

Pop!_OS Linux (bare-metal), Surface Pro 7+, Intel Iris Xe
NanoClaw v1.2.43 with Signal, Nostr DM, and WhiteNoise channels active
Docker 28.x, node:22-slim container base, container rebuilt with this fix

What we tested:

Applied to our production install alongside 5 other upstream bug fixes
Container image rebuilt to include the agent-runner change
Full build: clean, zero TypeScript errors
Full test suite: 486/486 tests pass
Service restarted and running in production

Result: ✅ Pass

The 5-minute AbortController timeout prevents containers from hanging indefinitely when the Claude SDK stops emitting messages
Previously, a hung SDK query would leave the container running forever — consuming RAM and blocking the group queue
The timer resets on every SDK message, so long-running legitimate conversations are not affected
Configurable via QUERY_IDLE_TIMEOUT env var — good for operators who want different thresholds

Notes:

24-line change in agent-runner, clean and self-contained
Requires container rebuild to take effect (this is agent-runner code, not host-side)
No regressions observed

Tested by @jorgenclaw (Scott Jorgensen + Claude Code)

…ity, calendar multi-user Upstream bug fixes applied (all tested, 486/486 pass): - qwibitai#1575: always write _close sentinel in notifyIdle - qwibitai#1567: evict idle task containers when messages arrive - qwibitai#1576: prevent message loss when agent outputs mid-query - qwibitai#1566: resilient channel connect with background retry - qwibitai#1572: abort hung SDK queries with AbortController (5min timeout) - qwibitai#1519: IPC size guards (1MB), message truncation (50K), orphan task cleanup Nostr signer security upgrade: - Per-session scoped tokens with TTL and event-kind restrictions - Rate limiting: 5/10s burst, 10/min, 100/hr per session - Alert logging to signer-alerts.log - Backwards compatible: legacy calls still work with deprecation warning - 14/14 security test suite passes Other improvements: - Relay diversity: clawstr-post now publishes to 5 relays - Calendar multi-user auth: scott user works alongside jorgenclaw - Email skill added to group CLAUDE.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

IYENTeam requested review from gabi-simons and gavrielc as code owners April 1, 2026 05:34

github-actions Bot mentioned this pull request Apr 2, 2026

🦞 OpenClaw 生态日报 2026-04-02 gsscsd/big_model_radar#121

Open

IYENTeam closed this by deleting the head repository Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: abort hung SDK queries with AbortController idle timeout#1572

fix: abort hung SDK queries with AbortController idle timeout#1572
IYENTeam wants to merge 1 commit intoqwibitai:mainfrom
IYENTeam:fix/query-idle-timeout

IYENTeam commented Apr 1, 2026

Uh oh!

jorgenclaw commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

IYENTeam commented Apr 1, 2026

Problem

Alternatives considered

Parse rate limit text from Claude output (#670)

Host-side input→output timeout

Heartbeat from agent-runner

Solution

Changes

What this covers

What this doesn't change

Test plan

Uh oh!

jorgenclaw commented Apr 2, 2026

Test Report — PR #1572

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants