fix: abort hung SDK queries with AbortController idle timeout#1572
Closed
IYENTeam wants to merge 1 commit intoqwibitai:mainfrom
Closed
fix: abort hung SDK queries with AbortController idle timeout#1572IYENTeam wants to merge 1 commit intoqwibitai:mainfrom
IYENTeam wants to merge 1 commit intoqwibitai:mainfrom
Conversation
When the SDK query() hangs — due to rate limits, network failures, or any other reason — the container becomes unresponsive. The host's container-level timeout (default 30 min) resets on every output marker, so if a rate limit notice is emitted right before the hang, the timer resets and the user waits another 30 minutes with no response. The host's existing retry logic (exponential backoff in GroupQueue) only triggers when the container exits with an error. A hung container never exits, so retries never fire. ## Alternatives considered ### Parse rate limit text from Claude output (#670) Regex-parses "resets Xam (Timezone)" from Claude's result text. Fragile — breaks if Claude's output format changes. Also only covers rate limits, not network failures, SDK bugs, or any other hang scenario. ### Host-side input→output timeout Track when the host sends a message and kill the container if no output arrives within N minutes. The host can't distinguish a legitimately long tool execution (e.g. a build) from a hang — both look like silence. ### Heartbeat from agent-runner Emit periodic heartbeats to stderr. The SDK's for-await loop is blocking — there's no opportunity to emit a heartbeat while waiting for the next message, which is exactly when the hang occurs. A worker thread could work but adds complexity for no benefit over AbortController. ## Solution Use the SDK's built-in abortController option to set a per-query idle timeout. If no SDK messages arrive within 5 minutes (configurable via QUERY_IDLE_TIMEOUT env var), the query is aborted cleanly. The container exits with an error, and the host's existing retry logic handles the rest. This works because SDK messages stream continuously during normal operation — tool invocations, intermediate output, system events all produce messages. Five minutes of complete silence means the query is stuck, regardless of the cause. Closes #186
morrowgarrett
added a commit
to morrowgarrett/nanoclaw
that referenced
this pull request
Apr 1, 2026
#1 AbortController idle timeout (PR qwibitai#1572): - Aborts hung SDK queries after 5min of no messages - Configurable via QUERY_IDLE_TIMEOUT env var - Container exits with error for host retry qwibitai#2 Session JSONL rotation (PR qwibitai#700): - Rotates session files exceeding 5MB - Prevents container timeouts from session bloat - Auto-creates fresh session on rotation qwibitai#3 Per-group .mcp.json config (PR qwibitai#1515): - Groups can define MCP servers in .mcp.json - Servers auto-discovered and tools auto-allowed - No code changes needed to add group-specific MCP Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test Report — PR #1572Environment:
What we tested:
Result: ✅ Pass
Notes:
Tested by @jorgenclaw (Scott Jorgensen + Claude Code) |
jorgenclaw
pushed a commit
to jorgenclaw/nanoclaw
that referenced
this pull request
Apr 14, 2026
…ity, calendar multi-user Upstream bug fixes applied (all tested, 486/486 pass): - qwibitai#1575: always write _close sentinel in notifyIdle - qwibitai#1567: evict idle task containers when messages arrive - qwibitai#1576: prevent message loss when agent outputs mid-query - qwibitai#1566: resilient channel connect with background retry - qwibitai#1572: abort hung SDK queries with AbortController (5min timeout) - qwibitai#1519: IPC size guards (1MB), message truncation (50K), orphan task cleanup Nostr signer security upgrade: - Per-session scoped tokens with TTL and event-kind restrictions - Rate limiting: 5/10s burst, 10/min, 100/hr per session - Alert logging to signer-alerts.log - Backwards compatible: legacy calls still work with deprecation warning - 14/14 security test suite passes Other improvements: - Relay diversity: clawstr-post now publishes to 5 relays - Calendar multi-user auth: scott user works alongside jorgenclaw - Email skill added to group CLAUDE.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When the SDK
query()hangs — due to rate limits, network failures, or any other reason — the container becomes unresponsive. The host's container-level timeout (default 30 min) resets on every output marker, so if a rate limit notice is emitted right before the hang, the timer resets and the user waits another 30 minutes with no response.The host's existing retry logic (exponential backoff in GroupQueue) only triggers when the container exits with an error. A hung container never exits, so retries never fire.
Alternatives considered
Parse rate limit text from Claude output (#670)
Regex-parses
"resets Xam (Timezone)"from the result text. Fragile — breaks if the output format changes. Also only covers rate limits, not network failures, SDK bugs, or any other hang scenario.Host-side input→output timeout
Track when the host sends a message and kill the container if no output arrives within N minutes. The host can't distinguish a legitimately long tool execution (e.g. a build) from a hang — both look like silence.
Heartbeat from agent-runner
Emit periodic heartbeats to stderr. The SDK's
for awaitloop is blocking — there's no opportunity to emit a heartbeat while waiting for the next message, which is exactly when the hang occurs. A worker thread could work but adds complexity for no benefit over AbortController.Solution
Use the SDK's built-in
abortControlleroption to set a per-query idle timeout. If no SDK messages arrive within 5 minutes (configurable viaQUERY_IDLE_TIMEOUTenv var), the query is aborted cleanly. The container exits with an error, and the host's existing retry logic handles the rest.This works because SDK messages stream continuously during normal operation — tool invocations, intermediate output, system events all produce messages. Five minutes of complete silence means the query is stuck, regardless of the cause.
Changes
QUERY_IDLE_TIMEOUT_MSconstant (default 5 min, env-configurable)AbortControllerper query, pass to SDKquery()optionsWhat this covers
What this doesn't change
Test plan
Closes #186