Skip to content

fix: abort hung SDK queries with AbortController idle timeout#1572

Closed
IYENTeam wants to merge 1 commit intoqwibitai:mainfrom
IYENTeam:fix/query-idle-timeout
Closed

fix: abort hung SDK queries with AbortController idle timeout#1572
IYENTeam wants to merge 1 commit intoqwibitai:mainfrom
IYENTeam:fix/query-idle-timeout

Conversation

@IYENTeam
Copy link
Copy Markdown
Contributor

@IYENTeam IYENTeam commented Apr 1, 2026

Problem

When the SDK query() hangs — due to rate limits, network failures, or any other reason — the container becomes unresponsive. The host's container-level timeout (default 30 min) resets on every output marker, so if a rate limit notice is emitted right before the hang, the timer resets and the user waits another 30 minutes with no response.

The host's existing retry logic (exponential backoff in GroupQueue) only triggers when the container exits with an error. A hung container never exits, so retries never fire.

Alternatives considered

Parse rate limit text from Claude output (#670)

Regex-parses "resets Xam (Timezone)" from the result text. Fragile — breaks if the output format changes. Also only covers rate limits, not network failures, SDK bugs, or any other hang scenario.

Host-side input→output timeout

Track when the host sends a message and kill the container if no output arrives within N minutes. The host can't distinguish a legitimately long tool execution (e.g. a build) from a hang — both look like silence.

Heartbeat from agent-runner

Emit periodic heartbeats to stderr. The SDK's for await loop is blocking — there's no opportunity to emit a heartbeat while waiting for the next message, which is exactly when the hang occurs. A worker thread could work but adds complexity for no benefit over AbortController.

Solution

Use the SDK's built-in abortController option to set a per-query idle timeout. If no SDK messages arrive within 5 minutes (configurable via QUERY_IDLE_TIMEOUT env var), the query is aborted cleanly. The container exits with an error, and the host's existing retry logic handles the rest.

This works because SDK messages stream continuously during normal operation — tool invocations, intermediate output, system events all produce messages. Five minutes of complete silence means the query is stuck, regardless of the cause.

Changes

  • Add QUERY_IDLE_TIMEOUT_MS constant (default 5 min, env-configurable)
  • Create AbortController per query, pass to SDK query() options
  • Reset idle timer on every SDK message
  • On timeout: abort query, emit error output, exit container

What this covers

  • Rate limit hangs
  • Network failures
  • SDK internal issues
  • Any future cause of query hangs

What this doesn't change

  • No host-side code changes
  • Container-level timeout preserved as safety net
  • Existing retry logic (GroupQueue exponential backoff) reused as-is

Test plan

  • Deploy and trigger a rate limit to confirm the container aborts within 5 min instead of hanging for 30 min
  • Verify normal long-running tasks (multi-step tool use, large file operations) complete without being aborted
  • Confirm host retries the message after container exits

Closes #186

When the SDK query() hangs — due to rate limits, network failures, or any
other reason — the container becomes unresponsive. The host's container-level
timeout (default 30 min) resets on every output marker, so if a rate limit
notice is emitted right before the hang, the timer resets and the user waits
another 30 minutes with no response.

The host's existing retry logic (exponential backoff in GroupQueue) only
triggers when the container exits with an error. A hung container never
exits, so retries never fire.

## Alternatives considered

### Parse rate limit text from Claude output (#670)
Regex-parses "resets Xam (Timezone)" from Claude's result text. Fragile —
breaks if Claude's output format changes. Also only covers rate limits,
not network failures, SDK bugs, or any other hang scenario.

### Host-side input→output timeout
Track when the host sends a message and kill the container if no output
arrives within N minutes. The host can't distinguish a legitimately long
tool execution (e.g. a build) from a hang — both look like silence.

### Heartbeat from agent-runner
Emit periodic heartbeats to stderr. The SDK's for-await loop is blocking —
there's no opportunity to emit a heartbeat while waiting for the next
message, which is exactly when the hang occurs. A worker thread could
work but adds complexity for no benefit over AbortController.

## Solution

Use the SDK's built-in abortController option to set a per-query idle
timeout. If no SDK messages arrive within 5 minutes (configurable via
QUERY_IDLE_TIMEOUT env var), the query is aborted cleanly. The container
exits with an error, and the host's existing retry logic handles the rest.

This works because SDK messages stream continuously during normal
operation — tool invocations, intermediate output, system events all
produce messages. Five minutes of complete silence means the query is
stuck, regardless of the cause.

Closes #186
morrowgarrett added a commit to morrowgarrett/nanoclaw that referenced this pull request Apr 1, 2026
#1 AbortController idle timeout (PR qwibitai#1572):
- Aborts hung SDK queries after 5min of no messages
- Configurable via QUERY_IDLE_TIMEOUT env var
- Container exits with error for host retry

qwibitai#2 Session JSONL rotation (PR qwibitai#700):
- Rotates session files exceeding 5MB
- Prevents container timeouts from session bloat
- Auto-creates fresh session on rotation

qwibitai#3 Per-group .mcp.json config (PR qwibitai#1515):
- Groups can define MCP servers in .mcp.json
- Servers auto-discovered and tools auto-allowed
- No code changes needed to add group-specific MCP

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jorgenclaw
Copy link
Copy Markdown

Test Report — PR #1572

Environment:

  • Pop!_OS Linux (bare-metal), Surface Pro 7+, Intel Iris Xe
  • NanoClaw v1.2.43 with Signal, Nostr DM, and WhiteNoise channels active
  • Docker 28.x, node:22-slim container base, container rebuilt with this fix

What we tested:

  • Applied to our production install alongside 5 other upstream bug fixes
  • Container image rebuilt to include the agent-runner change
  • Full build: clean, zero TypeScript errors
  • Full test suite: 486/486 tests pass
  • Service restarted and running in production

Result: ✅ Pass

  • The 5-minute AbortController timeout prevents containers from hanging indefinitely when the Claude SDK stops emitting messages
  • Previously, a hung SDK query would leave the container running forever — consuming RAM and blocking the group queue
  • The timer resets on every SDK message, so long-running legitimate conversations are not affected
  • Configurable via QUERY_IDLE_TIMEOUT env var — good for operators who want different thresholds

Notes:

  • 24-line change in agent-runner, clean and self-contained
  • Requires container rebuild to take effect (this is agent-runner code, not host-side)
  • No regressions observed

Tested by @jorgenclaw (Scott Jorgensen + Claude Code)

@IYENTeam IYENTeam closed this by deleting the head repository Apr 9, 2026
jorgenclaw pushed a commit to jorgenclaw/nanoclaw that referenced this pull request Apr 14, 2026
…ity, calendar multi-user

Upstream bug fixes applied (all tested, 486/486 pass):
- qwibitai#1575: always write _close sentinel in notifyIdle
- qwibitai#1567: evict idle task containers when messages arrive
- qwibitai#1576: prevent message loss when agent outputs mid-query
- qwibitai#1566: resilient channel connect with background retry
- qwibitai#1572: abort hung SDK queries with AbortController (5min timeout)
- qwibitai#1519: IPC size guards (1MB), message truncation (50K), orphan task cleanup

Nostr signer security upgrade:
- Per-session scoped tokens with TTL and event-kind restrictions
- Rate limiting: 5/10s burst, 10/min, 100/hr per session
- Alert logging to signer-alerts.log
- Backwards compatible: legacy calls still work with deprecation warning
- 14/14 security test suite passes

Other improvements:
- Relay diversity: clawstr-post now publishes to 5 relays
- Calendar multi-user auth: scott user works alongside jorgenclaw
- Email skill added to group CLAUDE.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

No rate limiting on outbound messages

2 participants