fix(container): stderr-aware idle timeout (fix chronic Email intelligence alerts) by topcoder1 · Pull Request #43 · topcoder1/nanoclaw

topcoder1 · 2026-04-28T16:49:42Z

Summary

Triage of the recurring ⚠️ Email intelligence trigger failed. Check logs. alerts. 113 container timeouts in logs/nanoclaw.error.log, firing a few times an hour during agent-heavy periods.

Root cause

The container liveness check only resets on stdout OUTPUT_MARKER chunks (user-facing emissions). The Claude SDK streams tool-call debug logs to stderr while the agent is actively working — /recall calls, file reads, MCP probes, web research. The previous comment at line 1004 explicitly said "Don't reset on stderr — SDK writes debug logs continuously" — that's the bug-as-design.

If an email trigger required 30+ min of tool calls before the agent's first user-facing reply (deep research, complex triage), the 30-min idle timer fired and killed the container. Agent's work was lost; user got a generic "trigger failed" alert.

Fix

Stderr-aware idle detection in src/container-runner.ts:

Track lastStdoutAt + lastStderrAt. Stderr handler now updates lastStderrAt on every chunk.
Replace single setTimeout with setInterval liveness check (cadence min(timeoutMs/10, 60s)).
Kill conditions:
1. Idle: stdout idle > timeoutMs AND stderr idle > 5 min (both quiet → genuinely hung), OR
2. Hard cap: total runtime > max(timeoutMs * 2, 60min) — bounds runaway-but-noisy agents.

Stderr churn within the last 5 min is treated as "alive but doing internal work" — the agent stays running.

Tests

Test	New	What it asserts
Container alive past IDLE_TIMEOUT when stderr active	✅	regression for the bug
HARD_CAP_MS fires even with continuous stderr churn	✅	runaway containment
IDLE_TIMEOUT fires when both streams quiet	✅	real-hang detection still works
(existing) timeout with no output → error	adjusted	advance one extra interval-tick past grace

Full suite: 2447/2447 green, typecheck clean.

The 5-min STDERR_GRACE_MS constant could become an env override in a follow-up if real workloads hit it, but starting conservative.

🤖 Generated with Claude Code

…ts alive Chronic 'Email intelligence trigger failed' alerts (113 timeouts in the error log over the past few days) traced to a stdout-only liveness check. The Claude SDK writes tool-call debug logs to stderr while the agent is doing real work — deep research, multiple /recall calls, file reads, MCP probes. Previous code only reset the 30-min idle timer on stdout OUTPUT_MARKER chunks (user-facing emissions), so an agent doing 30+ min of internal tool calls before its first reply got killed despite being alive throughout. Changes in src/container-runner.ts: - Track lastStdoutAt + lastStderrAt timestamps. Stderr handler now updates lastStderrAt on every chunk (was: ignored entirely). - Replace the single setTimeout with a setInterval-driven liveness check (cadence min(timeoutMs/10, 60_000)). Kills only when: • stdout idle > timeoutMs AND stderr idle > 5min, OR • total runtime > HARD_CAP_MS (max(timeoutMs * 2, 60min)) - Hard cap bounds runaway agents whose stderr never goes quiet — can't keep a noisy-but-stuck container alive forever. - clearTimeout → clearInterval at the close + error sites. 3 new tests in container-runner.test.ts: • Container alive past IDLE_TIMEOUT when stderr is active (regression) • HARD_CAP_MS fires even with continuous stderr churn • IDLE_TIMEOUT fires when both streams quiet (real-hang case) Plus loggerMock hoisted via vi.hoisted so tests can introspect the 'Container timeout, stopping gracefully' error log as the canonical kill signal. Existing 'timeout with no output' test updated to advance one extra interval-tick past IDLE_TIMEOUT + 30s grace (the interval-based check fires up to 60s late vs the old single-shot timer). Full test suite green: 2447/2447.

…st (#44) Two follow-ups to PR #43: 1. Add CONTAINER_STDERR_GRACE_MS env override in src/config.ts. Default 300000 (5min). container-runner reads from config instead of the local hardcoded constant. Lets ops bump the grace window if a workload regularly produces longer stderr-quiet stretches (e.g. an MCP tool that blocks for 6+ min on a slow upstream) without recompiling. 2. New test in container-runner.test.ts: 'recovers when stderr goes briefly silent then resumes (gap shorter than stdout idle)'. The reviewer-flagged edge case — stderr active, then 6min silence (past 5min grace), then stderr resumes for another 30min before stdout emits. Confirms the kill condition correctly requires BOTH stdout idle > timeoutMs AND stderr idle > grace; an isolated stderr gap within the stdout-idle window does NOT kill. Full suite: 2448/2448 green (was 2447, +1 new test).

topcoder1 merged commit 8b6ccd4 into main Apr 28, 2026

topcoder1 deleted the claude/container-timeout-fix branch April 28, 2026 16:51

topcoder1 mentioned this pull request Apr 28, 2026

chore(container): STDERR_GRACE_MS env override + recovery test #44

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(container): stderr-aware idle timeout (fix chronic Email intelligence alerts)#43

fix(container): stderr-aware idle timeout (fix chronic Email intelligence alerts)#43
topcoder1 merged 1 commit into
mainfrom
claude/container-timeout-fix

topcoder1 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

topcoder1 commented Apr 28, 2026

Summary

Root cause

Fix

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant