fix(container): stderr-aware idle timeout (fix chronic Email intelligence alerts)#43
Merged
Merged
Conversation
…ts alive
Chronic 'Email intelligence trigger failed' alerts (113 timeouts in
the error log over the past few days) traced to a stdout-only liveness
check. The Claude SDK writes tool-call debug logs to stderr while the
agent is doing real work — deep research, multiple /recall calls,
file reads, MCP probes. Previous code only reset the 30-min idle
timer on stdout OUTPUT_MARKER chunks (user-facing emissions), so an
agent doing 30+ min of internal tool calls before its first reply got
killed despite being alive throughout.
Changes in src/container-runner.ts:
- Track lastStdoutAt + lastStderrAt timestamps. Stderr handler now
updates lastStderrAt on every chunk (was: ignored entirely).
- Replace the single setTimeout with a setInterval-driven liveness
check (cadence min(timeoutMs/10, 60_000)). Kills only when:
• stdout idle > timeoutMs AND stderr idle > 5min, OR
• total runtime > HARD_CAP_MS (max(timeoutMs * 2, 60min))
- Hard cap bounds runaway agents whose stderr never goes quiet —
can't keep a noisy-but-stuck container alive forever.
- clearTimeout → clearInterval at the close + error sites.
3 new tests in container-runner.test.ts:
• Container alive past IDLE_TIMEOUT when stderr is active (regression)
• HARD_CAP_MS fires even with continuous stderr churn
• IDLE_TIMEOUT fires when both streams quiet (real-hang case)
Plus loggerMock hoisted via vi.hoisted so tests can introspect the
'Container timeout, stopping gracefully' error log as the canonical
kill signal.
Existing 'timeout with no output' test updated to advance one extra
interval-tick past IDLE_TIMEOUT + 30s grace (the interval-based check
fires up to 60s late vs the old single-shot timer).
Full test suite green: 2447/2447.
topcoder1
added a commit
that referenced
this pull request
Apr 28, 2026
…st (#44) Two follow-ups to PR #43: 1. Add CONTAINER_STDERR_GRACE_MS env override in src/config.ts. Default 300000 (5min). container-runner reads from config instead of the local hardcoded constant. Lets ops bump the grace window if a workload regularly produces longer stderr-quiet stretches (e.g. an MCP tool that blocks for 6+ min on a slow upstream) without recompiling. 2. New test in container-runner.test.ts: 'recovers when stderr goes briefly silent then resumes (gap shorter than stdout idle)'. The reviewer-flagged edge case — stderr active, then 6min silence (past 5min grace), then stderr resumes for another 30min before stdout emits. Confirms the kill condition correctly requires BOTH stdout idle > timeoutMs AND stderr idle > grace; an isolated stderr gap within the stdout-idle window does NOT kill. Full suite: 2448/2448 green (was 2447, +1 new test).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Triage of the recurring
⚠️ Email intelligence trigger failed. Check logs.alerts. 113 container timeouts inlogs/nanoclaw.error.log, firing a few times an hour during agent-heavy periods.Root cause
The container liveness check only resets on
stdoutOUTPUT_MARKER chunks (user-facing emissions). The Claude SDK streams tool-call debug logs tostderrwhile the agent is actively working —/recallcalls, file reads, MCP probes, web research. The previous comment at line 1004 explicitly said "Don't reset on stderr — SDK writes debug logs continuously" — that's the bug-as-design.If an email trigger required 30+ min of tool calls before the agent's first user-facing reply (deep research, complex triage), the 30-min idle timer fired and killed the container. Agent's work was lost; user got a generic "trigger failed" alert.
Fix
Stderr-aware idle detection in
src/container-runner.ts:lastStdoutAt+lastStderrAt. Stderr handler now updateslastStderrAton every chunk.setTimeoutwithsetIntervalliveness check (cadencemin(timeoutMs/10, 60s)).timeoutMsAND stderr idle > 5 min (both quiet → genuinely hung), ORmax(timeoutMs * 2, 60min)— bounds runaway-but-noisy agents.Stderr churn within the last 5 min is treated as "alive but doing internal work" — the agent stays running.
Tests
Full suite: 2447/2447 green, typecheck clean.
The 5-min
STDERR_GRACE_MSconstant could become an env override in a follow-up if real workloads hit it, but starting conservative.🤖 Generated with Claude Code