Skip to content

fix(container): stderr-aware idle timeout (fix chronic Email intelligence alerts)#43

Merged
topcoder1 merged 1 commit into
mainfrom
claude/container-timeout-fix
Apr 28, 2026
Merged

fix(container): stderr-aware idle timeout (fix chronic Email intelligence alerts)#43
topcoder1 merged 1 commit into
mainfrom
claude/container-timeout-fix

Conversation

@topcoder1

Copy link
Copy Markdown
Owner

Summary

Triage of the recurring ⚠️ Email intelligence trigger failed. Check logs. alerts. 113 container timeouts in logs/nanoclaw.error.log, firing a few times an hour during agent-heavy periods.

Root cause

The container liveness check only resets on stdout OUTPUT_MARKER chunks (user-facing emissions). The Claude SDK streams tool-call debug logs to stderr while the agent is actively working — /recall calls, file reads, MCP probes, web research. The previous comment at line 1004 explicitly said "Don't reset on stderr — SDK writes debug logs continuously" — that's the bug-as-design.

If an email trigger required 30+ min of tool calls before the agent's first user-facing reply (deep research, complex triage), the 30-min idle timer fired and killed the container. Agent's work was lost; user got a generic "trigger failed" alert.

Fix

Stderr-aware idle detection in src/container-runner.ts:

  • Track lastStdoutAt + lastStderrAt. Stderr handler now updates lastStderrAt on every chunk.
  • Replace single setTimeout with setInterval liveness check (cadence min(timeoutMs/10, 60s)).
  • Kill conditions:
    1. Idle: stdout idle > timeoutMs AND stderr idle > 5 min (both quiet → genuinely hung), OR
    2. Hard cap: total runtime > max(timeoutMs * 2, 60min) — bounds runaway-but-noisy agents.

Stderr churn within the last 5 min is treated as "alive but doing internal work" — the agent stays running.

Tests

Test New What it asserts
Container alive past IDLE_TIMEOUT when stderr active regression for the bug
HARD_CAP_MS fires even with continuous stderr churn runaway containment
IDLE_TIMEOUT fires when both streams quiet real-hang detection still works
(existing) timeout with no output → error adjusted advance one extra interval-tick past grace

Full suite: 2447/2447 green, typecheck clean.

The 5-min STDERR_GRACE_MS constant could become an env override in a follow-up if real workloads hit it, but starting conservative.

🤖 Generated with Claude Code

…ts alive

Chronic 'Email intelligence trigger failed' alerts (113 timeouts in
the error log over the past few days) traced to a stdout-only liveness
check. The Claude SDK writes tool-call debug logs to stderr while the
agent is doing real work — deep research, multiple /recall calls,
file reads, MCP probes. Previous code only reset the 30-min idle
timer on stdout OUTPUT_MARKER chunks (user-facing emissions), so an
agent doing 30+ min of internal tool calls before its first reply got
killed despite being alive throughout.

Changes in src/container-runner.ts:

  - Track lastStdoutAt + lastStderrAt timestamps. Stderr handler now
    updates lastStderrAt on every chunk (was: ignored entirely).
  - Replace the single setTimeout with a setInterval-driven liveness
    check (cadence min(timeoutMs/10, 60_000)). Kills only when:
      • stdout idle > timeoutMs AND stderr idle > 5min, OR
      • total runtime > HARD_CAP_MS (max(timeoutMs * 2, 60min))
  - Hard cap bounds runaway agents whose stderr never goes quiet —
    can't keep a noisy-but-stuck container alive forever.
  - clearTimeout → clearInterval at the close + error sites.

3 new tests in container-runner.test.ts:
  • Container alive past IDLE_TIMEOUT when stderr is active (regression)
  • HARD_CAP_MS fires even with continuous stderr churn
  • IDLE_TIMEOUT fires when both streams quiet (real-hang case)
Plus loggerMock hoisted via vi.hoisted so tests can introspect the
'Container timeout, stopping gracefully' error log as the canonical
kill signal.

Existing 'timeout with no output' test updated to advance one extra
interval-tick past IDLE_TIMEOUT + 30s grace (the interval-based check
fires up to 60s late vs the old single-shot timer).

Full test suite green: 2447/2447.
@topcoder1 topcoder1 merged commit 8b6ccd4 into main Apr 28, 2026
@topcoder1 topcoder1 deleted the claude/container-timeout-fix branch April 28, 2026 16:51
topcoder1 added a commit that referenced this pull request Apr 28, 2026
…st (#44)

Two follow-ups to PR #43:

1. Add CONTAINER_STDERR_GRACE_MS env override in src/config.ts.
   Default 300000 (5min). container-runner reads from config instead
   of the local hardcoded constant. Lets ops bump the grace window
   if a workload regularly produces longer stderr-quiet stretches
   (e.g. an MCP tool that blocks for 6+ min on a slow upstream)
   without recompiling.

2. New test in container-runner.test.ts: 'recovers when stderr goes
   briefly silent then resumes (gap shorter than stdout idle)'. The
   reviewer-flagged edge case — stderr active, then 6min silence
   (past 5min grace), then stderr resumes for another 30min before
   stdout emits. Confirms the kill condition correctly requires BOTH
   stdout idle > timeoutMs AND stderr idle > grace; an isolated
   stderr gap within the stdout-idle window does NOT kill.

Full suite: 2448/2448 green (was 2447, +1 new test).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant