fix(poll-loop): slash commands silently broken on warm containers#2181
Conversation
The follow-up poller filtered /clear out of every tick without acking
the row, and pushed every other slash command through plain
formatMessages() (XML wrapping). On a warm container the outer
while(true) loop never regains control, so:
- /clear sat pending in messages_in forever (no response at all)
- /compact, /cost, /context, /files, /remote-control arrived at the
SDK as XML-wrapped user text and were never dispatched as commands
Both modes are invisible to host monitoring: rows are either left
pending without a processing_ack claim, or marked completed normally;
heartbeat keeps firing inside the SDK event loop.
When the follow-up poller observes any slash command (admin or
passthrough — categorizeMessage decides), end the active query so the
current turn winds down cleanly and the outer loop wakes, re-fetches
the same pending set, and runs them through the canonical path
(/clear handler + formatMessagesWithCommands raw dispatch). Leave the
rows untouched so the outer-loop fetch sees the same set the poller
saw.
Cost: each slash command on a warm container forces close+reopen of
the SDK stream — a few seconds of subprocess startup. The Anthropic
prompt cache is server-side with a 5-min TTL keyed on prefix hash, so
stream lifecycle does not affect cache lifetime; close+reopen within
5 min still gets cache hits.
Also corrects the warm-stream rationale comment on processQuery, which
implied keeping the stream open preserved cache warmth — it doesn't.
Testing evidence — cache stays warm across stream close+reopen:
Turn 1 (warm session):
Usage: in=6 out=245 cache_create=92 cache_read=22996
Full cache hit (22996 tokens).
Turn 2 — /clear arrives:
Pending slash command — ending stream so outer loop can process
Clearing session (resetting continuation)
Usage: in=6 out=95 cache_create=9393 cache_read=13600
System prompt + tool defs (~13600 tokens) still hit cache;
conversation history is gone (continuation reset) so the new turn
writes fresh context.
Turn 3 — /cost arrives:
Pending slash command — ending stream so outer loop can process
Usage: in=0 out=0 cache_create=0 cache_read=0 wall=0.0s api=0.0s
/cost is a CLI built-in: dispatched locally by the SDK, no API
call. Pre-fix this would have arrived as XML-wrapped user text
and never dispatched — confirms the broader fix works.
Turn 4 (next chat after /cost):
Usage: in=6 out=142 cache_create=328 cache_read=22993
Full cache hit again (22993 tokens read, 328 written). Despite the
/cost-induced stream close+reopen, the server-side prompt cache
survived: the new sdkQuery() resumed the same continuation, the
request prefix matched the cached entry.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The 25 failing tests are pre-existing on Reproduced locally: Same numbers as the CI run on this branch. This PR doesn't change the pass/fail count. Root cause: The 34 passing tests are the ones that test pure functions or use only Fix is straightforward (have |
|
@mnolet Thank you for the contribution!! |
Type of Change
Description
What
Fix all slash commands silently failing when sent to a warm container.
Why
The follow-up poller filtered
/clearout of every tick without acking the row, and pushed every other slash command through plainformatMessages()(XML wrapping). On a warm container the outerwhile(true)loop never regains control (the SDK stream is held open across turns to avoid SDK subprocess respawn), so:/clearsat pending inmessages_inforever (no response at all)/compact,/cost,/context,/files,/remote-controlarrived at the SDK as XML-wrapped user text and were never dispatched as commandsBoth modes are invisible to host monitoring: rows are either left pending without a
processing_ackclaim, or marked completed normally; heartbeat keeps firing inside the SDK event loop. The bot looks healthy.How it works
When the follow-up poller observes any slash command (admin or passthrough —
categorizeMessagedecides), end the active query so the current turn winds down cleanly and the outer loop wakes, re-fetches the same pending set, and runs them through the canonical path (/clearhandler +formatMessagesWithCommandsraw dispatch). Leave the rows untouched so the outer-loop fetch sees the same set the poller saw.A small
endedForCommandflag suppresses further poller ticks during the brief window betweenquery.end()and the for-await onquery.eventsexiting, so we don't re-enter the predicate every 500ms while the stream finishes closing.Cost: each slash command on a warm container forces close+reopen of the SDK stream — a few seconds of subprocess startup. The Anthropic prompt cache is server-side with a 5-min TTL keyed on prefix hash, so stream lifecycle does not affect cache lifetime; close+reopen within 5 min still gets cache hits.
Also corrects the warm-stream rationale comment on
processQuery(line 257-262), which implied keeping the stream open preserved cache warmth — it doesn't.How it was tested
Verified against a live install. Cache stays warm across stream close+reopen as predicted:
No automated regression test added.
container/agent-runner/src/integration.test.tsis broken onmain(16/18 fail) becausegetPendingMessages()callsopenInboundDb()which opens a hardcoded/workspace/inbound.dbpath instead of honoring theinitTestSessionDb()in-memory DB. A regression test for this fix would hit the same wall — worth a separate fix.🤖 Generated with Claude Code