Skip to content

fix(poll-loop): slash commands silently broken on warm containers#2181

Merged
gavrielc merged 2 commits intoqwibitai:mainfrom
mnolet:fix/slash-commands-on-warm-containers
May 2, 2026
Merged

fix(poll-loop): slash commands silently broken on warm containers#2181
gavrielc merged 2 commits intoqwibitai:mainfrom
mnolet:fix/slash-commands-on-warm-containers

Conversation

@mnolet
Copy link
Copy Markdown
Contributor

@mnolet mnolet commented May 2, 2026

Type of Change

  • Fix - bug fix or security fix to source code

Description

What

Fix all slash commands silently failing when sent to a warm container.

Why

The follow-up poller filtered /clear out of every tick without acking the row, and pushed every other slash command through plain formatMessages() (XML wrapping). On a warm container the outer while(true) loop never regains control (the SDK stream is held open across turns to avoid SDK subprocess respawn), so:

  • /clear sat pending in messages_in forever (no response at all)
  • /compact, /cost, /context, /files, /remote-control arrived at the SDK as XML-wrapped user text and were never dispatched as commands

Both modes are invisible to host monitoring: rows are either left pending without a processing_ack claim, or marked completed normally; heartbeat keeps firing inside the SDK event loop. The bot looks healthy.

How it works

When the follow-up poller observes any slash command (admin or passthrough — categorizeMessage decides), end the active query so the current turn winds down cleanly and the outer loop wakes, re-fetches the same pending set, and runs them through the canonical path (/clear handler + formatMessagesWithCommands raw dispatch). Leave the rows untouched so the outer-loop fetch sees the same set the poller saw.

A small endedForCommand flag suppresses further poller ticks during the brief window between query.end() and the for-await on query.events exiting, so we don't re-enter the predicate every 500ms while the stream finishes closing.

Cost: each slash command on a warm container forces close+reopen of the SDK stream — a few seconds of subprocess startup. The Anthropic prompt cache is server-side with a 5-min TTL keyed on prefix hash, so stream lifecycle does not affect cache lifetime; close+reopen within 5 min still gets cache hits.

Also corrects the warm-stream rationale comment on processQuery (line 257-262), which implied keeping the stream open preserved cache warmth — it doesn't.

How it was tested

Verified against a live install. Cache stays warm across stream close+reopen as predicted:

Turn 1 (warm session):
  Usage: in=6 out=245 cache_create=92 cache_read=22996
  Full cache hit (22996 tokens).

Turn 2 — /clear arrives:
  [poll-loop] Pending slash command — ending stream so outer loop can process
  [poll-loop] Clearing session (resetting continuation)
  Usage: in=6 out=95 cache_create=9393 cache_read=13600
  System prompt + tool defs (~13600 tokens) still hit cache;
  conversation history is gone (continuation reset) so the new
  turn writes fresh context.

Turn 3 — /cost arrives:
  [poll-loop] Pending slash command — ending stream so outer loop can process
  Usage: in=0 out=0 cache_create=0 cache_read=0 wall=0.0s api=0.0s
  /cost is a CLI built-in: dispatched locally by the SDK, no API call.
  Pre-fix this would have arrived as XML-wrapped user text and never
  dispatched — confirms the broader fix works.

Turn 4 (next chat after /cost):
  Usage: in=6 out=142 cache_create=328 cache_read=22993
  Full cache hit again (22993 tokens read, 328 written). Despite the
  /cost-induced stream close+reopen, the server-side prompt cache
  survived: the new sdkQuery() resumed the same continuation, the
  request prefix matched the cached entry.

No automated regression test added. container/agent-runner/src/integration.test.ts is broken on main (16/18 fail) because getPendingMessages() calls openInboundDb() which opens a hardcoded /workspace/inbound.db path instead of honoring the initTestSessionDb() in-memory DB. A regression test for this fix would hit the same wall — worth a separate fix.

🤖 Generated with Claude Code

The follow-up poller filtered /clear out of every tick without acking
the row, and pushed every other slash command through plain
formatMessages() (XML wrapping). On a warm container the outer
while(true) loop never regains control, so:

  - /clear sat pending in messages_in forever (no response at all)
  - /compact, /cost, /context, /files, /remote-control arrived at the
    SDK as XML-wrapped user text and were never dispatched as commands

Both modes are invisible to host monitoring: rows are either left
pending without a processing_ack claim, or marked completed normally;
heartbeat keeps firing inside the SDK event loop.

When the follow-up poller observes any slash command (admin or
passthrough — categorizeMessage decides), end the active query so the
current turn winds down cleanly and the outer loop wakes, re-fetches
the same pending set, and runs them through the canonical path
(/clear handler + formatMessagesWithCommands raw dispatch). Leave the
rows untouched so the outer-loop fetch sees the same set the poller
saw.

Cost: each slash command on a warm container forces close+reopen of
the SDK stream — a few seconds of subprocess startup. The Anthropic
prompt cache is server-side with a 5-min TTL keyed on prefix hash, so
stream lifecycle does not affect cache lifetime; close+reopen within
5 min still gets cache hits.

Also corrects the warm-stream rationale comment on processQuery, which
implied keeping the stream open preserved cache warmth — it doesn't.

Testing evidence — cache stays warm across stream close+reopen:

  Turn 1 (warm session):
    Usage: in=6 out=245 cache_create=92 cache_read=22996
    Full cache hit (22996 tokens).

  Turn 2 — /clear arrives:
    Pending slash command — ending stream so outer loop can process
    Clearing session (resetting continuation)
    Usage: in=6 out=95 cache_create=9393 cache_read=13600
    System prompt + tool defs (~13600 tokens) still hit cache;
    conversation history is gone (continuation reset) so the new turn
    writes fresh context.

  Turn 3 — /cost arrives:
    Pending slash command — ending stream so outer loop can process
    Usage: in=0 out=0 cache_create=0 cache_read=0 wall=0.0s api=0.0s
    /cost is a CLI built-in: dispatched locally by the SDK, no API
    call. Pre-fix this would have arrived as XML-wrapped user text
    and never dispatched — confirms the broader fix works.

  Turn 4 (next chat after /cost):
    Usage: in=6 out=142 cache_create=328 cache_read=22993
    Full cache hit again (22993 tokens read, 328 written). Despite the
    /cost-induced stream close+reopen, the server-side prompt cache
    survived: the new sdkQuery() resumed the same continuation, the
    request prefix matched the cached entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the PR: Fix Bug fix label May 2, 2026
@mnolet
Copy link
Copy Markdown
Contributor Author

mnolet commented May 2, 2026

The 25 failing tests are pre-existing on upstream/main — not introduced by this PR.

Reproduced locally:

$ git checkout upstream/main && cd container/agent-runner && bun test
 34 pass
 25 fail
 59 expect() calls

Same numbers as the CI run on this branch. This PR doesn't change the pass/fail count.

Root cause: openInboundDb() (src/db/connection.ts:44) opens the hardcoded /workspace/inbound.db path, ignoring the in-memory DB created by initTestSessionDb(). Every test that ends up calling getPendingMessages() — directly or via test fixtures (e.g., poll-loop.test.ts:29 uses it to load formatter test data) — hits a SQLITE_CANTOPEN because the file path doesn't exist in CI.

The 34 passing tests are the ones that test pure functions or use only getInboundDb() (the singleton, which is in-memory after initTestSessionDb).

Fix is straightforward (have openInboundDb() honor the singleton in test mode, or thread the DB instance through), but it's a separate concern from the slash-command bug — happy to open a follow-up PR for the test infra if useful.

@gavrielc gavrielc merged commit 52051d4 into qwibitai:main May 2, 2026
1 check failed
@gavrielc
Copy link
Copy Markdown
Collaborator

gavrielc commented May 2, 2026

@mnolet Thank you for the contribution!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PR: Fix Bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants