Skip to content

fix: surface silent task failures instead of acking them as completed#2167

Open
kpscheffel wants to merge 3 commits into
nanocoai:mainfrom
kpscheffel:pr/silent-task-failures
Open

fix: surface silent task failures instead of acking them as completed#2167
kpscheffel wants to merge 3 commits into
nanocoai:mainfrom
kpscheffel:pr/silent-task-failures

Conversation

@kpscheffel
Copy link
Copy Markdown

Type of Change

  • Fix - bug fix or security fix to source code

Description

What. Stops the agent-runner from acking scheduled tasks as completed when the SDK call actually failed silently — and adds a host-side scan that catches anything that bypasses the runner-side fix.

Why. When the Claude SDK terminates with an error subtype (error_max_turns, error_during_execution, …) or the stream closes without a terminal result event, the v2 poll-loop unconditionally called markCompleted and produced no user-visible signal. Combined with the host's syncProcessingAcks collapsing both completed and failed processing_ack rows into messages_in.status='completed', scheduled tasks could fail every day and look healthy — operators had no way to tell. Reproduced in real data: three consecutive morning-briefing runs marked complete with zero messages_out.

How it works.

  1. container/agent-runner/src/providers/types.tsresult ProviderEvent carries the SDK's subtype. Three buckets: undefined = synthetic mid-turn (e.g. compact_boundary, dispatch text, do not ack); error_* = terminal failure; otherwise = terminal success.
  2. container/agent-runner/src/providers/claude.ts — yield site reads (message as { subtype?: string }).subtype and passes it through.
  3. container/agent-runner/src/poll-loop.tsprocessQuery tracks sawTerminalResult and lastErrorMessage. Error subtype → markFailed(initialBatchIds) and write a ⚠️ Task did not complete: <subtype> chat to the originating channel. After the stream ends without a terminal result → same. The outer catch path now calls markFailed instead of always markCompleted. markFailed becomes array-aware to mirror markCompleted. MockProvider updated to yield subtype: 'success'. New stopSignal: AbortSignal on PollLoopConfig so tests can halt the loop deterministically (production callers omit it).
  4. src/db/session-db.tssyncProcessingAcks propagates failed as failed instead of collapsing to completed, with a guard so an already-completed row isn't downgraded by a stale ack.
  5. src/host-sweep.ts — new detectAndNotifySilentTasks step. Scans kind='task' rows acked completed/failed in the last 15 min with zero messages_out in the same window, DMs the session's originating messaging group via getDeliveryAdapter().deliver(...), dedups per-process. Caps the dedup Set at 1000 entries.

How it was tested.

  • 5 new container tests in container/agent-runner/src/poll-loop-failure.test.ts covering: error subtype path, stream-closed-without-terminal path, in-stream error captured in DM, empty success result legitimately ack'd, synthetic mid-turn followed by terminal success.
  • 3 new host tests in src/db/session-db.test.ts covering syncProcessingAcks propagating failed as failed, propagating completed as completed, and refusing to downgrade an already-completed row.
  • Verified on a real install — the day before this fix, three consecutive morning-briefing tasks (task-1777349525294-capq6s, task-1777435932858-rqa22y, task-1777522278899-8mqkfc) each acked completed with 0 messages_out. With this patch deployed, future failures will markFailed and DM ⚠️ Task did not complete: … to the channel; the host-side scan will catch any path the runner-side missed.

kpscheffel and others added 3 commits May 1, 2026 13:53
…asks completed

Scheduled tasks that failed (rate limit, max-turns, network drop, etc.)
were being acked as `completed` in processing_ack with zero messages_out,
leaving operators thinking silent days were healthy.

Three causes, all fixed:

1. The Claude provider yielded `{ type: 'result', text }` for every SDK
   result subtype — `success`, `error_max_turns`, `error_during_execution`
   were treated identically. The poll-loop's idempotent `markCompleted`
   then ran on every terminal event regardless of outcome.

2. After `processQuery` returned (whether the SDK stream ended cleanly or
   because it errored), `runPollLoop` always called `markCompleted`. So
   even an `error` ProviderEvent followed by a silent stream close ended
   up flagged as a clean turn.

3. `markFailed` was per-id while `markCompleted` was array-based, which
   discouraged batched failure handling.

This commit:
  * Adds `subtype` to the `result` ProviderEvent (types.ts).
    - undefined → synthetic mid-turn (e.g. compact_boundary), do NOT ack.
    - 'error_*' → terminal failure.
    - other     → terminal success.
  * Plumbs the SDK subtype through the Claude provider (claude.ts).
  * Tracks `sawTerminalResult` and `lastErrorMessage` per turn in
    processQuery (poll-loop.ts). On error subtype: markFailed +
    write a ⚠️ Task did not complete DM to the originating channel.
    On stream close without terminal result: same.
  * Updates the catch path in runPollLoop to markFailed (was always
    markCompleted) and continue to the next turn.
  * Makes `markFailed` array-aware to mirror `markCompleted`.
  * Updates MockProvider to yield `subtype: 'success'` so existing
    tests represent real terminal results, not synthetic mid-turn ones.
  * Adds an optional `stopSignal: AbortSignal` to PollLoopConfig so
    tests can deterministically halt the loop after assertions.

New tests in poll-loop-failure.test.ts cover:
  - error_during_execution subtype → markFailed + DM
  - stream close without terminal result → markFailed + DM
  - in-stream error event captured in the failure DM
  - empty success result (legitimate "no chat reply needed") still acks
  - synthetic mid-turn result followed by terminal success acks correctly

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lapse to 'completed'

The host sync was collapsing both 'completed' and 'failed' processing_ack
statuses into messages_in.status='completed', erasing the runner's
signal that a turn errored out. Combined with the agent-runner's old
unconditional markCompleted, this is what made silent task failures
look healthy in the central DB.

Now propagates each status as itself, with a guard so an already-
completed row isn't downgraded to 'failed' on a stale ack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a sweep step that scans for kind='task' rows acked completed/failed
in the last 15 min with zero messages_out in the same window. On
detection: logs WARN and DMs the session's originating messaging group
via the channel adapter, so an operator sees the failure even if the
runner-side fix missed an SDK event shape.

Per-process Set<string> dedup (capped at 1000, reaped to currently-
relevant ids each sweep) prevents re-DMing every minute.

Pairs with the agent-runner's terminal-result handling — that's the
primary fix; this scan catches anything that bypasses it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PR: Fix Bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant