fix: surface silent task failures instead of acking them as completed by kpscheffel · Pull Request #2167 · nanocoai/nanoclaw

kpscheffel · 2026-05-01T11:55:05Z

Type of Change

Fix - bug fix or security fix to source code

Description

What. Stops the agent-runner from acking scheduled tasks as completed when the SDK call actually failed silently — and adds a host-side scan that catches anything that bypasses the runner-side fix.

Why. When the Claude SDK terminates with an error subtype (error_max_turns, error_during_execution, …) or the stream closes without a terminal result event, the v2 poll-loop unconditionally called markCompleted and produced no user-visible signal. Combined with the host's syncProcessingAcks collapsing both completed and failed processing_ack rows into messages_in.status='completed', scheduled tasks could fail every day and look healthy — operators had no way to tell. Reproduced in real data: three consecutive morning-briefing runs marked complete with zero messages_out.

How it works.

container/agent-runner/src/providers/types.ts — result ProviderEvent carries the SDK's subtype. Three buckets: undefined = synthetic mid-turn (e.g. compact_boundary, dispatch text, do not ack); error_* = terminal failure; otherwise = terminal success.
container/agent-runner/src/providers/claude.ts — yield site reads (message as { subtype?: string }).subtype and passes it through.
container/agent-runner/src/poll-loop.ts — processQuery tracks sawTerminalResult and lastErrorMessage. Error subtype → markFailed(initialBatchIds) and write a ⚠️ Task did not complete: <subtype> chat to the originating channel. After the stream ends without a terminal result → same. The outer catch path now calls markFailed instead of always markCompleted. markFailed becomes array-aware to mirror markCompleted. MockProvider updated to yield subtype: 'success'. New stopSignal: AbortSignal on PollLoopConfig so tests can halt the loop deterministically (production callers omit it).
src/db/session-db.ts — syncProcessingAcks propagates failed as failed instead of collapsing to completed, with a guard so an already-completed row isn't downgraded by a stale ack.
src/host-sweep.ts — new detectAndNotifySilentTasks step. Scans kind='task' rows acked completed/failed in the last 15 min with zero messages_out in the same window, DMs the session's originating messaging group via getDeliveryAdapter().deliver(...), dedups per-process. Caps the dedup Set at 1000 entries.

How it was tested.

5 new container tests in container/agent-runner/src/poll-loop-failure.test.ts covering: error subtype path, stream-closed-without-terminal path, in-stream error captured in DM, empty success result legitimately ack'd, synthetic mid-turn followed by terminal success.
3 new host tests in src/db/session-db.test.ts covering syncProcessingAcks propagating failed as failed, propagating completed as completed, and refusing to downgrade an already-completed row.
Verified on a real install — the day before this fix, three consecutive morning-briefing tasks (task-1777349525294-capq6s, task-1777435932858-rqa22y, task-1777522278899-8mqkfc) each acked completed with 0 messages_out. With this patch deployed, future failures will markFailed and DM ⚠️ Task did not complete: … to the channel; the host-side scan will catch any path the runner-side missed.

…asks completed Scheduled tasks that failed (rate limit, max-turns, network drop, etc.) were being acked as `completed` in processing_ack with zero messages_out, leaving operators thinking silent days were healthy. Three causes, all fixed: 1. The Claude provider yielded `{ type: 'result', text }` for every SDK result subtype — `success`, `error_max_turns`, `error_during_execution` were treated identically. The poll-loop's idempotent `markCompleted` then ran on every terminal event regardless of outcome. 2. After `processQuery` returned (whether the SDK stream ended cleanly or because it errored), `runPollLoop` always called `markCompleted`. So even an `error` ProviderEvent followed by a silent stream close ended up flagged as a clean turn. 3. `markFailed` was per-id while `markCompleted` was array-based, which discouraged batched failure handling. This commit: * Adds `subtype` to the `result` ProviderEvent (types.ts). - undefined → synthetic mid-turn (e.g. compact_boundary), do NOT ack. - 'error_*' → terminal failure. - other → terminal success. * Plumbs the SDK subtype through the Claude provider (claude.ts). * Tracks `sawTerminalResult` and `lastErrorMessage` per turn in processQuery (poll-loop.ts). On error subtype: markFailed + write a ⚠️ Task did not complete DM to the originating channel. On stream close without terminal result: same. * Updates the catch path in runPollLoop to markFailed (was always markCompleted) and continue to the next turn. * Makes `markFailed` array-aware to mirror `markCompleted`. * Updates MockProvider to yield `subtype: 'success'` so existing tests represent real terminal results, not synthetic mid-turn ones. * Adds an optional `stopSignal: AbortSignal` to PollLoopConfig so tests can deterministically halt the loop after assertions. New tests in poll-loop-failure.test.ts cover: - error_during_execution subtype → markFailed + DM - stream close without terminal result → markFailed + DM - in-stream error event captured in the failure DM - empty success result (legitimate "no chat reply needed") still acks - synthetic mid-turn result followed by terminal success acks correctly Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lapse to 'completed' The host sync was collapsing both 'completed' and 'failed' processing_ack statuses into messages_in.status='completed', erasing the runner's signal that a turn errored out. Combined with the agent-runner's old unconditional markCompleted, this is what made silent task failures look healthy in the central DB. Now propagates each status as itself, with a guard so an already- completed row isn't downgraded to 'failed' on a stale ack. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a sweep step that scans for kind='task' rows acked completed/failed in the last 15 min with zero messages_out in the same window. On detection: logs WARN and DMs the session's originating messaging group via the channel adapter, so an operator sees the failure even if the runner-side fix missed an SDK event shape. Per-process Set<string> dedup (capped at 1000, reaped to currently- relevant ids each sweep) prevents re-DMing every minute. Pairs with the agent-runner's terminal-result handling — that's the primary fix; this scan catches anything that bypasses it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kpscheffel and others added 3 commits May 1, 2026 13:53

kpscheffel requested review from gabi-simons and gavrielc as code owners May 1, 2026 11:55

github-actions Bot added the PR: Fix Bug fix label May 1, 2026

kpscheffel mentioned this pull request May 1, 2026

fix(container): pin host.docker.internal to OneCLI's bridge IP in rootless Docker #2168

Open

1 task

bkutasi mentioned this pull request May 1, 2026

🦞 OpenClaw Ecosystem Digest 2026-05-02 bkutasi/big_model_radar#2

Open

This was referenced May 3, 2026

fix(host): single-instance lock to prevent duplicate hosts racing on shared state #2224

Open

fix(host): throw on missing channel adapter instead of silently dropping the message #2226

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: surface silent task failures instead of acking them as completed#2167

fix: surface silent task failures instead of acking them as completed#2167
kpscheffel wants to merge 3 commits into
nanocoai:mainfrom
kpscheffel:pr/silent-task-failures

kpscheffel commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kpscheffel commented May 1, 2026

Type of Change

Description

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant