B-01: Interrupted-Run Detection and Observability

## Problem

NanoClaw can lose visibility into messages that have been selected for agent dispatch but never receive a durable bot response because the run was interrupted, crashed, or timed out. Today there is no explicit persisted marker for “this message batch left the normal polling path,” which makes diagnosis unreliable.

B-01 adds detection state and observability only. It must not change polling eligibility or perform recovery.

## Scope

Add `processing_started_at` tracking to persist when source messages are dispatched to `GroupQueue`.

Use this field to detect stale interrupted processing rows at startup and log them for operators. Do not requeue, replay, unlock, or recover messages in B-01.

## Acceptance Criteria

- `messages` table gains `processing_started_at TEXT` via safe additive migration.
- `processing_started_at` is set immediately before dispatching a message batch to `GroupQueue`, in one transaction for the batch.
- `processing_started_at` is cleared when a bot response is stored for that chat after the source message timestamp.
- When clearing `processing_started_at` after a bot response, clear only rows matching:

  ```sql
  chat_jid = bot_response.chat_jid
  AND processing_started_at IS NOT NULL
  AND timestamp <= bot_response.timestamp
  ```

- Startup stale scan logs rows where `processing_started_at` is older than `STALE_THRESHOLD_MS`.
- B-01 does not change `getNewMessages()`.
- B-01 does not requeue, replay, unlock, or recover messages.

## Files Likely Touched

- `src/db.ts`
- `src/index.ts`
- `src/db.test.ts`
- relevant index/router tests

## Test Plan

- Migration test confirms `processing_started_at TEXT` is added safely to existing `messages` tables.
- Batch dispatch test confirms all messages in a dispatched batch receive the same `processing_started_at` in one transaction.
- Bot-response test confirms `processing_started_at` is cleared only for rows in the same chat with `processing_started_at IS NOT NULL` and `timestamp <= bot_response.timestamp`.
- Regression test confirms future messages in the same chat are not cleared by an older bot response.
- Startup stale scan test confirms rows older than `STALE_THRESHOLD_MS` are logged.
- Regression test confirms `getNewMessages()` query behavior is unchanged in B-01.
- Regression test confirms no requeue/replay/unlock/recovery path is invoked.
- Run full test suite.

## Risks

- Incorrect clearing logic could hide still-interrupted messages.
- Logging stale rows without recovery may surface stuck state but leave it unresolved until B-02.
- Batch marking must be transactional to avoid partial state if dispatch setup fails.
- Timestamp comparison must be consistent with stored message timestamps.

## Dependencies

- Existing `messages` table migration path.
- Existing `GroupQueue` dispatch flow.
- Existing bot-response storage flow.
- B-02 recovery will build on this persisted detection marker.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

B-01: Interrupted-Run Detection and Observability #2173

Problem

Scope

Acceptance Criteria

Files Likely Touched

Test Plan

Risks

Dependencies

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

B-01: Interrupted-Run Detection and Observability #2173

Description

Problem

Scope

Acceptance Criteria

Files Likely Touched

Test Plan

Risks

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions