Skip to content

B-01: Interrupted-Run Detection and Observability #2173

@lazer-maker

Description

@lazer-maker

Problem

NanoClaw can lose visibility into messages that have been selected for agent dispatch but never receive a durable bot response because the run was interrupted, crashed, or timed out. Today there is no explicit persisted marker for “this message batch left the normal polling path,” which makes diagnosis unreliable.

B-01 adds detection state and observability only. It must not change polling eligibility or perform recovery.

Scope

Add processing_started_at tracking to persist when source messages are dispatched to GroupQueue.

Use this field to detect stale interrupted processing rows at startup and log them for operators. Do not requeue, replay, unlock, or recover messages in B-01.

Acceptance Criteria

  • messages table gains processing_started_at TEXT via safe additive migration.

  • processing_started_at is set immediately before dispatching a message batch to GroupQueue, in one transaction for the batch.

  • processing_started_at is cleared when a bot response is stored for that chat after the source message timestamp.

  • When clearing processing_started_at after a bot response, clear only rows matching:

    chat_jid = bot_response.chat_jid
    AND processing_started_at IS NOT NULL
    AND timestamp <= bot_response.timestamp
  • Startup stale scan logs rows where processing_started_at is older than STALE_THRESHOLD_MS.

  • B-01 does not change getNewMessages().

  • B-01 does not requeue, replay, unlock, or recover messages.

Files Likely Touched

  • src/db.ts
  • src/index.ts
  • src/db.test.ts
  • relevant index/router tests

Test Plan

  • Migration test confirms processing_started_at TEXT is added safely to existing messages tables.
  • Batch dispatch test confirms all messages in a dispatched batch receive the same processing_started_at in one transaction.
  • Bot-response test confirms processing_started_at is cleared only for rows in the same chat with processing_started_at IS NOT NULL and timestamp <= bot_response.timestamp.
  • Regression test confirms future messages in the same chat are not cleared by an older bot response.
  • Startup stale scan test confirms rows older than STALE_THRESHOLD_MS are logged.
  • Regression test confirms getNewMessages() query behavior is unchanged in B-01.
  • Regression test confirms no requeue/replay/unlock/recovery path is invoked.
  • Run full test suite.

Risks

  • Incorrect clearing logic could hide still-interrupted messages.
  • Logging stale rows without recovery may surface stuck state but leave it unresolved until B-02.
  • Batch marking must be transactional to avoid partial state if dispatch setup fails.
  • Timestamp comparison must be consistent with stored message timestamps.

Dependencies

  • Existing messages table migration path.
  • Existing GroupQueue dispatch flow.
  • Existing bot-response storage flow.
  • B-02 recovery will build on this persisted detection marker.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions