fix(engine,web): suppress restart-recovery noise on Projects tab; retry empty hydration on SSE open (#3274) by ilblackdragon · Pull Request #3328 · nearai/ironclaw

ilblackdragon · 2026-05-07T06:25:17Z

Summary

Closes #3274 — three post-upgrade UI symptoms (0.26.0 → 0.27.0) with two distinct root causes.

Projects tab persistent "Thread failed: …" warning (Issue 3, root cause). ThreadManager::recover_project_threads force-fails every non-terminal thread on engine restart with reason \"engine restart before thread completion\". After upgrade, every running/created/waiting thread from the previous process gets state = Failed with updated_at = now, which then floods get_engine_projects_overview's state == Failed && updated_at >= now − 24h filter as kind=failure attention items. Recovered threads are now tagged with the metadata flag engine_restart_recovery: true (exported from the engine crate as ENGINE_RESTART_RECOVERY_METADATA_KEY), and the projects overview excludes them from both failures_24h and the "Needs attention" feed.
Chat / Missions tabs empty until manual refresh (Issues 1 & 2, defensive net). loadHistory, loadThreads, and loadMissions ended in .catch(() => {}) / .catch(function() {}) — the silent-failure anti-pattern called out in .claude/rules/error-handling.md. They now log via console.error and flag window._initialHydrationPending. A new runInitialHydrationRetry() re-runs only the flagged loaders, exactly once per initApp() (guarded by _hydrationRetryDone), wired into the SSE onopen handler. SSE accept implies the backend has stabilised — init_engine is awaited before channels.start_all() (src/agent/agent_loop.rs:744).

The fix preserves real failure surfacing: a thread that explicitly transitions to Failed for any non-recovery reason still appears in the attention feed.

Files

crates/ironclaw_engine/src/{lib.rs,runtime/manager.rs} — metadata constant, recovery tagging, regression-tighten existing test.
src/bridge/router.rs — extracted is_real_thread_failure predicate, applied to failures_24h count + attention loop, plus four new unit tests.
crates/ironclaw_gateway/static/js/{core/{history,init-auth,sse}.js,surfaces/projects.js} — replace silent catches, hydration retry plumbing.

Test plan

cargo fmt
cargo clippy --all --benches --tests --examples --all-features — zero warnings
cargo test --lib — 5623 passed
cargo test -p ironclaw_engine — 520 passed (incl. tightened recover_project_threads_marks_non_terminal_as_failed)
bash scripts/pre-commit-safety.sh — exit 0
Manual smoke after deploy: confirm Projects tab no longer flags restart-recovered threads while a real LLM-error failure still surfaces; confirm Chat/Missions self-recover on first load with a logged error in DevTools when initial fetch fails.

Out of scope (follow-ups if the retry doesn't fully resolve 1 & 2)

Deeper root-cause investigation of why loadThreads / loadMissions would error on the first post-upgrade call (e.g. a tenant whose user_id is not covered by migrate_legacy_user_ids). The new console.error makes this surfaceable.
A first-class engine_ready SSE event instead of relying on the implicit "SSE-open == engine ready" assumption.

🤖 Generated with Claude Code

…ry empty hydration on SSE open (#3274) Fixes three post-upgrade UI symptoms reported in #3274: - Projects tab no longer surfaces threads that were force-failed by `recover_project_threads` on engine restart. Those threads now carry the `engine_restart_recovery` metadata flag, and the projects overview filters them out of both `failures_24h` and the "Needs attention" feed so an upgrade does not cascade into phantom "Thread failed: …" warnings. - Chat / Missions tabs no longer render silently empty when the first hydration request races engine init. `.catch(() => {})` blocks now log via `console.error` and flag the loader; a one-shot retry runs on the first SSE `onopen` (SSE accept implies `init_engine` has finished, since it is awaited before `channels.start_all()`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist · 2026-05-07T06:25:21Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copilot

Pull request overview

Addresses post-upgrade UI noise and first-load empty states by tagging restart-recovered threads in the engine, filtering those artifacts out of the Projects “needs attention” feed, and adding a one-time hydration retry on SSE connect after initial loader failures.

Changes:

Tag non-terminal threads force-failed during engine restart recovery with engine_restart_recovery metadata and re-export the metadata key from ironclaw_engine.
Filter restart-recovery failures out of get_engine_projects_overview failure counts and attention items; add unit tests for the predicate.
Replace silent promise catches in history/threads/missions loaders with console.error + a one-time SSE-onopen retry mechanism.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`crates/ironclaw_engine/src/runtime/manager.rs`	Tags restart-recovered threads via metadata; exports metadata key constants; tightens recovery test.
`crates/ironclaw_engine/src/lib.rs`	Re-exports recovery/checkpoint metadata key constants for downstream consumers.
`src/bridge/router.rs`	Adds `is_real_thread_failure` predicate and applies it to Projects overview failure surfacing; adds unit tests.
`crates/ironclaw_gateway/static/js/core/init-auth.js`	Adds hydration failure tracking + one-time retry hook invoked after SSE connects.
`crates/ironclaw_gateway/static/js/core/sse.js`	Invokes hydration retry from SSE `onopen`.
`crates/ironclaw_gateway/static/js/core/history.js`	Logs and flags initial hydration failures for chat history/threads instead of swallowing errors.
`crates/ironclaw_gateway/static/js/surfaces/projects.js`	Logs and flags initial hydration failures for missions instead of swallowing errors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+/// Metadata key set on a thread that has an in-flight pending-approval
+/// gate. Persisted threads carrying this key skip restart-recovery so the
+/// gate survives a process restart.
+pub const PENDING_APPROVAL_METADATA_KEY: &str = "pending_approval";
+
+/// Metadata key set on a thread that has a serialized runtime checkpoint
+/// (CodeAct VM state, nudge counters, compaction count). Threads carrying
+/// this key are suspended on restart instead of failed.
+pub const RUNTIME_CHECKPOINT_METADATA_KEY: &str = "runtime_checkpoint";


serrrfirat · 2026-05-07T11:32:41Z

+function runInitialHydrationRetry() {
+  var pending = window._initialHydrationPending;
+  if (!pending || window._hydrationRetryDone) return;
+  window._hydrationRetryDone = true;


Medium Severity

The hydration retry can be consumed before any loader has failed.

runInitialHydrationRetry() marks _hydrationRetryDone = true and later clears _initialHydrationPending even when all pending flags are still false. In initApp(), connectSSE() is called before the initial loadThreads() call, so a fast SSE onopen can run this function first, consume the one retry, and clear the tracker. If the first loadThreads / loadHistory / loadMissions request then rejects, the catch block sees _initialHydrationPending === null and cannot flag itself for retry. That leaves the same blank initial UI state this change is trying to recover from.

Consider only marking the retry as done when at least one pending flag is true, and have loader catches trigger the retry immediately when SSE is already open (or track an explicit initialHydrationSseReady flag).

serrrfirat

Approved after paranoid review. One medium follow-up was left inline for the hydration retry race.

Copilot AI review requested due to automatic review settings May 7, 2026 06:25

github-actions Bot added size: L 200-499 changed lines risk: low Changes to docs, tests, or low-risk modules contributor: core 20+ merged PRs labels May 7, 2026

Copilot started reviewing on behalf of ilblackdragon May 7, 2026 06:26 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

serrrfirat reviewed May 7, 2026

View reviewed changes

serrrfirat approved these changes May 7, 2026

View reviewed changes

serrrfirat added this pull request to the merge queue May 7, 2026

Merged via the queue into main with commit 80c20ed May 7, 2026
38 checks passed

serrrfirat deleted the fix/3274-projects-tab-failed-threads branch May 7, 2026 11:57

This was referenced May 7, 2026

chore: release #3373

Closed

chore: release #3376

Merged

ironclaw-ci Bot mentioned this pull request May 8, 2026

chore: release #3388

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(engine,web): suppress restart-recovery noise on Projects tab; retry empty hydration on SSE open (#3274)#3328

fix(engine,web): suppress restart-recovery noise on Projects tab; retry empty hydration on SSE open (#3274)#3328
serrrfirat merged 1 commit into
mainfrom
fix/3274-projects-tab-failed-threads

ilblackdragon commented May 7, 2026

Uh oh!

gemini-code-assist Bot commented May 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

serrrfirat May 7, 2026

Uh oh!

serrrfirat left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ilblackdragon commented May 7, 2026

Summary

Files

Test plan

Out of scope (follow-ups if the retry doesn't fully resolve 1 & 2)

Uh oh!

gemini-code-assist Bot commented May 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

serrrfirat May 7, 2026

Choose a reason for hiding this comment

Uh oh!

serrrfirat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants