Skip to content

fix(engine,web): suppress restart-recovery noise on Projects tab; retry empty hydration on SSE open (#3274)#3328

Merged
serrrfirat merged 1 commit into
mainfrom
fix/3274-projects-tab-failed-threads
May 7, 2026
Merged

fix(engine,web): suppress restart-recovery noise on Projects tab; retry empty hydration on SSE open (#3274)#3328
serrrfirat merged 1 commit into
mainfrom
fix/3274-projects-tab-failed-threads

Conversation

@ilblackdragon
Copy link
Copy Markdown
Member

Summary

Closes #3274 — three post-upgrade UI symptoms (0.26.0 → 0.27.0) with two distinct root causes.

  • Projects tab persistent "Thread failed: …" warning (Issue 3, root cause). ThreadManager::recover_project_threads force-fails every non-terminal thread on engine restart with reason \"engine restart before thread completion\". After upgrade, every running/created/waiting thread from the previous process gets state = Failed with updated_at = now, which then floods get_engine_projects_overview's state == Failed && updated_at >= now − 24h filter as kind=failure attention items. Recovered threads are now tagged with the metadata flag engine_restart_recovery: true (exported from the engine crate as ENGINE_RESTART_RECOVERY_METADATA_KEY), and the projects overview excludes them from both failures_24h and the "Needs attention" feed.
  • Chat / Missions tabs empty until manual refresh (Issues 1 & 2, defensive net). loadHistory, loadThreads, and loadMissions ended in .catch(() => {}) / .catch(function() {}) — the silent-failure anti-pattern called out in .claude/rules/error-handling.md. They now log via console.error and flag window._initialHydrationPending. A new runInitialHydrationRetry() re-runs only the flagged loaders, exactly once per initApp() (guarded by _hydrationRetryDone), wired into the SSE onopen handler. SSE accept implies the backend has stabilised — init_engine is awaited before channels.start_all() (src/agent/agent_loop.rs:744).

The fix preserves real failure surfacing: a thread that explicitly transitions to Failed for any non-recovery reason still appears in the attention feed.

Files

  • crates/ironclaw_engine/src/{lib.rs,runtime/manager.rs} — metadata constant, recovery tagging, regression-tighten existing test.
  • src/bridge/router.rs — extracted is_real_thread_failure predicate, applied to failures_24h count + attention loop, plus four new unit tests.
  • crates/ironclaw_gateway/static/js/{core/{history,init-auth,sse}.js,surfaces/projects.js} — replace silent catches, hydration retry plumbing.

Test plan

  • cargo fmt
  • cargo clippy --all --benches --tests --examples --all-features — zero warnings
  • cargo test --lib — 5623 passed
  • cargo test -p ironclaw_engine — 520 passed (incl. tightened recover_project_threads_marks_non_terminal_as_failed)
  • bash scripts/pre-commit-safety.sh — exit 0
  • Manual smoke after deploy: confirm Projects tab no longer flags restart-recovered threads while a real LLM-error failure still surfaces; confirm Chat/Missions self-recover on first load with a logged error in DevTools when initial fetch fails.

Out of scope (follow-ups if the retry doesn't fully resolve 1 & 2)

  • Deeper root-cause investigation of why loadThreads / loadMissions would error on the first post-upgrade call (e.g. a tenant whose user_id is not covered by migrate_legacy_user_ids). The new console.error makes this surfaceable.
  • A first-class engine_ready SSE event instead of relying on the implicit "SSE-open == engine ready" assumption.

🤖 Generated with Claude Code

…ry empty hydration on SSE open (#3274)

Fixes three post-upgrade UI symptoms reported in #3274:

- Projects tab no longer surfaces threads that were force-failed by
  `recover_project_threads` on engine restart. Those threads now carry
  the `engine_restart_recovery` metadata flag, and the projects
  overview filters them out of both `failures_24h` and the
  "Needs attention" feed so an upgrade does not cascade into phantom
  "Thread failed: …" warnings.
- Chat / Missions tabs no longer render silently empty when the first
  hydration request races engine init. `.catch(() => {})` blocks now
  log via `console.error` and flag the loader; a one-shot retry runs
  on the first SSE `onopen` (SSE accept implies `init_engine` has
  finished, since it is awaited before `channels.start_all()`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 7, 2026 06:25
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added size: L 200-499 changed lines risk: low Changes to docs, tests, or low-risk modules contributor: core 20+ merged PRs labels May 7, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Addresses post-upgrade UI noise and first-load empty states by tagging restart-recovered threads in the engine, filtering those artifacts out of the Projects “needs attention” feed, and adding a one-time hydration retry on SSE connect after initial loader failures.

Changes:

  • Tag non-terminal threads force-failed during engine restart recovery with engine_restart_recovery metadata and re-export the metadata key from ironclaw_engine.
  • Filter restart-recovery failures out of get_engine_projects_overview failure counts and attention items; add unit tests for the predicate.
  • Replace silent promise catches in history/threads/missions loaders with console.error + a one-time SSE-onopen retry mechanism.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
crates/ironclaw_engine/src/runtime/manager.rs Tags restart-recovered threads via metadata; exports metadata key constants; tightens recovery test.
crates/ironclaw_engine/src/lib.rs Re-exports recovery/checkpoint metadata key constants for downstream consumers.
src/bridge/router.rs Adds is_real_thread_failure predicate and applies it to Projects overview failure surfacing; adds unit tests.
crates/ironclaw_gateway/static/js/core/init-auth.js Adds hydration failure tracking + one-time retry hook invoked after SSE connects.
crates/ironclaw_gateway/static/js/core/sse.js Invokes hydration retry from SSE onopen.
crates/ironclaw_gateway/static/js/core/history.js Logs and flags initial hydration failures for chat history/threads instead of swallowing errors.
crates/ironclaw_gateway/static/js/surfaces/projects.js Logs and flags initial hydration failures for missions instead of swallowing errors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +719 to +727
/// Metadata key set on a thread that has an in-flight pending-approval
/// gate. Persisted threads carrying this key skip restart-recovery so the
/// gate survives a process restart.
pub const PENDING_APPROVAL_METADATA_KEY: &str = "pending_approval";

/// Metadata key set on a thread that has a serialized runtime checkpoint
/// (CodeAct VM state, nudge counters, compaction count). Threads carrying
/// this key are suspended on restart instead of failed.
pub const RUNTIME_CHECKPOINT_METADATA_KEY: &str = "runtime_checkpoint";
function runInitialHydrationRetry() {
var pending = window._initialHydrationPending;
if (!pending || window._hydrationRetryDone) return;
window._hydrationRetryDone = true;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium Severity

The hydration retry can be consumed before any loader has failed.

runInitialHydrationRetry() marks _hydrationRetryDone = true and later clears _initialHydrationPending even when all pending flags are still false. In initApp(), connectSSE() is called before the initial loadThreads() call, so a fast SSE onopen can run this function first, consume the one retry, and clear the tracker. If the first loadThreads / loadHistory / loadMissions request then rejects, the catch block sees _initialHydrationPending === null and cannot flag itself for retry. That leaves the same blank initial UI state this change is trying to recover from.

Consider only marking the retry as done when at least one pending flag is true, and have loader catches trigger the retry immediately when SSE is already open (or track an explicit initialHydrationSseReady flag).

Copy link
Copy Markdown
Collaborator

@serrrfirat serrrfirat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved after paranoid review. One medium follow-up was left inline for the hydration retry race.

@serrrfirat serrrfirat added this pull request to the merge queue May 7, 2026
Merged via the queue into main with commit 80c20ed May 7, 2026
38 checks passed
@serrrfirat serrrfirat deleted the fix/3274-projects-tab-failed-threads branch May 7, 2026 11:57
This was referenced May 7, 2026
@ironclaw-ci ironclaw-ci Bot mentioned this pull request May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: low Changes to docs, tests, or low-risk modules size: L 200-499 changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Data Missing After Upgrading IronClaw 0.26.0 → 0.27.0 Until Manual Refresh, and Projects Tab Shows Failed Threads Warning

3 participants