fix(engine,web): suppress restart-recovery noise on Projects tab; retry empty hydration on SSE open (#3274)#3328
Conversation
…ry empty hydration on SSE open (#3274) Fixes three post-upgrade UI symptoms reported in #3274: - Projects tab no longer surfaces threads that were force-failed by `recover_project_threads` on engine restart. Those threads now carry the `engine_restart_recovery` metadata flag, and the projects overview filters them out of both `failures_24h` and the "Needs attention" feed so an upgrade does not cascade into phantom "Thread failed: …" warnings. - Chat / Missions tabs no longer render silently empty when the first hydration request races engine init. `.catch(() => {})` blocks now log via `console.error` and flag the loader; a one-shot retry runs on the first SSE `onopen` (SSE accept implies `init_engine` has finished, since it is awaited before `channels.start_all()`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
Pull request overview
Addresses post-upgrade UI noise and first-load empty states by tagging restart-recovered threads in the engine, filtering those artifacts out of the Projects “needs attention” feed, and adding a one-time hydration retry on SSE connect after initial loader failures.
Changes:
- Tag non-terminal threads force-failed during engine restart recovery with
engine_restart_recoverymetadata and re-export the metadata key fromironclaw_engine. - Filter restart-recovery failures out of
get_engine_projects_overviewfailure counts and attention items; add unit tests for the predicate. - Replace silent promise catches in history/threads/missions loaders with
console.error+ a one-time SSE-onopen retry mechanism.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
crates/ironclaw_engine/src/runtime/manager.rs |
Tags restart-recovered threads via metadata; exports metadata key constants; tightens recovery test. |
crates/ironclaw_engine/src/lib.rs |
Re-exports recovery/checkpoint metadata key constants for downstream consumers. |
src/bridge/router.rs |
Adds is_real_thread_failure predicate and applies it to Projects overview failure surfacing; adds unit tests. |
crates/ironclaw_gateway/static/js/core/init-auth.js |
Adds hydration failure tracking + one-time retry hook invoked after SSE connects. |
crates/ironclaw_gateway/static/js/core/sse.js |
Invokes hydration retry from SSE onopen. |
crates/ironclaw_gateway/static/js/core/history.js |
Logs and flags initial hydration failures for chat history/threads instead of swallowing errors. |
crates/ironclaw_gateway/static/js/surfaces/projects.js |
Logs and flags initial hydration failures for missions instead of swallowing errors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// Metadata key set on a thread that has an in-flight pending-approval | ||
| /// gate. Persisted threads carrying this key skip restart-recovery so the | ||
| /// gate survives a process restart. | ||
| pub const PENDING_APPROVAL_METADATA_KEY: &str = "pending_approval"; | ||
|
|
||
| /// Metadata key set on a thread that has a serialized runtime checkpoint | ||
| /// (CodeAct VM state, nudge counters, compaction count). Threads carrying | ||
| /// this key are suspended on restart instead of failed. | ||
| pub const RUNTIME_CHECKPOINT_METADATA_KEY: &str = "runtime_checkpoint"; |
| function runInitialHydrationRetry() { | ||
| var pending = window._initialHydrationPending; | ||
| if (!pending || window._hydrationRetryDone) return; | ||
| window._hydrationRetryDone = true; |
There was a problem hiding this comment.
Medium Severity
The hydration retry can be consumed before any loader has failed.
runInitialHydrationRetry() marks _hydrationRetryDone = true and later clears _initialHydrationPending even when all pending flags are still false. In initApp(), connectSSE() is called before the initial loadThreads() call, so a fast SSE onopen can run this function first, consume the one retry, and clear the tracker. If the first loadThreads / loadHistory / loadMissions request then rejects, the catch block sees _initialHydrationPending === null and cannot flag itself for retry. That leaves the same blank initial UI state this change is trying to recover from.
Consider only marking the retry as done when at least one pending flag is true, and have loader catches trigger the retry immediately when SSE is already open (or track an explicit initialHydrationSseReady flag).
serrrfirat
left a comment
There was a problem hiding this comment.
Approved after paranoid review. One medium follow-up was left inline for the hydration retry race.
Summary
Closes #3274 — three post-upgrade UI symptoms (0.26.0 → 0.27.0) with two distinct root causes.
ThreadManager::recover_project_threadsforce-fails every non-terminal thread on engine restart with reason\"engine restart before thread completion\". After upgrade, every running/created/waiting thread from the previous process getsstate = Failedwithupdated_at = now, which then floodsget_engine_projects_overview'sstate == Failed && updated_at >= now − 24hfilter askind=failureattention items. Recovered threads are now tagged with the metadata flagengine_restart_recovery: true(exported from the engine crate asENGINE_RESTART_RECOVERY_METADATA_KEY), and the projects overview excludes them from bothfailures_24hand the "Needs attention" feed.loadHistory,loadThreads, andloadMissionsended in.catch(() => {})/.catch(function() {})— the silent-failure anti-pattern called out in.claude/rules/error-handling.md. They now log viaconsole.errorand flagwindow._initialHydrationPending. A newrunInitialHydrationRetry()re-runs only the flagged loaders, exactly once perinitApp()(guarded by_hydrationRetryDone), wired into the SSEonopenhandler. SSE accept implies the backend has stabilised —init_engineis awaited beforechannels.start_all()(src/agent/agent_loop.rs:744).The fix preserves real failure surfacing: a thread that explicitly transitions to
Failedfor any non-recovery reason still appears in the attention feed.Files
crates/ironclaw_engine/src/{lib.rs,runtime/manager.rs}— metadata constant, recovery tagging, regression-tighten existing test.src/bridge/router.rs— extractedis_real_thread_failurepredicate, applied tofailures_24hcount + attention loop, plus four new unit tests.crates/ironclaw_gateway/static/js/{core/{history,init-auth,sse}.js,surfaces/projects.js}— replace silent catches, hydration retry plumbing.Test plan
cargo fmtcargo clippy --all --benches --tests --examples --all-features— zero warningscargo test --lib— 5623 passedcargo test -p ironclaw_engine— 520 passed (incl. tightenedrecover_project_threads_marks_non_terminal_as_failed)bash scripts/pre-commit-safety.sh— exit 0Out of scope (follow-ups if the retry doesn't fully resolve 1 & 2)
loadThreads/loadMissionswould error on the first post-upgrade call (e.g. a tenant whoseuser_idis not covered bymigrate_legacy_user_ids). The newconsole.errormakes this surfaceable.engine_readySSE event instead of relying on the implicit "SSE-open == engine ready" assumption.🤖 Generated with Claude Code