fix(bridge): mission_* tools accept name; resolves #2583#3155
Conversation
The bug bash report on #2583 ("Routine creation fails with '5 consecutive code errors'") attributed the failure to the budget cap rework in #2843. A live regression test (added here) reproduces the original prompt verbatim and shows the actual root cause is much smaller: every mission_* handler in the bridge required an `id` UUID, but the LLM and the routine_* alias path naturally pass `name`. Each call collapsed to "invalid mission id: invalid length 0", retried 4–5×, and tripped the orchestrator's `consecutive_action_errors >= max_consecutive_errors` guard. Fix: - New `resolve_mission_id` helper in `src/bridge/effect_adapter.rs` resolves a MissionId from `name` (preferred) → `id` UUID (legacy fallback) → positional arg, looking up names through `MissionManager::list_missions(project_id, user_id)`. UUIDs stay internal. - `mission_get`, `mission_fire`, `mission_pause`, `mission_resume`, `mission_complete`, `mission_update` migrated to use the helper. `mission_update` renames now go through `new_name`; the legacy `name`-as-rename-target behaviour is preserved only when a UUID `id` is also passed. - Schemas in `src/bridge/engine_actions.rs` advertise `name` as primary with `id` documented as a legacy alternative; neither is `required`. Live regression test (`tests/e2e_live_routine.rs`): - Drives the issue's exact prompt through engine v2. - Asserts no "invalid mission id" / "consecutive code errors" surfaces. - Verifies routine_fire delivers a `**[name]**` notification on both the initial fire and a second-turn re-fire. - Recorded trace committed for deterministic replay. Two related observations surfaced by the live test, parked as separate follow-ups (warnings only — not test-failing): - Administrative tools (mission_create / routine_create / mission_fire / routine_fire) do not surface `ApprovalNeeded` gates with `auto_approve_tools=false` even though they are in the `AUTONOMOUS_TOOL_DENYLIST`. - The mission-fire child thread sometimes returns work narration in FINAL() instead of the requested data point — a `build_meta_prompt` guidance issue independent of #2583.
There was a problem hiding this comment.
Code Review
This pull request implements a name-based resolution system for missions and routines, enabling identification via human-readable names in addition to UUIDs. A centralized resolve_mission_id helper is introduced and integrated into various mission actions, while action definitions are updated to prioritize name as the primary identifier. The PR also adds comprehensive regression tests and a live end-to-end test for routine workflows. Feedback indicates that the mission_update logic fails to preserve legacy behavior for renames when a UUID is provided, and notes a discrepancy between the implementation and documentation regarding the resolution priority of identifiers.
| // Renames use `new_name`. The legacy `name` field is | ||
| // now the lookup key (consumed by `resolve_mission_id`), | ||
| // not the rename target — see the schema in | ||
| // `engine_actions.rs` and the migration note in the | ||
| // helper docs above. | ||
| if let Some(new_name) = params.get("new_name").and_then(|v| v.as_str()) { | ||
| updates.name = Some(new_name.to_string()); | ||
| } |
There was a problem hiding this comment.
The implementation of mission_update does not preserve the legacy behavior where name acts as the rename target when a UUID id is also provided, despite the PR description claiming this preservation. The current code only checks new_name for renames, which breaks backward compatibility for existing callers or recorded traces using the (id=UUID, name=NewName) pattern.
// Renames use new_name. The legacy name field is
// now the lookup key (consumed by resolve_mission_id),
// not the rename target — unless a UUID id is also
// passed, preserving legacy behavior.
if let Some(new_name) = params.get("new_name").and_then(|v| v.as_str()) {
updates.name = Some(new_name.to_string());
} else if let Some(id_val) = params.get("id").and_then(|v| v.as_str()) {
if uuid::Uuid::parse_str(id_val).is_ok() {
if let Some(name) = params.get("name").and_then(|v| v.as_str()) {
updates.name = Some(name.to_string());
}
}
}References
- When intercepting and forwarding a tool call to a backend implementation, ensure all relevant parameters from the original call are passed through, including legacy fields required for backward compatibility.
| /// Resolution order (first match wins): | ||
| /// 1. `params.id` — if present and a valid UUID, use directly. | ||
| /// 2. `params.name` — look up `(project_id, user_id, name)` via | ||
| /// `MissionManager::list_missions`. The same identifier used in | ||
| /// `mission_create` is what we resolve back here. | ||
| /// 3. `params._args[0]` — positional fallback used by some Tier-0 | ||
| /// tool-call shapes; tried as UUID first, then as name. |
There was a problem hiding this comment.
The docstring's resolution order does not match the actual implementation. Specifically, params._args[0] as a valid UUID (checked at line 2200) has higher priority than params.name as a name (checked during the mission list iteration at line 2218), but the docstring lists it as a fallback (#3).
| /// Resolution order (first match wins): | |
| /// 1. `params.id` — if present and a valid UUID, use directly. | |
| /// 2. `params.name` — look up `(project_id, user_id, name)` via | |
| /// `MissionManager::list_missions`. The same identifier used in | |
| /// `mission_create` is what we resolve back here. | |
| /// 3. `params._args[0]` — positional fallback used by some Tier-0 | |
| /// tool-call shapes; tried as UUID first, then as name. | |
| /// Resolution order (first match wins): | |
| /// 1. params.id — if present and a valid UUID, use directly. | |
| /// 2. params._args[0] — if present and a valid UUID, use directly. | |
| /// 3. params.name — look up (project_id, user_id, name) via | |
| /// MissionManager::list_missions. | |
| /// 4. params.id — look up by name (legacy fallback). | |
| /// 5. params._args[0] — look up by name (positional fallback). |
References
- When resolving identifiers from stored tool data or parameters, implement and document a clear fallback chain to check multiple possible field names or positions.
There was a problem hiding this comment.
Fixed in b74c42e. The docstring on resolve_mission_id now matches the new four-step resolution order including the conflict guard. _args[0] no longer short-circuits ahead of explicit name (it's now a name candidate only — see serrrfirat's related comment).
There was a problem hiding this comment.
Pull request overview
This PR fixes routine/mission tool calls in the bridge by allowing mission_* handlers (and their routine_* aliases) to resolve missions by name instead of requiring a UUID id, preventing retry loops that previously tripped the orchestrator’s “consecutive code errors” guard (issue #2583).
Changes:
- Add
resolve_mission_idand migratemission_get/fire/pause/resume/complete/updateto acceptname(preferred) with legacyidfallback. - Update mission tool schemas to advertise
nameas primary and introducenew_namefor renames inmission_update. - Add a new live/replayable end-to-end regression test plus fixture, and extend the test rig with approval helpers.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/bridge/effect_adapter.rs |
Adds resolve_mission_id and switches mission handlers to name-first resolution. |
src/bridge/engine_actions.rs |
Updates mission tool schemas and guidance to prefer name and use new_name for rename. |
tests/support/test_rig.rs |
Adds channel_handle() and send_exec_approval() for background helpers in tests. |
tests/e2e_live_routine.rs |
Adds a live/replayable E2E regression test for #2583. |
tests/fixtures/llm_traces/live/routine_btc_create_test_and_refire.json |
Adds recorded trace fixture for deterministic replay. |
Comments suppressed due to low confidence (1)
src/bridge/effect_adapter.rs:1027
mission_completeresolves an explicit UUID viaresolve_mission_idand then callsMissionManager::complete_mission(id)without any ownership/shared-access check. Sincecomplete_missionitself does not enforce tenant isolation, a caller who knows/guesses another user's mission UUID could complete it. Add an authorization check here (e.g., load the mission and verifymission.is_owned_by(context.user_id)ormission.owner_id().is_shared()before completing), or change the engine API to requireuser_idlikefire_mission/pause_mission/resume_mission/update_missiondo.
"mission_complete" => {
let id =
resolve_mission_id(mgr.as_ref(), context.project_id, &context.user_id, params)
.await;
match id {
Ok(id) => match mgr.complete_mission(id).await {
Ok(()) => Ok(serde_json::json!({"status": "completed"})),
Err(e) => Err(e),
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| &["id"], | ||
| &[], | ||
| &[ | ||
| "Identify the mission to update with `name` (preferred) or `id` — exactly one is required.", |
There was a problem hiding this comment.
Fixed in 720a0db. Discovery summary now matches the actual handler tolerance: "Identify the mission with name (preferred) or id (UUID). If both are provided they must identify the same mission, or use the legacy {id, name} rename shape (where name is the new name) — otherwise the resolver errors with 'identify different missions'."
| self.handle.abort(); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Fixed in b74c42e. ApprovalAutoResponder now implements Drop that aborts the handle. shutdown() is kept for explicitness; abort() is idempotent so the double-abort path is safe.
| // We resolved the id via name lookup, so this | ||
| // should be unreachable in practice (the lookup | ||
| // already proved the mission exists). Keep the | ||
| // arm so the match is exhaustive. |
There was a problem hiding this comment.
Fixed in b74c42e. The comment now correctly notes that Ok(None) is reachable when the caller passed an explicit UUID for a deleted/unknown mission — names that didn't match are caught earlier by resolve_mission_id with a more useful error.
|
|
||
| Err(EngineError::Effect { | ||
| reason: format!( | ||
| "mission not found by name: tried {name_candidates:?} for user '{user_id}'. \ |
There was a problem hiding this comment.
Fixed in b74c42e. user_id is no longer included in the mission not found error. Same fix on mission_get's "belongs to another user" arm — the foreign mission's UUID is now scrubbed from the LLM-visible reason and preserved only via tracing::debug!.
| let mut updates = ironclaw_engine::MissionUpdate::default(); | ||
| if let Some(name) = params.get("name").and_then(|v| v.as_str()) { | ||
| updates.name = Some(name.to_string()); | ||
| // Renames use `new_name`. The legacy `name` field is | ||
| // now the lookup key (consumed by `resolve_mission_id`), | ||
| // not the rename target — see the schema in | ||
| // `engine_actions.rs` and the migration note in the | ||
| // helper docs above. | ||
| if let Some(new_name) = params.get("new_name").and_then(|v| v.as_str()) { | ||
| updates.name = Some(new_name.to_string()); | ||
| } |
There was a problem hiding this comment.
Fixed in 0bd586d (initial preservation) + b74c42e (mission_update handler now strips name from the resolver-view params when the legacy shape is detected, so the new conflict guard doesn't reject it). Two regression tests pin it: mission_update_preserves_legacy_id_plus_name_rename and mission_update_renames_via_new_name_and_uses_name_as_lookup.
|
|
||
| // De-duplicate while preserving order. | ||
| let mut seen: std::collections::HashSet<&str> = std::collections::HashSet::new(); | ||
| name_candidates.retain(|s| seen.insert(*s)); |
There was a problem hiding this comment.
Performance regression on hot path (High): every name-keyed mission_* call now loads the user's full mission list. A cron routine firing every 5 min on an account with hundreds of historical missions (including completed/archived) pays an O(N) DB read per fire — the previous UUID-only path was O(1).
Suggest adding MissionManager::find_by_name(project_id, user_id, name) so the bridge does a targeted lookup. Even a small per-(project_id, user_id) name→id cache inside MissionManager (invalidated on create/rename/complete) would pay for itself fast — the same mission name is hit on every fire/get/pause cycle.
There was a problem hiding this comment.
Addressed in b74c42e. Added MissionManager::find_by_name(project_id, user_id, name) -> Result<Option<Mission>, EngineError> typed API; bridge resolver switched off list_missions onto it. Today find_by_name still scans the user's missions in-memory; the doc-comment names the future store-level index drop-in (Store::find_mission_by_name) explicitly. Adding it later is non-breaking — every caller goes through this method now, so the optimization plugs in without churn at the call sites.
| // Renames use `new_name`. The legacy `name` field is | ||
| // now the lookup key (consumed by `resolve_mission_id`), | ||
| // not the rename target — see the schema in | ||
| // `engine_actions.rs` and the migration note in the |
There was a problem hiding this comment.
Foot-gun (High): if a caller passes mission_update({id: <real_uuid_A>, name: "mission-B", new_name: "renamed"}), resolve_mission_id resolves id first (UUID fast-path), so mission A is renamed even though name says "mission-B". The schema text in engine_actions.rs says name is the lookup key — the helper logic contradicts that whenever id is also present.
Defense-in-depth: when both id (UUID) and name are provided and resolve to different missions, error out with "both id and name provided but they identify different missions" instead of silently preferring one. The doc comment on resolve_mission_id should also call out this precedence trap explicitly so the next caller doesn't trip it.
There was a problem hiding this comment.
Fixed in b74c42e. Added the conflict guard you suggested. New regression test mission_resolver_errors_when_id_and_name_disagree pins it. Error reads: "both id and name provided but they identify different missions: id=X vs name=[Y] → Z. Provide only one — or fix the name to match the id." The resolver docstring also explicitly calls out the conflict-guard contract.
| /// cause of the bug bash failure (not the budget-rework hypothesis in | ||
| /// the linked #2843). | ||
| #[tokio::test] | ||
| async fn mission_fire_resolves_by_name_when_id_absent() { |
There was a problem hiding this comment.
Test coverage gap (High): all six mission_* handlers (get, fire, pause, resume, complete, update) now route through resolve_mission_id, but only mission_fire got new by-name regression tests. The most foot-gun-prone handler — mission_update (see related comment on the new_name rename) — has no by-name test at all, and the existing mission_update_string_guardrails_rejected_via_execute_action still passes id directly so doesn't exercise the new path.
Per CLAUDE.md "Test Through the Caller, Not Just the Helper", each migrated handler needs a caller-level test. Suggest three more symmetric to this one: mission_get_resolves_by_name, mission_complete_resolves_by_name, and mission_update_renames_via_new_name_when_name_is_lookup_key (the last one is the regression guard for the id+name foot-gun).
There was a problem hiding this comment.
Fixed in b74c42e. Added by-name caller-level tests for every migrated handler: mission_get_resolves_by_name, mission_complete_resolves_by_name, mission_pause_and_resume_resolve_by_name, plus the conflict-guard regression mission_resolver_errors_when_id_and_name_disagree and the legacy-rename regression mission_update_preserves_legacy_id_plus_name_rename.
| .and_then(|a| a.get(0)) | ||
| .and_then(|v| v.as_str()) | ||
| .filter(|s| !s.is_empty()) | ||
| { |
There was a problem hiding this comment.
Edge case (Medium): if a caller passes {name: "actual-mission", _args: ["<uuid-of-other>"]}, this _args[0] UUID parse short-circuits with return Ok(MissionId(uuid)) — the explicit name field is silently ignored. This contradicts the docstring's stated principle that name is preferred and _args[0] is a "positional fallback used by some Tier-0 tool-call shapes".
Two ways out:
- (a) Preserve preference: move the
_args[0]UUID parse to aftername_candidateslookup falls through, so explicitnamewins. - (b) Treat positional as always-a-name: drop the
_args[0]UUID-parse branch and only push toname_candidates. UUID-shaped strings would still match if any mission happens to be named with a UUID — vanishingly rare and an explicit caller decision.
Either is fine; the current ordering is the worst of both.
There was a problem hiding this comment.
Fixed in b74c42e — went with option (b). _args[0] is now treated as a name candidate only, never honoured as a UUID. UUID-shaped strings would only match if a mission happens to be named with one (vanishingly rare and the explicit caller decision you noted).
| } | ||
| } | ||
|
|
||
| Err(EngineError::Effect { |
There was a problem hiding this comment.
Information disclosure (Medium): echoing user_id into the error reason leaks server-side identity into the LLM context, conversation history, and any tool-error telemetry. The previous error ("mission not found: {id_str}") only echoed user input.
EngineError::Effect::reason flows back to the LLM and is persisted — depending on the channel, user_id is a UUID (low value), but for some flows it's an external identity (chat id, session-derived token). Suggest dropping the for user '{user_id}' segment entirely. The LLM can't act on it anyway (it can't list a different user's missions).
Same goes for the mission {} belongs to another user reason a few lines above (mission_get arm at the new resolve site) — there's no LLM-actionable info in the foreign mission's UUID, and emitting it confirms the existence of a mission the caller has no claim to.
There was a problem hiding this comment.
Fixed in b74c42e. Both error reasons are scrubbed: user_id is gone from the "mission not found" error, and mission_get's "belongs to another user" reason no longer echoes the foreign mission's UUID (preserved only via tracing::debug!).
| /// tool-call shapes; tried as UUID first, then as name. | ||
| /// | ||
| /// A non-UUID `id` value is treated as a name (some callers conflate the | ||
| /// two), which keeps backward compat for `mission_complete`'s historical |
There was a problem hiding this comment.
Architecture (Medium): name-or-id resolution is conceptually a MissionManager concern, not a bridge-adapter concern. Keeping this 60-line policy here means any future caller — CLI mission commands, the WS handler, a WASM channel that wants a mission lookup, integration tests outside the bridge — duplicates the resolution rules.
Suggest moving to a typed API on MissionManager:
enum MissionRef<'a> {
Id(MissionId),
Name(&'a str),
}
impl MissionManager {
pub async fn resolve(&self, project_id: ProjectId, user_id: &str, r: MissionRef<'_>)
-> Result<MissionId, EngineError> { ... }
}The bridge then becomes a one-liner that picks the right variant from params. The two ambiguity rules (id+name simultaneous; _args[0]) collapse into one well-scoped place that has a unit test suite of its own.
There was a problem hiding this comment.
Partially addressed in b74c42e. The typed engine API (MissionManager::find_by_name) is in place — that's the part future callers (CLI, WS, WASM channels, integration tests) will actually need. The 60-line param-parsing in the bridge stayed put because it deals in JSON and Tier-0 calling conventions (_args[0] etc.) that aren't a MissionManager concern. Happy to lift the param-shape policy onto a MissionRef enum + MissionManager::resolve in a follow-up if you'd prefer the typed-API form — should I file that?
| "HIT" | ||
| ], | ||
| [ | ||
| "cf-ray", |
There was a problem hiding this comment.
Trace fixture noise (Medium): the recorded CoinGecko response carries every header from the real Cloudflare/Coingecko response — cf-ray, x-request-id, etag, age, date, cf-cache-status, strict-transport-security, content-security-policy-report-only, etc. They aren't secrets, but they:
- bloat the diff (~941 lines for one e2e test, most of it duplicated headers across three identical price requests),
- pin the fixture to one Cloudflare datacenter (
-FUK) and a 2026-05-01 timestamp, - risk replay surprises if any HTTP cache layer in the stack decides to honor
etag/age.
Suggest a fixture-redaction helper in tests/support/ (e.g. redact_response_headers(allow=["content-type", "content-length"])) applied at fixture-write time. Net result is a smaller, more reproducible fixture and one less drift source for replay-mode tests.
There was a problem hiding this comment.
Deferred — agreed on the diagnosis. The fixture is committed to enable replay-mode CI runs, and a redaction helper that's careful to keep replay deterministic is its own small piece of work. Will file a follow-up issue today if you want; otherwise happy to tackle in a tiny PR right after this one merges.
…alf 1) Issue #3133 reported a Gmail-sending mission firing every 3 minutes with the FINAL() body "Failed to send email. Status: None Error: None / Next focus: Debug Gmail authentication." The new live test under #2583's infrastructure reproduced the actual mechanism: a mission's child thread hits a `tool_activate(gmail)` OAuth gate, the engine emits `ThreadOutcome::GatePaused`, and `process_mission_outcome_and_notify` — which only matched `Completed` / `Failed` / `MaxIterations` — silently swallowed it via a bare `_ => {}` arm. The cron scheduler kept ticking, the user got no auth-tray prompt, and the LLM eventually narrated the two-None pattern when its `http` fallback failed. Engine side (`crates/ironclaw_engine/src/runtime/mission.rs`): - New `MissionGateInfo { gate_name, action_name, parameters, request_id, resume_kind }` struct, exported from the engine crate. - `MissionNotification` gains an `Option<MissionGateInfo> gate` field. - New `ThreadOutcome::GatePaused` arm in `process_mission_outcome_and_notify` that: * Transitions the mission to `MissionStatus::Paused`. The cron tick path filters by `Active` (lines 1058 / 1131 / 1194 / 1781), so pausing is enough to stop the ghost re-firing pattern. * Records `PAUSED: gate '...' on action '...' awaiting <kind>` in `approach_history`. * Emits a `MissionNotification` with `gate: Some(...)` and a user-facing nudge tailored per `ResumeKind` (Authentication / Approval / External). - Regression test `gate_paused_pauses_mission_and_emits_gate_notification` pinning the contract. Bridge side (`src/bridge/router.rs::handle_mission_notification`): - New optional args (`auth_manager`, `tools`, `extension_manager`) plumbed through the mission-notification dispatcher loop. - When `notif.gate` is `Some`, additionally emit a structured `StatusUpdate` per `ResumeKind`: * `Authentication` → `StatusUpdate::AuthRequired { extension_name, instructions, auth_url, setup_url, request_id }`. Extension identity is resolved via the canonical `resolve_extension_for_action` helper (the same one foreground auth gates use), so the gateway UI's auth tray fires on the same path. * `Approval` → `StatusUpdate::ApprovalNeeded { request_id, tool_name, description, parameters, allow_always }`. * `External` → no user-actionable surface; falls through to the text response only. - The friendly text response is still broadcast on `notify_channels` alongside the structured status update, so the user sees both an auth-tray entry and a chat message. This is half 1 of the fix discussed in #3133. Half 2 (auto-resuming a paused mission once auth completes) is a separate follow-up: the mission stays Paused after the user resolves the gate today; they have to explicitly call `mission_resume`. Test infrastructure: - New `tests/e2e_live_mission_gmail.rs` driving the issue's flow with Gmail OAuth seeded from the source DB. Asserts no "Status: None Error: None" pattern and no "consecutive code errors" surface. - Shared mission/routine test helpers extracted to `tests/support/live_mission_helpers.rs` (ApprovalAutoResponder, looks_like_routine_notification, tool_is, wait_for_response_matching) so #2583's routine test and #3133's mission test reuse the same bug-pattern detectors without copy-paste drift.
|
Half 2 follow-up filed as #3166: 'Mission auto-resume after auth/approval gate resolution'. That issue covers persisting |
Addresses the high/medium-severity findings from the PR #3155 review (serrrfirat, Gemini, Copilot). No behaviour change for the happy path the original PR established (name-keyed mission_* dispatch keeps working); the conflict and information-disclosure paths get safer. Engine (`crates/ironclaw_engine/src/runtime/mission.rs`): - New `MissionManager::find_by_name(project_id, user_id, name)` typed API. Today it's a thin wrapper over `list_missions_with_shared` plus an in-memory filter, but every name-keyed call goes through it so a future store-level index drops in non-breakingly. The doc-comment names the optimisation (Store::find_mission_by_name) and the perf concern from the review explicitly. - Two new unit tests: round-trip via create+find, and the per-user scope (mission for bob is invisible when alice queries). Bridge (`src/bridge/effect_adapter.rs::resolve_mission_id`): - Switched off `list_missions` and onto `mgr.find_by_name`, so the hot path stops doing an O(N) scan over the user's full mission set per fire. - Added the id+name conflict guard. If both are supplied AND the name resolves to a different mission than the id identifies, the resolver now errors loudly: "both id and name provided but they identify different missions: id=X vs name=[Y] → Z. Provide only one — or fix the name to match the id." Previously the UUID would silently win, which is the foot-gun serrrfirat flagged: a typoed `name` would rename the wrong mission. New regression test `mission_resolver_errors_when_id_and_name_disagree` pins it. - Inverted the `_args[0]` precedence: a positional arg that *parses as a UUID* used to short-circuit ahead of `params.name`, which contradicted the docstring's "explicit name preferred". Now `_args[0]` is treated as a name candidate only; positional UUIDs aren't honoured at all (rare in practice — the Tier-0 callers that use `_args[0]` overwhelmingly pass names). - Dropped `user_id` from the "mission not found" error string. The reason field flows to the LLM context and conversation history; echoing the server-side identity served no purpose. Same fix on the `mission_get` "belongs to another user" arm — the foreign mission's UUID is now scrubbed from the LLM-visible reason and preserved only via tracing::debug!. - Updated the stale "unreachable Ok(None)" comment in mission_get per Copilot — the arm IS reachable when the caller passed an explicit UUID for a deleted/unknown mission. Bridge (`mission_update` handler): - The legacy `{id, name=NEWNAME}` rename shape (preserved by the earlier review-fix commit on this branch) was tripping the new conflict guard because the resolver saw both `id` and `name` as identifiers. The handler now strips `name` from the resolver-view params *only* when the legacy shape is detected (id is a valid UUID, `new_name` absent), so the resolver never sees a would-be-rename `name` as a lookup target. Both the legacy and canonical (`new_name`) paths keep their existing tests. By-name regression tests (`src/bridge/effect_adapter.rs`): - Closes the test gap flagged by serrrfirat — `mission_fire` was the only handler with a by-name caller-level test before. Adds: - `mission_get_resolves_by_name` - `mission_complete_resolves_by_name` - `mission_pause_and_resume_resolve_by_name` - `mission_resolver_errors_when_id_and_name_disagree` (the new conflict guard above) - All migrated handlers now have a caller-level test that exercises the name path through `execute_action`. Test infra (`tests/support/live_mission_helpers.rs`): - `ApprovalAutoResponder` now implements `Drop` (per Copilot). A panicking test would otherwise leak the background polling task across test boundaries. `shutdown()` is kept for explicitness; abort() is idempotent.
|
Pushed b74c42e addressing the review feedback. Summary by reviewer: serrrfirat:
Gemini:
Copilot:
Deferred:
Test results on b74c42e: 448 bridge tests + 89 engine mission tests, all green. fmt + clippy -D warnings clean. |
There was a problem hiding this comment.
Pull request overview
This PR updates the engine-v2 bridge so mission_* (and routine_* aliases) can target missions by name (preferred) with UUID id as a legacy fallback, addressing the retry loop behind issue #2583. It also adds mission gate metadata/notification surfacing and introduces live/replay E2E regression coverage.
Changes:
- Add name→id resolution (
resolve_mission_id) and migratemission_get/fire/pause/resume/complete/updatehandlers to acceptname(with legacyidfallback). - Update mission tool schemas to advertise
nameas primary and introducenew_nameformission_updaterenames. - Add live/replay E2E tests + shared helpers and fixtures; extend mission notifications with gate metadata and bridge-side status emission.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/bridge/effect_adapter.rs |
Adds resolve_mission_id and migrates mission handlers to name-first resolution; adds unit tests around the new behavior. |
src/bridge/engine_actions.rs |
Updates tool schemas/descriptions to prefer name, keeps id as optional legacy, adds new_name for renames. |
src/bridge/router.rs |
Extends mission-notification handling to surface gate state via StatusUpdate::{AuthRequired,ApprovalNeeded}. |
crates/ironclaw_engine/src/runtime/mission.rs |
Adds MissionGateInfo and emits it on ThreadOutcome::GatePaused; pauses missions on gate outcomes. |
crates/ironclaw_engine/src/lib.rs |
Re-exports MissionGateInfo. |
src/channels/web/tests/no_silent_drop.rs |
Adapts test notifications to include the new gate field and updated handler signature. |
tests/support/test_rig.rs |
Adds channel_handle() and a helper to send structured exec approvals. |
tests/support/mod.rs |
Registers shared live-mission helper module behind libsql. |
tests/support/live_mission_helpers.rs |
Adds shared live-test helpers (notification heuristic, approval auto-responder, waiters). |
tests/e2e_live_routine.rs |
Adds live/replay E2E regression test for #2583 prompt flow and notification assertions. |
tests/e2e_live_mission_gmail.rs |
Adds live/replay E2E test scenario around Gmail mission behavior and gate-related regression markers. |
tests/fixtures/llm_traces/live/routine_btc_create_test_and_refire.json |
Adds recorded trace fixture for deterministic replay of the live routine test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if let Some(gate) = ¬if.gate | ||
| && let Some(tools) = tools | ||
| { | ||
| for channel_name in ¬if.notify_channels { | ||
| let metadata = serde_json::json!({"user_id": notif.user_id}); | ||
| let status = match &gate.resume_kind { | ||
| ironclaw_engine::ResumeKind::Authentication { | ||
| credential_name, | ||
| instructions, | ||
| auth_url, | ||
| } => { | ||
| let extension_name = resolve_extension_for_action( | ||
| auth_manager, | ||
| extension_manager, | ||
| tools, | ||
| &gate.action_name, | ||
| &gate.parameters, | ||
| credential_name.as_str(), | ||
| ¬if.user_id, | ||
| ) | ||
| .await; | ||
| StatusUpdate::AuthRequired { | ||
| extension_name, | ||
| instructions: Some(instructions.clone()), | ||
| auth_url: auth_url.clone(), | ||
| setup_url: None, | ||
| request_id: Some(gate.request_id.clone()), | ||
| } | ||
| } | ||
| ironclaw_engine::ResumeKind::Approval { allow_always } => { | ||
| StatusUpdate::ApprovalNeeded { | ||
| request_id: gate.request_id.clone(), | ||
| tool_name: gate.action_name.clone(), | ||
| description: format!( | ||
| "Mission '{}' is waiting for approval to run '{}'.", | ||
| notif.mission_name, gate.action_name | ||
| ), | ||
| parameters: gate.parameters.clone(), | ||
| allow_always: *allow_always, | ||
| } | ||
| } |
There was a problem hiding this comment.
Fixed in 720a0db. MissionGateInfo.request_id renamed to call_id (matches what it actually carries — the engine LLM tool-call id). New MissionGateInfo.gate_request_id: Uuid is freshly generated at gate-emit time, and the bridge now forwards it (not call_id) into StatusUpdate::AuthRequired.request_id / ApprovalNeeded.request_id. The gateway resolve POST (features/chat/mod.rs:183/274/303 + platform/ws.rs:236) now sees a UUID-shaped string and won't 400. Mapping the gateway-issued UUID back to the engine call_id for actual auto-resume is half-2 (#3166).
| /// Engine-side `call_id` used to resume. | ||
| pub request_id: String, |
Addresses Copilot's review on PR #3155 commit b74c42e: > StatusUpdate::ApprovalNeeded.request_id is later parsed as a UUID by > the web approval endpoints / WS handler (they reject non-UUID > request_id). Here the mission gate wiring forwards gate.request_id, > but that value originates from ThreadOutcome::GatePaused.call_id > (LLM tool-call id like call_...), so approvals from mission/routine > notifications will be impossible to resolve via the gateway UI. Confirmed by reading the gateway resolve handler at `src/channels/web/features/chat/mod.rs:183`, `:274`, `:303` and the WS handler at `src/channels/web/platform/ws.rs:236` — all four call `Uuid::parse_str(&req.request_id)` and reject non-UUIDs with 400 Bad Request. Mission auth-tray entries built off the LLM `call_id` would silently 400 every resolve POST, making the tray decorative. Engine (`crates/ironclaw_engine/src/runtime/mission.rs`): - Renamed `MissionGateInfo.request_id` → `MissionGateInfo.call_id` (its actual identity — engine LLM tool-call id) so the field name matches what it carries. - Added `MissionGateInfo.gate_request_id: Uuid` (a fresh UUID generated at gate-emit time). Doc-comment explicitly contrasts the two fields and points at #3166 for the half-2 wiring that maps back from `gate_request_id` to `call_id` for engine resume. - Updated `gate_paused_pauses_mission_and_emits_gate_notification` to assert both fields are populated and that `gate_request_id` is a valid (non-nil) UUID. Bridge (`src/bridge/router.rs::handle_mission_notification`): - AuthRequired and ApprovalNeeded translations now forward `gate.gate_request_id.to_string()` into `StatusUpdate.request_id`, not `gate.call_id`. Inline comment names the gateway/WS handler files that parse it as UUID so the next maintainer doesn't re-introduce the bug. Schema (`src/bridge/engine_actions.rs::mission_update`): - Discovery summary previously said "exactly one is required" for name/id, but the handler tolerates both (with the conflict guard added on b74c42e + the legacy `{id, name}` rename shape). Wording now matches actual behaviour and explicitly calls out the conflict-guard outcome and the legacy rename shape so a tool-using model knows what to expect.
|
Pushed 720a0db closing the last review-feedback items. Full status: Just landed (720a0db):
Already landed earlier on this branch:
Deferred (each individually replied on the inline thread):
Test results on 720a0db: 19/19 mission tests in |
…) (#3366) * fix(missions): auto-resume paused missions after gate resolution (#3166) Half-2 of #3133. Half-1 (PR #3155) made the engine emit a typed `GatePaused` outcome that pauses a mission and surfaces an `AuthRequired` status update on the user's auth tray. Until now the mission stayed Paused even after the user completed OAuth — they had to manually call `mission_resume`. This patch wires the credential- write and gate-resolve paths so paused missions auto-resume the moment the gate they were waiting on is resolved. Engine (`crates/ironclaw_engine/`) - `Mission` gains a persistent `paused_gate: Option<MissionGateInfo>`. `MissionGateInfo` moved from `runtime/mission.rs` to `types/mission.rs` and made `Serialize`/`Deserialize` so it round-trips through the store. Existing rows decode with `paused_gate = None` via `#[serde(default)]`. - `MissionManager::resume_paused_for_credential(credential_name, user_id)` walks paused missions visible to the user, matches by `Authentication.credential_name`, transitions Paused → Active, clears `paused_gate`, and kicks an immediate fire for non-Manual cadences. Returns the resumed ids. - `MissionManager::resume_paused_for_request_id(uuid, GateResolutionOutcome, user_id)` resumes by `gate_request_id` for approval/external gates. Approved → resume + fire; Denied/Cancelled → mark Failed. - `resume_mission` clears `paused_gate` so a manual resume doesn't leave stale gate state behind. - 4 new unit tests pin the matrix: `oauth_completion_resumes_paused_mission`, `unrelated_credential_write_does_not_resume_paused_mission`, `gate_resolution_approved_resumes_matching_paused_mission`, `gate_resolution_denied_marks_paused_mission_failed`. Bridge (`src/bridge/`) - New pub helpers `resume_paused_missions_for_credential` and `resume_paused_missions_for_gate_request` in `router.rs`. Both best-effort: helper failures log and swallow so an OAuth flow cannot 5xx because of stale mission state. Wiring - Gateway OAuth callback (`channels/web/features/oauth/mod.rs`) and WASM-tool OAuth completion (`extensions/manager.rs`) call `resume_paused_missions_for_credential` immediately after `oauth::store_oauth_tokens` succeeds, before resolving the foreground gate. - `/api/chat/gate/resolve` (`channels/web/features/chat/mod.rs`) fans the disposition out to `resume_paused_missions_for_gate_request` after the foreground gate is resolved. CredentialProvided is excluded — the credential write itself triggers the OAuth-side hook. Tests - The Rust live test `tests/e2e_live_mission_gmail.rs` is replaced by `tests/e2e/scenarios/test_mission_gmail_3133.py`. The Rust variant required a real Gmail OAuth token in the developer's `~/.ironclaw/` DB; the Python tier uses the deterministic mock LLM + the existing `/oauth/callback` flow, so it runs in CI. - The Python file pins three checks: the OAuth callback drives the resume hook without erroring when no paused mission matches; the browser sees the extension flip to authenticated; no thread in the gateway carries the `Status: None`+`Error: None` dual-marker fingerprint or the `consecutive code errors` surface from #2583. - Out of scope: chat-driven mission lifecycle (LLM emits `mission_create` + `mission_fire` and the child trips `tool_activate(gmail)`). The unit tests cover that state transition in isolation; adding canned mock-LLM responses for the chat path is a follow-up. - `secrets_store::create/update` (manual API-token entry) hook is also out of scope for this PR — only OAuth completion paths are wired today. Easy to extend later. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(e2e): live-LLM record/replay infra for #3133 mission auto-resume Replaces the stub mock-LLM Playwright test with real live-test infrastructure that mirrors the Rust `LiveTestHarnessBuilder` record/replay pattern (`tests/support/live_harness.rs`) for the Python tier: - `tests/e2e/live_llm_proxy.py` — HTTP proxy that sits between ironclaw and a real LLM. Two modes: * RECORD (`IRONCLAW_LIVE_TEST=1`): forwards `/v1/chat/completions` to upstream (`IRONCLAW_LIVE_LLM_BASE_URL` / `..._API_KEY` / `..._MODEL`) and writes (request_hash, response) pairs to a JSON fixture. Canonicalisation strips tool-call ids and other non-deterministic fields so re-recordings match. * REPLAY: serves recorded responses by canonical-request hash. Streaming requests are re-emitted as a single SSE chunk + `[DONE]` on replay, which is sufficient for ironclaw's chunk accumulator. - `tests/e2e/live_harness.py` — `start_live_proxy(test_name)`, `is_live_mode()`, and `proxy_state()` helpers. Skips (not fails) in replay mode when the per-test fixture is missing so a fresh checkout doesn't bog down on un-recorded traces. - `tests/e2e/conftest.py` adds `mission_gmail_live_server` / `mission_gmail_live_page` fixtures. The fixture wires: * the live proxy as `LLM_BASE_URL` * `IRONCLAW_TEST_HTTP_REWRITE_MAP` routing `gmail.googleapis.com` at `mock_llm.py`'s new `/gmail/v1/users/me/{drafts,messages,...}` mock endpoints (so the gmail WASM tool's HTTP calls land deterministically) * `ENGINE_V2=true` and `AGENT_AUTO_APPROVE_TOOLS=true` so the chat-driven mission_create + fire flow runs without an administrative-approval prompt while the *authentication* gate that triggers the #3133 path remains active. - `tests/e2e/mock_llm.py` adds Gmail HTTP mocks (drafts/send/list/get) plus `/__mock/gmail/{state,reset}` so tests can assert that the gmail tool actually fired against this mock (rather than no-oping or hitting the real Gmail API). - `tests/e2e/scenarios/test_mission_gmail_3133.py` is rewritten to drive the full chat flow: chat asks for the mission, the live LLM emits `mission_create` + `mission_fire`, the child thread tries to use gmail, the auth gate fires and the mission transitions to Paused, the test completes OAuth via `/oauth/callback`, the half-2 wire (`bridge::resume_paused_missions_for_credential`) resumes the mission, and the re-fired child thread completes the draft against the gmail mock. Known limitation documented in the test docstring: when the live LLM picks Tier 1 (CodeAct / Python via Monty) for the child thread, `tool_activate` is NOT exposed as a CodeAct callable — only Tier 0 structured tool calls expose it. Recent Sonnet defaults to Tier 1, so the auth-gate path that half-1 added never fires in those traces and the child thread surfaces a tool-error narration instead. The test skips with a clear diagnostic in record mode if that happens and the trace recording becomes a useful artifact for investigating. Half-2 is verified end-to-end for the Tier 0 path by the unit tests in `crates/ironclaw_engine/src/runtime/mission.rs`. A Tier-1-resilient fix needs the engine to auto-promote "credential missing" tool errors to a `GatePaused` outcome so the Tier 1 path also writes `paused_gate` — tracked as a follow-up on the same issue thread. The fixture file is intentionally not committed in this PR — the tooling works (recording produced 12 entries against NearAI's Sonnet 4.5 in local testing) but a clean fixture requires the Tier 1 gap to be addressed first. The committed `.gitkeep` keeps the fixtures directory in tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine,bridge): inline-await for Authentication gates (Tier 0 + Tier 1) PR #3157 wired Tier 0 + Tier 1 inline gate-await for Approval gates but explicitly excluded Authentication (and External) — the trait docstring on `GateController::pause` said "Restricted to Approval" and both executor paths bailed with `RuntimeError("execution paused by gate ...")` for non-Approval. That left the #3133 path partial: a mission's CodeAct child thread that hit a Gmail OAuth gate unwound to `ThreadOutcome::GatePaused`, but then the LLM in the re-entered thread saw a stale "tool returned error" instead of a clean retry against the now-present credential, and the mission either re-fired on cron (the original ghost-fire) or narrated the failure into FINAL(). This patch lets Authentication gates flow through the same inline- await path as Approval and wires the OAuth callback to wake parked VMs by credential name. Engine (`crates/ironclaw_engine/`) - `executor/scripting.rs`: `resolve_tool_future` and the `drive_inline_gate` retry loop now accept `ResumeKind::Authentication` alongside Approval. External keeps the legacy unwind path because its resolution payload (callback body) can't be handed back to a suspended call. - `executor/structured.rs`: `execute_with_inline_gate_retry` mirrors the same expansion for Tier 0 batches. - `gate/mod.rs`: drop the "Restricted to Approval" doc on `GateController::pause`. Document the new contract — the host controller is expected to deliver `GateResolution::Approved` once the credential lands in the secrets store, and the engine retries the action inline against the now-present secret. Bridge (`src/bridge/`) - `gate_controller.rs`: drop the non-Approval cancel in `BridgeGateController::pause`. Authentication is now handled the same way Approval is: register the pending gate, park on a oneshot, await resolution. Add a credential-name → request_ids index in `GateResolutions` and a new `deliver_for_credential(credential_name)` method that delivers `Approved` to every parked Authentication waiter for that credential. - `router.rs`: new pub helper `resolve_inline_gates_for_credential(credential_name)` that walks `EngineState::gate_resolutions` and calls `deliver_for_credential`. Added a `gate_resolutions` field on `EngineState` (and threaded through every test fixture) so the OAuth-callback path can reach the resolutions registry without going through the controller's internals. Wiring - `channels/web/features/oauth/mod.rs` and `extensions/manager.rs`: the OAuth completion path now fires BOTH `resolve_inline_gates_for_credential` (wakes parked Tier 0 / Tier 1 VMs for live foreground or mission child-thread executions) AND `resume_paused_missions_for_credential` (the half-2 mission-arm hook for missions whose child thread already unwound before OAuth completed). Best-effort dispatch — failures are logged inside the helpers and never block the OAuth landing page. Tests - All 27 existing `engine_v2_gate_integration` tests continue to pass unchanged. The Approval contract is unchanged and the Authentication contract was previously untestable end-to-end at the inline-await level (the old code surfaced as a `RuntimeError`). - `runtime::mission::tests` (104 tests) all pass — the engine state machine is unchanged for the half-2 mission arm. Open follow-up: a new gate_integration test that pins the Tier 1 + Authentication + credential-write resume cycle. The existing test harness already exercises the Tier 0 + Approval cycle; adding the analogous Authentication test is straightforward but out of scope for this commit (the LLM-driven trace recording in the Playwright e2e test serves as the integration-tier coverage once the chat path can drive it deterministically). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): preserve legacy unwind for Authentication when controller cancels Follow-up to the previous commit. The previous change made all Authentication gates flow through `GateController::pause()`, but that broke 4 existing tests (`gate_paused_authentication_carries_credential_name`, `auth_resolution_retries_same_pending_action_without_second_pause`, `approval_chains_directly_into_auth_for_install_flow`, `install_auth_resume_followed_by_aliased_tool_call_completes_without_hanging`) that exercise the legacy `ThreadOutcome::GatePaused` re-entry path. Those tests use `CancellingGateController` (or a fixture variant that cancels) which previously surfaced as the legacy unwind because non-Approval never reached the controller; now that Authentication does reach the controller, Cancelled was being treated as denial. Fix: when the controller returns `Cancelled` for an Authentication gate, fall through to the legacy unwind path (re-raise the original `Err(GatePaused)` in Tier 0 / re-raise `RuntimeError("execution paused by gate ...")` in Tier 1). Semantically: Cancelled-on-Auth means "the controller can't resolve OAuth inline" — that's the fallback condition for missions and non-inline-aware controllers, where the engine should hand the gate up to the orchestrator so the mission can transition to Paused and the half-2 mission-arm auto-resume mechanism takes over after OAuth. `Denied` and explicit user-driven `Cancelled` from a real production controller (BridgeGateController via `/api/chat/gate/resolve`) still fail the action. Approved retries inline as before. Test updates - `AutoApprovingGateController::pause` now only auto-approves Approval gates and explicitly returns Cancelled for Authentication so the 4 legacy-path tests continue to exercise the unwind. - `approval_chains_directly_into_auth_for_install_flow` now expects the controller to observe BOTH the Approval pause AND the Authentication pause (which falls through to legacy unwind), since Authentication now goes through the controller before unwinding. - New `authentication_gate_resolves_inline_via_controller` test pins the new inline-await contract for Authentication: a controller that returns Approved (mimicking the production BridgeGateController after a credential-write hook fires) makes the action retry inline and the thread complete in a single join, with both gate and retry events recorded. All 28 `engine_v2_gate_integration` tests pass. 524 `ironclaw_engine` lib tests pass (unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(engine): make tool_activate redundant for installed-but-unauthed tools User's question: "do we need tool_activate at all?" Answer: not for the case where a tool is installed but its credential is missing. The engine should detect missing credentials at execute time and raise an Authentication gate (which the inline-await machinery from the previous commits then resolves after OAuth). The model can call the tool directly — no preceding `tool_activate(name=...)` step. Three cooperating changes ship that contract: 1. `Tool::required_credentials()` — new trait method, default empty. `WasmToolWrapper` overrides it by walking `capabilities.http.credentials` and returning every non-optional `secret_name`. Other tool kinds (built-in, MCP, etc.) inherit the empty default and are unaffected. 2. `AuthManager::check_action_auth` — extends the existing HTTP-tool credential preflight to non-HTTP tools. After the existing `is_http` early return, looks up the tool by name in the registry and consults `tool.required_credentials()`. For each declared credential it tries `resolve_secret_for_runtime`; missing ones produce `AuthCheckResult::MissingCredentials`. The bridge's `effect_adapter::execute_action` already converts that to `Err(EngineError::GatePaused { resume_kind: Authentication, ... })`, which the inline-await loops in `executor/structured.rs` and `executor/scripting.rs` (extended in the previous commits) park on `gate_controller.pause()`. After OAuth completes, `bridge::resolve_inline_gates_for_credential` delivers `Approved` and the action retries against the now-present secret. 3. `tool_surface::is_direct_ready` — `NeedsAuth` extension tools now stay on the callable surface alongside `Ready`. Pre-#3133 they were filtered out of `available_actions` and the model had to call `tool_activate` to move them to Ready. Post-fix they appear on the schema and in the CodeAct namespace; auth resolves at execute time. `NeedsSetup`, `Inactive`, `Latent`, and `AvailableNotInstalled` still fall through to capabilities-only because their resolution requires user-driven onboarding work that a credential-write hook can't supply. Test updates - `bridge::tool_surface::tests::assigns_surface_matrix_rows` — the "needs-auth extension direct action" case flips from `capabilities_only()` to `actions_only()`. - `bridge::action_projector::tests::needs_auth_provider_tools_omitted_from_available_actions` renamed to `..._stay_in_available_actions` and expectation flipped. The `NeedsSetup` companion test stays unchanged — that path still needs explicit setup via `tool_activate` / extension UI. - `bridge::effect_adapter::tests::available_actions_omit_installed_needs_auth_provider_action` renamed to `..._keep_installed_needs_auth_provider_action` and expectation flipped. - 460 bridge tests pass, 28 gate integration tests pass, 524 engine tests pass, 19 auth tests pass. Live test prompt updated - The Playwright test no longer instructs the LLM to call `tool_activate` first. The new prompt says "call the gmail tool directly — the runtime handles authentication". This matches the new contract and gives the LLM the simplest path to the gmail call. Known limitation surfaced by the live recording - The live trace against NearAI's Sonnet 4.5 still skips with the test's "mission never reached Paused" diagnostic. Inspection shows gmail is NOT registered as a callable tool with the engine's tool registry until the WASM extension is activated — install + unauthed leaves the WASM module loaded but its tool definitions un-registered. So even with the surface change in place, the CodeAct namespace doesn't see `gmail` and the LLM can't invoke it. Closing this last gap requires registering WASM tools at install time (rather than at activation time) so the engine's preflight has a tool to look up and gate. Tracked as a follow-up on the same #3166 thread; the unit-test coverage of the auto-resume mechanism is unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(extensions): auto-register WASM tools with the engine registry on install User asked: "can we remove tool_activate as a tool completely? and instead register tools on install in ToolRegistry. Let's make the live test work end to end." This commit takes the first step: WASM tools are now registered with the engine's `ToolRegistry` immediately at install time, not gated behind a separate `tool_activate(name=...)` call. Combined with the previous commits' surface change (`is_direct_ready(NeedsAuth)`) and per-tool credential preflight (`AuthManager::check_action_auth` reading `Tool::required_credentials()`), the model can now call an installed-but-unauthed tool directly — the engine raises an `Authentication` gate at execute time and the inline-await machinery resumes the action after OAuth completes. `install_wasm_tool_from_url_with_caps` now invokes `activate_wasm_tool` immediately after the download lands. Any activation failure is logged and downgraded to a warning rather than unwinding the install — so a transient runtime error doesn't break install. The user can still re-run `/api/extensions/{name}/activate` in that fallback case. Verified by direct diagnostic: a chat against a fresh ironclaw with a freshly-installed (but unauthed) gmail extension now shows `gmail` registered in `/api/extensions/tools` and classified as `Inline` in the action projector (NeedsAuth status, available_actions = true). 132 extension manager tests pass. 460 bridge tests pass. 524 engine tests pass. 28 gate integration tests pass. Remaining work to make `tool_activate` fully retire and the live test pass end-to-end (not blocking this commit): 1. Remove `tool_activate` as a builtin tool, drop it from the AUTONOMOUS_TOOL_DENYLIST, drop the special-cased install-approval logic, and update the engine v2 system prompt's "Activatable Integrations" guidance. Mostly mechanical. 2. Make Tier 1 (CodeAct) mission child threads surface `ThreadOutcome::GatePaused` when an Authentication gate fires and the controller cancels (no PerExecutionContext registered for missions today). Currently the legacy fallback raises a Python `RuntimeError("execution paused by gate ...")` inside the script, which appears in stdout but isn't propagated to `CodeExecutionResult::need_approval`. The Tier 0 path already surfaces correctly via the gate_paused result_json. Closing this gap likely needs either: (a) a side-channel from `drive_inline_gate` into `execute_code`'s `need_approval` field, or (b) registering a mission-thread `PerExecutionContext` so the controller parks indefinitely on Authentication and the mission completes the action inline instead of unwinding. (b) is cleaner — missions then behave like the foreground chat path. The live test stays in the tree as the recording artifact. The diagnostic confirms the architectural change is correct (gmail is classified Inline + registered) — the remaining gap is downstream of that. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): surface Tier 1 Cancelled+Authentication gates as need_approval Adds the missing wire from `drive_inline_gate`'s Cancelled+Authentication fallback to `CodeExecutionResult::need_approval`, so mission child threads running in Tier 1 (CodeAct) actually transition to Paused instead of silently dropping the gate. Before this commit: - Tier 1 mission child thread calls a tool that needs auth - Engine raises GatePaused{Authentication} - Inline-await calls `gate_controller.pause()` — no PerExecutionContext registered for missions, so the controller cancels - Cancelled+Authentication fallback in `drive_inline_gate` raises `RuntimeError("execution paused by gate ...")` - Script exits with the error in stdout - `CodeExecutionResult::need_approval` was always None - Orchestrator sees the script error but no `pending_gate` field - Mission stays Active, cron keeps re-firing — the original #3133 ghost-fire shape After this commit: - `drive_inline_gate` writes the original `ThreadOutcome::GatePaused` to a tokio `task_local!` (`PENDING_GATE_STASH`) before raising the RuntimeError - `execute_code_with_skills_inner` runs inside `PENDING_GATE_STASH.scope(RefCell::new(None), ...)` so the stash is scoped per execution - The runtime-error exit path reads the stash and threads it into `CodeExecutionResult::need_approval` - The orchestrator's `__run_code__` wrapper already converts `need_approval` into a `pending_gate` field on the script result (line ~1061 of orchestrator.rs) - The Python orchestrator reads `pending_gate`, transitions the thread to Waiting, and returns `outcome=gate_paused` - Rust converts that to `ThreadOutcome::GatePaused` - `process_mission_outcome_and_notify` sees GatePaused → mission Paused → `paused_gate` set → user sees notification → user does OAuth → half-2 mission arm auto-resumes the mission This is the missing piece to make Tier 1 missions safe under #3133. For Tier 0, the legacy unwind already produces a `gate_paused` result_json directly. For foreground threads with a wired BridgeGateController, inline-await parks indefinitely on Authentication (the new contract from earlier commits) and the Cancelled fallback never fires. 524 engine tests pass. 28 gate integration tests pass. 460 bridge tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(bridge): tone down preflight-auth + gate-cancel logs to debug Cleanup pass after the broader #3133 wiring audit. Both lines were temporarily promoted to info! during debugging; for production they should sit at debug! to avoid log noise on the happy path. No behavior change. * fix: complete Tier 1 mission auto-resume + WASM tool test rewrites for #3133 Five small fixes that together let the mission auto-resume (#3133 / #3166) flow run end-to-end through Tier 1 (CodeAct/Monty) child threads: - engine: extract `take_pending_gate_stash()` and apply it at all three Tier 1 script-error exit paths in `execute_code_with_skills_inner`. The previous fix only covered one exit; the others silently swallowed Cancelled+Authentication gates, leaving the mission Active when Monty raised the gate from a non-progress error path. - engine: deterministic sort in the trait-default `list_missions_with_shared` plus the in-memory `store_adapter::list_missions` / `list_all_missions`. Eliminates HashMap-iteration-order leakage into `mission_list` tool results so the LLM and replay tests see a stable shape. - bridge gate_controller: replace stale "no per-execution context → cancel" comment with the real reason — mission/background threads intentionally fall through to the legacy `ThreadOutcome::GatePaused` unwind so `process_mission_outcome_and_notify` (#3133 half-1) flips the mission to Paused, and `resume_paused_for_credential` (half-2) resumes after OAuth. - WASM tool wrapper: in test/debug builds, after credential injection apply `IRONCLAW_TEST_HTTP_REWRITE_MAP` to outbound URLs so live-test fixtures can route `gmail.googleapis.com` at the mock_llm endpoint the same way WASM channels already do. Also implement `Tool::required_credentials()` from the WASM capability manifest so the engine can preflight credential checks without invoking the tool. - WASM tool http_security: when `IRONCLAW_TEST_HTTP_REWRITE_MAP` is set in a debug build and the rewritten host resolves to loopback, allow it through the SSRF guard. Production paths (no env var, or release builds) keep rejecting private/internal IPs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(e2e): land live fixture for #3133 + harden replay canonicalization Records the full LLM trace for the mission auto-resume regression and makes replay deterministic across re-recordings on different machines: - live_llm_proxy: drop skill bodies from the canonical hash. Skills are local-machine state (the user's `~/.ironclaw/skills` set varies between record and replay) so hashing the body forces a fresh recording for every checkout. The skill *set* still influences the LLM's actual response — replay just doesn't re-validate it. Also collapse `[SKILL] skill:NAME ...` and `### [SKILL] skill:NAME ...` forms via a single regex so user-message and system-prompt skill blocks both normalize. - live_llm_proxy: log every chat_completions request and any upstream 4xx/5xx body to stderr. Diagnosable failures during a re-record without re-running the harness from scratch. - conftest + live_harness: when `IRONCLAW_E2E_STDERR_LOG` / `IRONCLAW_LIVE_PROXY_STDERR_LOG` are set, redirect ironclaw and proxy stderr to a file so the LLM-call sequence and any rust-side retry warnings are durable across teardown. Defaults are unchanged (PIPE) so CI behavior matches today. - fixtures/live/test_mission_gmail_draft_3133.json: fresh recording produced by `IRONCLAW_LIVE_TEST=1 IRONCLAW_LIVE_LLM_MODEL=Qwen/Qwen3.5-122B-A10B pytest scenarios/test_mission_gmail_3133.py`. Replay passes deterministically (25.87s on this box) and exercises the full Paused → OAuth → Active → gmail-draft loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: remove tool_activate; rely on auth-preflight for installed-but-unauthed tools `tool_activate` was the engine-v2 enablement entry point for installed provider tools that needed OAuth before the model could call them. Post-#3133 the same flow is covered by the auth preflight in `AuthManager::check_action_auth`: any tool with declared `required_credentials()` raises an `Authentication` gate at execute time when the credential is missing, the inline-await machinery parks the VM, and the OAuth callback resumes the action. The separate `tool_activate` step is now redundant. What's removed: - `ToolActivateTool` struct + Tool impl (`src/tools/builtin/extension_tools.rs`) - Registration sites: `ToolRegistry::register_extension_tools`, `effect_adapter` test rigs, action_projector test surface - The `tool_activate_requires_install_approval` adapter method, `matching_extension_requires_install_approval` helper, and three approval-gate tests that only fired on the `tool_activate` lookup name - The `tool_activate` entry from the autonomous tool denylist (`src/tools/autonomy.rs`, `crates/ironclaw_engine/src/gate/tool_tier.rs`), `seeded_default_permission`, the action-resolver alias map (`src/auth/extension.rs`), the schema validator, and the autonomous whitelist in `src/tools/registry.rs` - The "Activatable Integrations" prompt template's `tool_activate(name=...)` instruction; replaced with text telling the model these integrations need user setup through the IronClaw UI - `tests/e2e/scenarios/test_v2_tool_activate_surface.py` and `tests/e2e_auth_gate_traces.rs` plus its 5 LLM trace fixtures — these tested the old tool-output-detection contract that the auth preflight obsoletes; equivalent coverage lives in `bridge::effect_adapter::tests::preflight_gate_blocks_missing_credential`, the engine v2 mission unit tests, and the live mission gmail e2e What changed in the prompt: - `is_activatable_integration` now excludes `NeedsAuth` — those tools are direct-callable through the regular action inventory. `NeedsSetup`, `Inactive`, `Latent`, and `AvailableNotInstalled` still surface separately because they need user-driven setup the model can't do itself. Live fixture re-hashed: stripped `tool_activate` from each entry's `request_canonical.tools` and recomputed `request_hash` so replay matches the new (1-shorter) tool list. Replay test still passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: cargo fmt across the branch Run rustfmt on the engine and bridge files touched by the earlier mission auto-resume + tool_activate-removal commits. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(missions, gates): address PR #3366 review — atomic resume + per-user wakeup scoping Five correctness fixes from the PR review on #3366: 1. **Per-user credential wakeup (Copilot)** `GateResolutions.by_credential` now keys by `(user_id, credential_name)`. Previously a credential write under one account could deliver `Approved` to parked Authentication gates from a different account that happened to share the same credential name. `register_credential`, `deliver_for_credential`, and the public `bridge::resolve_inline_gates_for_credential` helper now all carry `user_id`. Call sites: OAuth callback handler, WASM OAuth completion in extensions manager. 2. **Atomic `resume_paused_for_request_id` (serrrfirat HIGH)** The handler now reloads the mission and re-checks `paused_gate.gate_request_id` in the same load+save round-trip, instead of relying on the snapshot it scanned for. Prevents a race where a mission re-paused on a different gate between the snapshot and the resume could be transitioned to `Active` (silently clearing the new gate) because `resume_mission` accepts both `Paused` and `Failed`. Returns `None` when the live gate no longer matches. 3. **Shared mission auto-resume runs under requesting user (serrrfirat HIGH)** For shared missions (`mission.user_id == "__shared__"`), `resume_paused_for_credential` and `resume_paused_for_request_id` now use `user_id` (the user whose credential / resolution unblocked the gate) for `fire_mission`, while keeping `mission.user_id` for the `resume_mission` ownership check. Previously the post-resume fire ran under `__shared__`, which has no per-user secret / project scope. 4. **Mission-only gate cancellation (serrrfirat MEDIUM)** `chat_gate_resolve_handler` no longer requires `thread_id` on the `Cancelled` arm. Mission-only gates have no foreground `thread_id` in the resolution payload; the mission auto-resume path now carries the `Cancelled` outcome to the mission state machine even without a foreground submission. Foreground gates that do supply `thread_id` still dispatch the structured cancellation as before. 5. **CredentialProvided now triggers mission auto-resume (serrrfirat MEDIUM)** `chat_gate_resolve_handler` now fires `mission_outcome = Approved` for `CredentialProvided` submissions, not just `Approved`. Previously, missions paused on a non-OAuth credential (e.g. user submits a Telegram bot token through the gate UI) would only auto-resume on a subsequent OAuth callback. The new mission helpers' atomic gate-id check makes the double-fire (this hook + the OAuth-callback path) idempotent. Plus six doc fixes: stale references to `bridge::resume_paused_missions_for_credential` in inline-await docstrings (`gate/mod.rs`, `executor/structured.rs`, `executor/scripting.rs`) now correctly point at `bridge::resolve_inline_gates_for_credential`. The router doc no longer claims a `secrets_store::create/update` hook that doesn't exist; updated to name the actual call sites and document the manual-credential follow-up. Test fixtures in `bridge::router::tests` (4 EngineState constructors) now share a single `Arc<GateResolutions>` between `gate_controller` and `gate_resolutions` so test paths exercising `resolve_inline_gates_for_credential` actually reach the parked waiters that `BridgeGateController::pause` registered. All 5623 lib tests pass; replay e2e still green at ~28s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(e2e): correct return type on live_llm_proxy SSE helpers Copilot review: `_emit_streamed_response` was annotated `web.StreamResponse` but the underlying `_send_sse` helper buffers the payload and returns a `web.Response`. Aligned the chain (`_emit_streamed_response`, `_send_sse_payload`, `_send_sse_lines`, `_send_sse`) on the actual return type and removed the unreachable `_start`/`started` block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: bump to retrigger workflows on PR #3366 * fix(missions, wasm, e2e): address remaining PR #3366 review feedback - chat_gate_resolve_handler: validate gate request_id once up front so every arm (including Approved/Denied) returns a uniform 400 on malformed UUIDs and the mission auto-resume hook isn't silently skipped. - extensions::manager::install_wasm_tool_from_url_with_caps: reflect activation outcome in InstallResult.message — no longer claims "installed and ready" when activate_wasm_tool fails. - tools::wasm::http_security: drop the is_unspecified() branch from the IRONCLAW_TEST_HTTP_REWRITE_MAP escape hatch; the rewrite helper only emits loopback IPs, so allowing 0.0.0.0/:: just weakened the SSRF guard in debug/test builds. - tests/e2e/live_harness.py: remove unused subprocess import and unused _find_free_port helper. - tests/e2e/live_llm_proxy.py: docstring on _normalize_skills_block said "sorted name list" while the implementation drops the entire block to a `[SKILLS]` placeholder — corrected to match behavior. [skip-regression-check] — review-feedback fixes, no behavioral regression to add a test for; existing chat-gate resolve coverage exercises the validated path, and the SSRF guard fix tightens an unreachable branch. * fix(gates, wasm, e2e): address final PR #3366 review and unblock CI - chat/gate-resolve cancel: when the client omits `thread_id` for a foreground inline-await gate, look up the owning thread from `PendingGateStore` instead of skipping dispatch — otherwise the parked VM is stranded even though the API reports success. Adds `PendingGateStore::peek_by_request_id` (user-scoped, read-only) and `bridge::get_pending_gate_by_request_id`. - tools/wasm rewrite validator: replace the hardcoded `http://...`-prefix match with URL parsing that mirrors the channel-side `is_loopback_test_rewrite_base`, accepting any http/https loopback target. Also strip IPv6 brackets in both helpers so `[::1]` actually validates. - tests/e2e/pyproject.toml: scope setuptools auto-discovery to `scenarios*` so the new `fixtures/` (live-LLM record/replay JSON) no longer trips the "multiple top-level packages" error and `pip install -e .` works in CI. Regression coverage: store-level peek_by_request_id ownership + expiry tests; loopback-rewrite parity + https-loopback acceptance tests in tools/wasm/wrapper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
5 consecutive code errors. The actual root cause was much smaller than the linked Cost-based budgeting: replace iteration/time caps with USD budgets cascading user → project → mission → thread #2843 epic suggested: everymission_*handler in the bridge required anidUUID, but the LLM and theroutine_*alias path naturally passname. Each call collapsed toinvalid mission id: invalid length 0, retried 4–5×, and tripped the orchestrator's consecutive-error guard.name(preferred) withidUUID retained as a legacy fallback; UUIDs stay internal.What changed
src/bridge/effect_adapter.rsresolve_mission_idhelper. Handlersmission_get,mission_fire,mission_pause,mission_resume,mission_complete,mission_updatemigrated.mission_updaterenames vianew_name; legacyname-as-rename-target preserved only when a UUIDidis also passed. Four new unit tests.src/bridge/engine_actions.rsnameas primary,idas legacy alternative. Neither field isrequired.mission_updatedocumentsnew_namefor renames.tests/support/test_rig.rschannel_handle()accessor andsend_exec_approvalfor the live test's approval responder.tests/e2e_live_routine.rsinvalid mission idorconsecutive code errors; verifies**[name]**notifications on both fire and refire.tests/fixtures/llm_traces/live/routine_btc_create_test_and_refire.jsonHow the fix was found
Driving the issue's verbatim prompt through engine v2 with a real LLM (the new test) reproduced the failure on first run. The trace showed:
routine_firetranslates tomission_fireviaroutine_to_mission_alias, which passesparams.clone()unchanged. Themission_firehandler then tried to parse the routine name as a UUID and failed. Five errors in a row → orchestrator killed the thread → "5 consecutive code errors" symptom.Verification
cargo test --features libsql --lib bridge::effect_adapter— all 113 effect-adapter tests pass, including 4 new ones for the helper / handler / alias paths.cargo test --features libsql --test e2e_live_routine -- --skip routine_btc— heuristic + structural unit tests pass.IRONCLAW_LIVE_TEST=1 cargo test --features libsql --test e2e_live_routine -- --ignored routine_btc_create_test_and_refire --nocapture— passes in 49s. Tools observed:[\"routine_create(*/5 * * * *)\", \"routine_fire(bitcoin-price-check)\", \"routine_history(bitcoin-price-check)\", \"job_status(...)\"]. Noinvalid mission id. Noconsecutive code errors. Notifications delivered on both the initial fire and the second-turn refire.IRONCLAW_LIVE_TEST) loads the committed fixture and runs deterministically.Out of scope (parked for separate follow-ups)
The live test surfaced two further observations that are independent of this fix and not test-failing:
auto_approve_tools=false,mission_create / routine_create / mission_fire / routine_fire(all classifiedAdministrativepercrates/ironclaw_engine/src/gate/tool_tier.rs) ran without firing a singleStatusUpdate::ApprovalNeeded. Logged as a warning in the test.build_meta_prompt(crates/ironclaw_engine/src/runtime/mission.rs:2132–2139) instructs the child to FINAL() with "1. What you accomplished / 2. Next focus / 3. Goal achieved" — meta-prose. For "fetch X and surface it" routines the user actually wants the data. Logged as a warning in the test.Both warrant their own issues; flagging here so they're not lost.
Test plan