Skip to content

refactor: extract AppEvent to crates/ironclaw_common#1615

Merged
ilblackdragon merged 4 commits intostagingfrom
refactor/extract-app-event-to-ironclaw-common
Mar 25, 2026
Merged

refactor: extract AppEvent to crates/ironclaw_common#1615
ilblackdragon merged 4 commits intostagingfrom
refactor/extract-app-event-to-ironclaw-common

Conversation

@ilblackdragon
Copy link
Copy Markdown
Member

Summary

  • New crate crates/ironclaw_common with AppEvent (renamed from SseEvent) and truncate_preview — shared types/utils that were leaking from the web gateway into 12+ modules
  • Rename SseEventAppEvent across 23 files — the enum was the app-wide event protocol, not a web transport concern
  • from_sse_event()from_app_event() on WsServerMessage
  • Wire format unchanged — serde renames are on variants, not the enum name
  • web/types.rs and web/util.rs re-export from ironclaw_common for backward compat within the gateway

Motivation

SseEvent was defined in src/channels/web/types.rs but imported by agent, orchestrator, worker, tools, extensions, and CLI modules. This coupling would block extracting the web gateway into its own crate. Moving the event type to a neutral shared crate decouples the event protocol from the transport.

Aligned with the event bus direction on refactor/architectural-hardening where DomainEvent (≡ AppEvent) is wrapped in a SystemEvent envelope with category-based sink routing.

Test plan

  • cargo check — clean compile
  • cargo clippy --all --benches --tests --examples --all-features — zero warnings
  • cargo fmt -- --check — clean
  • cargo test -p ironclaw_common — 11 tests pass (truncate_preview)
  • cargo test -p ironclaw --lib -- channels::web — 127 gateway tests pass
  • cargo test -p ironclaw --lib -- job_monitor — 9 tests pass
  • Full cargo test suite — 3576+ tests pass, 0 failures

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings March 24, 2026 06:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

@github-actions github-actions Bot added size: XL 500+ changed lines scope: agent Agent core (agent loop, router, scheduler) scope: channel Channel infrastructure scope: channel/cli TUI / CLI channel scope: channel/web Web gateway channel scope: channel/wasm WASM channel runtime scope: tool Tool infrastructure scope: tool/builtin Built-in tools scope: tool/wasm WASM tool sandbox scope: tool/mcp MCP client scope: db Database trait / abstraction scope: db/postgres PostgreSQL backend scope: db/libsql libSQL / Turso backend scope: llm LLM integration scope: workspace Persistent memory / workspace scope: orchestrator Container orchestrator scope: worker Container worker scope: config Configuration scope: extensions Extension management scope: setup Onboarding / setup scope: ci CI/CD workflows scope: docs Documentation scope: dependencies Dependency updates risk: high Safety, secrets, auth, or critical infrastructure contributor: core 20+ merged PRs and removed size: XL 500+ changed lines labels Mar 24, 2026
@ilblackdragon ilblackdragon changed the base branch from main to staging March 24, 2026 06:23
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on refactoring and enhancing the IronClaw application by introducing a common crate for shared types, improving LLM provider support, enhancing security, and improving the robustness of the system. The changes aim to decouple components, improve security, and provide a more flexible and reliable platform.

Highlights

  • Code Sharing: Introduces a new ironclaw_common crate for sharing code between different parts of the application, reducing code duplication and improving maintainability.
  • Event Handling: Renames SseEvent to AppEvent to better reflect its purpose as an application-wide event protocol, decoupling it from web-specific transport concerns.
  • LLM Provider Support: Adds support for GitHub Copilot and Google Gemini LLM providers, enhancing the system's flexibility and integration capabilities.
  • Security Enhancements: Improves security by preventing prompt injection vulnerabilities through the escaping of tool output and external content, and by validating base URLs.
  • Robustness: Improves the reliability of the system by adding a mechanism to queue messages during agent processing, ensuring no messages are lost during high load.
Ignored Files
  • Ignored by pattern: .github/workflows/** (2)
    • .github/workflows/e2e.yml
    • .github/workflows/regression-test-check.yml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ilblackdragon ilblackdragon force-pushed the refactor/extract-app-event-to-ironclaw-common branch from a280bc5 to 3e8afb1 Compare March 24, 2026 19:30
@ilblackdragon
Copy link
Copy Markdown
Member Author

Addressed Copilot review in 3e8afb1: renamed leftover sse_event vars to app_event in orchestrator/api.rs and ws.rs, renamed test functions from test_ws_server_from_sse_* to test_ws_server_from_app_event_*, updated stale SSE comments.

Copy link
Copy Markdown
Collaborator

@zmanian zmanian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: refactor: extract AppEvent to crates/ironclaw_common

+500/-477 across 24 files

Issues (ranked by severity)

1. No Deserialize derive on AppEvent (Medium)
The new AppEvent only derives Serialize + Debug + Clone. The old SseEvent also only had Serialize, but now that AppEvent lives in a shared crate intended for reuse by other workspace members, downstream consumers (e.g. test harnesses, CLI tools, external integrations) will likely need to deserialize incoming events. Adding Deserialize now avoids a semver-ish breaking change to the shared crate later.

2. event_type() manually duplicates serde rename values (Medium)
The event_type() match arm strings must stay in sync with the #[serde(rename = "...")] attributes. If a variant is added and the developer forgets to update event_type(), the compiler won't catch it (the match is exhaustive on variants, but the string could be wrong). Consider deriving the event type string from serde metadata or adding a test that round-trips serialization and asserts event_type() matches the "type" field in the JSON output.

3. Stale variable/comment references to "SSE" remain (Low)
Several comments and at least one variable still reference "SSE" — e.g. src/channels/web/server.rs:1203 still says "Broadcast SSE event", and src/worker/job.rs doc comment was only partially updated. Copilot and Gemini already flagged specific instances. Not a functional issue, but undermines the refactor's goal of decoupling from SSE terminology.

4. truncate_preview moved but public API surface unchanged (Low)
The old crate::channels::web::util::truncate_preview is now a re-export of ironclaw_common::truncate_preview. This is clean, but the re-export means two import paths work — both the old path (via pub use) and the new ironclaw_common::truncate_preview. This is fine for backward compat, but consider deprecating the old path to guide callers toward the canonical import.

5. ironclaw_common crate uses edition = "2024" and rust-version = "1.92" (Nit)
Just confirming this is intentional and aligns with the workspace's MSRV. If the workspace targets an older MSRV, this could cause issues for contributors on older toolchains.

What's good

  • Clean mechanical rename with zero semantic changes — the serde wire format is identical (same #[serde(rename)] values), so this is fully backward-compatible on the wire.
  • The event_type() helper consolidates three duplicate match blocks (SSE, WS, types) into one, reducing ~70 lines of duplication.
  • Tests were updated consistently across all 24 files.
  • truncate_preview extraction with comprehensive unit tests in the new crate is solid.
  • publish = false on the new crate prevents accidental crates.io publishing.

Verdict

Approve with suggestions — This is a clean, low-risk refactor. The wire format is unchanged, so there are no breaking changes for clients. The suggestions above (especially adding Deserialize and a round-trip test for event_type()) would strengthen the shared crate for future consumers but are not blockers.

…ments

Address zmanian review:
- Add Deserialize derive to AppEvent so downstream consumers can
  deserialize incoming events
- Add event_type_matches_serde_type_field test that round-trips every
  variant through serde and asserts event_type() matches the serialized
  "type" field — catches drift between serde renames and the manual match
- Add round_trip_deserialize test for basic Serialize/Deserialize parity
- Update remaining "SSE" references in comments across server.rs,
  manager.rs, ws_gateway_integration.rs, and worker/job.rs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 25, 2026 05:43
@ilblackdragon
Copy link
Copy Markdown
Member Author

Addressed zmanian's review in e0cad3b:

  1. Deserialize on AppEvent — Added Deserialize derive so downstream consumers can deserialize events. ✅
  2. Round-trip test for event_type()event_type_matches_serde_type_field serializes every variant, parses the JSON "type" field, and asserts it matches event_type(). Catches drift between serde renames and the manual match. Also added round_trip_deserialize test. ✅
  3. Stale SSE comments — Updated remaining "SSE" references in server.rs, manager.rs, ws_gateway_integration.rs, and worker/job.rs. ✅
  4. Re-export deprecation — Rust doesn't support #[deprecated] on pub use re-exports. Both callers (session.rs, thread_ops.rs) already import from ironclaw_common directly, so the old path is only used within the web gateway itself. No action needed.
  5. Edition/MSRV — Confirmed aligned with workspace (edition = "2024", rust-version = "1.92"). ✅

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 24 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +12 to +20
// Walk backwards from max_bytes to find a valid char boundary
let mut end = max_bytes;
while end > 0 && !s.is_char_boundary(end) {
end -= 1;
}
let mut result = format!("{}...", &s[..end]);

// Re-close <tool_output> if truncation cut through the closing tag.
if s.starts_with("<tool_output") && !result.ends_with("</tool_output>") {
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truncate_preview appends a closing </tool_output> tag whenever the input starts with <tool_output and the truncated result doesn’t end with the closing tag. Since the truncated result always ends with ..., this will always append on truncation, and it can still produce malformed XML if the truncation point lands inside the closing tag (leaving a partial </tool_... fragment) or if the string isn’t actually wrapped (closing tag appears earlier / extra trailing content). Consider tightening the condition to only run when the original is actually wrapped (e.g., s.starts_with(..) && s.trim_end().ends_with("</tool_output>")) and, when truncating, ensure end never falls within the final closing tag (clamp end to the start of the closing tag before adding ... and re-appending the full closing tag).

Suggested change
// Walk backwards from max_bytes to find a valid char boundary
let mut end = max_bytes;
while end > 0 && !s.is_char_boundary(end) {
end -= 1;
}
let mut result = format!("{}...", &s[..end]);
// Re-close <tool_output> if truncation cut through the closing tag.
if s.starts_with("<tool_output") && !result.ends_with("</tool_output>") {
// Detect strings that are actually wrapped in a <tool_output>...</tool_output> pair.
let is_wrapped_tool_output = s.starts_with("<tool_output")
&& s.trim_end().ends_with("</tool_output>");
let closing_tag = "</tool_output>";
let closing_start = if is_wrapped_tool_output {
s.rfind(closing_tag)
} else {
None
};
// Walk backwards from an initial end position to find a valid char boundary.
// For wrapped <tool_output> strings, avoid truncating inside the closing tag
// by clamping `end` to the start of the final `</tool_output>`.
let mut end = max_bytes;
if let Some(close_start) = closing_start {
if end > close_start {
end = close_start;
}
}
while end > 0 && !s.is_char_boundary(end) {
end -= 1;
}
let mut result = format!("{}...", &s[..end]);
// Re-close <tool_output> if we truncated a string that was originally wrapped.
if is_wrapped_tool_output {

Copilot uses AI. Check for mistakes.
@ilblackdragon
Copy link
Copy Markdown
Member Author

Re: Copilot comment on truncate_preview tool_output edge case —

This is pre-existing behavior moved verbatim from src/channels/web/util.rs. The function was not modified in this PR, only relocated. The edge case (string starting with <tool_output but not actually wrapped) doesn't occur in practice — all callers pass actual tool output that's always properly wrapped.

Tightening the logic is a reasonable improvement but belongs in a separate PR to keep this refactor focused on the extraction.

@ilblackdragon ilblackdragon merged commit 706c3a1 into staging Mar 25, 2026
18 checks passed
@ilblackdragon ilblackdragon deleted the refactor/extract-app-event-to-ironclaw-common branch March 25, 2026 06:02
bkutasi pushed a commit to bkutasi/ironclaw that referenced this pull request Mar 28, 2026
* refactor: extract AppEvent to crates/ironclaw_common

SseEvent was defined in src/channels/web/types.rs but imported by 12+
modules across agent, orchestrator, worker, tools, and extensions — it
had become the application-wide event protocol, not a web transport
concern.

Create crates/ironclaw_common as a shared workspace crate and move the
enum there as AppEvent.  Also move the truncate_preview utility which
was similarly leaked from the web gateway into agent modules.

- New crate: crates/ironclaw_common (AppEvent, truncate_preview)
- Rename SseEvent → AppEvent, from_sse_event → from_app_event
- web/types.rs re-exports AppEvent for internal gateway use
- web/util.rs re-exports truncate_preview
- Wire format unchanged (serde renames are on variants, not the enum)

Aligned with the event bus direction on refactor/architectural-hardening
where DomainEvent (≡ AppEvent) is wrapped in a SystemEvent envelope.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: add AppEvent::event_type() helper, deduplicate match blocks

Address Gemini review: extract the variant→string match into a single
method on AppEvent, replacing the duplicated 22-arm matches in sse.rs
and types.rs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: rename leftover sse vars/tests to match AppEvent rename

Address Copilot review: rename sse_event vars to app_event in
orchestrator/api.rs and ws.rs, rename test functions from
test_ws_server_from_sse_* to test_ws_server_from_app_event_*, and
update stale SSE comments.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: add Deserialize to AppEvent, round-trip test, fix stale comments

Address zmanian review:
- Add Deserialize derive to AppEvent so downstream consumers can
  deserialize incoming events
- Add event_type_matches_serde_type_field test that round-trips every
  variant through serde and asserts event_type() matches the serialized
  "type" field — catches drift between serde renames and the manual match
- Add round_trip_deserialize test for basic Serialize/Deserialize parity
- Update remaining "SSE" references in comments across server.rs,
  manager.rs, ws_gateway_integration.rs, and worker/job.rs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ilblackdragon added a commit that referenced this pull request Apr 3, 2026
…architecture) (#1557)

* v2 architecture phase 1

* feat(engine): Phase 2 — execution loop, capability system, thread runtime

Add the core execution engine to ironclaw_engine crate:

- CapabilityRegistry: register/get/list capabilities and actions
- LeaseManager: async lease lifecycle (grant, check, consume, revoke, expire)
- PolicyEngine: deterministic effect-level allow/deny/approve
- ThreadTree: parent-child relationship tracking
- ThreadSignal/ThreadOutcome: inter-thread messaging via mpsc
- ThreadManager: spawn threads as tokio tasks, stop, inject messages, join
- ExecutionLoop: core loop replacing run_agentic_loop() with signals,
  context building, LLM calls, action execution, and event recording
- Structured executor (Tier 0): lease lookup → policy check → effect execution
- Tool intent nudge detection
- MemoryStore + RetrievalEngine stubs for Phase 4
- Full 8-phase architecture plan in docs/plans/
- CLAUDE.md spec for the engine crate

74 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): Phase 3 — Monty Python executor with RLM pattern

Add CodeAct execution (Tier 1) using the Monty embedded Python
interpreter, following the Recursive Language Model (RLM) pattern
from arXiv:2512.24601.

Key additions:
- executor/scripting.rs: Monty integration with FunctionCall-based
  tool dispatch, catch_unwind panic safety, resource limits (30s,
  64MB, 1M allocs)
- LlmResponse::Code variant + ExecutionTier::Scripting
- Context-as-variables (RLM 3.4): thread messages, goal, step_number,
  previous_results injected as Python variables — LLM context stays
  lean while code accesses data selectively
- llm_query(prompt, context) (RLM 3.5): recursive subagent calls
  from within Python code — results stored as variables, not injected
  into parent's attention window (symbolic composition)
- Compact output metadata between code steps instead of full stdout
- MontyObject ↔ serde_json::Value bidirectional conversion
- Updated architecture plan with RLM design principles

74 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): RLM best-practices enhancements from cross-reference analysis

Cross-referenced our implementation against the official RLM (alexzhang13/rlm),
fast-rlm (avbiswas/fast-rlm), and Prime Intellect's verifiers implementation.
Key enhancements:

- FINAL(answer) / FINAL_VAR(name): explicit termination pattern matching
  all three reference implementations. Code can signal completion at any
  point, not just via return value.
- llm_query_batched(prompts): parallel recursive sub-calls via tokio::spawn,
  matching fast-rlm's asyncio.gather pattern and Prime Intellect's llm_batch.
- Output truncation increased to 8000 chars (from 120), matching Prime
  Intellect's 8192 default. Shows [TRUNCATED: last N chars] or [FULL OUTPUT].
- Step 0 orientation preamble: auto-injects context metadata (message count,
  total chars, goal, last user message preview) before first code step,
  matching fast-rlm's auto-print pattern.
- Error-to-LLM flow: Python parse errors, runtime errors, NameErrors,
  OS errors, and async errors now flow back as stdout content instead of
  terminating the step, enabling LLM self-correction on next iteration.
  Only VM panics (catch_unwind) terminate as EngineError.

74 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(engine): update architecture plan with RLM cross-reference learnings

Comprehensive update after cross-referencing against official RLM
(alexzhang13/rlm), fast-rlm (avbiswas/fast-rlm), Prime Intellect
(verifiers/RLMEnv), rlm-rs (zircote/rlm-rs), and Google ADK RLM.

Changes:
- Mark Phases 1-3 as DONE with commit refs and test counts
- Add "Key Influences" section documenting all reference implementations
- Phase 3: full table of implemented RLM features with sources
- Phase 3: "Remaining gaps" table with which phase addresses each
- Phase 4: expanded with compaction (85% context), rlm_query() (full
  recursive sub-agent), dual model routing, budget controls (USD,
  timeout, tokens, consecutive errors), lazy loading, pass-by-reference
- Add "RLM Execution Model" cross-cutting section
- Add "Implementation Progress" tracking table
- Remove stale "TO IMPLEMENT" markers (all Phase 3 work is done)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): Phase 4 — budget controls, compaction, reflection pipeline

Budget enforcement in ExecutionLoop:
- max_tokens_total: cumulative token limit, checked before each iteration
- max_duration: wall-clock timeout for entire thread
- max_consecutive_errors: consecutive error steps threshold (resets on
  success, matching official RLM behavior)
- All produce ThreadOutcome::Failed with descriptive messages

Context compaction (from RLM paper, 85% threshold):
- estimate_tokens(): char-based estimation (chars/4, matching RLM)
- should_compact(): triggers when tokens >= threshold_pct * context_limit
- compact_messages(): asks LLM to summarize progress, replaces history
  with [system, summary, continuation_note], preserves intermediate results
- Configurable via ThreadConfig: model_context_limit, compaction_threshold

Dual model routing:
- LlmCallConfig gains depth field (0=root, 1+=sub-call)
- Implementations can route to cheaper models for sub-calls
- ExecutionLoop passes thread depth to every LLM call

Reflection pipeline (reflection/pipeline.rs):
- reflect(thread, llm): analyzes completed thread via LLM
- Produces Summary doc (always), Lesson doc (if errors), Issue doc (if failed)
- Builds transcript from thread messages + error events
- Returns ReflectionResult with docs + token usage

ThreadConfig extended with: max_tokens_total, max_consecutive_errors,
model_context_limit, enable_compaction, compaction_threshold, depth, max_depth.

78 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): Phase 5 — conversation surface separated from execution

Conversation is now a UI layer, not an execution boundary. Multiple
threads can run concurrently within one conversation; threads can
outlive their originating conversation.

New types (types/conversation.rs):
- ConversationSurface: channel + user + entries + active_threads
- ConversationEntry: sender (User/Agent/System) + content + origin_thread_id
- ConversationId, EntryId (UUID newtypes)
- EntrySender enum (User, Agent{thread_id}, System)

ConversationManager (runtime/conversation.rs):
- get_or_create_conversation(channel, user) — indexed by (channel, user)
- handle_user_message() — injects into active foreground thread or spawns new
- record_thread_outcome() — adds agent/system entries, untracks completed threads
- get_conversation(), list_conversations()

This enables the key architectural insight: a user can ask "what's the
weather?" while a deployment thread is still running. Both produce entries
in the same conversation.

85 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(engine): simplify execution tiers — Monty-only for CodeAct/RLM

Restructure phases 6-8 to clarify execution model:

- Monty is the sole Python executor for CodeAct/RLM. No WASM or Docker
  Python runtimes for LLM-generated code.
- WASM sandbox is for third-party tool isolation (existing infra, Phase 8)
- Docker containers are for thread-level isolation of high-risk work (Phase 8)
- Two-phase commit moves to Phase 6 (integration) at the adapter boundary

Phase renumbering:
- Old Phase 6 (Tier 2-3) → removed as separate phase
- Old Phase 7 (integration) → Phase 6
- Old Phase 8 (cleanup) → Phase 7
- New Phase 8: WASM tools + Docker thread isolation (infra integration)

Updated progress table: Phases 1-5 marked DONE with test counts and commits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): Phase 6 — bridge adapters for main crate integration

Strategy C parallel deployment: when ENGINE_V2=true env var is set,
user messages route through the engine instead of the existing agentic
loop. All existing behavior is unchanged when the flag is off.

Bridge module (src/bridge/):
- LlmBridgeAdapter: wraps LlmProvider as engine LlmBackend, converts
  ThreadMessage↔ChatMessage, ActionDef↔ToolDefinition, depth-based
  model routing (primary vs cheap_llm)
- EffectBridgeAdapter: wraps ToolRegistry+SafetyLayer as EffectExecutor,
  routes tool calls through existing execute_tool_with_safety pipeline
- InMemoryStore: HashMap-backed Store impl (no DB tables needed yet)
- EngineRouter: is_engine_v2_enabled() + handle_with_engine() that
  builds engine from Agent deps and processes messages end-to-end

Integration touchpoint (4 lines in agent_loop.rs):
  After hook processing, before session resolution, check ENGINE_V2
  flag and route UserInput through the engine path.

Accessor visibility widened: llm(), cheap_llm(), safety(), tools()
changed from pub(super) to pub(crate) for bridge access.

85 engine tests + main crate clippy clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): add user message and system prompt to thread before execution

The ExecutionLoop was sending empty messages to the LLM because the
thread was spawned with the user's input as the goal but no messages.

Fixes:
- ThreadManager.spawn_thread() now adds the goal as an initial user
  message before starting the execution loop
- ExecutionLoop.run() injects a default system prompt if none exists

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(bridge): match existing LLM request format to prevent 400 errors

The LLM bridge was missing several defaults that the existing
Reasoning.respond_with_tools() sets:

- tool_choice: "auto" when tools are present (required by some providers)
- max_tokens: 4096 (default)
- temperature: 0.7 (default)
- When no tools (force_text): use plain complete() instead of
  complete_with_tools() with empty tools array — matches existing
  no-tools fallback path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): persist conversation context across messages

The engine was creating a fresh ThreadManager and InMemoryStore per
message, losing all context between turns. A follow-up question like
"what are the latest 10 issues?" had no memory of the prior "how many
issues" response.

Fixes:
- EngineState (ThreadManager, ConversationManager, InMemoryStore) now
  persists across messages via OnceLock, initialized on first use
- ConversationManager builds message history from prior conversation
  entries (user messages + agent responses) and passes it to new threads
- ThreadManager.spawn_thread_with_history() accepts initial_messages
  that are prepended before the current user message
- System notifications (thread started/completed) are filtered out of
  the history (not useful as LLM context)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): enable CodeAct/RLM mode with code block detection

The engine now operates in CodeAct/RLM mode:

System prompt (executor/prompt.rs):
- Instructs LLM to write Python in ```repl fenced blocks
- Documents available tools as callable Python functions
- Documents llm_query(), llm_query_batched(), FINAL()
- Documents context variables (context, goal, step_number, previous_results)
- Strategy guidance: examine context, break into steps, use tools, call FINAL()

Code block detection (bridge/llm_adapter.rs):
- extract_code_block() scans LLM text responses for ```repl or ```python blocks
- When detected, returns LlmResponse::Code instead of LlmResponse::Text
- The ExecutionLoop routes Code responses through Monty for execution

No structured tool definitions sent to LLM:
- Tools are described in the system prompt as Python functions
- The LLM call sends empty actions array, forcing text-mode responses
- This ensures the LLM writes code blocks (CodeAct) instead of
  structured tool calls (which would bypass the REPL)

85 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(engine): add 8 CodeAct/RLM E2E tests with mock LLM

Comprehensive test coverage for the Monty Python execution path:

- codeact_simple_final: Python code calls FINAL('answer') → thread completes
- codeact_tool_call_then_final: code calls test_tool() → FunctionCall
  suspends VM → MockEffects returns result → code resumes → FINAL()
- codeact_pure_python_computation: sum([1,2,3,4,5]) → FINAL('Sum is 15')
  with no tool calls — pure Python in Monty
- codeact_multi_step: first step prints output (no FINAL), second step
  sees output metadata and calls FINAL — tests iterative REPL flow
- codeact_error_recovery: first step has NameError → error flows to LLM
  as stdout → second step recovers with FINAL — tests error transparency
- codeact_context_variables_available: code accesses `goal` and `context`
  variables injected by the RLM context builder
- codeact_multiple_tool_calls_in_loop: for loop calls test_tool() 3 times
  → 3 FunctionCall suspensions → all results collected → FINAL
- codeact_llm_query_recursive: code calls llm_query('prompt') → VM
  suspends → MockLlm provides sub-agent response → result returned as
  Python string variable

93 tests passing (85 prior + 8 new), zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(bridge): detect code blocks in plain completion path + multi-block support

Two bugs fixed:

1. The no-tools completion path (used by CodeAct since we send empty
   actions) returned LlmResponse::Text without checking for code blocks.
   Code blocks were rendered as markdown text instead of being executed.

2. extract_code_block now:
   - Handles bare ``` fences (skips non-Python languages)
   - Collects ALL code blocks in the response and concatenates them
     (models often split code across multiple blocks with explanation)
   - Tries markers in order: ```repl, ```python, ```py, then bare ```

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(bridge): add 11 regression tests for code block extraction

Covers the exact failure modes discovered during live testing:

- extract_repl_block: standard ```repl fenced block
- extract_python_block: ```python marker
- extract_py_block: ```py shorthand
- extract_bare_backtick_block: bare ``` with Python content
- skip_non_python_language: ```json should NOT be extracted
- no_code_blocks_returns_none: plain text, no fences
- multiple_code_blocks_concatenated: two ```repl blocks with
  explanation between them → concatenated with \n\n
- mixed_thinking_and_code: model outputs explanation + two
  ```python blocks (the Hyperliquid case) → both extracted
- repl_preferred_over_bare: ```repl takes priority over bare ```
- empty_code_block_skipped: empty fenced block returns None
- unclosed_block_returns_none: no closing ``` returns None

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): detect FINAL() in text responses + regression tests

Models sometimes write FINAL() outside code blocks — as plain text
after an explanation. The Hyperliquid case: model outputs a long
analysis then FINAL("""...""") at the end, not inside ```repl fences.

Fixes:
- extract_final_from_text(): regex-based FINAL detection in text
  responses, matching the official RLM's find_final_answer() fallback
- Handles: double-quoted, single-quoted, triple-quoted, unquoted,
  nested parens
- Checked in LlmResponse::Text handler BEFORE tool intent nudge
  (FINAL takes priority)

9 new tests:
- codeact_final_in_text_response: FINAL("answer") in plain text
- codeact_final_triple_quoted_in_text: FINAL("""multi\nline""") in text
- final_double_quoted, final_single_quoted, final_triple_quoted,
  final_unquoted, final_with_nested_parens, final_after_long_text,
  no_final_returns_none

102 tests passing (93 + 9 new), zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add crate extraction & cleanup roadmap

Documents architectural recommendations from the engine v2 design
process for future reference:

- Root directory consolidation (channels-src + tools-src → extensions/)
- Crate extraction tiers: zero-coupling (estimation, observability,
  tunnel), trivial-coupling (document_extraction, pairing, hooks),
  medium-coupling (secrets, MCP, db, workspace, llm, skills),
  heavy-coupling (web gateway, agent, extensions)
- src/ module reorganization into logical groups (core, persistence,
  infra, media, support)
- main.rs/app.rs slimming targets (100/500 lines after migration)
- WASM module candidates (document_extraction) and non-candidates
  (REPL, web gateway → separate crates instead)
- Priority ordering for extraction work
- Tracks completed items (ironclaw_safety, ironclaw_engine,
  transcription move)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): live progress status updates via event broadcast

Engine v2 now shows live progress in the CLI (and any channel):
- "Thinking..." when a step starts
- Tool name + success/error when actions execute
- "Processing results..." when a step completes

Implementation:
- ThreadManager holds a broadcast::Sender<ThreadEvent> (capacity 256)
- ExecutionLoop.emit_event() writes to thread.events AND broadcasts
- ThreadManager.subscribe_events() returns a receiver
- Router uses tokio::select! to listen for events while waiting for
  thread completion, forwarding them as StatusUpdate to the channel

This replaces the polling approach with zero-latency event streaming.
Agent.channels visibility widened to pub(crate) for bridge access.

102 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): include tool results in code step output for LLM context

The LLM was ignoring tool results and answering from training data
because the compact output metadata didn't include what tools returned.
Tool results lived only as ActionResult messages (role: Tool) which
some providers flatten or the model ignores.

Now the code step output includes:
- stdout from Python print() statements
- [tool_name result] with the actual output (truncated to 4K per tool)
- [tool_name error] for failed tools
- [return] for the code's return value
- Total output truncated to 8K chars to prevent context bloat

This ensures the model sees web_search results, API responses, etc.
in the next iteration and can reason about them instead of hallucinating.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): add debug/trace logging for CodeAct execution

Three verbosity levels for debugging the engine:

RUST_LOG=ironclaw_engine=debug:
- LLM call: message count, iteration, force_text
- LLM response: type (text/code/action_calls), token usage
- Code execution: code length, action count, had_error, final_answer
- Text response: length, FINAL() detection

RUST_LOG=ironclaw_engine=trace:
- Full message list sent to LLM (role, length, first 200 chars each)
- Full code block being executed
- stdout preview (first 500 chars)
- Per-tool results (name, success, first 300 chars of output)
- Text response preview (first 500 chars)

Usage:
  ENGINE_V2=true RUST_LOG=ironclaw_engine=debug cargo run
  ENGINE_V2=true RUST_LOG=ironclaw_engine=trace cargo run

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): execution trace recording + retrospective analysis

Enable with ENGINE_V2_TRACE=1 to get full execution traces and
automatic issue detection after each thread completes.

Trace recording (executor/trace.rs):
- build_trace(): captures full thread state — messages (with full
  content), events, step count, token usage, detected issues
- write_trace(): writes JSON to engine_trace_{timestamp}.json
- log_trace_summary(): logs summary + issues at info/warn level

Retrospective analyzer detects 8 issue categories:
- thread_failure: thread ended in Failed state
- no_response: no assistant message generated
- tool_error: specific tool failures with error details
- code_error: Python errors (NameError, SyntaxError, etc.) in output
- missing_tool_output: tool results exist but not in system messages
- excessive_steps: >10 steps (may be stuck in loop)
- no_tools_used: single-step answer without tools (hallucination risk)
- mixed_mode: text responses without code blocks (prompt not followed)

Thread state now saved to store after execution completes (for trace
access after join_thread).

Usage:
  ENGINE_V2=true ENGINE_V2_TRACE=1 cargo run
  # After each message: trace JSON + issue log in terminal

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): wire reflection pipeline + trace analysis into thread lifecycle

After every thread completes, ThreadManager now automatically runs:

1. Retrospective trace analysis (non-LLM, always):
   - Detects 8 issue categories (tool errors, code errors, missing
     outputs, excessive steps, hallucination risk, etc.)
   - Logs issues at warn level when found

2. Trace file recording (when ENGINE_V2_TRACE=1):
   - Writes full JSON trace to engine_trace_{timestamp}.json

3. LLM reflection (when enable_reflection=true):
   - Calls reflection pipeline to produce Summary, Lesson, Issue docs
   - Saves docs to store for future context retrieval
   - Enabled by default in the bridge router

All three run inside the spawned tokio task after exec.run() completes,
before saving the final thread state. No external wiring needed.

Removed duplicate trace recording from the router — it's now handled
by ThreadManager automatically.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(bridge): convert tool name hyphens to underscores for Python compatibility

Root cause from trace analysis: the LLM writes `web_search()` (valid
Python identifier) but the tool registry has `web-search` (with hyphen).
The EffectBridgeAdapter couldn't find the tool → "Tool not found" error
→ model fabricated fake data instead.

Fixes:
- available_actions(): converts tool names from hyphens to underscores
  (web-search → web_search) so the system prompt lists valid Python names
- execute_action(): tries the original name first, then falls back to
  hyphenated form (web_search → web-search) for tool registry lookup
- Same conversion in router's capability registry builder

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(bridge): parse JSON tool output to prevent double-serialization

From trace analysis: web_search returned a JSON string, which was
wrapped as serde_json::json!(string) creating a Value::String containing
JSON. When Monty got this as MontyObject::String, the Python code
couldn't index it with result['title'] → TypeError.

Fix: try parsing the tool output string as JSON first. If valid, use the
parsed Value (becomes a Python dict/list). If not valid JSON, keep as
string. This means web_search results are directly indexable in Python:
  results = web_search(query="...")
  print(results["results"][0]["title"])  # works now

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): persist variables across code steps via `state` dict

Monty creates a fresh runtime per code step, so variables are lost
between steps. This caused the model to re-paste tool results from
system messages, wasting tokens.

Fix: maintain a `persisted_state` JSON dict in the ExecutionLoop that
accumulates across steps:
- Tool results stored by tool name: state["web_search"] = {results...}
- Return values stored: state["last_return"], state["step_0_return"]
- Injected as a `state` Python variable in each new MontyRun

Now the model can do:
  Step 1: results = web_search(query="...")  # tool result saved in state
  Step 2: data = state["web_search"]         # access previous result
          summary = llm_query("summarize", str(data))
          FINAL(summary)

System prompt updated to document the `state` variable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): add state hint on code errors + retrieval engine integration

When code fails with NameError/UnboundLocalError (model trying to
access variables from a previous step), the error output now includes:

  [HINT] Variables don't persist between code blocks. Use the `state`
  dict to access data from previous steps. Available keys: ["web_search",
  "last_return"]

This teaches the model to use `state["web_search"]` instead of `result`
after a NameError, reducing wasted steps from 3-4 to 1.

Also integrates RetrievalEngine into context building and ThreadManager:
- build_step_context() now accepts optional RetrievalEngine to inject
  relevant memory docs (Lessons, Specs, Playbooks) into LLM context
- RetrievalEngine uses keyword matching with doc-type priority scoring
- Memory docs from reflection (Phase 4) now feed back into future threads

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: remove trace files and add to .gitignore

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): replace web_fetch example with web_search in CodeAct prompt

The system prompt example used web_fetch(url="...") which doesn't exist
as a tool. The model learned from the example and tried web_fetch,
getting "Tool not found". Changed to web_search(query="...") which is
an actual registered tool.

Found via trace analysis — reflection pipeline correctly identified
this as a "Tool Name Correction" spec doc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor(engine): extract prompt templates to markdown files

Prompt templates moved from inline Rust strings to plain markdown files
at crates/ironclaw_engine/prompts/ for easy inspection and iteration:

- prompts/codeact_preamble.md — main instructions, special functions,
  context variables, rules
- prompts/codeact_postamble.md — strategy section

Loaded at compile time via include_str!(), so no runtime file I/O.
Edit the .md files and rebuild to iterate on prompts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): replace byte-index slicing with char-safe truncation

Panic: 'byte index 80 is not a char boundary; it is inside ''' when
tool output contained multi-byte UTF-8 characters (smart quotes from
web search results).

Fixed 4 unsafe byte-index slices:
- thread.rs:281: message preview &content[..80] → chars().take(80)
- loop_engine.rs:556: tool output &str[..4000] → chars().take(4000)
- loop_engine.rs:579: output tail &str[len-8000..] → chars().skip()
- scripting.rs:82: stdout tail &str[len-N..] → chars().skip()

All now use .chars().take() or .chars().skip() which respect character
boundaries. Follows CLAUDE.md rule: "Never use byte-index slicing on
user-supplied or external strings."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): fix false positive missing_tool_output warning in trace analyzer

The check was looking for "[" + "result]" in System-role messages only,
but tool output metadata is added with patterns like "[shell result]"
and may appear in messages with any role. Changed to scan all messages
for " result]" or " error]" patterns regardless of role.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(engine): update architecture plan with Phase 6 status and approval flow design

Phase 6 updated to reflect what was actually built:
- Bridge adapters (LLM, Effect, InMemoryStore, Router) — all done
- Integration touchpoint (4 lines in handle_message) — done
- Live progress via broadcast events — done
- Conversation persistence across messages — done
- Trace recording + retrospective analysis — done
- 8 bugs found and fixed via trace analysis — documented

Phase 6 remaining work documented:
- Approval flow: detailed 5-step design (send to channel, pause thread,
  route response, resume execution, always handling) with v1 reference
- Database persistence (InMemoryStore → real DB tables)
- Acceptance testing (TestRig + TraceLlm fixtures)
- Two-phase commit for high-stakes effects

Progress table updated: Phase 6 marked as DONE (partial), 134 tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add self-improving engine design plan

Designs a system where the engine debugs and improves itself, based on
the pattern observed in the last session: 5 consecutive bug fixes all
followed trace → read → identify → edit → test, using tools the engine
already has access to.

Three levels of self-improvement:
- Level 1 (Prompt): edit prompts/*.md to prevent LLM mistakes. Auto-apply.
- Level 2 (Config): adjust defaults/mappings. Branch + test + PR.
- Level 3 (Code): Rust patches for engine bugs. Branch + test + clippy + PR.

Architecture: Self-improvement Mission spawns a Reflection thread that
reads traces, reads source, proposes fixes, validates via cargo test,
and either auto-applies (Level 1) or creates a PR (Level 2-3).

Includes: fix pattern database (seeded from our 8 debugging session
fixes), feedback loop diagram, safety model, implementation phases
(A through D), and what exists vs what's new.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add engine v2 security model and audit

Comprehensive security analysis of engine v2 covering:

Threat model: 4 attacker profiles (malicious input, prompt injection
via tools, poisoned memory, supply chain).

Current state audit: 9 controls working (Monty sandbox, safety layer,
policy engine, leases, provenance, events) and 9 gaps identified.

Critical finding: ALL tools granted by default — CodeAct code can call
shell, write_file, apply_patch without approval. Proposed fix: 3-tier
tool classification (auto/approve-once/always-approve).

CodeAct-specific threats: tool call amplification, prompt injection via
search results, data exfiltration via tool chains, Monty escape.

Self-improvement security: poisoned trace attacks, memory poisoning via
reflection. Mitigations: edit validation, frequency caps, audit trail,
auto-rollback, reflection output scanning.

6-layer security architecture proposed: input validation, capability
gating, output sanitization, execution sandboxing, self-improvement
controls, observability.

Prioritized implementation plan with severity/effort ratings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(security): cross-reference v1 controls — use, don't reinvent

Updated security plan with detailed audit of ALL existing v1 security
controls and how they map to engine v2 bridge gaps:

Key finding: v1 already has solutions for every security gap identified.
The bridge just needs to wire them in:

- Tool::requires_approval() exists but bridge doesn't call it
- safety.wrap_for_llm() exists but tool results enter context unwrapped
- RateLimiter exists but bridge doesn't check rate limits
- BeforeToolCall hooks exist but bridge doesn't run them
- redact_params() exists but bridge doesn't redact sensitive params
- Shell risk classification (Low/Medium/High) is inherited but ignored

Revised priority: most fixes are small wiring tasks in EffectBridgeAdapter,
not new security infrastructure. The bridge is the security boundary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): add missions, reliability tracker, reflection executor, and provenance-aware policy

- Add Mission type and MissionManager for recurring thread scheduling
- Add ReliabilityTracker for per-capability success/failure/latency tracking
- Add reflection executor that spawns CodeAct threads for post-completion reflection
- Extend PolicyEngine with provenance-aware taint checking (LLM-generated data
  requires approval for financial/external-write effects)
- Extend Store trait with mission CRUD methods
- Add conversation surface tracking, compaction token fix, context memory injection
- Wire new modules through lib.rs re-exports and bridge adapters

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): wire v1 security controls into engine v2 adapter

Zero engine crate changes. All security controls enforced at the bridge
boundary in EffectBridgeAdapter:

1. Tool approval (v1: Tool::requires_approval):
   - Checks each tool's approval requirement with actual params
   - Always → returns EngineError::LeaseDenied (blocks execution)
   - UnlessAutoApproved → checks auto_approved set, blocks if not approved
   - Never → proceeds
   - Per-session auto_approved HashSet (for future "always" handling)

2. Hook interception (v1: BeforeToolCall):
   - Runs HookEvent::ToolCall before every execution
   - HookOutcome::Reject → blocks with reason
   - HookError::Rejected → blocks with reason
   - Hook errors → fail-open (logged, execution continues)

3. Output sanitization (v1: sanitize_tool_output + wrap_for_llm):
   - Leak detection: API keys in tool output are redacted
   - Policy enforcement: content policy rules applied
   - Length truncation: output capped at 100KB
   - XML boundary protection: prevents injection via tool output

4. Sensitive param redaction (v1: redact_params):
   - Tool's sensitive_params() consulted before hooks see parameters
   - Redacted params sent to hooks, original params used for execution

5. available_actions() now sets requires_approval based on each tool's
   default approval requirement, so the engine's PolicyEngine can
   gate tools it hasn't seen before.

6. Actual execution timing measured via Instant::now() (replaces
   placeholder Duration::from_millis(1)).

Accessor visibility: hooks() widened to pub(crate).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): implement tool approval flow for engine v2

Adds a complete approval flow that mirrors v1 behavior, using the
existing v1 security controls (Tool::requires_approval, auto-approve
sets, StatusUpdate::ApprovalNeeded).

## How it works

### Step 1: Tool blocked at execution
When the LLM's code calls a tool (e.g., `shell("ls")`):
1. EffectBridgeAdapter.execute_action() looks up the Tool object
2. Calls tool.requires_approval(&params) — returns ApprovalRequirement
3. If Always → EngineError::LeaseDenied (always blocks)
4. If UnlessAutoApproved → checks auto_approved HashSet → if not in set,
   returns EngineError::LeaseDenied
5. If Never → proceeds to execution

### Step 2: Engine returns NeedApproval
The LeaseDenied error propagates through:
- CodeAct path: becomes Python RuntimeError, code halts, thread returns
  NeedApproval with action_name + parameters
- Structured path: same via ActionResult.is_error

### Step 3: Router stores pending approval
- PendingApproval { action_name, original_content } stored on EngineState
- StatusUpdate::ApprovalNeeded sent to channel (shows approval card in
  CLI/web with tool name, parameters, yes/always/no buttons)
- Returns text: "Tool 'shell' requires approval. Reply yes/always/no."

### Step 4: User responds
handle_message() intercepts Submission::ApprovalResponse when ENGINE_V2:
- 'yes' → auto_approve_tool(name) on EffectBridgeAdapter, re-processes
  original message (tool now passes the approval check on second run)
- 'always' → same + logs for session persistence
- 'no' → returns "Denied: tool was not executed."

### Key design choice
Instead of pausing/resuming mid-execution (which needs engine changes
to freeze/restore the Monty VM state), we auto-approve the tool and
re-run the full message. The EffectBridgeAdapter's auto_approved set
persists across runs, so the second execution passes immediately.

This trades one extra LLM call for zero engine modifications.

## Files changed
- src/bridge/router.rs: PendingApproval struct, handle_approval(),
  NeedApproval → StatusUpdate::ApprovalNeeded conversion
- src/bridge/mod.rs: export handle_approval
- src/agent/agent_loop.rs: intercept ApprovalResponse for engine v2
- src/bridge/effect_adapter.rs: fmt fixes

151 tests passing, clippy + fmt clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): demote trace/reflection logging from info to debug

INFO-level log output from background tasks (trace analysis, reflection)
corrupts the REPL terminal UI. The trace summary, issue warnings, and
reflection doc previews were printing mid-approval-card, breaking the
interactive display.

Fix: all logging in trace.rs changed from info!/warn! to debug!/warn!.
Trace analysis and reflection results now only show when
RUST_LOG=ironclaw_engine=debug is set.

Also added logging discipline rule to global CLAUDE.md:
- info! → user-facing status the REPL intentionally renders
- debug! → internal diagnostics (traces, reflection, engine internals)
- Background tasks must NEVER use info! — it breaks the TUI

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(bridge): demote all router info! logging to debug!

"engine v2: initializing" and "engine v2: handling message" were
printing at INFO level, corrupting the REPL UI. All router logging
now uses debug! — only visible with RUST_LOG=ironclaw=debug.

Zero info! calls remain in crates/ironclaw_engine/ or src/bridge/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(safety): demote leak detector warn-action logs from warn! to debug!

The leak detector's Warn-action matches (high_entropy_hex pattern on
web search results containing commit SHAs, CSS colors, URL hashes)
were logging at warn! level, corrupting the REPL UI with lines like:
  WARN Potential secret leak detected pattern=high_entropy_hex preview=a96f********cee5

These are informational false positives — real leaks use LeakAction::Redact
which silently modifies the content. Warn-action matches only log for
debugging purposes and should not appear in production output.

Changed to debug! level — visible with RUST_LOG=ironclaw_safety=debug.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): strengthen CodeAct prompt to prevent shallow text answers

The model was answering "Suggested 45 improvements" as a brief text
summary from training data without actually searching or listing them.
The trace showed: no code block, no tool calls, no FINAL().

Prompt changes:
- Rule 1: "ALWAYS respond with a ```repl code block. NEVER answer with
  plain text only." (was: "Always write code... plain text for brief
  explanations")
- Rule 2 (NEW): "NEVER answer from memory or training data alone.
  Always use tools to get real, current information before answering."
- Rule 3: FINAL answer "should be detailed and complete — not just a
  summary like 'found 45 items'"
- Rule 8 (NEW): "Include the actual content in your FINAL() answer,
  not just a count or summary. Users want to see the details."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): persist reflection docs to workspace for cross-session learning

Replaces InMemoryStore with HybridStore:
- Ephemeral data (threads, steps, events, leases) stays in-memory
- MemoryDocs (lessons, specs, playbooks from reflection) persist to
  the workspace at engine/docs/{type}/{id}.json

On engine init, load_docs_from_workspace() reads existing docs back
into the in-memory cache. This means:
- Lessons learned in session 1 are available in session 2
- The RetrievalEngine injects relevant past lessons into new threads
- The engine genuinely improves over time as reflection accumulates

Workspace paths:
  engine/docs/lessons/{uuid}.json
  engine/docs/specs/{uuid}.json
  engine/docs/playbooks/{uuid}.json
  engine/docs/summaries/{uuid}.json
  engine/docs/issues/{uuid}.json

No new database tables. Uses existing workspace write/read/list.
workspace() accessor widened to pub(crate).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(bridge): adapt to execute_tool_with_safety params-by-value change

Staging merge changed execute_tool_with_safety to take params by value
instead of by reference (perf optimization from PR #926). Updated
bridge adapter to clone params before passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(engine): add web gateway integration plan to Phase 6

Documents three gaps between engine v2 and the web gateway:
1. No SSE streaming (engine emits ThreadEvent, gateway expects SseEvent)
2. No conversation persistence (engine uses HybridStore, gateway reads v1 DB)
3. No cross-channel visibility (REPL ↔ web messages invisible to each other)

Implementation plan: bridge ThreadEvent→AppEvent, write messages to v1
conversation tables after thread completion. Prerequisite: AppEvent
extraction PR (in progress separately).

Also updated DB persistence status: HybridStore with workspace-backed
MemoryDocs is now implemented (partial persistence).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(engine): document routine/job gap and SIGKILL crash scenario

Routines are entirely v1 — not hooked up to engine v2. When a user
asks "create a routine" as natural language, engine v2 tries to call
routine_create via CodeAct, but the tool needs RoutineEngine + Database
refs that the bridge's minimal JobContext doesn't provide. This caused
a SIGKILL crash during testing.

Options documented: block routine tools in v2 (short term), pass refs
through context (medium), replace with Mission system (long term).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: extract AppEvent to crates/ironclaw_common

SseEvent was defined in src/channels/web/types.rs but imported by 12+
modules across agent, orchestrator, worker, tools, and extensions — it
had become the application-wide event protocol, not a web transport
concern.

Create crates/ironclaw_common as a shared workspace crate and move the
enum there as AppEvent.  Also move the truncate_preview utility which
was similarly leaked from the web gateway into agent modules.

- New crate: crates/ironclaw_common (AppEvent, truncate_preview)
- Rename SseEvent → AppEvent, from_sse_event → from_app_event
- web/types.rs re-exports AppEvent for internal gateway use
- web/util.rs re-exports truncate_preview
- Wire format unchanged (serde renames are on variants, not the enum)

Aligned with the event bus direction on refactor/architectural-hardening
where DomainEvent (≡ AppEvent) is wrapped in a SystemEvent envelope.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): integrate with web gateway via AppEvent + v1 conversation DB

Three changes to make engine v2 visible in the web gateway:

1. SSE event streaming (AppEvent broadcast):
   - ThreadEvent → AppEvent conversion via thread_event_to_app_event()
   - Events broadcast to SseManager during the poll loop
   - Covers: Thinking, ToolCompleted (success/error), Status, Response
   - Web gateway receives real-time progress without any gateway changes

2. Conversation persistence to v1 database:
   - After thread completes, writes user message + agent response to
     v1 ConversationStore via add_conversation_message()
   - Uses get_or_create_assistant_conversation() for per-user per-channel
   - Web gateway reads from DB as usual — chat history appears

3. Final response broadcast:
   - AppEvent::Response with full text + thread_id sent via SSE
   - Web gateway renders the response in the chat UI

New EngineState fields: sse (Option<Arc<SseManager>>),
db (Option<Arc<dyn Database>>). Both populated from Agent.deps.

Agent.deps visibility widened to pub(crate).

Depends on: ironclaw_common crate with AppEvent type (PR #1615).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): complete Phase 6 — v1-only tool blocking, rate limiting, call limits

Three security/stability improvements in EffectBridgeAdapter:

1. V1-only tool blocking:
   - routine_create, create_job, build_software (and hyphenated variants)
     return helpful error: "use the slash command instead"
   - Filtered out of available_actions() so system prompt doesn't list them
   - Prevents crash from tools needing RoutineEngine/Scheduler refs

2. Per-step tool call limit:
   - Max 50 tool calls per code block (AtomicU32 counter)
   - Prevents amplification: `for i in range(10000): shell(...)`
   - Returns "call limit reached, break into multiple steps"

3. Rate limiting:
   - Per-user per-tool sliding window via RateLimiter
   - Checks tool.rate_limit_config() before every execution
   - Returns "rate limited, try again in Ns"

Architecture plan updated:
- Gateway integration: DONE
- Routines: BLOCKED (gracefully, with slash command fallback)
- Rate limiting: DONE
- Call limit: DONE
- Phase 6 status: DONE (remaining: acceptance tests, two-phase commit)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add Mission system design — goal-oriented autonomous threads

Missions replace routines with evolving, knowledge-accumulating
autonomous agents. Unlike routines (fixed prompt, stateless), Missions:

- Generate prompts from accumulated Project knowledge (lessons,
  playbooks, issues from prior threads)
- Adapt approach when something fails repeatedly
- Track progress toward a goal with success criteria
- Self-manage: pause when stuck, complete when goal achieved

Architecture: MissionManager with cron ticker spawns threads via
ThreadManager. Meta-prompt built from mission goal + Project MemoryDocs
via RetrievalEngine. Reflection feeds back automatically.

6-step implementation plan: cron trigger, meta-prompt builder, bridge
wiring, CodeAct tools, progress tracking, persistence.

Includes two worked examples: daily tech news briefing (ongoing) and
test coverage improvement (goal-driven, self-completing).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): extend Mission types with webhook/event triggers + evolving strategy

Mission types updated to support external activation sources:

MissionCadence expanded:
- Cron { expression, timezone } — timezone-aware scheduling
- OnEvent { event_pattern } — channel message pattern matching
- OnSystemEvent { source, event_type } — structured events from tools
- Webhook { path, secret } — external HTTP triggers (GitHub, email, etc.)
- Manual — explicit triggering only

The engine defines trigger TYPES. The bridge implements infrastructure
(cron ticker, webhook endpoints, event matchers). GitHub issues, PRs,
email, Slack events all use the generic Webhook cadence — no
special-casing in the engine. Webhook payload injected as
state["trigger_payload"] in the thread's Python context.

Mission struct extended:
- current_focus: what the next thread should work on (evolving)
- approach_history: what we've tried (for adaptation)
- max_threads_per_day / threads_today: daily budget
- last_trigger_payload: webhook/event data for thread context

Plan updated with trigger type table and webhook integration design.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): implement MissionManager execution with meta-prompts

The MissionManager now builds evolving meta-prompts and processes
thread outcomes for continuous learning:

fire_mission() upgraded:
- Loads Project MemoryDocs via RetrievalEngine for context
- Builds meta-prompt from: goal, current_focus, approach_history,
  project knowledge docs, trigger payload, thread count
- Spawns thread with meta-prompt as user message
- Background task waits for completion and processes outcome
- Daily thread budget enforcement (max_threads_per_day)

Meta-prompt structure:
  # Mission: {name}
  Goal: {goal}
  ## Current Focus (evolves between threads)
  ## Previous Approaches (what we've tried)
  ## Knowledge from Prior Threads (lessons, playbooks, issues)
  ## Trigger Payload (webhook/event data if applicable)
  ## Instructions (accomplish step, report next focus, check goal)

Outcome processing:
- Extracts "next focus:" from FINAL() response → updates current_focus
- Detects "goal achieved: yes" → completes mission
- Records accomplishment in approach_history
- Failed threads recorded as "FAILED: {error}"

Cron ticker:
- start_cron_ticker() spawns tokio task, ticks every 60s
- Checks active Cron missions, fires those past next_fire_at

151 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): wire MissionManager into engine v2 for CodeAct access

Missions are now callable from CodeAct Python code:

```python
# Create a daily briefing mission
result = mission_create(
    name="Tech News",
    goal="Daily AI/crypto/software news briefing",
    cadence="0 9 * * *"
)

# List all missions
missions = mission_list()

# Manually fire a mission
mission_fire(id="...")

# Pause/resume
mission_pause(id="...")
mission_resume(id="...")
```

Implementation:
- MissionManager created on engine init, cron ticker started
- EffectBridgeAdapter intercepts mission_* function calls before tool
  lookup and routes to MissionManager
- parse_cadence() handles: "manual", cron expressions, "event:pattern",
  "webhook:path"
- Mission functions documented in CodeAct system prompt
- MissionManager set on adapter via set_mission_manager() after init
  (avoids circular dependency)

System prompt updated with mission_create, mission_list, mission_fire,
mission_pause, mission_resume documentation.

151 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): map routine_* calls to mission operations in v2

When the model calls routine_create, routine_list, routine_fire,
routine_pause, routine_resume, or routine_delete, the bridge now
routes them to the MissionManager instead of blocking with an error.

Mapping:
  routine_create → mission_create (with cadence parsing)
  routine_list   → mission_list
  routine_fire   → mission_fire
  routine_pause  → mission_pause
  routine_resume → mission_resume
  routine_update → mission_pause/resume (based on params)
  routine_delete → mission_complete (marks as done)

Routine tools removed from v1-only blocklist and restored in
available_actions(). The model can use either "routine" or "mission"
vocabulary — both work.

Still blocked: create_job, cancel_job, build_software (need v1
Scheduler/ContainerJobManager refs).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(engine): add E2E mission flow tests — 7 new tests

Comprehensive mission lifecycle tests:

- fire_mission_builds_meta_prompt_with_goal: verifies thread spawned
  with project context and recorded in history
- outcome_processing_extracts_next_focus: "Next focus: X" in FINAL()
  response → mission.current_focus updated
- outcome_processing_detects_goal_achieved: "Goal achieved: yes" →
  mission status transitions to Completed
- mission_evolves_via_direct_outcome_processing: 3-step evolution:
  step 1 sets focus to "db module", step 2 evolves to "tools module",
  step 3 detects goal achieved → mission completes. Tests the full
  learning loop without background task timing dependencies.
- fire_with_trigger_payload: webhook payload stored on mission and
  threads_today counter incremented
- daily_budget_enforced: max_threads_per_day=1 → first fire succeeds,
  second returns None

157 tests passing (151 prior + 6 new mission E2E).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): self-improving engine via Mission system

Wire the self-improvement loop as a Mission with OnSystemEvent cadence,
inspired by karpathy/autoresearch's program.md approach. The mission
fires when threads complete with issues, receives trace data as trigger
payload, and uses tools directly to diagnose and fix problems.

Key changes:

Engine self-improvement (Phase A+B from design doc):
- Add fire_on_system_event() to MissionManager for OnSystemEvent cadence
- Add start_event_listener() that subscribes to thread events and fires
  matching missions when non-Mission threads complete with trace issues
- Add ensure_self_improvement_mission() with autoresearch-style goal
  prompt (concrete loop steps, not vague instructions)
- Add process_self_improvement_output() for structured JSON fallback
- Seed fix pattern database with 8 known patterns from debugging
- Runtime prompt overlay via MemoryDoc (build_codeact_system_prompt now
  async + Store-aware, appends learned rules from prompt_overlay docs)
- Pass Store to ExecutionLoop for overlay loading

Bridge review fixes (P1/P2):
- Scope engine v2 SSE events to requesting user (broadcast_for_user)
- Per-user pending approvals via HashMap instead of global Option
- Reset tool-call limit counter before each thread execution
- Only persist auto-approval when user chose "always", not one-off "yes"
- Remove dead store/mission_manager fields from EngineState

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add checkpoint-based engine thread recovery

* feat(engine): add Python orchestrator module and host functions

Add the orchestrator infrastructure for replacing the Rust execution
loop with versioned Python code. This commit adds the module and host
functions without switching over — the existing Rust loop is unchanged.

New files:
- orchestrator/default.py: v0 Python orchestrator (run_loop + helpers)
- executor/orchestrator.rs: host function dispatch, orchestrator
  loading from Store with version selection, OrchestratorResult parsing

Host functions exposed to orchestrator Python via Monty suspension:
  __llm_complete__, __execute_code_step__ (nested Monty VM),
  __execute_action__, __check_signals__, __emit_event__,
  __add_message__, __save_checkpoint__, __transition_to__,
  __retrieve_docs__, __check_budget__, __get_actions__

Also makes json_to_monty, monty_to_json, monty_to_string pub(crate)
in scripting.rs for cross-module use.

Design doc: docs/plans/2026-03-25-python-orchestrator.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): switch ExecutionLoop::run() to Python orchestrator

Replace the 900-line Rust execution loop with a ~80-line bootstrap
that loads and runs the versioned Python orchestrator via Monty VM.

The orchestrator Python code (orchestrator/default.py) is the v0
compiled-in version. Runtime versions can override it via MemoryDoc
storage (orchestrator:main with tag orchestrator_code).

Key fixes during switchover:
- Use ExtFunctionResult::NotFound for unknown functions so Monty
  falls through to Python-defined functions (extract_final, etc.)
- Move helper function definitions above run_loop for Monty scoping
- Use FINAL result value (not VM return value) in Complete handler
- Rename 'final' variable to 'final_answer' to avoid Python keyword

Status: 171/177 tests pass. 6 remaining failures are step_count and
token tracking bookkeeping — the orchestrator manages these internally
but doesn't yet update the thread's counters via host functions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): all 177 tests pass with Python orchestrator

- Increment step_count and track tokens in __emit_event__("step_completed")
  so thread bookkeeping matches the old Rust loop behavior
- Remove double-counting of tokens in bootstrap (orchestrator handles it)
- Match nudge text to existing TOOL_INTENT_NUDGE constant
- Fix FINAL result propagation (use stored final_result, not VM return)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): orchestrator versioning, auto-rollback, and tests

Add version lifecycle for the Python orchestrator:
- Failure tracking via MemoryDoc (orchestrator:failures)
- Auto-rollback: after 3 consecutive failures, skip the latest version
  and fall back to previous (or compiled-in v0)
- Success resets the failure counter
- OrchestratorRollback event for observability

Update self-improvement Mission goal with Level 1.5 instructions for
orchestrator patches — the agent can now modify the execution loop
itself via memory_write with versioned orchestrator docs.

12 new tests: version selection (highest wins), rollback after failures,
rollback to default, failure counting/resetting, outcome parsing for
all 5 ThreadOutcome variants.

189 tests pass, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add engine v2 architecture, self-improvement, and dev history

Three new docs for contributors:

- engine-v2-architecture.md: Two-layer architecture (Rust kernel +
  Python orchestrator), five primitives, execution model with nested
  Monty VMs, bridge layer, memory/reflection, missions, capabilities

- self-improvement.md: Three improvement levels (prompt/orchestrator/
  config/code), autoresearch-inspired Mission loop, versioned
  orchestrator with auto-rollback, fix pattern database, safety model

- development-history.md: Summary of 6 Claude Code sessions that
  built the system, key design decisions and debugging moments,
  architecture evolution from 900-line Rust loop to Python orchestrator

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): complete v2 side-by-side integration with gateway API

Wire engine v2 into the full submission pipeline and expose threads,
projects, and missions through the web gateway REST API.

Bridge routing — route ExecApproval, Interrupt, NewThread, and Clear
submissions to engine v2 when ENGINE_V2=true. Previously only UserInput
and ApprovalResponse were handled; all other control commands fell
through to disconnected v1 sessions.

Bridge query layer — add 11 read-only query functions and 6 DTO types
so gateway handlers can inspect engine state (threads, steps, events,
projects, missions) without direct access to the EngineState singleton.

Gateway endpoints — new /api/engine/* routes:
  GET  /threads, /threads/{id}, /threads/{id}/steps, /threads/{id}/events
  GET  /projects, /projects/{id}
  GET  /missions, /missions/{id}
  POST /missions/{id}/fire, /missions/{id}/pause, /missions/{id}/resume

SSE events — add ThreadStateChanged, ChildThreadSpawned, and
MissionThreadSpawned AppEvent variants. Expand the bridge event mapper
to forward StateChanged and ChildSpawned engine events to the browser.

Engine crate — add ConversationManager::clear_conversation() for /new
and /clear commands.

Code quality — replace 10 .expect() calls with proper error returns,
remove dead AgentConfig.engine_v2 field, log silent init errors, fix
duplicate doc comment, improve fallthrough documentation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): empty call_id on ActionResult and trace analyzer false positives

Fix structured executor not stamping call_id onto ActionResult — the
EffectExecutor trait doesn't receive call_id, so the structured executor
must copy it from the original ActionCall after execution. Empty call_id
caused OpenAI-compatible providers to reject the next LLM request with
"Invalid 'input[2].call_id': empty string".

Fix trace analyzer false positives:
- code_error check now only scans User-role code output messages
  (prefixed with [stdout]/[stderr]/[code ]/Traceback), not System
  prompt which contains example error text
- missing_tool_output check now recognizes ActionResult messages as
  valid tool output (Tier 0 structured path)
- Add NotImplementedError to detected code error patterns

New trace checks:
- empty_call_id: detect ActionResult messages with missing/empty
  call_id before they reach the LLM API (severity: Error)
- llm_error: extract LLM provider errors from Failed state reason
- orchestrator_error: extract orchestrator errors from Failed state

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(web): add Missions tab to gateway UI

Add a full Missions page to the web gateway with list view, detail view,
and action buttons (Fire, Pause, Resume).

Backend: add /api/engine/missions/summary endpoint returning counts by
status (active/paused/completed/failed).

Frontend:
- New "Missions" tab between Jobs and Routines
- Summary cards showing mission counts by status
- Table with name, goal, cadence type, thread count, status, actions
- Detail view with goal, cadence, current focus, success criteria,
  approach history, spawned thread list, and action buttons
- Fire/Pause/Resume actions with toast notifications
- i18n support (English + Chinese)
- CSS following the existing routines/jobs patterns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): eagerly initialize engine v2 at startup

The gateway API endpoints (/api/engine/missions, etc.) call bridge
query functions that return empty results when the engine state hasn't
been initialized yet. Previously, initialization only happened lazily
on the first chat message via handle_with_engine().

Now when ENGINE_V2=true, the engine is initialized in Agent::run()
before channels start, so the self-improvement mission and other
engine state is available to gateway API endpoints immediately.

Also rename get_or_init_engine → init_engine and make it public so
it can be called from agent_loop.rs at startup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(web): improve mission detail with markdown goal and thread table

- Goal rendered as full-width markdown block instead of plain-text
  meta item (uses existing renderMarkdown/marked)
- Current focus and success criteria also rendered as markdown
- Spawned threads shown as a clickable table with goal, type, state,
  steps, tokens, and created date instead of a UUID list
- Clicking a thread row opens an inline thread detail view showing
  metadata grid and full message history with markdown rendering
- Back button returns to the mission detail view
- Backend: mission detail now returns full thread summaries (goal,
  state, step_count, tokens) instead of just thread IDs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(web): close SSE connections on page unload to prevent connection starvation

The browser limits concurrent HTTP/1.1 connections per origin to 6.
Without cleanup, SSE connections from prior page loads linger after
refresh/navigation, eating into the pool. After 2-3 refreshes, all 6
slots are consumed by stale SSE streams and new API fetch calls queue
indefinitely — the UI shows "connected" (SSE works) but data never
loads.

Add a beforeunload handler that closes both eventSource (chat events)
and logEventSource (log stream) so the browser can reuse connections
immediately on page reload.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(web): support multiple gateway tabs by reducing SSE connections

Each browser tab opened 2 SSE connections (chat events + log events).
With the HTTP/1.1 per-origin limit of 6, the 3rd tab exhausted the
pool and couldn't load any data.

Three changes:

1. Lazy log SSE — only connect when the logs tab is active, disconnect
   when switching away. Most users rarely view logs, so this saves a
   connection slot per tab.

2. Visibility API — close SSE when the browser tab goes to background
   (user switches to another tab), reconnect when it becomes visible.
   Background tabs don't need real-time events.

3. Combined with the existing beforeunload cleanup, this means:
   - Active foreground tab: 1 connection (chat SSE only, +1 if logs tab)
   - Background tabs: 0 connections
   - Closed/refreshed tabs: 0 connections (beforeunload cleanup)

This allows many gateway tabs to coexist within the 6-connection limit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): route messages to correct conversation by thread scope

Messages sent from a new conversation in the gateway always appeared in
the default assistant conversation because handle_with_engine ignored
the thread_id from the frontend.

Two fixes:

1. Engine conversation scoping — when the message carries a thread_id
   (from the frontend's conversation picker), use it as part of the
   engine conversation key: "gateway:<thread_id>" instead of just
   "gateway". This creates a distinct engine conversation per v1
   thread, so messages don't cross-contaminate.

2. V1 dual-write targeting — write user messages and assistant
   responses to the v1 conversation matching the thread_id (via
   ensure_conversation), not the hardcoded assistant conversation.
   Falls back to the assistant conversation when no thread_id is
   present (e.g., default chat).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(web): richer activity indicators for engine v2 execution

The gateway UI showed only generic "Thinking..." during engine v2
execution with no visibility into CodeAct code execution, tool calls,
or reflection. Now the event mapping produces detailed status updates:

Step lifecycle:
- "Calling LLM..." when a step starts (was "…
serrrfirat pushed a commit that referenced this pull request Apr 5, 2026
…architecture) (#1557)

* v2 architecture phase 1

* feat(engine): Phase 2 — execution loop, capability system, thread runtime

Add the core execution engine to ironclaw_engine crate:

- CapabilityRegistry: register/get/list capabilities and actions
- LeaseManager: async lease lifecycle (grant, check, consume, revoke, expire)
- PolicyEngine: deterministic effect-level allow/deny/approve
- ThreadTree: parent-child relationship tracking
- ThreadSignal/ThreadOutcome: inter-thread messaging via mpsc
- ThreadManager: spawn threads as tokio tasks, stop, inject messages, join
- ExecutionLoop: core loop replacing run_agentic_loop() with signals,
  context building, LLM calls, action execution, and event recording
- Structured executor (Tier 0): lease lookup → policy check → effect execution
- Tool intent nudge detection
- MemoryStore + RetrievalEngine stubs for Phase 4
- Full 8-phase architecture plan in docs/plans/
- CLAUDE.md spec for the engine crate

74 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): Phase 3 — Monty Python executor with RLM pattern

Add CodeAct execution (Tier 1) using the Monty embedded Python
interpreter, following the Recursive Language Model (RLM) pattern
from arXiv:2512.24601.

Key additions:
- executor/scripting.rs: Monty integration with FunctionCall-based
  tool dispatch, catch_unwind panic safety, resource limits (30s,
  64MB, 1M allocs)
- LlmResponse::Code variant + ExecutionTier::Scripting
- Context-as-variables (RLM 3.4): thread messages, goal, step_number,
  previous_results injected as Python variables — LLM context stays
  lean while code accesses data selectively
- llm_query(prompt, context) (RLM 3.5): recursive subagent calls
  from within Python code — results stored as variables, not injected
  into parent's attention window (symbolic composition)
- Compact output metadata between code steps instead of full stdout
- MontyObject ↔ serde_json::Value bidirectional conversion
- Updated architecture plan with RLM design principles

74 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): RLM best-practices enhancements from cross-reference analysis

Cross-referenced our implementation against the official RLM (alexzhang13/rlm),
fast-rlm (avbiswas/fast-rlm), and Prime Intellect's verifiers implementation.
Key enhancements:

- FINAL(answer) / FINAL_VAR(name): explicit termination pattern matching
  all three reference implementations. Code can signal completion at any
  point, not just via return value.
- llm_query_batched(prompts): parallel recursive sub-calls via tokio::spawn,
  matching fast-rlm's asyncio.gather pattern and Prime Intellect's llm_batch.
- Output truncation increased to 8000 chars (from 120), matching Prime
  Intellect's 8192 default. Shows [TRUNCATED: last N chars] or [FULL OUTPUT].
- Step 0 orientation preamble: auto-injects context metadata (message count,
  total chars, goal, last user message preview) before first code step,
  matching fast-rlm's auto-print pattern.
- Error-to-LLM flow: Python parse errors, runtime errors, NameErrors,
  OS errors, and async errors now flow back as stdout content instead of
  terminating the step, enabling LLM self-correction on next iteration.
  Only VM panics (catch_unwind) terminate as EngineError.

74 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(engine): update architecture plan with RLM cross-reference learnings

Comprehensive update after cross-referencing against official RLM
(alexzhang13/rlm), fast-rlm (avbiswas/fast-rlm), Prime Intellect
(verifiers/RLMEnv), rlm-rs (zircote/rlm-rs), and Google ADK RLM.

Changes:
- Mark Phases 1-3 as DONE with commit refs and test counts
- Add "Key Influences" section documenting all reference implementations
- Phase 3: full table of implemented RLM features with sources
- Phase 3: "Remaining gaps" table with which phase addresses each
- Phase 4: expanded with compaction (85% context), rlm_query() (full
  recursive sub-agent), dual model routing, budget controls (USD,
  timeout, tokens, consecutive errors), lazy loading, pass-by-reference
- Add "RLM Execution Model" cross-cutting section
- Add "Implementation Progress" tracking table
- Remove stale "TO IMPLEMENT" markers (all Phase 3 work is done)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): Phase 4 — budget controls, compaction, reflection pipeline

Budget enforcement in ExecutionLoop:
- max_tokens_total: cumulative token limit, checked before each iteration
- max_duration: wall-clock timeout for entire thread
- max_consecutive_errors: consecutive error steps threshold (resets on
  success, matching official RLM behavior)
- All produce ThreadOutcome::Failed with descriptive messages

Context compaction (from RLM paper, 85% threshold):
- estimate_tokens(): char-based estimation (chars/4, matching RLM)
- should_compact(): triggers when tokens >= threshold_pct * context_limit
- compact_messages(): asks LLM to summarize progress, replaces history
  with [system, summary, continuation_note], preserves intermediate results
- Configurable via ThreadConfig: model_context_limit, compaction_threshold

Dual model routing:
- LlmCallConfig gains depth field (0=root, 1+=sub-call)
- Implementations can route to cheaper models for sub-calls
- ExecutionLoop passes thread depth to every LLM call

Reflection pipeline (reflection/pipeline.rs):
- reflect(thread, llm): analyzes completed thread via LLM
- Produces Summary doc (always), Lesson doc (if errors), Issue doc (if failed)
- Builds transcript from thread messages + error events
- Returns ReflectionResult with docs + token usage

ThreadConfig extended with: max_tokens_total, max_consecutive_errors,
model_context_limit, enable_compaction, compaction_threshold, depth, max_depth.

78 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): Phase 5 — conversation surface separated from execution

Conversation is now a UI layer, not an execution boundary. Multiple
threads can run concurrently within one conversation; threads can
outlive their originating conversation.

New types (types/conversation.rs):
- ConversationSurface: channel + user + entries + active_threads
- ConversationEntry: sender (User/Agent/System) + content + origin_thread_id
- ConversationId, EntryId (UUID newtypes)
- EntrySender enum (User, Agent{thread_id}, System)

ConversationManager (runtime/conversation.rs):
- get_or_create_conversation(channel, user) — indexed by (channel, user)
- handle_user_message() — injects into active foreground thread or spawns new
- record_thread_outcome() — adds agent/system entries, untracks completed threads
- get_conversation(), list_conversations()

This enables the key architectural insight: a user can ask "what's the
weather?" while a deployment thread is still running. Both produce entries
in the same conversation.

85 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(engine): simplify execution tiers — Monty-only for CodeAct/RLM

Restructure phases 6-8 to clarify execution model:

- Monty is the sole Python executor for CodeAct/RLM. No WASM or Docker
  Python runtimes for LLM-generated code.
- WASM sandbox is for third-party tool isolation (existing infra, Phase 8)
- Docker containers are for thread-level isolation of high-risk work (Phase 8)
- Two-phase commit moves to Phase 6 (integration) at the adapter boundary

Phase renumbering:
- Old Phase 6 (Tier 2-3) → removed as separate phase
- Old Phase 7 (integration) → Phase 6
- Old Phase 8 (cleanup) → Phase 7
- New Phase 8: WASM tools + Docker thread isolation (infra integration)

Updated progress table: Phases 1-5 marked DONE with test counts and commits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): Phase 6 — bridge adapters for main crate integration

Strategy C parallel deployment: when ENGINE_V2=true env var is set,
user messages route through the engine instead of the existing agentic
loop. All existing behavior is unchanged when the flag is off.

Bridge module (src/bridge/):
- LlmBridgeAdapter: wraps LlmProvider as engine LlmBackend, converts
  ThreadMessage↔ChatMessage, ActionDef↔ToolDefinition, depth-based
  model routing (primary vs cheap_llm)
- EffectBridgeAdapter: wraps ToolRegistry+SafetyLayer as EffectExecutor,
  routes tool calls through existing execute_tool_with_safety pipeline
- InMemoryStore: HashMap-backed Store impl (no DB tables needed yet)
- EngineRouter: is_engine_v2_enabled() + handle_with_engine() that
  builds engine from Agent deps and processes messages end-to-end

Integration touchpoint (4 lines in agent_loop.rs):
  After hook processing, before session resolution, check ENGINE_V2
  flag and route UserInput through the engine path.

Accessor visibility widened: llm(), cheap_llm(), safety(), tools()
changed from pub(super) to pub(crate) for bridge access.

85 engine tests + main crate clippy clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): add user message and system prompt to thread before execution

The ExecutionLoop was sending empty messages to the LLM because the
thread was spawned with the user's input as the goal but no messages.

Fixes:
- ThreadManager.spawn_thread() now adds the goal as an initial user
  message before starting the execution loop
- ExecutionLoop.run() injects a default system prompt if none exists

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(bridge): match existing LLM request format to prevent 400 errors

The LLM bridge was missing several defaults that the existing
Reasoning.respond_with_tools() sets:

- tool_choice: "auto" when tools are present (required by some providers)
- max_tokens: 4096 (default)
- temperature: 0.7 (default)
- When no tools (force_text): use plain complete() instead of
  complete_with_tools() with empty tools array — matches existing
  no-tools fallback path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): persist conversation context across messages

The engine was creating a fresh ThreadManager and InMemoryStore per
message, losing all context between turns. A follow-up question like
"what are the latest 10 issues?" had no memory of the prior "how many
issues" response.

Fixes:
- EngineState (ThreadManager, ConversationManager, InMemoryStore) now
  persists across messages via OnceLock, initialized on first use
- ConversationManager builds message history from prior conversation
  entries (user messages + agent responses) and passes it to new threads
- ThreadManager.spawn_thread_with_history() accepts initial_messages
  that are prepended before the current user message
- System notifications (thread started/completed) are filtered out of
  the history (not useful as LLM context)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): enable CodeAct/RLM mode with code block detection

The engine now operates in CodeAct/RLM mode:

System prompt (executor/prompt.rs):
- Instructs LLM to write Python in ```repl fenced blocks
- Documents available tools as callable Python functions
- Documents llm_query(), llm_query_batched(), FINAL()
- Documents context variables (context, goal, step_number, previous_results)
- Strategy guidance: examine context, break into steps, use tools, call FINAL()

Code block detection (bridge/llm_adapter.rs):
- extract_code_block() scans LLM text responses for ```repl or ```python blocks
- When detected, returns LlmResponse::Code instead of LlmResponse::Text
- The ExecutionLoop routes Code responses through Monty for execution

No structured tool definitions sent to LLM:
- Tools are described in the system prompt as Python functions
- The LLM call sends empty actions array, forcing text-mode responses
- This ensures the LLM writes code blocks (CodeAct) instead of
  structured tool calls (which would bypass the REPL)

85 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(engine): add 8 CodeAct/RLM E2E tests with mock LLM

Comprehensive test coverage for the Monty Python execution path:

- codeact_simple_final: Python code calls FINAL('answer') → thread completes
- codeact_tool_call_then_final: code calls test_tool() → FunctionCall
  suspends VM → MockEffects returns result → code resumes → FINAL()
- codeact_pure_python_computation: sum([1,2,3,4,5]) → FINAL('Sum is 15')
  with no tool calls — pure Python in Monty
- codeact_multi_step: first step prints output (no FINAL), second step
  sees output metadata and calls FINAL — tests iterative REPL flow
- codeact_error_recovery: first step has NameError → error flows to LLM
  as stdout → second step recovers with FINAL — tests error transparency
- codeact_context_variables_available: code accesses `goal` and `context`
  variables injected by the RLM context builder
- codeact_multiple_tool_calls_in_loop: for loop calls test_tool() 3 times
  → 3 FunctionCall suspensions → all results collected → FINAL
- codeact_llm_query_recursive: code calls llm_query('prompt') → VM
  suspends → MockLlm provides sub-agent response → result returned as
  Python string variable

93 tests passing (85 prior + 8 new), zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(bridge): detect code blocks in plain completion path + multi-block support

Two bugs fixed:

1. The no-tools completion path (used by CodeAct since we send empty
   actions) returned LlmResponse::Text without checking for code blocks.
   Code blocks were rendered as markdown text instead of being executed.

2. extract_code_block now:
   - Handles bare ``` fences (skips non-Python languages)
   - Collects ALL code blocks in the response and concatenates them
     (models often split code across multiple blocks with explanation)
   - Tries markers in order: ```repl, ```python, ```py, then bare ```

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(bridge): add 11 regression tests for code block extraction

Covers the exact failure modes discovered during live testing:

- extract_repl_block: standard ```repl fenced block
- extract_python_block: ```python marker
- extract_py_block: ```py shorthand
- extract_bare_backtick_block: bare ``` with Python content
- skip_non_python_language: ```json should NOT be extracted
- no_code_blocks_returns_none: plain text, no fences
- multiple_code_blocks_concatenated: two ```repl blocks with
  explanation between them → concatenated with \n\n
- mixed_thinking_and_code: model outputs explanation + two
  ```python blocks (the Hyperliquid case) → both extracted
- repl_preferred_over_bare: ```repl takes priority over bare ```
- empty_code_block_skipped: empty fenced block returns None
- unclosed_block_returns_none: no closing ``` returns None

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): detect FINAL() in text responses + regression tests

Models sometimes write FINAL() outside code blocks — as plain text
after an explanation. The Hyperliquid case: model outputs a long
analysis then FINAL("""...""") at the end, not inside ```repl fences.

Fixes:
- extract_final_from_text(): regex-based FINAL detection in text
  responses, matching the official RLM's find_final_answer() fallback
- Handles: double-quoted, single-quoted, triple-quoted, unquoted,
  nested parens
- Checked in LlmResponse::Text handler BEFORE tool intent nudge
  (FINAL takes priority)

9 new tests:
- codeact_final_in_text_response: FINAL("answer") in plain text
- codeact_final_triple_quoted_in_text: FINAL("""multi\nline""") in text
- final_double_quoted, final_single_quoted, final_triple_quoted,
  final_unquoted, final_with_nested_parens, final_after_long_text,
  no_final_returns_none

102 tests passing (93 + 9 new), zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add crate extraction & cleanup roadmap

Documents architectural recommendations from the engine v2 design
process for future reference:

- Root directory consolidation (channels-src + tools-src → extensions/)
- Crate extraction tiers: zero-coupling (estimation, observability,
  tunnel), trivial-coupling (document_extraction, pairing, hooks),
  medium-coupling (secrets, MCP, db, workspace, llm, skills),
  heavy-coupling (web gateway, agent, extensions)
- src/ module reorganization into logical groups (core, persistence,
  infra, media, support)
- main.rs/app.rs slimming targets (100/500 lines after migration)
- WASM module candidates (document_extraction) and non-candidates
  (REPL, web gateway → separate crates instead)
- Priority ordering for extraction work
- Tracks completed items (ironclaw_safety, ironclaw_engine,
  transcription move)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): live progress status updates via event broadcast

Engine v2 now shows live progress in the CLI (and any channel):
- "Thinking..." when a step starts
- Tool name + success/error when actions execute
- "Processing results..." when a step completes

Implementation:
- ThreadManager holds a broadcast::Sender<ThreadEvent> (capacity 256)
- ExecutionLoop.emit_event() writes to thread.events AND broadcasts
- ThreadManager.subscribe_events() returns a receiver
- Router uses tokio::select! to listen for events while waiting for
  thread completion, forwarding them as StatusUpdate to the channel

This replaces the polling approach with zero-latency event streaming.
Agent.channels visibility widened to pub(crate) for bridge access.

102 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): include tool results in code step output for LLM context

The LLM was ignoring tool results and answering from training data
because the compact output metadata didn't include what tools returned.
Tool results lived only as ActionResult messages (role: Tool) which
some providers flatten or the model ignores.

Now the code step output includes:
- stdout from Python print() statements
- [tool_name result] with the actual output (truncated to 4K per tool)
- [tool_name error] for failed tools
- [return] for the code's return value
- Total output truncated to 8K chars to prevent context bloat

This ensures the model sees web_search results, API responses, etc.
in the next iteration and can reason about them instead of hallucinating.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): add debug/trace logging for CodeAct execution

Three verbosity levels for debugging the engine:

RUST_LOG=ironclaw_engine=debug:
- LLM call: message count, iteration, force_text
- LLM response: type (text/code/action_calls), token usage
- Code execution: code length, action count, had_error, final_answer
- Text response: length, FINAL() detection

RUST_LOG=ironclaw_engine=trace:
- Full message list sent to LLM (role, length, first 200 chars each)
- Full code block being executed
- stdout preview (first 500 chars)
- Per-tool results (name, success, first 300 chars of output)
- Text response preview (first 500 chars)

Usage:
  ENGINE_V2=true RUST_LOG=ironclaw_engine=debug cargo run
  ENGINE_V2=true RUST_LOG=ironclaw_engine=trace cargo run

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): execution trace recording + retrospective analysis

Enable with ENGINE_V2_TRACE=1 to get full execution traces and
automatic issue detection after each thread completes.

Trace recording (executor/trace.rs):
- build_trace(): captures full thread state — messages (with full
  content), events, step count, token usage, detected issues
- write_trace(): writes JSON to engine_trace_{timestamp}.json
- log_trace_summary(): logs summary + issues at info/warn level

Retrospective analyzer detects 8 issue categories:
- thread_failure: thread ended in Failed state
- no_response: no assistant message generated
- tool_error: specific tool failures with error details
- code_error: Python errors (NameError, SyntaxError, etc.) in output
- missing_tool_output: tool results exist but not in system messages
- excessive_steps: >10 steps (may be stuck in loop)
- no_tools_used: single-step answer without tools (hallucination risk)
- mixed_mode: text responses without code blocks (prompt not followed)

Thread state now saved to store after execution completes (for trace
access after join_thread).

Usage:
  ENGINE_V2=true ENGINE_V2_TRACE=1 cargo run
  # After each message: trace JSON + issue log in terminal

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): wire reflection pipeline + trace analysis into thread lifecycle

After every thread completes, ThreadManager now automatically runs:

1. Retrospective trace analysis (non-LLM, always):
   - Detects 8 issue categories (tool errors, code errors, missing
     outputs, excessive steps, hallucination risk, etc.)
   - Logs issues at warn level when found

2. Trace file recording (when ENGINE_V2_TRACE=1):
   - Writes full JSON trace to engine_trace_{timestamp}.json

3. LLM reflection (when enable_reflection=true):
   - Calls reflection pipeline to produce Summary, Lesson, Issue docs
   - Saves docs to store for future context retrieval
   - Enabled by default in the bridge router

All three run inside the spawned tokio task after exec.run() completes,
before saving the final thread state. No external wiring needed.

Removed duplicate trace recording from the router — it's now handled
by ThreadManager automatically.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(bridge): convert tool name hyphens to underscores for Python compatibility

Root cause from trace analysis: the LLM writes `web_search()` (valid
Python identifier) but the tool registry has `web-search` (with hyphen).
The EffectBridgeAdapter couldn't find the tool → "Tool not found" error
→ model fabricated fake data instead.

Fixes:
- available_actions(): converts tool names from hyphens to underscores
  (web-search → web_search) so the system prompt lists valid Python names
- execute_action(): tries the original name first, then falls back to
  hyphenated form (web_search → web-search) for tool registry lookup
- Same conversion in router's capability registry builder

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(bridge): parse JSON tool output to prevent double-serialization

From trace analysis: web_search returned a JSON string, which was
wrapped as serde_json::json!(string) creating a Value::String containing
JSON. When Monty got this as MontyObject::String, the Python code
couldn't index it with result['title'] → TypeError.

Fix: try parsing the tool output string as JSON first. If valid, use the
parsed Value (becomes a Python dict/list). If not valid JSON, keep as
string. This means web_search results are directly indexable in Python:
  results = web_search(query="...")
  print(results["results"][0]["title"])  # works now

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): persist variables across code steps via `state` dict

Monty creates a fresh runtime per code step, so variables are lost
between steps. This caused the model to re-paste tool results from
system messages, wasting tokens.

Fix: maintain a `persisted_state` JSON dict in the ExecutionLoop that
accumulates across steps:
- Tool results stored by tool name: state["web_search"] = {results...}
- Return values stored: state["last_return"], state["step_0_return"]
- Injected as a `state` Python variable in each new MontyRun

Now the model can do:
  Step 1: results = web_search(query="...")  # tool result saved in state
  Step 2: data = state["web_search"]         # access previous result
          summary = llm_query("summarize", str(data))
          FINAL(summary)

System prompt updated to document the `state` variable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): add state hint on code errors + retrieval engine integration

When code fails with NameError/UnboundLocalError (model trying to
access variables from a previous step), the error output now includes:

  [HINT] Variables don't persist between code blocks. Use the `state`
  dict to access data from previous steps. Available keys: ["web_search",
  "last_return"]

This teaches the model to use `state["web_search"]` instead of `result`
after a NameError, reducing wasted steps from 3-4 to 1.

Also integrates RetrievalEngine into context building and ThreadManager:
- build_step_context() now accepts optional RetrievalEngine to inject
  relevant memory docs (Lessons, Specs, Playbooks) into LLM context
- RetrievalEngine uses keyword matching with doc-type priority scoring
- Memory docs from reflection (Phase 4) now feed back into future threads

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: remove trace files and add to .gitignore

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): replace web_fetch example with web_search in CodeAct prompt

The system prompt example used web_fetch(url="...") which doesn't exist
as a tool. The model learned from the example and tried web_fetch,
getting "Tool not found". Changed to web_search(query="...") which is
an actual registered tool.

Found via trace analysis — reflection pipeline correctly identified
this as a "Tool Name Correction" spec doc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor(engine): extract prompt templates to markdown files

Prompt templates moved from inline Rust strings to plain markdown files
at crates/ironclaw_engine/prompts/ for easy inspection and iteration:

- prompts/codeact_preamble.md — main instructions, special functions,
  context variables, rules
- prompts/codeact_postamble.md — strategy section

Loaded at compile time via include_str!(), so no runtime file I/O.
Edit the .md files and rebuild to iterate on prompts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): replace byte-index slicing with char-safe truncation

Panic: 'byte index 80 is not a char boundary; it is inside ''' when
tool output contained multi-byte UTF-8 characters (smart quotes from
web search results).

Fixed 4 unsafe byte-index slices:
- thread.rs:281: message preview &content[..80] → chars().take(80)
- loop_engine.rs:556: tool output &str[..4000] → chars().take(4000)
- loop_engine.rs:579: output tail &str[len-8000..] → chars().skip()
- scripting.rs:82: stdout tail &str[len-N..] → chars().skip()

All now use .chars().take() or .chars().skip() which respect character
boundaries. Follows CLAUDE.md rule: "Never use byte-index slicing on
user-supplied or external strings."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): fix false positive missing_tool_output warning in trace analyzer

The check was looking for "[" + "result]" in System-role messages only,
but tool output metadata is added with patterns like "[shell result]"
and may appear in messages with any role. Changed to scan all messages
for " result]" or " error]" patterns regardless of role.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(engine): update architecture plan with Phase 6 status and approval flow design

Phase 6 updated to reflect what was actually built:
- Bridge adapters (LLM, Effect, InMemoryStore, Router) — all done
- Integration touchpoint (4 lines in handle_message) — done
- Live progress via broadcast events — done
- Conversation persistence across messages — done
- Trace recording + retrospective analysis — done
- 8 bugs found and fixed via trace analysis — documented

Phase 6 remaining work documented:
- Approval flow: detailed 5-step design (send to channel, pause thread,
  route response, resume execution, always handling) with v1 reference
- Database persistence (InMemoryStore → real DB tables)
- Acceptance testing (TestRig + TraceLlm fixtures)
- Two-phase commit for high-stakes effects

Progress table updated: Phase 6 marked as DONE (partial), 134 tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add self-improving engine design plan

Designs a system where the engine debugs and improves itself, based on
the pattern observed in the last session: 5 consecutive bug fixes all
followed trace → read → identify → edit → test, using tools the engine
already has access to.

Three levels of self-improvement:
- Level 1 (Prompt): edit prompts/*.md to prevent LLM mistakes. Auto-apply.
- Level 2 (Config): adjust defaults/mappings. Branch + test + PR.
- Level 3 (Code): Rust patches for engine bugs. Branch + test + clippy + PR.

Architecture: Self-improvement Mission spawns a Reflection thread that
reads traces, reads source, proposes fixes, validates via cargo test,
and either auto-applies (Level 1) or creates a PR (Level 2-3).

Includes: fix pattern database (seeded from our 8 debugging session
fixes), feedback loop diagram, safety model, implementation phases
(A through D), and what exists vs what's new.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add engine v2 security model and audit

Comprehensive security analysis of engine v2 covering:

Threat model: 4 attacker profiles (malicious input, prompt injection
via tools, poisoned memory, supply chain).

Current state audit: 9 controls working (Monty sandbox, safety layer,
policy engine, leases, provenance, events) and 9 gaps identified.

Critical finding: ALL tools granted by default — CodeAct code can call
shell, write_file, apply_patch without approval. Proposed fix: 3-tier
tool classification (auto/approve-once/always-approve).

CodeAct-specific threats: tool call amplification, prompt injection via
search results, data exfiltration via tool chains, Monty escape.

Self-improvement security: poisoned trace attacks, memory poisoning via
reflection. Mitigations: edit validation, frequency caps, audit trail,
auto-rollback, reflection output scanning.

6-layer security architecture proposed: input validation, capability
gating, output sanitization, execution sandboxing, self-improvement
controls, observability.

Prioritized implementation plan with severity/effort ratings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(security): cross-reference v1 controls — use, don't reinvent

Updated security plan with detailed audit of ALL existing v1 security
controls and how they map to engine v2 bridge gaps:

Key finding: v1 already has solutions for every security gap identified.
The bridge just needs to wire them in:

- Tool::requires_approval() exists but bridge doesn't call it
- safety.wrap_for_llm() exists but tool results enter context unwrapped
- RateLimiter exists but bridge doesn't check rate limits
- BeforeToolCall hooks exist but bridge doesn't run them
- redact_params() exists but bridge doesn't redact sensitive params
- Shell risk classification (Low/Medium/High) is inherited but ignored

Revised priority: most fixes are small wiring tasks in EffectBridgeAdapter,
not new security infrastructure. The bridge is the security boundary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): add missions, reliability tracker, reflection executor, and provenance-aware policy

- Add Mission type and MissionManager for recurring thread scheduling
- Add ReliabilityTracker for per-capability success/failure/latency tracking
- Add reflection executor that spawns CodeAct threads for post-completion reflection
- Extend PolicyEngine with provenance-aware taint checking (LLM-generated data
  requires approval for financial/external-write effects)
- Extend Store trait with mission CRUD methods
- Add conversation surface tracking, compaction token fix, context memory injection
- Wire new modules through lib.rs re-exports and bridge adapters

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): wire v1 security controls into engine v2 adapter

Zero engine crate changes. All security controls enforced at the bridge
boundary in EffectBridgeAdapter:

1. Tool approval (v1: Tool::requires_approval):
   - Checks each tool's approval requirement with actual params
   - Always → returns EngineError::LeaseDenied (blocks execution)
   - UnlessAutoApproved → checks auto_approved set, blocks if not approved
   - Never → proceeds
   - Per-session auto_approved HashSet (for future "always" handling)

2. Hook interception (v1: BeforeToolCall):
   - Runs HookEvent::ToolCall before every execution
   - HookOutcome::Reject → blocks with reason
   - HookError::Rejected → blocks with reason
   - Hook errors → fail-open (logged, execution continues)

3. Output sanitization (v1: sanitize_tool_output + wrap_for_llm):
   - Leak detection: API keys in tool output are redacted
   - Policy enforcement: content policy rules applied
   - Length truncation: output capped at 100KB
   - XML boundary protection: prevents injection via tool output

4. Sensitive param redaction (v1: redact_params):
   - Tool's sensitive_params() consulted before hooks see parameters
   - Redacted params sent to hooks, original params used for execution

5. available_actions() now sets requires_approval based on each tool's
   default approval requirement, so the engine's PolicyEngine can
   gate tools it hasn't seen before.

6. Actual execution timing measured via Instant::now() (replaces
   placeholder Duration::from_millis(1)).

Accessor visibility: hooks() widened to pub(crate).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): implement tool approval flow for engine v2

Adds a complete approval flow that mirrors v1 behavior, using the
existing v1 security controls (Tool::requires_approval, auto-approve
sets, StatusUpdate::ApprovalNeeded).

## How it works

### Step 1: Tool blocked at execution
When the LLM's code calls a tool (e.g., `shell("ls")`):
1. EffectBridgeAdapter.execute_action() looks up the Tool object
2. Calls tool.requires_approval(&params) — returns ApprovalRequirement
3. If Always → EngineError::LeaseDenied (always blocks)
4. If UnlessAutoApproved → checks auto_approved HashSet → if not in set,
   returns EngineError::LeaseDenied
5. If Never → proceeds to execution

### Step 2: Engine returns NeedApproval
The LeaseDenied error propagates through:
- CodeAct path: becomes Python RuntimeError, code halts, thread returns
  NeedApproval with action_name + parameters
- Structured path: same via ActionResult.is_error

### Step 3: Router stores pending approval
- PendingApproval { action_name, original_content } stored on EngineState
- StatusUpdate::ApprovalNeeded sent to channel (shows approval card in
  CLI/web with tool name, parameters, yes/always/no buttons)
- Returns text: "Tool 'shell' requires approval. Reply yes/always/no."

### Step 4: User responds
handle_message() intercepts Submission::ApprovalResponse when ENGINE_V2:
- 'yes' → auto_approve_tool(name) on EffectBridgeAdapter, re-processes
  original message (tool now passes the approval check on second run)
- 'always' → same + logs for session persistence
- 'no' → returns "Denied: tool was not executed."

### Key design choice
Instead of pausing/resuming mid-execution (which needs engine changes
to freeze/restore the Monty VM state), we auto-approve the tool and
re-run the full message. The EffectBridgeAdapter's auto_approved set
persists across runs, so the second execution passes immediately.

This trades one extra LLM call for zero engine modifications.

## Files changed
- src/bridge/router.rs: PendingApproval struct, handle_approval(),
  NeedApproval → StatusUpdate::ApprovalNeeded conversion
- src/bridge/mod.rs: export handle_approval
- src/agent/agent_loop.rs: intercept ApprovalResponse for engine v2
- src/bridge/effect_adapter.rs: fmt fixes

151 tests passing, clippy + fmt clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): demote trace/reflection logging from info to debug

INFO-level log output from background tasks (trace analysis, reflection)
corrupts the REPL terminal UI. The trace summary, issue warnings, and
reflection doc previews were printing mid-approval-card, breaking the
interactive display.

Fix: all logging in trace.rs changed from info!/warn! to debug!/warn!.
Trace analysis and reflection results now only show when
RUST_LOG=ironclaw_engine=debug is set.

Also added logging discipline rule to global CLAUDE.md:
- info! → user-facing status the REPL intentionally renders
- debug! → internal diagnostics (traces, reflection, engine internals)
- Background tasks must NEVER use info! — it breaks the TUI

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(bridge): demote all router info! logging to debug!

"engine v2: initializing" and "engine v2: handling message" were
printing at INFO level, corrupting the REPL UI. All router logging
now uses debug! — only visible with RUST_LOG=ironclaw=debug.

Zero info! calls remain in crates/ironclaw_engine/ or src/bridge/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(safety): demote leak detector warn-action logs from warn! to debug!

The leak detector's Warn-action matches (high_entropy_hex pattern on
web search results containing commit SHAs, CSS colors, URL hashes)
were logging at warn! level, corrupting the REPL UI with lines like:
  WARN Potential secret leak detected pattern=high_entropy_hex preview=a96f********cee5

These are informational false positives — real leaks use LeakAction::Redact
which silently modifies the content. Warn-action matches only log for
debugging purposes and should not appear in production output.

Changed to debug! level — visible with RUST_LOG=ironclaw_safety=debug.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): strengthen CodeAct prompt to prevent shallow text answers

The model was answering "Suggested 45 improvements" as a brief text
summary from training data without actually searching or listing them.
The trace showed: no code block, no tool calls, no FINAL().

Prompt changes:
- Rule 1: "ALWAYS respond with a ```repl code block. NEVER answer with
  plain text only." (was: "Always write code... plain text for brief
  explanations")
- Rule 2 (NEW): "NEVER answer from memory or training data alone.
  Always use tools to get real, current information before answering."
- Rule 3: FINAL answer "should be detailed and complete — not just a
  summary like 'found 45 items'"
- Rule 8 (NEW): "Include the actual content in your FINAL() answer,
  not just a count or summary. Users want to see the details."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): persist reflection docs to workspace for cross-session learning

Replaces InMemoryStore with HybridStore:
- Ephemeral data (threads, steps, events, leases) stays in-memory
- MemoryDocs (lessons, specs, playbooks from reflection) persist to
  the workspace at engine/docs/{type}/{id}.json

On engine init, load_docs_from_workspace() reads existing docs back
into the in-memory cache. This means:
- Lessons learned in session 1 are available in session 2
- The RetrievalEngine injects relevant past lessons into new threads
- The engine genuinely improves over time as reflection accumulates

Workspace paths:
  engine/docs/lessons/{uuid}.json
  engine/docs/specs/{uuid}.json
  engine/docs/playbooks/{uuid}.json
  engine/docs/summaries/{uuid}.json
  engine/docs/issues/{uuid}.json

No new database tables. Uses existing workspace write/read/list.
workspace() accessor widened to pub(crate).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(bridge): adapt to execute_tool_with_safety params-by-value change

Staging merge changed execute_tool_with_safety to take params by value
instead of by reference (perf optimization from PR #926). Updated
bridge adapter to clone params before passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(engine): add web gateway integration plan to Phase 6

Documents three gaps between engine v2 and the web gateway:
1. No SSE streaming (engine emits ThreadEvent, gateway expects SseEvent)
2. No conversation persistence (engine uses HybridStore, gateway reads v1 DB)
3. No cross-channel visibility (REPL ↔ web messages invisible to each other)

Implementation plan: bridge ThreadEvent→AppEvent, write messages to v1
conversation tables after thread completion. Prerequisite: AppEvent
extraction PR (in progress separately).

Also updated DB persistence status: HybridStore with workspace-backed
MemoryDocs is now implemented (partial persistence).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(engine): document routine/job gap and SIGKILL crash scenario

Routines are entirely v1 — not hooked up to engine v2. When a user
asks "create a routine" as natural language, engine v2 tries to call
routine_create via CodeAct, but the tool needs RoutineEngine + Database
refs that the bridge's minimal JobContext doesn't provide. This caused
a SIGKILL crash during testing.

Options documented: block routine tools in v2 (short term), pass refs
through context (medium), replace with Mission system (long term).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: extract AppEvent to crates/ironclaw_common

SseEvent was defined in src/channels/web/types.rs but imported by 12+
modules across agent, orchestrator, worker, tools, and extensions — it
had become the application-wide event protocol, not a web transport
concern.

Create crates/ironclaw_common as a shared workspace crate and move the
enum there as AppEvent.  Also move the truncate_preview utility which
was similarly leaked from the web gateway into agent modules.

- New crate: crates/ironclaw_common (AppEvent, truncate_preview)
- Rename SseEvent → AppEvent, from_sse_event → from_app_event
- web/types.rs re-exports AppEvent for internal gateway use
- web/util.rs re-exports truncate_preview
- Wire format unchanged (serde renames are on variants, not the enum)

Aligned with the event bus direction on refactor/architectural-hardening
where DomainEvent (≡ AppEvent) is wrapped in a SystemEvent envelope.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): integrate with web gateway via AppEvent + v1 conversation DB

Three changes to make engine v2 visible in the web gateway:

1. SSE event streaming (AppEvent broadcast):
   - ThreadEvent → AppEvent conversion via thread_event_to_app_event()
   - Events broadcast to SseManager during the poll loop
   - Covers: Thinking, ToolCompleted (success/error), Status, Response
   - Web gateway receives real-time progress without any gateway changes

2. Conversation persistence to v1 database:
   - After thread completes, writes user message + agent response to
     v1 ConversationStore via add_conversation_message()
   - Uses get_or_create_assistant_conversation() for per-user per-channel
   - Web gateway reads from DB as usual — chat history appears

3. Final response broadcast:
   - AppEvent::Response with full text + thread_id sent via SSE
   - Web gateway renders the response in the chat UI

New EngineState fields: sse (Option<Arc<SseManager>>),
db (Option<Arc<dyn Database>>). Both populated from Agent.deps.

Agent.deps visibility widened to pub(crate).

Depends on: ironclaw_common crate with AppEvent type (PR #1615).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): complete Phase 6 — v1-only tool blocking, rate limiting, call limits

Three security/stability improvements in EffectBridgeAdapter:

1. V1-only tool blocking:
   - routine_create, create_job, build_software (and hyphenated variants)
     return helpful error: "use the slash command instead"
   - Filtered out of available_actions() so system prompt doesn't list them
   - Prevents crash from tools needing RoutineEngine/Scheduler refs

2. Per-step tool call limit:
   - Max 50 tool calls per code block (AtomicU32 counter)
   - Prevents amplification: `for i in range(10000): shell(...)`
   - Returns "call limit reached, break into multiple steps"

3. Rate limiting:
   - Per-user per-tool sliding window via RateLimiter
   - Checks tool.rate_limit_config() before every execution
   - Returns "rate limited, try again in Ns"

Architecture plan updated:
- Gateway integration: DONE
- Routines: BLOCKED (gracefully, with slash command fallback)
- Rate limiting: DONE
- Call limit: DONE
- Phase 6 status: DONE (remaining: acceptance tests, two-phase commit)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add Mission system design — goal-oriented autonomous threads

Missions replace routines with evolving, knowledge-accumulating
autonomous agents. Unlike routines (fixed prompt, stateless), Missions:

- Generate prompts from accumulated Project knowledge (lessons,
  playbooks, issues from prior threads)
- Adapt approach when something fails repeatedly
- Track progress toward a goal with success criteria
- Self-manage: pause when stuck, complete when goal achieved

Architecture: MissionManager with cron ticker spawns threads via
ThreadManager. Meta-prompt built from mission goal + Project MemoryDocs
via RetrievalEngine. Reflection feeds back automatically.

6-step implementation plan: cron trigger, meta-prompt builder, bridge
wiring, CodeAct tools, progress tracking, persistence.

Includes two worked examples: daily tech news briefing (ongoing) and
test coverage improvement (goal-driven, self-completing).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): extend Mission types with webhook/event triggers + evolving strategy

Mission types updated to support external activation sources:

MissionCadence expanded:
- Cron { expression, timezone } — timezone-aware scheduling
- OnEvent { event_pattern } — channel message pattern matching
- OnSystemEvent { source, event_type } — structured events from tools
- Webhook { path, secret } — external HTTP triggers (GitHub, email, etc.)
- Manual — explicit triggering only

The engine defines trigger TYPES. The bridge implements infrastructure
(cron ticker, webhook endpoints, event matchers). GitHub issues, PRs,
email, Slack events all use the generic Webhook cadence — no
special-casing in the engine. Webhook payload injected as
state["trigger_payload"] in the thread's Python context.

Mission struct extended:
- current_focus: what the next thread should work on (evolving)
- approach_history: what we've tried (for adaptation)
- max_threads_per_day / threads_today: daily budget
- last_trigger_payload: webhook/event data for thread context

Plan updated with trigger type table and webhook integration design.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): implement MissionManager execution with meta-prompts

The MissionManager now builds evolving meta-prompts and processes
thread outcomes for continuous learning:

fire_mission() upgraded:
- Loads Project MemoryDocs via RetrievalEngine for context
- Builds meta-prompt from: goal, current_focus, approach_history,
  project knowledge docs, trigger payload, thread count
- Spawns thread with meta-prompt as user message
- Background task waits for completion and processes outcome
- Daily thread budget enforcement (max_threads_per_day)

Meta-prompt structure:
  # Mission: {name}
  Goal: {goal}
  ## Current Focus (evolves between threads)
  ## Previous Approaches (what we've tried)
  ## Knowledge from Prior Threads (lessons, playbooks, issues)
  ## Trigger Payload (webhook/event data if applicable)
  ## Instructions (accomplish step, report next focus, check goal)

Outcome processing:
- Extracts "next focus:" from FINAL() response → updates current_focus
- Detects "goal achieved: yes" → completes mission
- Records accomplishment in approach_history
- Failed threads recorded as "FAILED: {error}"

Cron ticker:
- start_cron_ticker() spawns tokio task, ticks every 60s
- Checks active Cron missions, fires those past next_fire_at

151 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): wire MissionManager into engine v2 for CodeAct access

Missions are now callable from CodeAct Python code:

```python
# Create a daily briefing mission
result = mission_create(
    name="Tech News",
    goal="Daily AI/crypto/software news briefing",
    cadence="0 9 * * *"
)

# List all missions
missions = mission_list()

# Manually fire a mission
mission_fire(id="...")

# Pause/resume
mission_pause(id="...")
mission_resume(id="...")
```

Implementation:
- MissionManager created on engine init, cron ticker started
- EffectBridgeAdapter intercepts mission_* function calls before tool
  lookup and routes to MissionManager
- parse_cadence() handles: "manual", cron expressions, "event:pattern",
  "webhook:path"
- Mission functions documented in CodeAct system prompt
- MissionManager set on adapter via set_mission_manager() after init
  (avoids circular dependency)

System prompt updated with mission_create, mission_list, mission_fire,
mission_pause, mission_resume documentation.

151 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): map routine_* calls to mission operations in v2

When the model calls routine_create, routine_list, routine_fire,
routine_pause, routine_resume, or routine_delete, the bridge now
routes them to the MissionManager instead of blocking with an error.

Mapping:
  routine_create → mission_create (with cadence parsing)
  routine_list   → mission_list
  routine_fire   → mission_fire
  routine_pause  → mission_pause
  routine_resume → mission_resume
  routine_update → mission_pause/resume (based on params)
  routine_delete → mission_complete (marks as done)

Routine tools removed from v1-only blocklist and restored in
available_actions(). The model can use either "routine" or "mission"
vocabulary — both work.

Still blocked: create_job, cancel_job, build_software (need v1
Scheduler/ContainerJobManager refs).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(engine): add E2E mission flow tests — 7 new tests

Comprehensive mission lifecycle tests:

- fire_mission_builds_meta_prompt_with_goal: verifies thread spawned
  with project context and recorded in history
- outcome_processing_extracts_next_focus: "Next focus: X" in FINAL()
  response → mission.current_focus updated
- outcome_processing_detects_goal_achieved: "Goal achieved: yes" →
  mission status transitions to Completed
- mission_evolves_via_direct_outcome_processing: 3-step evolution:
  step 1 sets focus to "db module", step 2 evolves to "tools module",
  step 3 detects goal achieved → mission completes. Tests the full
  learning loop without background task timing dependencies.
- fire_with_trigger_payload: webhook payload stored on mission and
  threads_today counter incremented
- daily_budget_enforced: max_threads_per_day=1 → first fire succeeds,
  second returns None

157 tests passing (151 prior + 6 new mission E2E).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): self-improving engine via Mission system

Wire the self-improvement loop as a Mission with OnSystemEvent cadence,
inspired by karpathy/autoresearch's program.md approach. The mission
fires when threads complete with issues, receives trace data as trigger
payload, and uses tools directly to diagnose and fix problems.

Key changes:

Engine self-improvement (Phase A+B from design doc):
- Add fire_on_system_event() to MissionManager for OnSystemEvent cadence
- Add start_event_listener() that subscribes to thread events and fires
  matching missions when non-Mission threads complete with trace issues
- Add ensure_self_improvement_mission() with autoresearch-style goal
  prompt (concrete loop steps, not vague instructions)
- Add process_self_improvement_output() for structured JSON fallback
- Seed fix pattern database with 8 known patterns from debugging
- Runtime prompt overlay via MemoryDoc (build_codeact_system_prompt now
  async + Store-aware, appends learned rules from prompt_overlay docs)
- Pass Store to ExecutionLoop for overlay loading

Bridge review fixes (P1/P2):
- Scope engine v2 SSE events to requesting user (broadcast_for_user)
- Per-user pending approvals via HashMap instead of global Option
- Reset tool-call limit counter before each thread execution
- Only persist auto-approval when user chose "always", not one-off "yes"
- Remove dead store/mission_manager fields from EngineState

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add checkpoint-based engine thread recovery

* feat(engine): add Python orchestrator module and host functions

Add the orchestrator infrastructure for replacing the Rust execution
loop with versioned Python code. This commit adds the module and host
functions without switching over — the existing Rust loop is unchanged.

New files:
- orchestrator/default.py: v0 Python orchestrator (run_loop + helpers)
- executor/orchestrator.rs: host function dispatch, orchestrator
  loading from Store with version selection, OrchestratorResult parsing

Host functions exposed to orchestrator Python via Monty suspension:
  __llm_complete__, __execute_code_step__ (nested Monty VM),
  __execute_action__, __check_signals__, __emit_event__,
  __add_message__, __save_checkpoint__, __transition_to__,
  __retrieve_docs__, __check_budget__, __get_actions__

Also makes json_to_monty, monty_to_json, monty_to_string pub(crate)
in scripting.rs for cross-module use.

Design doc: docs/plans/2026-03-25-python-orchestrator.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): switch ExecutionLoop::run() to Python orchestrator

Replace the 900-line Rust execution loop with a ~80-line bootstrap
that loads and runs the versioned Python orchestrator via Monty VM.

The orchestrator Python code (orchestrator/default.py) is the v0
compiled-in version. Runtime versions can override it via MemoryDoc
storage (orchestrator:main with tag orchestrator_code).

Key fixes during switchover:
- Use ExtFunctionResult::NotFound for unknown functions so Monty
  falls through to Python-defined functions (extract_final, etc.)
- Move helper function definitions above run_loop for Monty scoping
- Use FINAL result value (not VM return value) in Complete handler
- Rename 'final' variable to 'final_answer' to avoid Python keyword

Status: 171/177 tests pass. 6 remaining failures are step_count and
token tracking bookkeeping — the orchestrator manages these internally
but doesn't yet update the thread's counters via host functions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): all 177 tests pass with Python orchestrator

- Increment step_count and track tokens in __emit_event__("step_completed")
  so thread bookkeeping matches the old Rust loop behavior
- Remove double-counting of tokens in bootstrap (orchestrator handles it)
- Match nudge text to existing TOOL_INTENT_NUDGE constant
- Fix FINAL result propagation (use stored final_result, not VM return)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): orchestrator versioning, auto-rollback, and tests

Add version lifecycle for the Python orchestrator:
- Failure tracking via MemoryDoc (orchestrator:failures)
- Auto-rollback: after 3 consecutive failures, skip the latest version
  and fall back to previous (or compiled-in v0)
- Success resets the failure counter
- OrchestratorRollback event for observability

Update self-improvement Mission goal with Level 1.5 instructions for
orchestrator patches — the agent can now modify the execution loop
itself via memory_write with versioned orchestrator docs.

12 new tests: version selection (highest wins), rollback after failures,
rollback to default, failure counting/resetting, outcome parsing for
all 5 ThreadOutcome variants.

189 tests pass, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add engine v2 architecture, self-improvement, and dev history

Three new docs for contributors:

- engine-v2-architecture.md: Two-layer architecture (Rust kernel +
  Python orchestrator), five primitives, execution model with nested
  Monty VMs, bridge layer, memory/reflection, missions, capabilities

- self-improvement.md: Three improvement levels (prompt/orchestrator/
  config/code), autoresearch-inspired Mission loop, versioned
  orchestrator with auto-rollback, fix pattern database, safety model

- development-history.md: Summary of 6 Claude Code sessions that
  built the system, key design decisions and debugging moments,
  architecture evolution from 900-line Rust loop to Python orchestrator

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): complete v2 side-by-side integration with gateway API

Wire engine v2 into the full submission pipeline and expose threads,
projects, and missions through the web gateway REST API.

Bridge routing — route ExecApproval, Interrupt, NewThread, and Clear
submissions to engine v2 when ENGINE_V2=true. Previously only UserInput
and ApprovalResponse were handled; all other control commands fell
through to disconnected v1 sessions.

Bridge query layer — add 11 read-only query functions and 6 DTO types
so gateway handlers can inspect engine state (threads, steps, events,
projects, missions) without direct access to the EngineState singleton.

Gateway endpoints — new /api/engine/* routes:
  GET  /threads, /threads/{id}, /threads/{id}/steps, /threads/{id}/events
  GET  /projects, /projects/{id}
  GET  /missions, /missions/{id}
  POST /missions/{id}/fire, /missions/{id}/pause, /missions/{id}/resume

SSE events — add ThreadStateChanged, ChildThreadSpawned, and
MissionThreadSpawned AppEvent variants. Expand the bridge event mapper
to forward StateChanged and ChildSpawned engine events to the browser.

Engine crate — add ConversationManager::clear_conversation() for /new
and /clear commands.

Code quality — replace 10 .expect() calls with proper error returns,
remove dead AgentConfig.engine_v2 field, log silent init errors, fix
duplicate doc comment, improve fallthrough documentation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): empty call_id on ActionResult and trace analyzer false positives

Fix structured executor not stamping call_id onto ActionResult — the
EffectExecutor trait doesn't receive call_id, so the structured executor
must copy it from the original ActionCall after execution. Empty call_id
caused OpenAI-compatible providers to reject the next LLM request with
"Invalid 'input[2].call_id': empty string".

Fix trace analyzer false positives:
- code_error check now only scans User-role code output messages
  (prefixed with [stdout]/[stderr]/[code ]/Traceback), not System
  prompt which contains example error text
- missing_tool_output check now recognizes ActionResult messages as
  valid tool output (Tier 0 structured path)
- Add NotImplementedError to detected code error patterns

New trace checks:
- empty_call_id: detect ActionResult messages with missing/empty
  call_id before they reach the LLM API (severity: Error)
- llm_error: extract LLM provider errors from Failed state reason
- orchestrator_error: extract orchestrator errors from Failed state

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(web): add Missions tab to gateway UI

Add a full Missions page to the web gateway with list view, detail view,
and action buttons (Fire, Pause, Resume).

Backend: add /api/engine/missions/summary endpoint returning counts by
status (active/paused/completed/failed).

Frontend:
- New "Missions" tab between Jobs and Routines
- Summary cards showing mission counts by status
- Table with name, goal, cadence type, thread count, status, actions
- Detail view with goal, cadence, current focus, success criteria,
  approach history, spawned thread list, and action buttons
- Fire/Pause/Resume actions with toast notifications
- i18n support (English + Chinese)
- CSS following the existing routines/jobs patterns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): eagerly initialize engine v2 at startup

The gateway API endpoints (/api/engine/missions, etc.) call bridge
query functions that return empty results when the engine state hasn't
been initialized yet. Previously, initialization only happened lazily
on the first chat message via handle_with_engine().

Now when ENGINE_V2=true, the engine is initialized in Agent::run()
before channels start, so the self-improvement mission and other
engine state is available to gateway API endpoints immediately.

Also rename get_or_init_engine → init_engine and make it public so
it can be called from agent_loop.rs at startup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(web): improve mission detail with markdown goal and thread table

- Goal rendered as full-width markdown block instead of plain-text
  meta item (uses existing renderMarkdown/marked)
- Current focus and success criteria also rendered as markdown
- Spawned threads shown as a clickable table with goal, type, state,
  steps, tokens, and created date instead of a UUID list
- Clicking a thread row opens an inline thread detail view showing
  metadata grid and full message history with markdown rendering
- Back button returns to the mission detail view
- Backend: mission detail now returns full thread summaries (goal,
  state, step_count, tokens) instead of just thread IDs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(web): close SSE connections on page unload to prevent connection starvation

The browser limits concurrent HTTP/1.1 connections per origin to 6.
Without cleanup, SSE connections from prior page loads linger after
refresh/navigation, eating into the pool. After 2-3 refreshes, all 6
slots are consumed by stale SSE streams and new API fetch calls queue
indefinitely — the UI shows "connected" (SSE works) but data never
loads.

Add a beforeunload handler that closes both eventSource (chat events)
and logEventSource (log stream) so the browser can reuse connections
immediately on page reload.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(web): support multiple gateway tabs by reducing SSE connections

Each browser tab opened 2 SSE connections (chat events + log events).
With the HTTP/1.1 per-origin limit of 6, the 3rd tab exhausted the
pool and couldn't load any data.

Three changes:

1. Lazy log SSE — only connect when the logs tab is active, disconnect
   when switching away. Most users rarely view logs, so this saves a
   connection slot per tab.

2. Visibility API — close SSE when the browser tab goes to background
   (user switches to another tab), reconnect when it becomes visible.
   Background tabs don't need real-time events.

3. Combined with the existing beforeunload cleanup, this means:
   - Active foreground tab: 1 connection (chat SSE only, +1 if logs tab)
   - Background tabs: 0 connections
   - Closed/refreshed tabs: 0 connections (beforeunload cleanup)

This allows many gateway tabs to coexist within the 6-connection limit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): route messages to correct conversation by thread scope

Messages sent from a new conversation in the gateway always appeared in
the default assistant conversation because handle_with_engine ignored
the thread_id from the frontend.

Two fixes:

1. Engine conversation scoping — when the message carries a thread_id
   (from the frontend's conversation picker), use it as part of the
   engine conversation key: "gateway:<thread_id>" instead of just
   "gateway". This creates a distinct engine conversation per v1
   thread, so messages don't cross-contaminate.

2. V1 dual-write targeting — write user messages and assistant
   responses to the v1 conversation matching the thread_id (via
   ensure_conversation), not the hardcoded assistant conversation.
   Falls back to the assistant conversation when no thread_id is
   present (e.g., default chat).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(web): richer activity indicators for engine v2 execution

The gateway UI showed only generic "Thinking..." during engine v2
execution with no visibility into CodeAct code execution, tool calls,
or reflection. Now the event mapping produces detailed status updates:

Step lifecycle:
- "Calling LLM..." when a step starts (was "…
drchirag1991 pushed a commit to drchirag1991/ironclaw that referenced this pull request Apr 8, 2026
* refactor: extract AppEvent to crates/ironclaw_common

SseEvent was defined in src/channels/web/types.rs but imported by 12+
modules across agent, orchestrator, worker, tools, and extensions — it
had become the application-wide event protocol, not a web transport
concern.

Create crates/ironclaw_common as a shared workspace crate and move the
enum there as AppEvent.  Also move the truncate_preview utility which
was similarly leaked from the web gateway into agent modules.

- New crate: crates/ironclaw_common (AppEvent, truncate_preview)
- Rename SseEvent → AppEvent, from_sse_event → from_app_event
- web/types.rs re-exports AppEvent for internal gateway use
- web/util.rs re-exports truncate_preview
- Wire format unchanged (serde renames are on variants, not the enum)

Aligned with the event bus direction on refactor/architectural-hardening
where DomainEvent (≡ AppEvent) is wrapped in a SystemEvent envelope.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: add AppEvent::event_type() helper, deduplicate match blocks

Address Gemini review: extract the variant→string match into a single
method on AppEvent, replacing the duplicated 22-arm matches in sse.rs
and types.rs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: rename leftover sse vars/tests to match AppEvent rename

Address Copilot review: rename sse_event vars to app_event in
orchestrator/api.rs and ws.rs, rename test functions from
test_ws_server_from_sse_* to test_ws_server_from_app_event_*, and
update stale SSE comments.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: add Deserialize to AppEvent, round-trip test, fix stale comments

Address zmanian review:
- Add Deserialize derive to AppEvent so downstream consumers can
  deserialize incoming events
- Add event_type_matches_serde_type_field test that round-trips every
  variant through serde and asserts event_type() matches the serialized
  "type" field — catches drift between serde renames and the manual match
- Add round_trip_deserialize test for basic Serialize/Deserialize parity
- Update remaining "SSE" references in comments across server.rs,
  manager.rs, ws_gateway_integration.rs, and worker/job.rs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
drchirag1991 pushed a commit to drchirag1991/ironclaw that referenced this pull request Apr 8, 2026
…architecture) (nearai#1557)

* v2 architecture phase 1

* feat(engine): Phase 2 — execution loop, capability system, thread runtime

Add the core execution engine to ironclaw_engine crate:

- CapabilityRegistry: register/get/list capabilities and actions
- LeaseManager: async lease lifecycle (grant, check, consume, revoke, expire)
- PolicyEngine: deterministic effect-level allow/deny/approve
- ThreadTree: parent-child relationship tracking
- ThreadSignal/ThreadOutcome: inter-thread messaging via mpsc
- ThreadManager: spawn threads as tokio tasks, stop, inject messages, join
- ExecutionLoop: core loop replacing run_agentic_loop() with signals,
  context building, LLM calls, action execution, and event recording
- Structured executor (Tier 0): lease lookup → policy check → effect execution
- Tool intent nudge detection
- MemoryStore + RetrievalEngine stubs for Phase 4
- Full 8-phase architecture plan in docs/plans/
- CLAUDE.md spec for the engine crate

74 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): Phase 3 — Monty Python executor with RLM pattern

Add CodeAct execution (Tier 1) using the Monty embedded Python
interpreter, following the Recursive Language Model (RLM) pattern
from arXiv:2512.24601.

Key additions:
- executor/scripting.rs: Monty integration with FunctionCall-based
  tool dispatch, catch_unwind panic safety, resource limits (30s,
  64MB, 1M allocs)
- LlmResponse::Code variant + ExecutionTier::Scripting
- Context-as-variables (RLM 3.4): thread messages, goal, step_number,
  previous_results injected as Python variables — LLM context stays
  lean while code accesses data selectively
- llm_query(prompt, context) (RLM 3.5): recursive subagent calls
  from within Python code — results stored as variables, not injected
  into parent's attention window (symbolic composition)
- Compact output metadata between code steps instead of full stdout
- MontyObject ↔ serde_json::Value bidirectional conversion
- Updated architecture plan with RLM design principles

74 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): RLM best-practices enhancements from cross-reference analysis

Cross-referenced our implementation against the official RLM (alexzhang13/rlm),
fast-rlm (avbiswas/fast-rlm), and Prime Intellect's verifiers implementation.
Key enhancements:

- FINAL(answer) / FINAL_VAR(name): explicit termination pattern matching
  all three reference implementations. Code can signal completion at any
  point, not just via return value.
- llm_query_batched(prompts): parallel recursive sub-calls via tokio::spawn,
  matching fast-rlm's asyncio.gather pattern and Prime Intellect's llm_batch.
- Output truncation increased to 8000 chars (from 120), matching Prime
  Intellect's 8192 default. Shows [TRUNCATED: last N chars] or [FULL OUTPUT].
- Step 0 orientation preamble: auto-injects context metadata (message count,
  total chars, goal, last user message preview) before first code step,
  matching fast-rlm's auto-print pattern.
- Error-to-LLM flow: Python parse errors, runtime errors, NameErrors,
  OS errors, and async errors now flow back as stdout content instead of
  terminating the step, enabling LLM self-correction on next iteration.
  Only VM panics (catch_unwind) terminate as EngineError.

74 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(engine): update architecture plan with RLM cross-reference learnings

Comprehensive update after cross-referencing against official RLM
(alexzhang13/rlm), fast-rlm (avbiswas/fast-rlm), Prime Intellect
(verifiers/RLMEnv), rlm-rs (zircote/rlm-rs), and Google ADK RLM.

Changes:
- Mark Phases 1-3 as DONE with commit refs and test counts
- Add "Key Influences" section documenting all reference implementations
- Phase 3: full table of implemented RLM features with sources
- Phase 3: "Remaining gaps" table with which phase addresses each
- Phase 4: expanded with compaction (85% context), rlm_query() (full
  recursive sub-agent), dual model routing, budget controls (USD,
  timeout, tokens, consecutive errors), lazy loading, pass-by-reference
- Add "RLM Execution Model" cross-cutting section
- Add "Implementation Progress" tracking table
- Remove stale "TO IMPLEMENT" markers (all Phase 3 work is done)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): Phase 4 — budget controls, compaction, reflection pipeline

Budget enforcement in ExecutionLoop:
- max_tokens_total: cumulative token limit, checked before each iteration
- max_duration: wall-clock timeout for entire thread
- max_consecutive_errors: consecutive error steps threshold (resets on
  success, matching official RLM behavior)
- All produce ThreadOutcome::Failed with descriptive messages

Context compaction (from RLM paper, 85% threshold):
- estimate_tokens(): char-based estimation (chars/4, matching RLM)
- should_compact(): triggers when tokens >= threshold_pct * context_limit
- compact_messages(): asks LLM to summarize progress, replaces history
  with [system, summary, continuation_note], preserves intermediate results
- Configurable via ThreadConfig: model_context_limit, compaction_threshold

Dual model routing:
- LlmCallConfig gains depth field (0=root, 1+=sub-call)
- Implementations can route to cheaper models for sub-calls
- ExecutionLoop passes thread depth to every LLM call

Reflection pipeline (reflection/pipeline.rs):
- reflect(thread, llm): analyzes completed thread via LLM
- Produces Summary doc (always), Lesson doc (if errors), Issue doc (if failed)
- Builds transcript from thread messages + error events
- Returns ReflectionResult with docs + token usage

ThreadConfig extended with: max_tokens_total, max_consecutive_errors,
model_context_limit, enable_compaction, compaction_threshold, depth, max_depth.

78 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): Phase 5 — conversation surface separated from execution

Conversation is now a UI layer, not an execution boundary. Multiple
threads can run concurrently within one conversation; threads can
outlive their originating conversation.

New types (types/conversation.rs):
- ConversationSurface: channel + user + entries + active_threads
- ConversationEntry: sender (User/Agent/System) + content + origin_thread_id
- ConversationId, EntryId (UUID newtypes)
- EntrySender enum (User, Agent{thread_id}, System)

ConversationManager (runtime/conversation.rs):
- get_or_create_conversation(channel, user) — indexed by (channel, user)
- handle_user_message() — injects into active foreground thread or spawns new
- record_thread_outcome() — adds agent/system entries, untracks completed threads
- get_conversation(), list_conversations()

This enables the key architectural insight: a user can ask "what's the
weather?" while a deployment thread is still running. Both produce entries
in the same conversation.

85 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(engine): simplify execution tiers — Monty-only for CodeAct/RLM

Restructure phases 6-8 to clarify execution model:

- Monty is the sole Python executor for CodeAct/RLM. No WASM or Docker
  Python runtimes for LLM-generated code.
- WASM sandbox is for third-party tool isolation (existing infra, Phase 8)
- Docker containers are for thread-level isolation of high-risk work (Phase 8)
- Two-phase commit moves to Phase 6 (integration) at the adapter boundary

Phase renumbering:
- Old Phase 6 (Tier 2-3) → removed as separate phase
- Old Phase 7 (integration) → Phase 6
- Old Phase 8 (cleanup) → Phase 7
- New Phase 8: WASM tools + Docker thread isolation (infra integration)

Updated progress table: Phases 1-5 marked DONE with test counts and commits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): Phase 6 — bridge adapters for main crate integration

Strategy C parallel deployment: when ENGINE_V2=true env var is set,
user messages route through the engine instead of the existing agentic
loop. All existing behavior is unchanged when the flag is off.

Bridge module (src/bridge/):
- LlmBridgeAdapter: wraps LlmProvider as engine LlmBackend, converts
  ThreadMessage↔ChatMessage, ActionDef↔ToolDefinition, depth-based
  model routing (primary vs cheap_llm)
- EffectBridgeAdapter: wraps ToolRegistry+SafetyLayer as EffectExecutor,
  routes tool calls through existing execute_tool_with_safety pipeline
- InMemoryStore: HashMap-backed Store impl (no DB tables needed yet)
- EngineRouter: is_engine_v2_enabled() + handle_with_engine() that
  builds engine from Agent deps and processes messages end-to-end

Integration touchpoint (4 lines in agent_loop.rs):
  After hook processing, before session resolution, check ENGINE_V2
  flag and route UserInput through the engine path.

Accessor visibility widened: llm(), cheap_llm(), safety(), tools()
changed from pub(super) to pub(crate) for bridge access.

85 engine tests + main crate clippy clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): add user message and system prompt to thread before execution

The ExecutionLoop was sending empty messages to the LLM because the
thread was spawned with the user's input as the goal but no messages.

Fixes:
- ThreadManager.spawn_thread() now adds the goal as an initial user
  message before starting the execution loop
- ExecutionLoop.run() injects a default system prompt if none exists

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(bridge): match existing LLM request format to prevent 400 errors

The LLM bridge was missing several defaults that the existing
Reasoning.respond_with_tools() sets:

- tool_choice: "auto" when tools are present (required by some providers)
- max_tokens: 4096 (default)
- temperature: 0.7 (default)
- When no tools (force_text): use plain complete() instead of
  complete_with_tools() with empty tools array — matches existing
  no-tools fallback path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): persist conversation context across messages

The engine was creating a fresh ThreadManager and InMemoryStore per
message, losing all context between turns. A follow-up question like
"what are the latest 10 issues?" had no memory of the prior "how many
issues" response.

Fixes:
- EngineState (ThreadManager, ConversationManager, InMemoryStore) now
  persists across messages via OnceLock, initialized on first use
- ConversationManager builds message history from prior conversation
  entries (user messages + agent responses) and passes it to new threads
- ThreadManager.spawn_thread_with_history() accepts initial_messages
  that are prepended before the current user message
- System notifications (thread started/completed) are filtered out of
  the history (not useful as LLM context)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): enable CodeAct/RLM mode with code block detection

The engine now operates in CodeAct/RLM mode:

System prompt (executor/prompt.rs):
- Instructs LLM to write Python in ```repl fenced blocks
- Documents available tools as callable Python functions
- Documents llm_query(), llm_query_batched(), FINAL()
- Documents context variables (context, goal, step_number, previous_results)
- Strategy guidance: examine context, break into steps, use tools, call FINAL()

Code block detection (bridge/llm_adapter.rs):
- extract_code_block() scans LLM text responses for ```repl or ```python blocks
- When detected, returns LlmResponse::Code instead of LlmResponse::Text
- The ExecutionLoop routes Code responses through Monty for execution

No structured tool definitions sent to LLM:
- Tools are described in the system prompt as Python functions
- The LLM call sends empty actions array, forcing text-mode responses
- This ensures the LLM writes code blocks (CodeAct) instead of
  structured tool calls (which would bypass the REPL)

85 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(engine): add 8 CodeAct/RLM E2E tests with mock LLM

Comprehensive test coverage for the Monty Python execution path:

- codeact_simple_final: Python code calls FINAL('answer') → thread completes
- codeact_tool_call_then_final: code calls test_tool() → FunctionCall
  suspends VM → MockEffects returns result → code resumes → FINAL()
- codeact_pure_python_computation: sum([1,2,3,4,5]) → FINAL('Sum is 15')
  with no tool calls — pure Python in Monty
- codeact_multi_step: first step prints output (no FINAL), second step
  sees output metadata and calls FINAL — tests iterative REPL flow
- codeact_error_recovery: first step has NameError → error flows to LLM
  as stdout → second step recovers with FINAL — tests error transparency
- codeact_context_variables_available: code accesses `goal` and `context`
  variables injected by the RLM context builder
- codeact_multiple_tool_calls_in_loop: for loop calls test_tool() 3 times
  → 3 FunctionCall suspensions → all results collected → FINAL
- codeact_llm_query_recursive: code calls llm_query('prompt') → VM
  suspends → MockLlm provides sub-agent response → result returned as
  Python string variable

93 tests passing (85 prior + 8 new), zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(bridge): detect code blocks in plain completion path + multi-block support

Two bugs fixed:

1. The no-tools completion path (used by CodeAct since we send empty
   actions) returned LlmResponse::Text without checking for code blocks.
   Code blocks were rendered as markdown text instead of being executed.

2. extract_code_block now:
   - Handles bare ``` fences (skips non-Python languages)
   - Collects ALL code blocks in the response and concatenates them
     (models often split code across multiple blocks with explanation)
   - Tries markers in order: ```repl, ```python, ```py, then bare ```

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(bridge): add 11 regression tests for code block extraction

Covers the exact failure modes discovered during live testing:

- extract_repl_block: standard ```repl fenced block
- extract_python_block: ```python marker
- extract_py_block: ```py shorthand
- extract_bare_backtick_block: bare ``` with Python content
- skip_non_python_language: ```json should NOT be extracted
- no_code_blocks_returns_none: plain text, no fences
- multiple_code_blocks_concatenated: two ```repl blocks with
  explanation between them → concatenated with \n\n
- mixed_thinking_and_code: model outputs explanation + two
  ```python blocks (the Hyperliquid case) → both extracted
- repl_preferred_over_bare: ```repl takes priority over bare ```
- empty_code_block_skipped: empty fenced block returns None
- unclosed_block_returns_none: no closing ``` returns None

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): detect FINAL() in text responses + regression tests

Models sometimes write FINAL() outside code blocks — as plain text
after an explanation. The Hyperliquid case: model outputs a long
analysis then FINAL("""...""") at the end, not inside ```repl fences.

Fixes:
- extract_final_from_text(): regex-based FINAL detection in text
  responses, matching the official RLM's find_final_answer() fallback
- Handles: double-quoted, single-quoted, triple-quoted, unquoted,
  nested parens
- Checked in LlmResponse::Text handler BEFORE tool intent nudge
  (FINAL takes priority)

9 new tests:
- codeact_final_in_text_response: FINAL("answer") in plain text
- codeact_final_triple_quoted_in_text: FINAL("""multi\nline""") in text
- final_double_quoted, final_single_quoted, final_triple_quoted,
  final_unquoted, final_with_nested_parens, final_after_long_text,
  no_final_returns_none

102 tests passing (93 + 9 new), zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add crate extraction & cleanup roadmap

Documents architectural recommendations from the engine v2 design
process for future reference:

- Root directory consolidation (channels-src + tools-src → extensions/)
- Crate extraction tiers: zero-coupling (estimation, observability,
  tunnel), trivial-coupling (document_extraction, pairing, hooks),
  medium-coupling (secrets, MCP, db, workspace, llm, skills),
  heavy-coupling (web gateway, agent, extensions)
- src/ module reorganization into logical groups (core, persistence,
  infra, media, support)
- main.rs/app.rs slimming targets (100/500 lines after migration)
- WASM module candidates (document_extraction) and non-candidates
  (REPL, web gateway → separate crates instead)
- Priority ordering for extraction work
- Tracks completed items (ironclaw_safety, ironclaw_engine,
  transcription move)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): live progress status updates via event broadcast

Engine v2 now shows live progress in the CLI (and any channel):
- "Thinking..." when a step starts
- Tool name + success/error when actions execute
- "Processing results..." when a step completes

Implementation:
- ThreadManager holds a broadcast::Sender<ThreadEvent> (capacity 256)
- ExecutionLoop.emit_event() writes to thread.events AND broadcasts
- ThreadManager.subscribe_events() returns a receiver
- Router uses tokio::select! to listen for events while waiting for
  thread completion, forwarding them as StatusUpdate to the channel

This replaces the polling approach with zero-latency event streaming.
Agent.channels visibility widened to pub(crate) for bridge access.

102 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): include tool results in code step output for LLM context

The LLM was ignoring tool results and answering from training data
because the compact output metadata didn't include what tools returned.
Tool results lived only as ActionResult messages (role: Tool) which
some providers flatten or the model ignores.

Now the code step output includes:
- stdout from Python print() statements
- [tool_name result] with the actual output (truncated to 4K per tool)
- [tool_name error] for failed tools
- [return] for the code's return value
- Total output truncated to 8K chars to prevent context bloat

This ensures the model sees web_search results, API responses, etc.
in the next iteration and can reason about them instead of hallucinating.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): add debug/trace logging for CodeAct execution

Three verbosity levels for debugging the engine:

RUST_LOG=ironclaw_engine=debug:
- LLM call: message count, iteration, force_text
- LLM response: type (text/code/action_calls), token usage
- Code execution: code length, action count, had_error, final_answer
- Text response: length, FINAL() detection

RUST_LOG=ironclaw_engine=trace:
- Full message list sent to LLM (role, length, first 200 chars each)
- Full code block being executed
- stdout preview (first 500 chars)
- Per-tool results (name, success, first 300 chars of output)
- Text response preview (first 500 chars)

Usage:
  ENGINE_V2=true RUST_LOG=ironclaw_engine=debug cargo run
  ENGINE_V2=true RUST_LOG=ironclaw_engine=trace cargo run

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): execution trace recording + retrospective analysis

Enable with ENGINE_V2_TRACE=1 to get full execution traces and
automatic issue detection after each thread completes.

Trace recording (executor/trace.rs):
- build_trace(): captures full thread state — messages (with full
  content), events, step count, token usage, detected issues
- write_trace(): writes JSON to engine_trace_{timestamp}.json
- log_trace_summary(): logs summary + issues at info/warn level

Retrospective analyzer detects 8 issue categories:
- thread_failure: thread ended in Failed state
- no_response: no assistant message generated
- tool_error: specific tool failures with error details
- code_error: Python errors (NameError, SyntaxError, etc.) in output
- missing_tool_output: tool results exist but not in system messages
- excessive_steps: >10 steps (may be stuck in loop)
- no_tools_used: single-step answer without tools (hallucination risk)
- mixed_mode: text responses without code blocks (prompt not followed)

Thread state now saved to store after execution completes (for trace
access after join_thread).

Usage:
  ENGINE_V2=true ENGINE_V2_TRACE=1 cargo run
  # After each message: trace JSON + issue log in terminal

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): wire reflection pipeline + trace analysis into thread lifecycle

After every thread completes, ThreadManager now automatically runs:

1. Retrospective trace analysis (non-LLM, always):
   - Detects 8 issue categories (tool errors, code errors, missing
     outputs, excessive steps, hallucination risk, etc.)
   - Logs issues at warn level when found

2. Trace file recording (when ENGINE_V2_TRACE=1):
   - Writes full JSON trace to engine_trace_{timestamp}.json

3. LLM reflection (when enable_reflection=true):
   - Calls reflection pipeline to produce Summary, Lesson, Issue docs
   - Saves docs to store for future context retrieval
   - Enabled by default in the bridge router

All three run inside the spawned tokio task after exec.run() completes,
before saving the final thread state. No external wiring needed.

Removed duplicate trace recording from the router — it's now handled
by ThreadManager automatically.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(bridge): convert tool name hyphens to underscores for Python compatibility

Root cause from trace analysis: the LLM writes `web_search()` (valid
Python identifier) but the tool registry has `web-search` (with hyphen).
The EffectBridgeAdapter couldn't find the tool → "Tool not found" error
→ model fabricated fake data instead.

Fixes:
- available_actions(): converts tool names from hyphens to underscores
  (web-search → web_search) so the system prompt lists valid Python names
- execute_action(): tries the original name first, then falls back to
  hyphenated form (web_search → web-search) for tool registry lookup
- Same conversion in router's capability registry builder

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(bridge): parse JSON tool output to prevent double-serialization

From trace analysis: web_search returned a JSON string, which was
wrapped as serde_json::json!(string) creating a Value::String containing
JSON. When Monty got this as MontyObject::String, the Python code
couldn't index it with result['title'] → TypeError.

Fix: try parsing the tool output string as JSON first. If valid, use the
parsed Value (becomes a Python dict/list). If not valid JSON, keep as
string. This means web_search results are directly indexable in Python:
  results = web_search(query="...")
  print(results["results"][0]["title"])  # works now

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): persist variables across code steps via `state` dict

Monty creates a fresh runtime per code step, so variables are lost
between steps. This caused the model to re-paste tool results from
system messages, wasting tokens.

Fix: maintain a `persisted_state` JSON dict in the ExecutionLoop that
accumulates across steps:
- Tool results stored by tool name: state["web_search"] = {results...}
- Return values stored: state["last_return"], state["step_0_return"]
- Injected as a `state` Python variable in each new MontyRun

Now the model can do:
  Step 1: results = web_search(query="...")  # tool result saved in state
  Step 2: data = state["web_search"]         # access previous result
          summary = llm_query("summarize", str(data))
          FINAL(summary)

System prompt updated to document the `state` variable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): add state hint on code errors + retrieval engine integration

When code fails with NameError/UnboundLocalError (model trying to
access variables from a previous step), the error output now includes:

  [HINT] Variables don't persist between code blocks. Use the `state`
  dict to access data from previous steps. Available keys: ["web_search",
  "last_return"]

This teaches the model to use `state["web_search"]` instead of `result`
after a NameError, reducing wasted steps from 3-4 to 1.

Also integrates RetrievalEngine into context building and ThreadManager:
- build_step_context() now accepts optional RetrievalEngine to inject
  relevant memory docs (Lessons, Specs, Playbooks) into LLM context
- RetrievalEngine uses keyword matching with doc-type priority scoring
- Memory docs from reflection (Phase 4) now feed back into future threads

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: remove trace files and add to .gitignore

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): replace web_fetch example with web_search in CodeAct prompt

The system prompt example used web_fetch(url="...") which doesn't exist
as a tool. The model learned from the example and tried web_fetch,
getting "Tool not found". Changed to web_search(query="...") which is
an actual registered tool.

Found via trace analysis — reflection pipeline correctly identified
this as a "Tool Name Correction" spec doc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor(engine): extract prompt templates to markdown files

Prompt templates moved from inline Rust strings to plain markdown files
at crates/ironclaw_engine/prompts/ for easy inspection and iteration:

- prompts/codeact_preamble.md — main instructions, special functions,
  context variables, rules
- prompts/codeact_postamble.md — strategy section

Loaded at compile time via include_str!(), so no runtime file I/O.
Edit the .md files and rebuild to iterate on prompts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): replace byte-index slicing with char-safe truncation

Panic: 'byte index 80 is not a char boundary; it is inside ''' when
tool output contained multi-byte UTF-8 characters (smart quotes from
web search results).

Fixed 4 unsafe byte-index slices:
- thread.rs:281: message preview &content[..80] → chars().take(80)
- loop_engine.rs:556: tool output &str[..4000] → chars().take(4000)
- loop_engine.rs:579: output tail &str[len-8000..] → chars().skip()
- scripting.rs:82: stdout tail &str[len-N..] → chars().skip()

All now use .chars().take() or .chars().skip() which respect character
boundaries. Follows CLAUDE.md rule: "Never use byte-index slicing on
user-supplied or external strings."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): fix false positive missing_tool_output warning in trace analyzer

The check was looking for "[" + "result]" in System-role messages only,
but tool output metadata is added with patterns like "[shell result]"
and may appear in messages with any role. Changed to scan all messages
for " result]" or " error]" patterns regardless of role.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(engine): update architecture plan with Phase 6 status and approval flow design

Phase 6 updated to reflect what was actually built:
- Bridge adapters (LLM, Effect, InMemoryStore, Router) — all done
- Integration touchpoint (4 lines in handle_message) — done
- Live progress via broadcast events — done
- Conversation persistence across messages — done
- Trace recording + retrospective analysis — done
- 8 bugs found and fixed via trace analysis — documented

Phase 6 remaining work documented:
- Approval flow: detailed 5-step design (send to channel, pause thread,
  route response, resume execution, always handling) with v1 reference
- Database persistence (InMemoryStore → real DB tables)
- Acceptance testing (TestRig + TraceLlm fixtures)
- Two-phase commit for high-stakes effects

Progress table updated: Phase 6 marked as DONE (partial), 134 tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add self-improving engine design plan

Designs a system where the engine debugs and improves itself, based on
the pattern observed in the last session: 5 consecutive bug fixes all
followed trace → read → identify → edit → test, using tools the engine
already has access to.

Three levels of self-improvement:
- Level 1 (Prompt): edit prompts/*.md to prevent LLM mistakes. Auto-apply.
- Level 2 (Config): adjust defaults/mappings. Branch + test + PR.
- Level 3 (Code): Rust patches for engine bugs. Branch + test + clippy + PR.

Architecture: Self-improvement Mission spawns a Reflection thread that
reads traces, reads source, proposes fixes, validates via cargo test,
and either auto-applies (Level 1) or creates a PR (Level 2-3).

Includes: fix pattern database (seeded from our 8 debugging session
fixes), feedback loop diagram, safety model, implementation phases
(A through D), and what exists vs what's new.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add engine v2 security model and audit

Comprehensive security analysis of engine v2 covering:

Threat model: 4 attacker profiles (malicious input, prompt injection
via tools, poisoned memory, supply chain).

Current state audit: 9 controls working (Monty sandbox, safety layer,
policy engine, leases, provenance, events) and 9 gaps identified.

Critical finding: ALL tools granted by default — CodeAct code can call
shell, write_file, apply_patch without approval. Proposed fix: 3-tier
tool classification (auto/approve-once/always-approve).

CodeAct-specific threats: tool call amplification, prompt injection via
search results, data exfiltration via tool chains, Monty escape.

Self-improvement security: poisoned trace attacks, memory poisoning via
reflection. Mitigations: edit validation, frequency caps, audit trail,
auto-rollback, reflection output scanning.

6-layer security architecture proposed: input validation, capability
gating, output sanitization, execution sandboxing, self-improvement
controls, observability.

Prioritized implementation plan with severity/effort ratings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(security): cross-reference v1 controls — use, don't reinvent

Updated security plan with detailed audit of ALL existing v1 security
controls and how they map to engine v2 bridge gaps:

Key finding: v1 already has solutions for every security gap identified.
The bridge just needs to wire them in:

- Tool::requires_approval() exists but bridge doesn't call it
- safety.wrap_for_llm() exists but tool results enter context unwrapped
- RateLimiter exists but bridge doesn't check rate limits
- BeforeToolCall hooks exist but bridge doesn't run them
- redact_params() exists but bridge doesn't redact sensitive params
- Shell risk classification (Low/Medium/High) is inherited but ignored

Revised priority: most fixes are small wiring tasks in EffectBridgeAdapter,
not new security infrastructure. The bridge is the security boundary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): add missions, reliability tracker, reflection executor, and provenance-aware policy

- Add Mission type and MissionManager for recurring thread scheduling
- Add ReliabilityTracker for per-capability success/failure/latency tracking
- Add reflection executor that spawns CodeAct threads for post-completion reflection
- Extend PolicyEngine with provenance-aware taint checking (LLM-generated data
  requires approval for financial/external-write effects)
- Extend Store trait with mission CRUD methods
- Add conversation surface tracking, compaction token fix, context memory injection
- Wire new modules through lib.rs re-exports and bridge adapters

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): wire v1 security controls into engine v2 adapter

Zero engine crate changes. All security controls enforced at the bridge
boundary in EffectBridgeAdapter:

1. Tool approval (v1: Tool::requires_approval):
   - Checks each tool's approval requirement with actual params
   - Always → returns EngineError::LeaseDenied (blocks execution)
   - UnlessAutoApproved → checks auto_approved set, blocks if not approved
   - Never → proceeds
   - Per-session auto_approved HashSet (for future "always" handling)

2. Hook interception (v1: BeforeToolCall):
   - Runs HookEvent::ToolCall before every execution
   - HookOutcome::Reject → blocks with reason
   - HookError::Rejected → blocks with reason
   - Hook errors → fail-open (logged, execution continues)

3. Output sanitization (v1: sanitize_tool_output + wrap_for_llm):
   - Leak detection: API keys in tool output are redacted
   - Policy enforcement: content policy rules applied
   - Length truncation: output capped at 100KB
   - XML boundary protection: prevents injection via tool output

4. Sensitive param redaction (v1: redact_params):
   - Tool's sensitive_params() consulted before hooks see parameters
   - Redacted params sent to hooks, original params used for execution

5. available_actions() now sets requires_approval based on each tool's
   default approval requirement, so the engine's PolicyEngine can
   gate tools it hasn't seen before.

6. Actual execution timing measured via Instant::now() (replaces
   placeholder Duration::from_millis(1)).

Accessor visibility: hooks() widened to pub(crate).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): implement tool approval flow for engine v2

Adds a complete approval flow that mirrors v1 behavior, using the
existing v1 security controls (Tool::requires_approval, auto-approve
sets, StatusUpdate::ApprovalNeeded).

## How it works

### Step 1: Tool blocked at execution
When the LLM's code calls a tool (e.g., `shell("ls")`):
1. EffectBridgeAdapter.execute_action() looks up the Tool object
2. Calls tool.requires_approval(&params) — returns ApprovalRequirement
3. If Always → EngineError::LeaseDenied (always blocks)
4. If UnlessAutoApproved → checks auto_approved HashSet → if not in set,
   returns EngineError::LeaseDenied
5. If Never → proceeds to execution

### Step 2: Engine returns NeedApproval
The LeaseDenied error propagates through:
- CodeAct path: becomes Python RuntimeError, code halts, thread returns
  NeedApproval with action_name + parameters
- Structured path: same via ActionResult.is_error

### Step 3: Router stores pending approval
- PendingApproval { action_name, original_content } stored on EngineState
- StatusUpdate::ApprovalNeeded sent to channel (shows approval card in
  CLI/web with tool name, parameters, yes/always/no buttons)
- Returns text: "Tool 'shell' requires approval. Reply yes/always/no."

### Step 4: User responds
handle_message() intercepts Submission::ApprovalResponse when ENGINE_V2:
- 'yes' → auto_approve_tool(name) on EffectBridgeAdapter, re-processes
  original message (tool now passes the approval check on second run)
- 'always' → same + logs for session persistence
- 'no' → returns "Denied: tool was not executed."

### Key design choice
Instead of pausing/resuming mid-execution (which needs engine changes
to freeze/restore the Monty VM state), we auto-approve the tool and
re-run the full message. The EffectBridgeAdapter's auto_approved set
persists across runs, so the second execution passes immediately.

This trades one extra LLM call for zero engine modifications.

## Files changed
- src/bridge/router.rs: PendingApproval struct, handle_approval(),
  NeedApproval → StatusUpdate::ApprovalNeeded conversion
- src/bridge/mod.rs: export handle_approval
- src/agent/agent_loop.rs: intercept ApprovalResponse for engine v2
- src/bridge/effect_adapter.rs: fmt fixes

151 tests passing, clippy + fmt clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): demote trace/reflection logging from info to debug

INFO-level log output from background tasks (trace analysis, reflection)
corrupts the REPL terminal UI. The trace summary, issue warnings, and
reflection doc previews were printing mid-approval-card, breaking the
interactive display.

Fix: all logging in trace.rs changed from info!/warn! to debug!/warn!.
Trace analysis and reflection results now only show when
RUST_LOG=ironclaw_engine=debug is set.

Also added logging discipline rule to global CLAUDE.md:
- info! → user-facing status the REPL intentionally renders
- debug! → internal diagnostics (traces, reflection, engine internals)
- Background tasks must NEVER use info! — it breaks the TUI

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(bridge): demote all router info! logging to debug!

"engine v2: initializing" and "engine v2: handling message" were
printing at INFO level, corrupting the REPL UI. All router logging
now uses debug! — only visible with RUST_LOG=ironclaw=debug.

Zero info! calls remain in crates/ironclaw_engine/ or src/bridge/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(safety): demote leak detector warn-action logs from warn! to debug!

The leak detector's Warn-action matches (high_entropy_hex pattern on
web search results containing commit SHAs, CSS colors, URL hashes)
were logging at warn! level, corrupting the REPL UI with lines like:
  WARN Potential secret leak detected pattern=high_entropy_hex preview=a96f********cee5

These are informational false positives — real leaks use LeakAction::Redact
which silently modifies the content. Warn-action matches only log for
debugging purposes and should not appear in production output.

Changed to debug! level — visible with RUST_LOG=ironclaw_safety=debug.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): strengthen CodeAct prompt to prevent shallow text answers

The model was answering "Suggested 45 improvements" as a brief text
summary from training data without actually searching or listing them.
The trace showed: no code block, no tool calls, no FINAL().

Prompt changes:
- Rule 1: "ALWAYS respond with a ```repl code block. NEVER answer with
  plain text only." (was: "Always write code... plain text for brief
  explanations")
- Rule 2 (NEW): "NEVER answer from memory or training data alone.
  Always use tools to get real, current information before answering."
- Rule 3: FINAL answer "should be detailed and complete — not just a
  summary like 'found 45 items'"
- Rule 8 (NEW): "Include the actual content in your FINAL() answer,
  not just a count or summary. Users want to see the details."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): persist reflection docs to workspace for cross-session learning

Replaces InMemoryStore with HybridStore:
- Ephemeral data (threads, steps, events, leases) stays in-memory
- MemoryDocs (lessons, specs, playbooks from reflection) persist to
  the workspace at engine/docs/{type}/{id}.json

On engine init, load_docs_from_workspace() reads existing docs back
into the in-memory cache. This means:
- Lessons learned in session 1 are available in session 2
- The RetrievalEngine injects relevant past lessons into new threads
- The engine genuinely improves over time as reflection accumulates

Workspace paths:
  engine/docs/lessons/{uuid}.json
  engine/docs/specs/{uuid}.json
  engine/docs/playbooks/{uuid}.json
  engine/docs/summaries/{uuid}.json
  engine/docs/issues/{uuid}.json

No new database tables. Uses existing workspace write/read/list.
workspace() accessor widened to pub(crate).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(bridge): adapt to execute_tool_with_safety params-by-value change

Staging merge changed execute_tool_with_safety to take params by value
instead of by reference (perf optimization from PR #926). Updated
bridge adapter to clone params before passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(engine): add web gateway integration plan to Phase 6

Documents three gaps between engine v2 and the web gateway:
1. No SSE streaming (engine emits ThreadEvent, gateway expects SseEvent)
2. No conversation persistence (engine uses HybridStore, gateway reads v1 DB)
3. No cross-channel visibility (REPL ↔ web messages invisible to each other)

Implementation plan: bridge ThreadEvent→AppEvent, write messages to v1
conversation tables after thread completion. Prerequisite: AppEvent
extraction PR (in progress separately).

Also updated DB persistence status: HybridStore with workspace-backed
MemoryDocs is now implemented (partial persistence).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(engine): document routine/job gap and SIGKILL crash scenario

Routines are entirely v1 — not hooked up to engine v2. When a user
asks "create a routine" as natural language, engine v2 tries to call
routine_create via CodeAct, but the tool needs RoutineEngine + Database
refs that the bridge's minimal JobContext doesn't provide. This caused
a SIGKILL crash during testing.

Options documented: block routine tools in v2 (short term), pass refs
through context (medium), replace with Mission system (long term).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: extract AppEvent to crates/ironclaw_common

SseEvent was defined in src/channels/web/types.rs but imported by 12+
modules across agent, orchestrator, worker, tools, and extensions — it
had become the application-wide event protocol, not a web transport
concern.

Create crates/ironclaw_common as a shared workspace crate and move the
enum there as AppEvent.  Also move the truncate_preview utility which
was similarly leaked from the web gateway into agent modules.

- New crate: crates/ironclaw_common (AppEvent, truncate_preview)
- Rename SseEvent → AppEvent, from_sse_event → from_app_event
- web/types.rs re-exports AppEvent for internal gateway use
- web/util.rs re-exports truncate_preview
- Wire format unchanged (serde renames are on variants, not the enum)

Aligned with the event bus direction on refactor/architectural-hardening
where DomainEvent (≡ AppEvent) is wrapped in a SystemEvent envelope.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): integrate with web gateway via AppEvent + v1 conversation DB

Three changes to make engine v2 visible in the web gateway:

1. SSE event streaming (AppEvent broadcast):
   - ThreadEvent → AppEvent conversion via thread_event_to_app_event()
   - Events broadcast to SseManager during the poll loop
   - Covers: Thinking, ToolCompleted (success/error), Status, Response
   - Web gateway receives real-time progress without any gateway changes

2. Conversation persistence to v1 database:
   - After thread completes, writes user message + agent response to
     v1 ConversationStore via add_conversation_message()
   - Uses get_or_create_assistant_conversation() for per-user per-channel
   - Web gateway reads from DB as usual — chat history appears

3. Final response broadcast:
   - AppEvent::Response with full text + thread_id sent via SSE
   - Web gateway renders the response in the chat UI

New EngineState fields: sse (Option<Arc<SseManager>>),
db (Option<Arc<dyn Database>>). Both populated from Agent.deps.

Agent.deps visibility widened to pub(crate).

Depends on: ironclaw_common crate with AppEvent type (PR #1615).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): complete Phase 6 — v1-only tool blocking, rate limiting, call limits

Three security/stability improvements in EffectBridgeAdapter:

1. V1-only tool blocking:
   - routine_create, create_job, build_software (and hyphenated variants)
     return helpful error: "use the slash command instead"
   - Filtered out of available_actions() so system prompt doesn't list them
   - Prevents crash from tools needing RoutineEngine/Scheduler refs

2. Per-step tool call limit:
   - Max 50 tool calls per code block (AtomicU32 counter)
   - Prevents amplification: `for i in range(10000): shell(...)`
   - Returns "call limit reached, break into multiple steps"

3. Rate limiting:
   - Per-user per-tool sliding window via RateLimiter
   - Checks tool.rate_limit_config() before every execution
   - Returns "rate limited, try again in Ns"

Architecture plan updated:
- Gateway integration: DONE
- Routines: BLOCKED (gracefully, with slash command fallback)
- Rate limiting: DONE
- Call limit: DONE
- Phase 6 status: DONE (remaining: acceptance tests, two-phase commit)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add Mission system design — goal-oriented autonomous threads

Missions replace routines with evolving, knowledge-accumulating
autonomous agents. Unlike routines (fixed prompt, stateless), Missions:

- Generate prompts from accumulated Project knowledge (lessons,
  playbooks, issues from prior threads)
- Adapt approach when something fails repeatedly
- Track progress toward a goal with success criteria
- Self-manage: pause when stuck, complete when goal achieved

Architecture: MissionManager with cron ticker spawns threads via
ThreadManager. Meta-prompt built from mission goal + Project MemoryDocs
via RetrievalEngine. Reflection feeds back automatically.

6-step implementation plan: cron trigger, meta-prompt builder, bridge
wiring, CodeAct tools, progress tracking, persistence.

Includes two worked examples: daily tech news briefing (ongoing) and
test coverage improvement (goal-driven, self-completing).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): extend Mission types with webhook/event triggers + evolving strategy

Mission types updated to support external activation sources:

MissionCadence expanded:
- Cron { expression, timezone } — timezone-aware scheduling
- OnEvent { event_pattern } — channel message pattern matching
- OnSystemEvent { source, event_type } — structured events from tools
- Webhook { path, secret } — external HTTP triggers (GitHub, email, etc.)
- Manual — explicit triggering only

The engine defines trigger TYPES. The bridge implements infrastructure
(cron ticker, webhook endpoints, event matchers). GitHub issues, PRs,
email, Slack events all use the generic Webhook cadence — no
special-casing in the engine. Webhook payload injected as
state["trigger_payload"] in the thread's Python context.

Mission struct extended:
- current_focus: what the next thread should work on (evolving)
- approach_history: what we've tried (for adaptation)
- max_threads_per_day / threads_today: daily budget
- last_trigger_payload: webhook/event data for thread context

Plan updated with trigger type table and webhook integration design.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): implement MissionManager execution with meta-prompts

The MissionManager now builds evolving meta-prompts and processes
thread outcomes for continuous learning:

fire_mission() upgraded:
- Loads Project MemoryDocs via RetrievalEngine for context
- Builds meta-prompt from: goal, current_focus, approach_history,
  project knowledge docs, trigger payload, thread count
- Spawns thread with meta-prompt as user message
- Background task waits for completion and processes outcome
- Daily thread budget enforcement (max_threads_per_day)

Meta-prompt structure:
  # Mission: {name}
  Goal: {goal}
  ## Current Focus (evolves between threads)
  ## Previous Approaches (what we've tried)
  ## Knowledge from Prior Threads (lessons, playbooks, issues)
  ## Trigger Payload (webhook/event data if applicable)
  ## Instructions (accomplish step, report next focus, check goal)

Outcome processing:
- Extracts "next focus:" from FINAL() response → updates current_focus
- Detects "goal achieved: yes" → completes mission
- Records accomplishment in approach_history
- Failed threads recorded as "FAILED: {error}"

Cron ticker:
- start_cron_ticker() spawns tokio task, ticks every 60s
- Checks active Cron missions, fires those past next_fire_at

151 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): wire MissionManager into engine v2 for CodeAct access

Missions are now callable from CodeAct Python code:

```python
# Create a daily briefing mission
result = mission_create(
    name="Tech News",
    goal="Daily AI/crypto/software news briefing",
    cadence="0 9 * * *"
)

# List all missions
missions = mission_list()

# Manually fire a mission
mission_fire(id="...")

# Pause/resume
mission_pause(id="...")
mission_resume(id="...")
```

Implementation:
- MissionManager created on engine init, cron ticker started
- EffectBridgeAdapter intercepts mission_* function calls before tool
  lookup and routes to MissionManager
- parse_cadence() handles: "manual", cron expressions, "event:pattern",
  "webhook:path"
- Mission functions documented in CodeAct system prompt
- MissionManager set on adapter via set_mission_manager() after init
  (avoids circular dependency)

System prompt updated with mission_create, mission_list, mission_fire,
mission_pause, mission_resume documentation.

151 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): map routine_* calls to mission operations in v2

When the model calls routine_create, routine_list, routine_fire,
routine_pause, routine_resume, or routine_delete, the bridge now
routes them to the MissionManager instead of blocking with an error.

Mapping:
  routine_create → mission_create (with cadence parsing)
  routine_list   → mission_list
  routine_fire   → mission_fire
  routine_pause  → mission_pause
  routine_resume → mission_resume
  routine_update → mission_pause/resume (based on params)
  routine_delete → mission_complete (marks as done)

Routine tools removed from v1-only blocklist and restored in
available_actions(). The model can use either "routine" or "mission"
vocabulary — both work.

Still blocked: create_job, cancel_job, build_software (need v1
Scheduler/ContainerJobManager refs).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(engine): add E2E mission flow tests — 7 new tests

Comprehensive mission lifecycle tests:

- fire_mission_builds_meta_prompt_with_goal: verifies thread spawned
  with project context and recorded in history
- outcome_processing_extracts_next_focus: "Next focus: X" in FINAL()
  response → mission.current_focus updated
- outcome_processing_detects_goal_achieved: "Goal achieved: yes" →
  mission status transitions to Completed
- mission_evolves_via_direct_outcome_processing: 3-step evolution:
  step 1 sets focus to "db module", step 2 evolves to "tools module",
  step 3 detects goal achieved → mission completes. Tests the full
  learning loop without background task timing dependencies.
- fire_with_trigger_payload: webhook payload stored on mission and
  threads_today counter incremented
- daily_budget_enforced: max_threads_per_day=1 → first fire succeeds,
  second returns None

157 tests passing (151 prior + 6 new mission E2E).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): self-improving engine via Mission system

Wire the self-improvement loop as a Mission with OnSystemEvent cadence,
inspired by karpathy/autoresearch's program.md approach. The mission
fires when threads complete with issues, receives trace data as trigger
payload, and uses tools directly to diagnose and fix problems.

Key changes:

Engine self-improvement (Phase A+B from design doc):
- Add fire_on_system_event() to MissionManager for OnSystemEvent cadence
- Add start_event_listener() that subscribes to thread events and fires
  matching missions when non-Mission threads complete with trace issues
- Add ensure_self_improvement_mission() with autoresearch-style goal
  prompt (concrete loop steps, not vague instructions)
- Add process_self_improvement_output() for structured JSON fallback
- Seed fix pattern database with 8 known patterns from debugging
- Runtime prompt overlay via MemoryDoc (build_codeact_system_prompt now
  async + Store-aware, appends learned rules from prompt_overlay docs)
- Pass Store to ExecutionLoop for overlay loading

Bridge review fixes (P1/P2):
- Scope engine v2 SSE events to requesting user (broadcast_for_user)
- Per-user pending approvals via HashMap instead of global Option
- Reset tool-call limit counter before each thread execution
- Only persist auto-approval when user chose "always", not one-off "yes"
- Remove dead store/mission_manager fields from EngineState

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add checkpoint-based engine thread recovery

* feat(engine): add Python orchestrator module and host functions

Add the orchestrator infrastructure for replacing the Rust execution
loop with versioned Python code. This commit adds the module and host
functions without switching over — the existing Rust loop is unchanged.

New files:
- orchestrator/default.py: v0 Python orchestrator (run_loop + helpers)
- executor/orchestrator.rs: host function dispatch, orchestrator
  loading from Store with version selection, OrchestratorResult parsing

Host functions exposed to orchestrator Python via Monty suspension:
  __llm_complete__, __execute_code_step__ (nested Monty VM),
  __execute_action__, __check_signals__, __emit_event__,
  __add_message__, __save_checkpoint__, __transition_to__,
  __retrieve_docs__, __check_budget__, __get_actions__

Also makes json_to_monty, monty_to_json, monty_to_string pub(crate)
in scripting.rs for cross-module use.

Design doc: docs/plans/2026-03-25-python-orchestrator.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): switch ExecutionLoop::run() to Python orchestrator

Replace the 900-line Rust execution loop with a ~80-line bootstrap
that loads and runs the versioned Python orchestrator via Monty VM.

The orchestrator Python code (orchestrator/default.py) is the v0
compiled-in version. Runtime versions can override it via MemoryDoc
storage (orchestrator:main with tag orchestrator_code).

Key fixes during switchover:
- Use ExtFunctionResult::NotFound for unknown functions so Monty
  falls through to Python-defined functions (extract_final, etc.)
- Move helper function definitions above run_loop for Monty scoping
- Use FINAL result value (not VM return value) in Complete handler
- Rename 'final' variable to 'final_answer' to avoid Python keyword

Status: 171/177 tests pass. 6 remaining failures are step_count and
token tracking bookkeeping — the orchestrator manages these internally
but doesn't yet update the thread's counters via host functions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): all 177 tests pass with Python orchestrator

- Increment step_count and track tokens in __emit_event__("step_completed")
  so thread bookkeeping matches the old Rust loop behavior
- Remove double-counting of tokens in bootstrap (orchestrator handles it)
- Match nudge text to existing TOOL_INTENT_NUDGE constant
- Fix FINAL result propagation (use stored final_result, not VM return)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): orchestrator versioning, auto-rollback, and tests

Add version lifecycle for the Python orchestrator:
- Failure tracking via MemoryDoc (orchestrator:failures)
- Auto-rollback: after 3 consecutive failures, skip the latest version
  and fall back to previous (or compiled-in v0)
- Success resets the failure counter
- OrchestratorRollback event for observability

Update self-improvement Mission goal with Level 1.5 instructions for
orchestrator patches — the agent can now modify the execution loop
itself via memory_write with versioned orchestrator docs.

12 new tests: version selection (highest wins), rollback after failures,
rollback to default, failure counting/resetting, outcome parsing for
all 5 ThreadOutcome variants.

189 tests pass, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add engine v2 architecture, self-improvement, and dev history

Three new docs for contributors:

- engine-v2-architecture.md: Two-layer architecture (Rust kernel +
  Python orchestrator), five primitives, execution model with nested
  Monty VMs, bridge layer, memory/reflection, missions, capabilities

- self-improvement.md: Three improvement levels (prompt/orchestrator/
  config/code), autoresearch-inspired Mission loop, versioned
  orchestrator with auto-rollback, fix pattern database, safety model

- development-history.md: Summary of 6 Claude Code sessions that
  built the system, key design decisions and debugging moments,
  architecture evolution from 900-line Rust loop to Python orchestrator

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): complete v2 side-by-side integration with gateway API

Wire engine v2 into the full submission pipeline and expose threads,
projects, and missions through the web gateway REST API.

Bridge routing — route ExecApproval, Interrupt, NewThread, and Clear
submissions to engine v2 when ENGINE_V2=true. Previously only UserInput
and ApprovalResponse were handled; all other control commands fell
through to disconnected v1 sessions.

Bridge query layer — add 11 read-only query functions and 6 DTO types
so gateway handlers can inspect engine state (threads, steps, events,
projects, missions) without direct access to the EngineState singleton.

Gateway endpoints — new /api/engine/* routes:
  GET  /threads, /threads/{id}, /threads/{id}/steps, /threads/{id}/events
  GET  /projects, /projects/{id}
  GET  /missions, /missions/{id}
  POST /missions/{id}/fire, /missions/{id}/pause, /missions/{id}/resume

SSE events — add ThreadStateChanged, ChildThreadSpawned, and
MissionThreadSpawned AppEvent variants. Expand the bridge event mapper
to forward StateChanged and ChildSpawned engine events to the browser.

Engine crate — add ConversationManager::clear_conversation() for /new
and /clear commands.

Code quality — replace 10 .expect() calls with proper error returns,
remove dead AgentConfig.engine_v2 field, log silent init errors, fix
duplicate doc comment, improve fallthrough documentation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): empty call_id on ActionResult and trace analyzer false positives

Fix structured executor not stamping call_id onto ActionResult — the
EffectExecutor trait doesn't receive call_id, so the structured executor
must copy it from the original ActionCall after execution. Empty call_id
caused OpenAI-compatible providers to reject the next LLM request with
"Invalid 'input[2].call_id': empty string".

Fix trace analyzer false positives:
- code_error check now only scans User-role code output messages
  (prefixed with [stdout]/[stderr]/[code ]/Traceback), not System
  prompt which contains example error text
- missing_tool_output check now recognizes ActionResult messages as
  valid tool output (Tier 0 structured path)
- Add NotImplementedError to detected code error patterns

New trace checks:
- empty_call_id: detect ActionResult messages with missing/empty
  call_id before they reach the LLM API (severity: Error)
- llm_error: extract LLM provider errors from Failed state reason
- orchestrator_error: extract orchestrator errors from Failed state

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(web): add Missions tab to gateway UI

Add a full Missions page to the web gateway with list view, detail view,
and action buttons (Fire, Pause, Resume).

Backend: add /api/engine/missions/summary endpoint returning counts by
status (active/paused/completed/failed).

Frontend:
- New "Missions" tab between Jobs and Routines
- Summary cards showing mission counts by status
- Table with name, goal, cadence type, thread count, status, actions
- Detail view with goal, cadence, current focus, success criteria,
  approach history, spawned thread list, and action buttons
- Fire/Pause/Resume actions with toast notifications
- i18n support (English + Chinese)
- CSS following the existing routines/jobs patterns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): eagerly initialize engine v2 at startup

The gateway API endpoints (/api/engine/missions, etc.) call bridge
query functions that return empty results when the engine state hasn't
been initialized yet. Previously, initialization only happened lazily
on the first chat message via handle_with_engine().

Now when ENGINE_V2=true, the engine is initialized in Agent::run()
before channels start, so the self-improvement mission and other
engine state is available to gateway API endpoints immediately.

Also rename get_or_init_engine → init_engine and make it public so
it can be called from agent_loop.rs at startup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(web): improve mission detail with markdown goal and thread table

- Goal rendered as full-width markdown block instead of plain-text
  meta item (uses existing renderMarkdown/marked)
- Current focus and success criteria also rendered as markdown
- Spawned threads shown as a clickable table with goal, type, state,
  steps, tokens, and created date instead of a UUID list
- Clicking a thread row opens an inline thread detail view showing
  metadata grid and full message history with markdown rendering
- Back button returns to the mission detail view
- Backend: mission detail now returns full thread summaries (goal,
  state, step_count, tokens) instead of just thread IDs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(web): close SSE connections on page unload to prevent connection starvation

The browser limits concurrent HTTP/1.1 connections per origin to 6.
Without cleanup, SSE connections from prior page loads linger after
refresh/navigation, eating into the pool. After 2-3 refreshes, all 6
slots are consumed by stale SSE streams and new API fetch calls queue
indefinitely — the UI shows "connected" (SSE works) but data never
loads.

Add a beforeunload handler that closes both eventSource (chat events)
and logEventSource (log stream) so the browser can reuse connections
immediately on page reload.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(web): support multiple gateway tabs by reducing SSE connections

Each browser tab opened 2 SSE connections (chat events + log events).
With the HTTP/1.1 per-origin limit of 6, the 3rd tab exhausted the
pool and couldn't load any data.

Three changes:

1. Lazy log SSE — only connect when the logs tab is active, disconnect
   when switching away. Most users rarely view logs, so this saves a
   connection slot per tab.

2. Visibility API — close SSE when the browser tab goes to background
   (user switches to another tab), reconnect when it becomes visible.
   Background tabs don't need real-time events.

3. Combined with the existing beforeunload cleanup, this means:
   - Active foreground tab: 1 connection (chat SSE only, +1 if logs tab)
   - Background tabs: 0 connections
   - Closed/refreshed tabs: 0 connections (beforeunload cleanup)

This allows many gateway tabs to coexist within the 6-connection limit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): route messages to correct conversation by thread scope

Messages sent from a new conversation in the gateway always appeared in
the default assistant conversation because handle_with_engine ignored
the thread_id from the frontend.

Two fixes:

1. Engine conversation scoping — when the message carries a thread_id
   (from the frontend's conversation picker), use it as part of the
   engine conversation key: "gateway:<thread_id>" instead of just
   "gateway". This creates a distinct engine conversation per v1
   thread, so messages don't cross-contaminate.

2. V1 dual-write targeting — write user messages and assistant
   responses to the v1 conversation matching the thread_id (via
   ensure_conversation), not the hardcoded assistant conversation.
   Falls back to the assistant conversation when no thread_id is
   present (e.g., default chat).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(web): richer activity indicators for engine v2 execution

The gateway UI showed only generic "Thinking..." during engine v2
execution with no visibility into CodeAct code execution, tool calls,
or reflection. Now the event mapping produces detailed status updates:

Step lifecycle:
- "Calling LLM..." when a step starts (was "…
ilblackdragon added a commit that referenced this pull request Apr 10, 2026
…stant (#1736)

* v2 architecture phase 1

* feat(engine): Phase 2 — execution loop, capability system, thread runtime

Add the core execution engine to ironclaw_engine crate:

- CapabilityRegistry: register/get/list capabilities and actions
- LeaseManager: async lease lifecycle (grant, check, consume, revoke, expire)
- PolicyEngine: deterministic effect-level allow/deny/approve
- ThreadTree: parent-child relationship tracking
- ThreadSignal/ThreadOutcome: inter-thread messaging via mpsc
- ThreadManager: spawn threads as tokio tasks, stop, inject messages, join
- ExecutionLoop: core loop replacing run_agentic_loop() with signals,
  context building, LLM calls, action execution, and event recording
- Structured executor (Tier 0): lease lookup → policy check → effect execution
- Tool intent nudge detection
- MemoryStore + RetrievalEngine stubs for Phase 4
- Full 8-phase architecture plan in docs/plans/
- CLAUDE.md spec for the engine crate

74 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): Phase 3 — Monty Python executor with RLM pattern

Add CodeAct execution (Tier 1) using the Monty embedded Python
interpreter, following the Recursive Language Model (RLM) pattern
from arXiv:2512.24601.

Key additions:
- executor/scripting.rs: Monty integration with FunctionCall-based
  tool dispatch, catch_unwind panic safety, resource limits (30s,
  64MB, 1M allocs)
- LlmResponse::Code variant + ExecutionTier::Scripting
- Context-as-variables (RLM 3.4): thread messages, goal, step_number,
  previous_results injected as Python variables — LLM context stays
  lean while code accesses data selectively
- llm_query(prompt, context) (RLM 3.5): recursive subagent calls
  from within Python code — results stored as variables, not injected
  into parent's attention window (symbolic composition)
- Compact output metadata between code steps instead of full stdout
- MontyObject ↔ serde_json::Value bidirectional conversion
- Updated architecture plan with RLM design principles

74 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): RLM best-practices enhancements from cross-reference analysis

Cross-referenced our implementation against the official RLM (alexzhang13/rlm),
fast-rlm (avbiswas/fast-rlm), and Prime Intellect's verifiers implementation.
Key enhancements:

- FINAL(answer) / FINAL_VAR(name): explicit termination pattern matching
  all three reference implementations. Code can signal completion at any
  point, not just via return value.
- llm_query_batched(prompts): parallel recursive sub-calls via tokio::spawn,
  matching fast-rlm's asyncio.gather pattern and Prime Intellect's llm_batch.
- Output truncation increased to 8000 chars (from 120), matching Prime
  Intellect's 8192 default. Shows [TRUNCATED: last N chars] or [FULL OUTPUT].
- Step 0 orientation preamble: auto-injects context metadata (message count,
  total chars, goal, last user message preview) before first code step,
  matching fast-rlm's auto-print pattern.
- Error-to-LLM flow: Python parse errors, runtime errors, NameErrors,
  OS errors, and async errors now flow back as stdout content instead of
  terminating the step, enabling LLM self-correction on next iteration.
  Only VM panics (catch_unwind) terminate as EngineError.

74 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(engine): update architecture plan with RLM cross-reference learnings

Comprehensive update after cross-referencing against official RLM
(alexzhang13/rlm), fast-rlm (avbiswas/fast-rlm), Prime Intellect
(verifiers/RLMEnv), rlm-rs (zircote/rlm-rs), and Google ADK RLM.

Changes:
- Mark Phases 1-3 as DONE with commit refs and test counts
- Add "Key Influences" section documenting all reference implementations
- Phase 3: full table of implemented RLM features with sources
- Phase 3: "Remaining gaps" table with which phase addresses each
- Phase 4: expanded with compaction (85% context), rlm_query() (full
  recursive sub-agent), dual model routing, budget controls (USD,
  timeout, tokens, consecutive errors), lazy loading, pass-by-reference
- Add "RLM Execution Model" cross-cutting section
- Add "Implementation Progress" tracking table
- Remove stale "TO IMPLEMENT" markers (all Phase 3 work is done)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): Phase 4 — budget controls, compaction, reflection pipeline

Budget enforcement in ExecutionLoop:
- max_tokens_total: cumulative token limit, checked before each iteration
- max_duration: wall-clock timeout for entire thread
- max_consecutive_errors: consecutive error steps threshold (resets on
  success, matching official RLM behavior)
- All produce ThreadOutcome::Failed with descriptive messages

Context compaction (from RLM paper, 85% threshold):
- estimate_tokens(): char-based estimation (chars/4, matching RLM)
- should_compact(): triggers when tokens >= threshold_pct * context_limit
- compact_messages(): asks LLM to summarize progress, replaces history
  with [system, summary, continuation_note], preserves intermediate results
- Configurable via ThreadConfig: model_context_limit, compaction_threshold

Dual model routing:
- LlmCallConfig gains depth field (0=root, 1+=sub-call)
- Implementations can route to cheaper models for sub-calls
- ExecutionLoop passes thread depth to every LLM call

Reflection pipeline (reflection/pipeline.rs):
- reflect(thread, llm): analyzes completed thread via LLM
- Produces Summary doc (always), Lesson doc (if errors), Issue doc (if failed)
- Builds transcript from thread messages + error events
- Returns ReflectionResult with docs + token usage

ThreadConfig extended with: max_tokens_total, max_consecutive_errors,
model_context_limit, enable_compaction, compaction_threshold, depth, max_depth.

78 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): Phase 5 — conversation surface separated from execution

Conversation is now a UI layer, not an execution boundary. Multiple
threads can run concurrently within one conversation; threads can
outlive their originating conversation.

New types (types/conversation.rs):
- ConversationSurface: channel + user + entries + active_threads
- ConversationEntry: sender (User/Agent/System) + content + origin_thread_id
- ConversationId, EntryId (UUID newtypes)
- EntrySender enum (User, Agent{thread_id}, System)

ConversationManager (runtime/conversation.rs):
- get_or_create_conversation(channel, user) — indexed by (channel, user)
- handle_user_message() — injects into active foreground thread or spawns new
- record_thread_outcome() — adds agent/system entries, untracks completed threads
- get_conversation(), list_conversations()

This enables the key architectural insight: a user can ask "what's the
weather?" while a deployment thread is still running. Both produce entries
in the same conversation.

85 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(engine): simplify execution tiers — Monty-only for CodeAct/RLM

Restructure phases 6-8 to clarify execution model:

- Monty is the sole Python executor for CodeAct/RLM. No WASM or Docker
  Python runtimes for LLM-generated code.
- WASM sandbox is for third-party tool isolation (existing infra, Phase 8)
- Docker containers are for thread-level isolation of high-risk work (Phase 8)
- Two-phase commit moves to Phase 6 (integration) at the adapter boundary

Phase renumbering:
- Old Phase 6 (Tier 2-3) → removed as separate phase
- Old Phase 7 (integration) → Phase 6
- Old Phase 8 (cleanup) → Phase 7
- New Phase 8: WASM tools + Docker thread isolation (infra integration)

Updated progress table: Phases 1-5 marked DONE with test counts and commits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): Phase 6 — bridge adapters for main crate integration

Strategy C parallel deployment: when ENGINE_V2=true env var is set,
user messages route through the engine instead of the existing agentic
loop. All existing behavior is unchanged when the flag is off.

Bridge module (src/bridge/):
- LlmBridgeAdapter: wraps LlmProvider as engine LlmBackend, converts
  ThreadMessage↔ChatMessage, ActionDef↔ToolDefinition, depth-based
  model routing (primary vs cheap_llm)
- EffectBridgeAdapter: wraps ToolRegistry+SafetyLayer as EffectExecutor,
  routes tool calls through existing execute_tool_with_safety pipeline
- InMemoryStore: HashMap-backed Store impl (no DB tables needed yet)
- EngineRouter: is_engine_v2_enabled() + handle_with_engine() that
  builds engine from Agent deps and processes messages end-to-end

Integration touchpoint (4 lines in agent_loop.rs):
  After hook processing, before session resolution, check ENGINE_V2
  flag and route UserInput through the engine path.

Accessor visibility widened: llm(), cheap_llm(), safety(), tools()
changed from pub(super) to pub(crate) for bridge access.

85 engine tests + main crate clippy clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): add user message and system prompt to thread before execution

The ExecutionLoop was sending empty messages to the LLM because the
thread was spawned with the user's input as the goal but no messages.

Fixes:
- ThreadManager.spawn_thread() now adds the goal as an initial user
  message before starting the execution loop
- ExecutionLoop.run() injects a default system prompt if none exists

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(bridge): match existing LLM request format to prevent 400 errors

The LLM bridge was missing several defaults that the existing
Reasoning.respond_with_tools() sets:

- tool_choice: "auto" when tools are present (required by some providers)
- max_tokens: 4096 (default)
- temperature: 0.7 (default)
- When no tools (force_text): use plain complete() instead of
  complete_with_tools() with empty tools array — matches existing
  no-tools fallback path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): persist conversation context across messages

The engine was creating a fresh ThreadManager and InMemoryStore per
message, losing all context between turns. A follow-up question like
"what are the latest 10 issues?" had no memory of the prior "how many
issues" response.

Fixes:
- EngineState (ThreadManager, ConversationManager, InMemoryStore) now
  persists across messages via OnceLock, initialized on first use
- ConversationManager builds message history from prior conversation
  entries (user messages + agent responses) and passes it to new threads
- ThreadManager.spawn_thread_with_history() accepts initial_messages
  that are prepended before the current user message
- System notifications (thread started/completed) are filtered out of
  the history (not useful as LLM context)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): enable CodeAct/RLM mode with code block detection

The engine now operates in CodeAct/RLM mode:

System prompt (executor/prompt.rs):
- Instructs LLM to write Python in ```repl fenced blocks
- Documents available tools as callable Python functions
- Documents llm_query(), llm_query_batched(), FINAL()
- Documents context variables (context, goal, step_number, previous_results)
- Strategy guidance: examine context, break into steps, use tools, call FINAL()

Code block detection (bridge/llm_adapter.rs):
- extract_code_block() scans LLM text responses for ```repl or ```python blocks
- When detected, returns LlmResponse::Code instead of LlmResponse::Text
- The ExecutionLoop routes Code responses through Monty for execution

No structured tool definitions sent to LLM:
- Tools are described in the system prompt as Python functions
- The LLM call sends empty actions array, forcing text-mode responses
- This ensures the LLM writes code blocks (CodeAct) instead of
  structured tool calls (which would bypass the REPL)

85 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(engine): add 8 CodeAct/RLM E2E tests with mock LLM

Comprehensive test coverage for the Monty Python execution path:

- codeact_simple_final: Python code calls FINAL('answer') → thread completes
- codeact_tool_call_then_final: code calls test_tool() → FunctionCall
  suspends VM → MockEffects returns result → code resumes → FINAL()
- codeact_pure_python_computation: sum([1,2,3,4,5]) → FINAL('Sum is 15')
  with no tool calls — pure Python in Monty
- codeact_multi_step: first step prints output (no FINAL), second step
  sees output metadata and calls FINAL — tests iterative REPL flow
- codeact_error_recovery: first step has NameError → error flows to LLM
  as stdout → second step recovers with FINAL — tests error transparency
- codeact_context_variables_available: code accesses `goal` and `context`
  variables injected by the RLM context builder
- codeact_multiple_tool_calls_in_loop: for loop calls test_tool() 3 times
  → 3 FunctionCall suspensions → all results collected → FINAL
- codeact_llm_query_recursive: code calls llm_query('prompt') → VM
  suspends → MockLlm provides sub-agent response → result returned as
  Python string variable

93 tests passing (85 prior + 8 new), zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(bridge): detect code blocks in plain completion path + multi-block support

Two bugs fixed:

1. The no-tools completion path (used by CodeAct since we send empty
   actions) returned LlmResponse::Text without checking for code blocks.
   Code blocks were rendered as markdown text instead of being executed.

2. extract_code_block now:
   - Handles bare ``` fences (skips non-Python languages)
   - Collects ALL code blocks in the response and concatenates them
     (models often split code across multiple blocks with explanation)
   - Tries markers in order: ```repl, ```python, ```py, then bare ```

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(bridge): add 11 regression tests for code block extraction

Covers the exact failure modes discovered during live testing:

- extract_repl_block: standard ```repl fenced block
- extract_python_block: ```python marker
- extract_py_block: ```py shorthand
- extract_bare_backtick_block: bare ``` with Python content
- skip_non_python_language: ```json should NOT be extracted
- no_code_blocks_returns_none: plain text, no fences
- multiple_code_blocks_concatenated: two ```repl blocks with
  explanation between them → concatenated with \n\n
- mixed_thinking_and_code: model outputs explanation + two
  ```python blocks (the Hyperliquid case) → both extracted
- repl_preferred_over_bare: ```repl takes priority over bare ```
- empty_code_block_skipped: empty fenced block returns None
- unclosed_block_returns_none: no closing ``` returns None

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): detect FINAL() in text responses + regression tests

Models sometimes write FINAL() outside code blocks — as plain text
after an explanation. The Hyperliquid case: model outputs a long
analysis then FINAL("""...""") at the end, not inside ```repl fences.

Fixes:
- extract_final_from_text(): regex-based FINAL detection in text
  responses, matching the official RLM's find_final_answer() fallback
- Handles: double-quoted, single-quoted, triple-quoted, unquoted,
  nested parens
- Checked in LlmResponse::Text handler BEFORE tool intent nudge
  (FINAL takes priority)

9 new tests:
- codeact_final_in_text_response: FINAL("answer") in plain text
- codeact_final_triple_quoted_in_text: FINAL("""multi\nline""") in text
- final_double_quoted, final_single_quoted, final_triple_quoted,
  final_unquoted, final_with_nested_parens, final_after_long_text,
  no_final_returns_none

102 tests passing (93 + 9 new), zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add crate extraction & cleanup roadmap

Documents architectural recommendations from the engine v2 design
process for future reference:

- Root directory consolidation (channels-src + tools-src → extensions/)
- Crate extraction tiers: zero-coupling (estimation, observability,
  tunnel), trivial-coupling (document_extraction, pairing, hooks),
  medium-coupling (secrets, MCP, db, workspace, llm, skills),
  heavy-coupling (web gateway, agent, extensions)
- src/ module reorganization into logical groups (core, persistence,
  infra, media, support)
- main.rs/app.rs slimming targets (100/500 lines after migration)
- WASM module candidates (document_extraction) and non-candidates
  (REPL, web gateway → separate crates instead)
- Priority ordering for extraction work
- Tracks completed items (ironclaw_safety, ironclaw_engine,
  transcription move)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): live progress status updates via event broadcast

Engine v2 now shows live progress in the CLI (and any channel):
- "Thinking..." when a step starts
- Tool name + success/error when actions execute
- "Processing results..." when a step completes

Implementation:
- ThreadManager holds a broadcast::Sender<ThreadEvent> (capacity 256)
- ExecutionLoop.emit_event() writes to thread.events AND broadcasts
- ThreadManager.subscribe_events() returns a receiver
- Router uses tokio::select! to listen for events while waiting for
  thread completion, forwarding them as StatusUpdate to the channel

This replaces the polling approach with zero-latency event streaming.
Agent.channels visibility widened to pub(crate) for bridge access.

102 tests passing, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): include tool results in code step output for LLM context

The LLM was ignoring tool results and answering from training data
because the compact output metadata didn't include what tools returned.
Tool results lived only as ActionResult messages (role: Tool) which
some providers flatten or the model ignores.

Now the code step output includes:
- stdout from Python print() statements
- [tool_name result] with the actual output (truncated to 4K per tool)
- [tool_name error] for failed tools
- [return] for the code's return value
- Total output truncated to 8K chars to prevent context bloat

This ensures the model sees web_search results, API responses, etc.
in the next iteration and can reason about them instead of hallucinating.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): add debug/trace logging for CodeAct execution

Three verbosity levels for debugging the engine:

RUST_LOG=ironclaw_engine=debug:
- LLM call: message count, iteration, force_text
- LLM response: type (text/code/action_calls), token usage
- Code execution: code length, action count, had_error, final_answer
- Text response: length, FINAL() detection

RUST_LOG=ironclaw_engine=trace:
- Full message list sent to LLM (role, length, first 200 chars each)
- Full code block being executed
- stdout preview (first 500 chars)
- Per-tool results (name, success, first 300 chars of output)
- Text response preview (first 500 chars)

Usage:
  ENGINE_V2=true RUST_LOG=ironclaw_engine=debug cargo run
  ENGINE_V2=true RUST_LOG=ironclaw_engine=trace cargo run

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): execution trace recording + retrospective analysis

Enable with ENGINE_V2_TRACE=1 to get full execution traces and
automatic issue detection after each thread completes.

Trace recording (executor/trace.rs):
- build_trace(): captures full thread state — messages (with full
  content), events, step count, token usage, detected issues
- write_trace(): writes JSON to engine_trace_{timestamp}.json
- log_trace_summary(): logs summary + issues at info/warn level

Retrospective analyzer detects 8 issue categories:
- thread_failure: thread ended in Failed state
- no_response: no assistant message generated
- tool_error: specific tool failures with error details
- code_error: Python errors (NameError, SyntaxError, etc.) in output
- missing_tool_output: tool results exist but not in system messages
- excessive_steps: >10 steps (may be stuck in loop)
- no_tools_used: single-step answer without tools (hallucination risk)
- mixed_mode: text responses without code blocks (prompt not followed)

Thread state now saved to store after execution completes (for trace
access after join_thread).

Usage:
  ENGINE_V2=true ENGINE_V2_TRACE=1 cargo run
  # After each message: trace JSON + issue log in terminal

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): wire reflection pipeline + trace analysis into thread lifecycle

After every thread completes, ThreadManager now automatically runs:

1. Retrospective trace analysis (non-LLM, always):
   - Detects 8 issue categories (tool errors, code errors, missing
     outputs, excessive steps, hallucination risk, etc.)
   - Logs issues at warn level when found

2. Trace file recording (when ENGINE_V2_TRACE=1):
   - Writes full JSON trace to engine_trace_{timestamp}.json

3. LLM reflection (when enable_reflection=true):
   - Calls reflection pipeline to produce Summary, Lesson, Issue docs
   - Saves docs to store for future context retrieval
   - Enabled by default in the bridge router

All three run inside the spawned tokio task after exec.run() completes,
before saving the final thread state. No external wiring needed.

Removed duplicate trace recording from the router — it's now handled
by ThreadManager automatically.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(bridge): convert tool name hyphens to underscores for Python compatibility

Root cause from trace analysis: the LLM writes `web_search()` (valid
Python identifier) but the tool registry has `web-search` (with hyphen).
The EffectBridgeAdapter couldn't find the tool → "Tool not found" error
→ model fabricated fake data instead.

Fixes:
- available_actions(): converts tool names from hyphens to underscores
  (web-search → web_search) so the system prompt lists valid Python names
- execute_action(): tries the original name first, then falls back to
  hyphenated form (web_search → web-search) for tool registry lookup
- Same conversion in router's capability registry builder

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(bridge): parse JSON tool output to prevent double-serialization

From trace analysis: web_search returned a JSON string, which was
wrapped as serde_json::json!(string) creating a Value::String containing
JSON. When Monty got this as MontyObject::String, the Python code
couldn't index it with result['title'] → TypeError.

Fix: try parsing the tool output string as JSON first. If valid, use the
parsed Value (becomes a Python dict/list). If not valid JSON, keep as
string. This means web_search results are directly indexable in Python:
  results = web_search(query="...")
  print(results["results"][0]["title"])  # works now

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): persist variables across code steps via `state` dict

Monty creates a fresh runtime per code step, so variables are lost
between steps. This caused the model to re-paste tool results from
system messages, wasting tokens.

Fix: maintain a `persisted_state` JSON dict in the ExecutionLoop that
accumulates across steps:
- Tool results stored by tool name: state["web_search"] = {results...}
- Return values stored: state["last_return"], state["step_0_return"]
- Injected as a `state` Python variable in each new MontyRun

Now the model can do:
  Step 1: results = web_search(query="...")  # tool result saved in state
  Step 2: data = state["web_search"]         # access previous result
          summary = llm_query("summarize", str(data))
          FINAL(summary)

System prompt updated to document the `state` variable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): add state hint on code errors + retrieval engine integration

When code fails with NameError/UnboundLocalError (model trying to
access variables from a previous step), the error output now includes:

  [HINT] Variables don't persist between code blocks. Use the `state`
  dict to access data from previous steps. Available keys: ["web_search",
  "last_return"]

This teaches the model to use `state["web_search"]` instead of `result`
after a NameError, reducing wasted steps from 3-4 to 1.

Also integrates RetrievalEngine into context building and ThreadManager:
- build_step_context() now accepts optional RetrievalEngine to inject
  relevant memory docs (Lessons, Specs, Playbooks) into LLM context
- RetrievalEngine uses keyword matching with doc-type priority scoring
- Memory docs from reflection (Phase 4) now feed back into future threads

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: remove trace files and add to .gitignore

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): replace web_fetch example with web_search in CodeAct prompt

The system prompt example used web_fetch(url="...") which doesn't exist
as a tool. The model learned from the example and tried web_fetch,
getting "Tool not found". Changed to web_search(query="...") which is
an actual registered tool.

Found via trace analysis — reflection pipeline correctly identified
this as a "Tool Name Correction" spec doc.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor(engine): extract prompt templates to markdown files

Prompt templates moved from inline Rust strings to plain markdown files
at crates/ironclaw_engine/prompts/ for easy inspection and iteration:

- prompts/codeact_preamble.md — main instructions, special functions,
  context variables, rules
- prompts/codeact_postamble.md — strategy section

Loaded at compile time via include_str!(), so no runtime file I/O.
Edit the .md files and rebuild to iterate on prompts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): replace byte-index slicing with char-safe truncation

Panic: 'byte index 80 is not a char boundary; it is inside ''' when
tool output contained multi-byte UTF-8 characters (smart quotes from
web search results).

Fixed 4 unsafe byte-index slices:
- thread.rs:281: message preview &content[..80] → chars().take(80)
- loop_engine.rs:556: tool output &str[..4000] → chars().take(4000)
- loop_engine.rs:579: output tail &str[len-8000..] → chars().skip()
- scripting.rs:82: stdout tail &str[len-N..] → chars().skip()

All now use .chars().take() or .chars().skip() which respect character
boundaries. Follows CLAUDE.md rule: "Never use byte-index slicing on
user-supplied or external strings."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): fix false positive missing_tool_output warning in trace analyzer

The check was looking for "[" + "result]" in System-role messages only,
but tool output metadata is added with patterns like "[shell result]"
and may appear in messages with any role. Changed to scan all messages
for " result]" or " error]" patterns regardless of role.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(engine): update architecture plan with Phase 6 status and approval flow design

Phase 6 updated to reflect what was actually built:
- Bridge adapters (LLM, Effect, InMemoryStore, Router) — all done
- Integration touchpoint (4 lines in handle_message) — done
- Live progress via broadcast events — done
- Conversation persistence across messages — done
- Trace recording + retrospective analysis — done
- 8 bugs found and fixed via trace analysis — documented

Phase 6 remaining work documented:
- Approval flow: detailed 5-step design (send to channel, pause thread,
  route response, resume execution, always handling) with v1 reference
- Database persistence (InMemoryStore → real DB tables)
- Acceptance testing (TestRig + TraceLlm fixtures)
- Two-phase commit for high-stakes effects

Progress table updated: Phase 6 marked as DONE (partial), 134 tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add self-improving engine design plan

Designs a system where the engine debugs and improves itself, based on
the pattern observed in the last session: 5 consecutive bug fixes all
followed trace → read → identify → edit → test, using tools the engine
already has access to.

Three levels of self-improvement:
- Level 1 (Prompt): edit prompts/*.md to prevent LLM mistakes. Auto-apply.
- Level 2 (Config): adjust defaults/mappings. Branch + test + PR.
- Level 3 (Code): Rust patches for engine bugs. Branch + test + clippy + PR.

Architecture: Self-improvement Mission spawns a Reflection thread that
reads traces, reads source, proposes fixes, validates via cargo test,
and either auto-applies (Level 1) or creates a PR (Level 2-3).

Includes: fix pattern database (seeded from our 8 debugging session
fixes), feedback loop diagram, safety model, implementation phases
(A through D), and what exists vs what's new.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add engine v2 security model and audit

Comprehensive security analysis of engine v2 covering:

Threat model: 4 attacker profiles (malicious input, prompt injection
via tools, poisoned memory, supply chain).

Current state audit: 9 controls working (Monty sandbox, safety layer,
policy engine, leases, provenance, events) and 9 gaps identified.

Critical finding: ALL tools granted by default — CodeAct code can call
shell, write_file, apply_patch without approval. Proposed fix: 3-tier
tool classification (auto/approve-once/always-approve).

CodeAct-specific threats: tool call amplification, prompt injection via
search results, data exfiltration via tool chains, Monty escape.

Self-improvement security: poisoned trace attacks, memory poisoning via
reflection. Mitigations: edit validation, frequency caps, audit trail,
auto-rollback, reflection output scanning.

6-layer security architecture proposed: input validation, capability
gating, output sanitization, execution sandboxing, self-improvement
controls, observability.

Prioritized implementation plan with severity/effort ratings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(security): cross-reference v1 controls — use, don't reinvent

Updated security plan with detailed audit of ALL existing v1 security
controls and how they map to engine v2 bridge gaps:

Key finding: v1 already has solutions for every security gap identified.
The bridge just needs to wire them in:

- Tool::requires_approval() exists but bridge doesn't call it
- safety.wrap_for_llm() exists but tool results enter context unwrapped
- RateLimiter exists but bridge doesn't check rate limits
- BeforeToolCall hooks exist but bridge doesn't run them
- redact_params() exists but bridge doesn't redact sensitive params
- Shell risk classification (Low/Medium/High) is inherited but ignored

Revised priority: most fixes are small wiring tasks in EffectBridgeAdapter,
not new security infrastructure. The bridge is the security boundary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): add missions, reliability tracker, reflection executor, and provenance-aware policy

- Add Mission type and MissionManager for recurring thread scheduling
- Add ReliabilityTracker for per-capability success/failure/latency tracking
- Add reflection executor that spawns CodeAct threads for post-completion reflection
- Extend PolicyEngine with provenance-aware taint checking (LLM-generated data
  requires approval for financial/external-write effects)
- Extend Store trait with mission CRUD methods
- Add conversation surface tracking, compaction token fix, context memory injection
- Wire new modules through lib.rs re-exports and bridge adapters

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): wire v1 security controls into engine v2 adapter

Zero engine crate changes. All security controls enforced at the bridge
boundary in EffectBridgeAdapter:

1. Tool approval (v1: Tool::requires_approval):
   - Checks each tool's approval requirement with actual params
   - Always → returns EngineError::LeaseDenied (blocks execution)
   - UnlessAutoApproved → checks auto_approved set, blocks if not approved
   - Never → proceeds
   - Per-session auto_approved HashSet (for future "always" handling)

2. Hook interception (v1: BeforeToolCall):
   - Runs HookEvent::ToolCall before every execution
   - HookOutcome::Reject → blocks with reason
   - HookError::Rejected → blocks with reason
   - Hook errors → fail-open (logged, execution continues)

3. Output sanitization (v1: sanitize_tool_output + wrap_for_llm):
   - Leak detection: API keys in tool output are redacted
   - Policy enforcement: content policy rules applied
   - Length truncation: output capped at 100KB
   - XML boundary protection: prevents injection via tool output

4. Sensitive param redaction (v1: redact_params):
   - Tool's sensitive_params() consulted before hooks see parameters
   - Redacted params sent to hooks, original params used for execution

5. available_actions() now sets requires_approval based on each tool's
   default approval requirement, so the engine's PolicyEngine can
   gate tools it hasn't seen before.

6. Actual execution timing measured via Instant::now() (replaces
   placeholder Duration::from_millis(1)).

Accessor visibility: hooks() widened to pub(crate).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): implement tool approval flow for engine v2

Adds a complete approval flow that mirrors v1 behavior, using the
existing v1 security controls (Tool::requires_approval, auto-approve
sets, StatusUpdate::ApprovalNeeded).

## How it works

### Step 1: Tool blocked at execution
When the LLM's code calls a tool (e.g., `shell("ls")`):
1. EffectBridgeAdapter.execute_action() looks up the Tool object
2. Calls tool.requires_approval(&params) — returns ApprovalRequirement
3. If Always → EngineError::LeaseDenied (always blocks)
4. If UnlessAutoApproved → checks auto_approved HashSet → if not in set,
   returns EngineError::LeaseDenied
5. If Never → proceeds to execution

### Step 2: Engine returns NeedApproval
The LeaseDenied error propagates through:
- CodeAct path: becomes Python RuntimeError, code halts, thread returns
  NeedApproval with action_name + parameters
- Structured path: same via ActionResult.is_error

### Step 3: Router stores pending approval
- PendingApproval { action_name, original_content } stored on EngineState
- StatusUpdate::ApprovalNeeded sent to channel (shows approval card in
  CLI/web with tool name, parameters, yes/always/no buttons)
- Returns text: "Tool 'shell' requires approval. Reply yes/always/no."

### Step 4: User responds
handle_message() intercepts Submission::ApprovalResponse when ENGINE_V2:
- 'yes' → auto_approve_tool(name) on EffectBridgeAdapter, re-processes
  original message (tool now passes the approval check on second run)
- 'always' → same + logs for session persistence
- 'no' → returns "Denied: tool was not executed."

### Key design choice
Instead of pausing/resuming mid-execution (which needs engine changes
to freeze/restore the Monty VM state), we auto-approve the tool and
re-run the full message. The EffectBridgeAdapter's auto_approved set
persists across runs, so the second execution passes immediately.

This trades one extra LLM call for zero engine modifications.

## Files changed
- src/bridge/router.rs: PendingApproval struct, handle_approval(),
  NeedApproval → StatusUpdate::ApprovalNeeded conversion
- src/bridge/mod.rs: export handle_approval
- src/agent/agent_loop.rs: intercept ApprovalResponse for engine v2
- src/bridge/effect_adapter.rs: fmt fixes

151 tests passing, clippy + fmt clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): demote trace/reflection logging from info to debug

INFO-level log output from background tasks (trace analysis, reflection)
corrupts the REPL terminal UI. The trace summary, issue warnings, and
reflection doc previews were printing mid-approval-card, breaking the
interactive display.

Fix: all logging in trace.rs changed from info!/warn! to debug!/warn!.
Trace analysis and reflection results now only show when
RUST_LOG=ironclaw_engine=debug is set.

Also added logging discipline rule to global CLAUDE.md:
- info! → user-facing status the REPL intentionally renders
- debug! → internal diagnostics (traces, reflection, engine internals)
- Background tasks must NEVER use info! — it breaks the TUI

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(bridge): demote all router info! logging to debug!

"engine v2: initializing" and "engine v2: handling message" were
printing at INFO level, corrupting the REPL UI. All router logging
now uses debug! — only visible with RUST_LOG=ironclaw=debug.

Zero info! calls remain in crates/ironclaw_engine/ or src/bridge/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(safety): demote leak detector warn-action logs from warn! to debug!

The leak detector's Warn-action matches (high_entropy_hex pattern on
web search results containing commit SHAs, CSS colors, URL hashes)
were logging at warn! level, corrupting the REPL UI with lines like:
  WARN Potential secret leak detected pattern=high_entropy_hex preview=a96f********cee5

These are informational false positives — real leaks use LeakAction::Redact
which silently modifies the content. Warn-action matches only log for
debugging purposes and should not appear in production output.

Changed to debug! level — visible with RUST_LOG=ironclaw_safety=debug.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): strengthen CodeAct prompt to prevent shallow text answers

The model was answering "Suggested 45 improvements" as a brief text
summary from training data without actually searching or listing them.
The trace showed: no code block, no tool calls, no FINAL().

Prompt changes:
- Rule 1: "ALWAYS respond with a ```repl code block. NEVER answer with
  plain text only." (was: "Always write code... plain text for brief
  explanations")
- Rule 2 (NEW): "NEVER answer from memory or training data alone.
  Always use tools to get real, current information before answering."
- Rule 3: FINAL answer "should be detailed and complete — not just a
  summary like 'found 45 items'"
- Rule 8 (NEW): "Include the actual content in your FINAL() answer,
  not just a count or summary. Users want to see the details."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): persist reflection docs to workspace for cross-session learning

Replaces InMemoryStore with HybridStore:
- Ephemeral data (threads, steps, events, leases) stays in-memory
- MemoryDocs (lessons, specs, playbooks from reflection) persist to
  the workspace at engine/docs/{type}/{id}.json

On engine init, load_docs_from_workspace() reads existing docs back
into the in-memory cache. This means:
- Lessons learned in session 1 are available in session 2
- The RetrievalEngine injects relevant past lessons into new threads
- The engine genuinely improves over time as reflection accumulates

Workspace paths:
  engine/docs/lessons/{uuid}.json
  engine/docs/specs/{uuid}.json
  engine/docs/playbooks/{uuid}.json
  engine/docs/summaries/{uuid}.json
  engine/docs/issues/{uuid}.json

No new database tables. Uses existing workspace write/read/list.
workspace() accessor widened to pub(crate).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(bridge): adapt to execute_tool_with_safety params-by-value change

Staging merge changed execute_tool_with_safety to take params by value
instead of by reference (perf optimization from PR #926). Updated
bridge adapter to clone params before passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(engine): add web gateway integration plan to Phase 6

Documents three gaps between engine v2 and the web gateway:
1. No SSE streaming (engine emits ThreadEvent, gateway expects SseEvent)
2. No conversation persistence (engine uses HybridStore, gateway reads v1 DB)
3. No cross-channel visibility (REPL ↔ web messages invisible to each other)

Implementation plan: bridge ThreadEvent→AppEvent, write messages to v1
conversation tables after thread completion. Prerequisite: AppEvent
extraction PR (in progress separately).

Also updated DB persistence status: HybridStore with workspace-backed
MemoryDocs is now implemented (partial persistence).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs(engine): document routine/job gap and SIGKILL crash scenario

Routines are entirely v1 — not hooked up to engine v2. When a user
asks "create a routine" as natural language, engine v2 tries to call
routine_create via CodeAct, but the tool needs RoutineEngine + Database
refs that the bridge's minimal JobContext doesn't provide. This caused
a SIGKILL crash during testing.

Options documented: block routine tools in v2 (short term), pass refs
through context (medium), replace with Mission system (long term).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: extract AppEvent to crates/ironclaw_common

SseEvent was defined in src/channels/web/types.rs but imported by 12+
modules across agent, orchestrator, worker, tools, and extensions — it
had become the application-wide event protocol, not a web transport
concern.

Create crates/ironclaw_common as a shared workspace crate and move the
enum there as AppEvent.  Also move the truncate_preview utility which
was similarly leaked from the web gateway into agent modules.

- New crate: crates/ironclaw_common (AppEvent, truncate_preview)
- Rename SseEvent → AppEvent, from_sse_event → from_app_event
- web/types.rs re-exports AppEvent for internal gateway use
- web/util.rs re-exports truncate_preview
- Wire format unchanged (serde renames are on variants, not the enum)

Aligned with the event bus direction on refactor/architectural-hardening
where DomainEvent (≡ AppEvent) is wrapped in a SystemEvent envelope.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): integrate with web gateway via AppEvent + v1 conversation DB

Three changes to make engine v2 visible in the web gateway:

1. SSE event streaming (AppEvent broadcast):
   - ThreadEvent → AppEvent conversion via thread_event_to_app_event()
   - Events broadcast to SseManager during the poll loop
   - Covers: Thinking, ToolCompleted (success/error), Status, Response
   - Web gateway receives real-time progress without any gateway changes

2. Conversation persistence to v1 database:
   - After thread completes, writes user message + agent response to
     v1 ConversationStore via add_conversation_message()
   - Uses get_or_create_assistant_conversation() for per-user per-channel
   - Web gateway reads from DB as usual — chat history appears

3. Final response broadcast:
   - AppEvent::Response with full text + thread_id sent via SSE
   - Web gateway renders the response in the chat UI

New EngineState fields: sse (Option<Arc<SseManager>>),
db (Option<Arc<dyn Database>>). Both populated from Agent.deps.

Agent.deps visibility widened to pub(crate).

Depends on: ironclaw_common crate with AppEvent type (PR #1615).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): complete Phase 6 — v1-only tool blocking, rate limiting, call limits

Three security/stability improvements in EffectBridgeAdapter:

1. V1-only tool blocking:
   - routine_create, create_job, build_software (and hyphenated variants)
     return helpful error: "use the slash command instead"
   - Filtered out of available_actions() so system prompt doesn't list them
   - Prevents crash from tools needing RoutineEngine/Scheduler refs

2. Per-step tool call limit:
   - Max 50 tool calls per code block (AtomicU32 counter)
   - Prevents amplification: `for i in range(10000): shell(...)`
   - Returns "call limit reached, break into multiple steps"

3. Rate limiting:
   - Per-user per-tool sliding window via RateLimiter
   - Checks tool.rate_limit_config() before every execution
   - Returns "rate limited, try again in Ns"

Architecture plan updated:
- Gateway integration: DONE
- Routines: BLOCKED (gracefully, with slash command fallback)
- Rate limiting: DONE
- Call limit: DONE
- Phase 6 status: DONE (remaining: acceptance tests, two-phase commit)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add Mission system design — goal-oriented autonomous threads

Missions replace routines with evolving, knowledge-accumulating
autonomous agents. Unlike routines (fixed prompt, stateless), Missions:

- Generate prompts from accumulated Project knowledge (lessons,
  playbooks, issues from prior threads)
- Adapt approach when something fails repeatedly
- Track progress toward a goal with success criteria
- Self-manage: pause when stuck, complete when goal achieved

Architecture: MissionManager with cron ticker spawns threads via
ThreadManager. Meta-prompt built from mission goal + Project MemoryDocs
via RetrievalEngine. Reflection feeds back automatically.

6-step implementation plan: cron trigger, meta-prompt builder, bridge
wiring, CodeAct tools, progress tracking, persistence.

Includes two worked examples: daily tech news briefing (ongoing) and
test coverage improvement (goal-driven, self-completing).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): extend Mission types with webhook/event triggers + evolving strategy

Mission types updated to support external activation sources:

MissionCadence expanded:
- Cron { expression, timezone } — timezone-aware scheduling
- OnEvent { event_pattern } — channel message pattern matching
- OnSystemEvent { source, event_type } — structured events from tools
- Webhook { path, secret } — external HTTP triggers (GitHub, email, etc.)
- Manual — explicit triggering only

The engine defines trigger TYPES. The bridge implements infrastructure
(cron ticker, webhook endpoints, event matchers). GitHub issues, PRs,
email, Slack events all use the generic Webhook cadence — no
special-casing in the engine. Webhook payload injected as
state["trigger_payload"] in the thread's Python context.

Mission struct extended:
- current_focus: what the next thread should work on (evolving)
- approach_history: what we've tried (for adaptation)
- max_threads_per_day / threads_today: daily budget
- last_trigger_payload: webhook/event data for thread context

Plan updated with trigger type table and webhook integration design.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): implement MissionManager execution with meta-prompts

The MissionManager now builds evolving meta-prompts and processes
thread outcomes for continuous learning:

fire_mission() upgraded:
- Loads Project MemoryDocs via RetrievalEngine for context
- Builds meta-prompt from: goal, current_focus, approach_history,
  project knowledge docs, trigger payload, thread count
- Spawns thread with meta-prompt as user message
- Background task waits for completion and processes outcome
- Daily thread budget enforcement (max_threads_per_day)

Meta-prompt structure:
  # Mission: {name}
  Goal: {goal}
  ## Current Focus (evolves between threads)
  ## Previous Approaches (what we've tried)
  ## Knowledge from Prior Threads (lessons, playbooks, issues)
  ## Trigger Payload (webhook/event data if applicable)
  ## Instructions (accomplish step, report next focus, check goal)

Outcome processing:
- Extracts "next focus:" from FINAL() response → updates current_focus
- Detects "goal achieved: yes" → completes mission
- Records accomplishment in approach_history
- Failed threads recorded as "FAILED: {error}"

Cron ticker:
- start_cron_ticker() spawns tokio task, ticks every 60s
- Checks active Cron missions, fires those past next_fire_at

151 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): wire MissionManager into engine v2 for CodeAct access

Missions are now callable from CodeAct Python code:

```python
# Create a daily briefing mission
result = mission_create(
    name="Tech News",
    goal="Daily AI/crypto/software news briefing",
    cadence="0 9 * * *"
)

# List all missions
missions = mission_list()

# Manually fire a mission
mission_fire(id="...")

# Pause/resume
mission_pause(id="...")
mission_resume(id="...")
```

Implementation:
- MissionManager created on engine init, cron ticker started
- EffectBridgeAdapter intercepts mission_* function calls before tool
  lookup and routes to MissionManager
- parse_cadence() handles: "manual", cron expressions, "event:pattern",
  "webhook:path"
- Mission functions documented in CodeAct system prompt
- MissionManager set on adapter via set_mission_manager() after init
  (avoids circular dependency)

System prompt updated with mission_create, mission_list, mission_fire,
mission_pause, mission_resume documentation.

151 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(bridge): map routine_* calls to mission operations in v2

When the model calls routine_create, routine_list, routine_fire,
routine_pause, routine_resume, or routine_delete, the bridge now
routes them to the MissionManager instead of blocking with an error.

Mapping:
  routine_create → mission_create (with cadence parsing)
  routine_list   → mission_list
  routine_fire   → mission_fire
  routine_pause  → mission_pause
  routine_resume → mission_resume
  routine_update → mission_pause/resume (based on params)
  routine_delete → mission_complete (marks as done)

Routine tools removed from v1-only blocklist and restored in
available_actions(). The model can use either "routine" or "mission"
vocabulary — both work.

Still blocked: create_job, cancel_job, build_software (need v1
Scheduler/ContainerJobManager refs).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(engine): add E2E mission flow tests — 7 new tests

Comprehensive mission lifecycle tests:

- fire_mission_builds_meta_prompt_with_goal: verifies thread spawned
  with project context and recorded in history
- outcome_processing_extracts_next_focus: "Next focus: X" in FINAL()
  response → mission.current_focus updated
- outcome_processing_detects_goal_achieved: "Goal achieved: yes" →
  mission status transitions to Completed
- mission_evolves_via_direct_outcome_processing: 3-step evolution:
  step 1 sets focus to "db module", step 2 evolves to "tools module",
  step 3 detects goal achieved → mission completes. Tests the full
  learning loop without background task timing dependencies.
- fire_with_trigger_payload: webhook payload stored on mission and
  threads_today counter incremented
- daily_budget_enforced: max_threads_per_day=1 → first fire succeeds,
  second returns None

157 tests passing (151 prior + 6 new mission E2E).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): self-improving engine via Mission system

Wire the self-improvement loop as a Mission with OnSystemEvent cadence,
inspired by karpathy/autoresearch's program.md approach. The mission
fires when threads complete with issues, receives trace data as trigger
payload, and uses tools directly to diagnose and fix problems.

Key changes:

Engine self-improvement (Phase A+B from design doc):
- Add fire_on_system_event() to MissionManager for OnSystemEvent cadence
- Add start_event_listener() that subscribes to thread events and fires
  matching missions when non-Mission threads complete with trace issues
- Add ensure_self_improvement_mission() with autoresearch-style goal
  prompt (concrete loop steps, not vague instructions)
- Add process_self_improvement_output() for structured JSON fallback
- Seed fix pattern database with 8 known patterns from debugging
- Runtime prompt overlay via MemoryDoc (build_codeact_system_prompt now
  async + Store-aware, appends learned rules from prompt_overlay docs)
- Pass Store to ExecutionLoop for overlay loading

Bridge review fixes (P1/P2):
- Scope engine v2 SSE events to requesting user (broadcast_for_user)
- Per-user pending approvals via HashMap instead of global Option
- Reset tool-call limit counter before each thread execution
- Only persist auto-approval when user chose "always", not one-off "yes"
- Remove dead store/mission_manager fields from EngineState

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add checkpoint-based engine thread recovery

* feat(engine): add Python orchestrator module and host functions

Add the orchestrator infrastructure for replacing the Rust execution
loop with versioned Python code. This commit adds the module and host
functions without switching over — the existing Rust loop is unchanged.

New files:
- orchestrator/default.py: v0 Python orchestrator (run_loop + helpers)
- executor/orchestrator.rs: host function dispatch, orchestrator
  loading from Store with version selection, OrchestratorResult parsing

Host functions exposed to orchestrator Python via Monty suspension:
  __llm_complete__, __execute_code_step__ (nested Monty VM),
  __execute_action__, __check_signals__, __emit_event__,
  __add_message__, __save_checkpoint__, __transition_to__,
  __retrieve_docs__, __check_budget__, __get_actions__

Also makes json_to_monty, monty_to_json, monty_to_string pub(crate)
in scripting.rs for cross-module use.

Design doc: docs/plans/2026-03-25-python-orchestrator.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): switch ExecutionLoop::run() to Python orchestrator

Replace the 900-line Rust execution loop with a ~80-line bootstrap
that loads and runs the versioned Python orchestrator via Monty VM.

The orchestrator Python code (orchestrator/default.py) is the v0
compiled-in version. Runtime versions can override it via MemoryDoc
storage (orchestrator:main with tag orchestrator_code).

Key fixes during switchover:
- Use ExtFunctionResult::NotFound for unknown functions so Monty
  falls through to Python-defined functions (extract_final, etc.)
- Move helper function definitions above run_loop for Monty scoping
- Use FINAL result value (not VM return value) in Complete handler
- Rename 'final' variable to 'final_answer' to avoid Python keyword

Status: 171/177 tests pass. 6 remaining failures are step_count and
token tracking bookkeeping — the orchestrator manages these internally
but doesn't yet update the thread's counters via host functions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): all 177 tests pass with Python orchestrator

- Increment step_count and track tokens in __emit_event__("step_completed")
  so thread bookkeeping matches the old Rust loop behavior
- Remove double-counting of tokens in bootstrap (orchestrator handles it)
- Match nudge text to existing TOOL_INTENT_NUDGE constant
- Fix FINAL result propagation (use stored final_result, not VM return)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): orchestrator versioning, auto-rollback, and tests

Add version lifecycle for the Python orchestrator:
- Failure tracking via MemoryDoc (orchestrator:failures)
- Auto-rollback: after 3 consecutive failures, skip the latest version
  and fall back to previous (or compiled-in v0)
- Success resets the failure counter
- OrchestratorRollback event for observability

Update self-improvement Mission goal with Level 1.5 instructions for
orchestrator patches — the agent can now modify the execution loop
itself via memory_write with versioned orchestrator docs.

12 new tests: version selection (highest wins), rollback after failures,
rollback to default, failure counting/resetting, outcome parsing for
all 5 ThreadOutcome variants.

189 tests pass, zero clippy warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add engine v2 architecture, self-improvement, and dev history

Three new docs for contributors:

- engine-v2-architecture.md: Two-layer architecture (Rust kernel +
  Python orchestrator), five primitives, execution model with nested
  Monty VMs, bridge layer, memory/reflection, missions, capabilities

- self-improvement.md: Three improvement levels (prompt/orchestrator/
  config/code), autoresearch-inspired Mission loop, versioned
  orchestrator with auto-rollback, fix pattern database, safety model

- development-history.md: Summary of 6 Claude Code sessions that
  built the system, key design decisions and debugging moments,
  architecture evolution from 900-line Rust loop to Python orchestrator

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(engine): complete v2 side-by-side integration with gateway API

Wire engine v2 into the full submission pipeline and expose threads,
projects, and missions through the web gateway REST API.

Bridge routing — route ExecApproval, Interrupt, NewThread, and Clear
submissions to engine v2 when ENGINE_V2=true. Previously only UserInput
and ApprovalResponse were handled; all other control commands fell
through to disconnected v1 sessions.

Bridge query layer — add 11 read-only query functions and 6 DTO types
so gateway handlers can inspect engine state (threads, steps, events,
projects, missions) without direct access to the EngineState singleton.

Gateway endpoints — new /api/engine/* routes:
  GET  /threads, /threads/{id}, /threads/{id}/steps, /threads/{id}/events
  GET  /projects, /projects/{id}
  GET  /missions, /missions/{id}
  POST /missions/{id}/fire, /missions/{id}/pause, /missions/{id}/resume

SSE events — add ThreadStateChanged, ChildThreadSpawned, and
MissionThreadSpawned AppEvent variants. Expand the bridge event mapper
to forward StateChanged and ChildSpawned engine events to the browser.

Engine crate — add ConversationManager::clear_conversation() for /new
and /clear commands.

Code quality — replace 10 .expect() calls with proper error returns,
remove dead AgentConfig.engine_v2 field, log silent init errors, fix
duplicate doc comment, improve fallthrough documentation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): empty call_id on ActionResult and trace analyzer false positives

Fix structured executor not stamping call_id onto ActionResult — the
EffectExecutor trait doesn't receive call_id, so the structured executor
must copy it from the original ActionCall after execution. Empty call_id
caused OpenAI-compatible providers to reject the next LLM request with
"Invalid 'input[2].call_id': empty string".

Fix trace analyzer false positives:
- code_error check now only scans User-role code output messages
  (prefixed with [stdout]/[stderr]/[code ]/Traceback), not System
  prompt which contains example error text
- missing_tool_output check now recognizes ActionResult messages as
  valid tool output (Tier 0 structured path)
- Add NotImplementedError to detected code error patterns

New trace checks:
- empty_call_id: detect ActionResult messages with missing/empty
  call_id before they reach the LLM API (severity: Error)
- llm_error: extract LLM provider errors from Failed state reason
- orchestrator_error: extract orchestrator errors from Failed state

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(web): add Missions tab to gateway UI

Add a full Missions page to the web gateway with list view, detail view,
and action buttons (Fire, Pause, Resume).

Backend: add /api/engine/missions/summary endpoint returning counts by
status (active/paused/completed/failed).

Frontend:
- New "Missions" tab between Jobs and Routines
- Summary cards showing mission counts by status
- Table with name, goal, cadence type, thread count, status, actions
- Detail view with goal, cadence, current focus, success criteria,
  approach history, spawned thread list, and action buttons
- Fire/Pause/Resume actions with toast notifications
- i18n support (English + Chinese)
- CSS following the existing routines/jobs patterns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): eagerly initialize engine v2 at startup

The gateway API endpoints (/api/engine/missions, etc.) call bridge
query functions that return empty results when the engine state hasn't
been initialized yet. Previously, initialization only happened lazily
on the first chat message via handle_with_engine().

Now when ENGINE_V2=true, the engine is initialized in Agent::run()
before channels start, so the self-improvement mission and other
engine state is available to gateway API endpoints immediately.

Also rename get_or_init_engine → init_engine and make it public so
it can be called from agent_loop.rs at startup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(web): improve mission detail with markdown goal and thread table

- Goal rendered as full-width markdown block instead of plain-text
  meta item (uses existing renderMarkdown/marked)
- Current focus and success criteria also rendered as markdown
- Spawned threads shown as a clickable table with goal, type, state,
  steps, tokens, and created date instead of a UUID list
- Clicking a thread row opens an inline thread detail view showing
  metadata grid and full message history with markdown rendering
- Back button returns to the mission detail view
- Backend: mission detail now returns full thread summaries (goal,
  state, step_count, tokens) instead of just thread IDs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(web): close SSE connections on page unload to prevent connection starvation

The browser limits concurrent HTTP/1.1 connections per origin to 6.
Without cleanup, SSE connections from prior page loads linger after
refresh/navigation, eating into the pool. After 2-3 refreshes, all 6
slots are consumed by stale SSE streams and new API fetch calls queue
indefinitely — the UI shows "connected" (SSE works) but data never
loads.

Add a beforeunload handler that closes both eventSource (chat events)
and logEventSource (log stream) so the browser can reuse connections
immediately on page reload.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(web): support multiple gateway tabs by reducing SSE connections

Each browser tab opened 2 SSE connections (chat events + log events).
With the HTTP/1.1 per-origin limit of 6, the 3rd tab exhausted the
pool and couldn't load any data.

Three changes:

1. Lazy log SSE — only connect when the logs tab is active, disconnect
   when switching away. Most users rarely view logs, so this saves a
   connection slot per tab.

2. Visibility API — close SSE when the browser tab goes to background
   (user switches to another tab), reconnect when it becomes visible.
   Background tabs don't need real-time events.

3. Combined with the existing beforeunload cleanup, this means:
   - Active foreground tab: 1 connection (chat SSE only, +1 if logs tab)
   - Background tabs: 0 connections
   - Closed/refreshed tabs: 0 connections (beforeunload cleanup)

This allows many gateway tabs to coexist within the 6-connection limit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(engine): route messages to correct conversation by thread scope

Messages sent from a new conversation in the gateway always appeared in
the default assistant conversation because handle_with_engine ignored
the thread_id from the frontend.

Two fixes:

1. Engine conversation scoping — when the message carries a thread_id
   (from the frontend's conversation picker), use it as part of the
   engine conversation key: "gateway:<thread_id>" instead of just
   "gateway". This creates a distinct engine conversation per v1
   thread, so messages don't cross-contaminate.

2. V1 dual-write targeting — write user messages and assistant
   responses to the v1 conversation matching the thread_id (via
   ensure_conversation), not the hardcoded assistant conversation.
   Falls back to the assistant conversation when no thread_id is
   present (e.g., default chat).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(web): richer activity indicators for engine v2 execution

The gateway UI showed only generic "Thinking..." during engine v2
execution with no visibility into CodeAct code execution, tool calls,
or reflection. Now the event mapping produces detailed status updates:

Step lifecycle:
- "Calling LLM..." when a step starts (was "Thinki…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: medium Business logic, config, or moderate-risk modules scope: agent Agent core (agent loop, router, scheduler) scope: channel/cli TUI / CLI channel scope: channel/wasm WASM channel runtime scope: channel/web Web gateway channel scope: channel Channel infrastructure scope: ci CI/CD workflows scope: config Configuration scope: db/libsql libSQL / Turso backend scope: db/postgres PostgreSQL backend scope: db Database trait / abstraction scope: dependencies Dependency updates scope: docs Documentation scope: extensions Extension management scope: llm LLM integration scope: orchestrator Container orchestrator scope: setup Onboarding / setup scope: tool/builtin Built-in tools scope: tool/mcp MCP client scope: tool/wasm WASM tool sandbox scope: tool Tool infrastructure scope: worker Container worker scope: workspace Persistent memory / workspace size: XL 500+ changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants