refactor: extract AppEvent to crates/ironclaw_common#1615
refactor: extract AppEvent to crates/ironclaw_common#1615ilblackdragon merged 4 commits intostagingfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on refactoring and enhancing the IronClaw application by introducing a common crate for shared types, improving LLM provider support, enhancing security, and improving the robustness of the system. The changes aim to decouple components, improve security, and provide a more flexible and reliable platform. Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
a280bc5 to
3e8afb1
Compare
|
Addressed Copilot review in 3e8afb1: renamed leftover |
zmanian
left a comment
There was a problem hiding this comment.
Review: refactor: extract AppEvent to crates/ironclaw_common
+500/-477 across 24 files
Issues (ranked by severity)
1. No Deserialize derive on AppEvent (Medium)
The new AppEvent only derives Serialize + Debug + Clone. The old SseEvent also only had Serialize, but now that AppEvent lives in a shared crate intended for reuse by other workspace members, downstream consumers (e.g. test harnesses, CLI tools, external integrations) will likely need to deserialize incoming events. Adding Deserialize now avoids a semver-ish breaking change to the shared crate later.
2. event_type() manually duplicates serde rename values (Medium)
The event_type() match arm strings must stay in sync with the #[serde(rename = "...")] attributes. If a variant is added and the developer forgets to update event_type(), the compiler won't catch it (the match is exhaustive on variants, but the string could be wrong). Consider deriving the event type string from serde metadata or adding a test that round-trips serialization and asserts event_type() matches the "type" field in the JSON output.
3. Stale variable/comment references to "SSE" remain (Low)
Several comments and at least one variable still reference "SSE" — e.g. src/channels/web/server.rs:1203 still says "Broadcast SSE event", and src/worker/job.rs doc comment was only partially updated. Copilot and Gemini already flagged specific instances. Not a functional issue, but undermines the refactor's goal of decoupling from SSE terminology.
4. truncate_preview moved but public API surface unchanged (Low)
The old crate::channels::web::util::truncate_preview is now a re-export of ironclaw_common::truncate_preview. This is clean, but the re-export means two import paths work — both the old path (via pub use) and the new ironclaw_common::truncate_preview. This is fine for backward compat, but consider deprecating the old path to guide callers toward the canonical import.
5. ironclaw_common crate uses edition = "2024" and rust-version = "1.92" (Nit)
Just confirming this is intentional and aligns with the workspace's MSRV. If the workspace targets an older MSRV, this could cause issues for contributors on older toolchains.
What's good
- Clean mechanical rename with zero semantic changes — the serde wire format is identical (same
#[serde(rename)]values), so this is fully backward-compatible on the wire. - The
event_type()helper consolidates three duplicate match blocks (SSE, WS, types) into one, reducing ~70 lines of duplication. - Tests were updated consistently across all 24 files.
truncate_previewextraction with comprehensive unit tests in the new crate is solid.publish = falseon the new crate prevents accidental crates.io publishing.
Verdict
Approve with suggestions — This is a clean, low-risk refactor. The wire format is unchanged, so there are no breaking changes for clients. The suggestions above (especially adding Deserialize and a round-trip test for event_type()) would strengthen the shared crate for future consumers but are not blockers.
…ments Address zmanian review: - Add Deserialize derive to AppEvent so downstream consumers can deserialize incoming events - Add event_type_matches_serde_type_field test that round-trips every variant through serde and asserts event_type() matches the serialized "type" field — catches drift between serde renames and the manual match - Add round_trip_deserialize test for basic Serialize/Deserialize parity - Update remaining "SSE" references in comments across server.rs, manager.rs, ws_gateway_integration.rs, and worker/job.rs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Addressed zmanian's review in e0cad3b:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 24 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Walk backwards from max_bytes to find a valid char boundary | ||
| let mut end = max_bytes; | ||
| while end > 0 && !s.is_char_boundary(end) { | ||
| end -= 1; | ||
| } | ||
| let mut result = format!("{}...", &s[..end]); | ||
|
|
||
| // Re-close <tool_output> if truncation cut through the closing tag. | ||
| if s.starts_with("<tool_output") && !result.ends_with("</tool_output>") { |
There was a problem hiding this comment.
truncate_preview appends a closing </tool_output> tag whenever the input starts with <tool_output and the truncated result doesn’t end with the closing tag. Since the truncated result always ends with ..., this will always append on truncation, and it can still produce malformed XML if the truncation point lands inside the closing tag (leaving a partial </tool_... fragment) or if the string isn’t actually wrapped (closing tag appears earlier / extra trailing content). Consider tightening the condition to only run when the original is actually wrapped (e.g., s.starts_with(..) && s.trim_end().ends_with("</tool_output>")) and, when truncating, ensure end never falls within the final closing tag (clamp end to the start of the closing tag before adding ... and re-appending the full closing tag).
| // Walk backwards from max_bytes to find a valid char boundary | |
| let mut end = max_bytes; | |
| while end > 0 && !s.is_char_boundary(end) { | |
| end -= 1; | |
| } | |
| let mut result = format!("{}...", &s[..end]); | |
| // Re-close <tool_output> if truncation cut through the closing tag. | |
| if s.starts_with("<tool_output") && !result.ends_with("</tool_output>") { | |
| // Detect strings that are actually wrapped in a <tool_output>...</tool_output> pair. | |
| let is_wrapped_tool_output = s.starts_with("<tool_output") | |
| && s.trim_end().ends_with("</tool_output>"); | |
| let closing_tag = "</tool_output>"; | |
| let closing_start = if is_wrapped_tool_output { | |
| s.rfind(closing_tag) | |
| } else { | |
| None | |
| }; | |
| // Walk backwards from an initial end position to find a valid char boundary. | |
| // For wrapped <tool_output> strings, avoid truncating inside the closing tag | |
| // by clamping `end` to the start of the final `</tool_output>`. | |
| let mut end = max_bytes; | |
| if let Some(close_start) = closing_start { | |
| if end > close_start { | |
| end = close_start; | |
| } | |
| } | |
| while end > 0 && !s.is_char_boundary(end) { | |
| end -= 1; | |
| } | |
| let mut result = format!("{}...", &s[..end]); | |
| // Re-close <tool_output> if we truncated a string that was originally wrapped. | |
| if is_wrapped_tool_output { |
|
Re: Copilot comment on This is pre-existing behavior moved verbatim from Tightening the logic is a reasonable improvement but belongs in a separate PR to keep this refactor focused on the extraction. |
* refactor: extract AppEvent to crates/ironclaw_common SseEvent was defined in src/channels/web/types.rs but imported by 12+ modules across agent, orchestrator, worker, tools, and extensions — it had become the application-wide event protocol, not a web transport concern. Create crates/ironclaw_common as a shared workspace crate and move the enum there as AppEvent. Also move the truncate_preview utility which was similarly leaked from the web gateway into agent modules. - New crate: crates/ironclaw_common (AppEvent, truncate_preview) - Rename SseEvent → AppEvent, from_sse_event → from_app_event - web/types.rs re-exports AppEvent for internal gateway use - web/util.rs re-exports truncate_preview - Wire format unchanged (serde renames are on variants, not the enum) Aligned with the event bus direction on refactor/architectural-hardening where DomainEvent (≡ AppEvent) is wrapped in a SystemEvent envelope. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: add AppEvent::event_type() helper, deduplicate match blocks Address Gemini review: extract the variant→string match into a single method on AppEvent, replacing the duplicated 22-arm matches in sse.rs and types.rs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: rename leftover sse vars/tests to match AppEvent rename Address Copilot review: rename sse_event vars to app_event in orchestrator/api.rs and ws.rs, rename test functions from test_ws_server_from_sse_* to test_ws_server_from_app_event_*, and update stale SSE comments. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: add Deserialize to AppEvent, round-trip test, fix stale comments Address zmanian review: - Add Deserialize derive to AppEvent so downstream consumers can deserialize incoming events - Add event_type_matches_serde_type_field test that round-trips every variant through serde and asserts event_type() matches the serialized "type" field — catches drift between serde renames and the manual match - Add round_trip_deserialize test for basic Serialize/Deserialize parity - Update remaining "SSE" references in comments across server.rs, manager.rs, ws_gateway_integration.rs, and worker/job.rs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…architecture) (#1557) * v2 architecture phase 1 * feat(engine): Phase 2 — execution loop, capability system, thread runtime Add the core execution engine to ironclaw_engine crate: - CapabilityRegistry: register/get/list capabilities and actions - LeaseManager: async lease lifecycle (grant, check, consume, revoke, expire) - PolicyEngine: deterministic effect-level allow/deny/approve - ThreadTree: parent-child relationship tracking - ThreadSignal/ThreadOutcome: inter-thread messaging via mpsc - ThreadManager: spawn threads as tokio tasks, stop, inject messages, join - ExecutionLoop: core loop replacing run_agentic_loop() with signals, context building, LLM calls, action execution, and event recording - Structured executor (Tier 0): lease lookup → policy check → effect execution - Tool intent nudge detection - MemoryStore + RetrievalEngine stubs for Phase 4 - Full 8-phase architecture plan in docs/plans/ - CLAUDE.md spec for the engine crate 74 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): Phase 3 — Monty Python executor with RLM pattern Add CodeAct execution (Tier 1) using the Monty embedded Python interpreter, following the Recursive Language Model (RLM) pattern from arXiv:2512.24601. Key additions: - executor/scripting.rs: Monty integration with FunctionCall-based tool dispatch, catch_unwind panic safety, resource limits (30s, 64MB, 1M allocs) - LlmResponse::Code variant + ExecutionTier::Scripting - Context-as-variables (RLM 3.4): thread messages, goal, step_number, previous_results injected as Python variables — LLM context stays lean while code accesses data selectively - llm_query(prompt, context) (RLM 3.5): recursive subagent calls from within Python code — results stored as variables, not injected into parent's attention window (symbolic composition) - Compact output metadata between code steps instead of full stdout - MontyObject ↔ serde_json::Value bidirectional conversion - Updated architecture plan with RLM design principles 74 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): RLM best-practices enhancements from cross-reference analysis Cross-referenced our implementation against the official RLM (alexzhang13/rlm), fast-rlm (avbiswas/fast-rlm), and Prime Intellect's verifiers implementation. Key enhancements: - FINAL(answer) / FINAL_VAR(name): explicit termination pattern matching all three reference implementations. Code can signal completion at any point, not just via return value. - llm_query_batched(prompts): parallel recursive sub-calls via tokio::spawn, matching fast-rlm's asyncio.gather pattern and Prime Intellect's llm_batch. - Output truncation increased to 8000 chars (from 120), matching Prime Intellect's 8192 default. Shows [TRUNCATED: last N chars] or [FULL OUTPUT]. - Step 0 orientation preamble: auto-injects context metadata (message count, total chars, goal, last user message preview) before first code step, matching fast-rlm's auto-print pattern. - Error-to-LLM flow: Python parse errors, runtime errors, NameErrors, OS errors, and async errors now flow back as stdout content instead of terminating the step, enabling LLM self-correction on next iteration. Only VM panics (catch_unwind) terminate as EngineError. 74 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(engine): update architecture plan with RLM cross-reference learnings Comprehensive update after cross-referencing against official RLM (alexzhang13/rlm), fast-rlm (avbiswas/fast-rlm), Prime Intellect (verifiers/RLMEnv), rlm-rs (zircote/rlm-rs), and Google ADK RLM. Changes: - Mark Phases 1-3 as DONE with commit refs and test counts - Add "Key Influences" section documenting all reference implementations - Phase 3: full table of implemented RLM features with sources - Phase 3: "Remaining gaps" table with which phase addresses each - Phase 4: expanded with compaction (85% context), rlm_query() (full recursive sub-agent), dual model routing, budget controls (USD, timeout, tokens, consecutive errors), lazy loading, pass-by-reference - Add "RLM Execution Model" cross-cutting section - Add "Implementation Progress" tracking table - Remove stale "TO IMPLEMENT" markers (all Phase 3 work is done) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): Phase 4 — budget controls, compaction, reflection pipeline Budget enforcement in ExecutionLoop: - max_tokens_total: cumulative token limit, checked before each iteration - max_duration: wall-clock timeout for entire thread - max_consecutive_errors: consecutive error steps threshold (resets on success, matching official RLM behavior) - All produce ThreadOutcome::Failed with descriptive messages Context compaction (from RLM paper, 85% threshold): - estimate_tokens(): char-based estimation (chars/4, matching RLM) - should_compact(): triggers when tokens >= threshold_pct * context_limit - compact_messages(): asks LLM to summarize progress, replaces history with [system, summary, continuation_note], preserves intermediate results - Configurable via ThreadConfig: model_context_limit, compaction_threshold Dual model routing: - LlmCallConfig gains depth field (0=root, 1+=sub-call) - Implementations can route to cheaper models for sub-calls - ExecutionLoop passes thread depth to every LLM call Reflection pipeline (reflection/pipeline.rs): - reflect(thread, llm): analyzes completed thread via LLM - Produces Summary doc (always), Lesson doc (if errors), Issue doc (if failed) - Builds transcript from thread messages + error events - Returns ReflectionResult with docs + token usage ThreadConfig extended with: max_tokens_total, max_consecutive_errors, model_context_limit, enable_compaction, compaction_threshold, depth, max_depth. 78 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): Phase 5 — conversation surface separated from execution Conversation is now a UI layer, not an execution boundary. Multiple threads can run concurrently within one conversation; threads can outlive their originating conversation. New types (types/conversation.rs): - ConversationSurface: channel + user + entries + active_threads - ConversationEntry: sender (User/Agent/System) + content + origin_thread_id - ConversationId, EntryId (UUID newtypes) - EntrySender enum (User, Agent{thread_id}, System) ConversationManager (runtime/conversation.rs): - get_or_create_conversation(channel, user) — indexed by (channel, user) - handle_user_message() — injects into active foreground thread or spawns new - record_thread_outcome() — adds agent/system entries, untracks completed threads - get_conversation(), list_conversations() This enables the key architectural insight: a user can ask "what's the weather?" while a deployment thread is still running. Both produce entries in the same conversation. 85 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(engine): simplify execution tiers — Monty-only for CodeAct/RLM Restructure phases 6-8 to clarify execution model: - Monty is the sole Python executor for CodeAct/RLM. No WASM or Docker Python runtimes for LLM-generated code. - WASM sandbox is for third-party tool isolation (existing infra, Phase 8) - Docker containers are for thread-level isolation of high-risk work (Phase 8) - Two-phase commit moves to Phase 6 (integration) at the adapter boundary Phase renumbering: - Old Phase 6 (Tier 2-3) → removed as separate phase - Old Phase 7 (integration) → Phase 6 - Old Phase 8 (cleanup) → Phase 7 - New Phase 8: WASM tools + Docker thread isolation (infra integration) Updated progress table: Phases 1-5 marked DONE with test counts and commits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): Phase 6 — bridge adapters for main crate integration Strategy C parallel deployment: when ENGINE_V2=true env var is set, user messages route through the engine instead of the existing agentic loop. All existing behavior is unchanged when the flag is off. Bridge module (src/bridge/): - LlmBridgeAdapter: wraps LlmProvider as engine LlmBackend, converts ThreadMessage↔ChatMessage, ActionDef↔ToolDefinition, depth-based model routing (primary vs cheap_llm) - EffectBridgeAdapter: wraps ToolRegistry+SafetyLayer as EffectExecutor, routes tool calls through existing execute_tool_with_safety pipeline - InMemoryStore: HashMap-backed Store impl (no DB tables needed yet) - EngineRouter: is_engine_v2_enabled() + handle_with_engine() that builds engine from Agent deps and processes messages end-to-end Integration touchpoint (4 lines in agent_loop.rs): After hook processing, before session resolution, check ENGINE_V2 flag and route UserInput through the engine path. Accessor visibility widened: llm(), cheap_llm(), safety(), tools() changed from pub(super) to pub(crate) for bridge access. 85 engine tests + main crate clippy clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): add user message and system prompt to thread before execution The ExecutionLoop was sending empty messages to the LLM because the thread was spawned with the user's input as the goal but no messages. Fixes: - ThreadManager.spawn_thread() now adds the goal as an initial user message before starting the execution loop - ExecutionLoop.run() injects a default system prompt if none exists Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(bridge): match existing LLM request format to prevent 400 errors The LLM bridge was missing several defaults that the existing Reasoning.respond_with_tools() sets: - tool_choice: "auto" when tools are present (required by some providers) - max_tokens: 4096 (default) - temperature: 0.7 (default) - When no tools (force_text): use plain complete() instead of complete_with_tools() with empty tools array — matches existing no-tools fallback path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): persist conversation context across messages The engine was creating a fresh ThreadManager and InMemoryStore per message, losing all context between turns. A follow-up question like "what are the latest 10 issues?" had no memory of the prior "how many issues" response. Fixes: - EngineState (ThreadManager, ConversationManager, InMemoryStore) now persists across messages via OnceLock, initialized on first use - ConversationManager builds message history from prior conversation entries (user messages + agent responses) and passes it to new threads - ThreadManager.spawn_thread_with_history() accepts initial_messages that are prepended before the current user message - System notifications (thread started/completed) are filtered out of the history (not useful as LLM context) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): enable CodeAct/RLM mode with code block detection The engine now operates in CodeAct/RLM mode: System prompt (executor/prompt.rs): - Instructs LLM to write Python in ```repl fenced blocks - Documents available tools as callable Python functions - Documents llm_query(), llm_query_batched(), FINAL() - Documents context variables (context, goal, step_number, previous_results) - Strategy guidance: examine context, break into steps, use tools, call FINAL() Code block detection (bridge/llm_adapter.rs): - extract_code_block() scans LLM text responses for ```repl or ```python blocks - When detected, returns LlmResponse::Code instead of LlmResponse::Text - The ExecutionLoop routes Code responses through Monty for execution No structured tool definitions sent to LLM: - Tools are described in the system prompt as Python functions - The LLM call sends empty actions array, forcing text-mode responses - This ensures the LLM writes code blocks (CodeAct) instead of structured tool calls (which would bypass the REPL) 85 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(engine): add 8 CodeAct/RLM E2E tests with mock LLM Comprehensive test coverage for the Monty Python execution path: - codeact_simple_final: Python code calls FINAL('answer') → thread completes - codeact_tool_call_then_final: code calls test_tool() → FunctionCall suspends VM → MockEffects returns result → code resumes → FINAL() - codeact_pure_python_computation: sum([1,2,3,4,5]) → FINAL('Sum is 15') with no tool calls — pure Python in Monty - codeact_multi_step: first step prints output (no FINAL), second step sees output metadata and calls FINAL — tests iterative REPL flow - codeact_error_recovery: first step has NameError → error flows to LLM as stdout → second step recovers with FINAL — tests error transparency - codeact_context_variables_available: code accesses `goal` and `context` variables injected by the RLM context builder - codeact_multiple_tool_calls_in_loop: for loop calls test_tool() 3 times → 3 FunctionCall suspensions → all results collected → FINAL - codeact_llm_query_recursive: code calls llm_query('prompt') → VM suspends → MockLlm provides sub-agent response → result returned as Python string variable 93 tests passing (85 prior + 8 new), zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(bridge): detect code blocks in plain completion path + multi-block support Two bugs fixed: 1. The no-tools completion path (used by CodeAct since we send empty actions) returned LlmResponse::Text without checking for code blocks. Code blocks were rendered as markdown text instead of being executed. 2. extract_code_block now: - Handles bare ``` fences (skips non-Python languages) - Collects ALL code blocks in the response and concatenates them (models often split code across multiple blocks with explanation) - Tries markers in order: ```repl, ```python, ```py, then bare ``` Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(bridge): add 11 regression tests for code block extraction Covers the exact failure modes discovered during live testing: - extract_repl_block: standard ```repl fenced block - extract_python_block: ```python marker - extract_py_block: ```py shorthand - extract_bare_backtick_block: bare ``` with Python content - skip_non_python_language: ```json should NOT be extracted - no_code_blocks_returns_none: plain text, no fences - multiple_code_blocks_concatenated: two ```repl blocks with explanation between them → concatenated with \n\n - mixed_thinking_and_code: model outputs explanation + two ```python blocks (the Hyperliquid case) → both extracted - repl_preferred_over_bare: ```repl takes priority over bare ``` - empty_code_block_skipped: empty fenced block returns None - unclosed_block_returns_none: no closing ``` returns None Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): detect FINAL() in text responses + regression tests Models sometimes write FINAL() outside code blocks — as plain text after an explanation. The Hyperliquid case: model outputs a long analysis then FINAL("""...""") at the end, not inside ```repl fences. Fixes: - extract_final_from_text(): regex-based FINAL detection in text responses, matching the official RLM's find_final_answer() fallback - Handles: double-quoted, single-quoted, triple-quoted, unquoted, nested parens - Checked in LlmResponse::Text handler BEFORE tool intent nudge (FINAL takes priority) 9 new tests: - codeact_final_in_text_response: FINAL("answer") in plain text - codeact_final_triple_quoted_in_text: FINAL("""multi\nline""") in text - final_double_quoted, final_single_quoted, final_triple_quoted, final_unquoted, final_with_nested_parens, final_after_long_text, no_final_returns_none 102 tests passing (93 + 9 new), zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add crate extraction & cleanup roadmap Documents architectural recommendations from the engine v2 design process for future reference: - Root directory consolidation (channels-src + tools-src → extensions/) - Crate extraction tiers: zero-coupling (estimation, observability, tunnel), trivial-coupling (document_extraction, pairing, hooks), medium-coupling (secrets, MCP, db, workspace, llm, skills), heavy-coupling (web gateway, agent, extensions) - src/ module reorganization into logical groups (core, persistence, infra, media, support) - main.rs/app.rs slimming targets (100/500 lines after migration) - WASM module candidates (document_extraction) and non-candidates (REPL, web gateway → separate crates instead) - Priority ordering for extraction work - Tracks completed items (ironclaw_safety, ironclaw_engine, transcription move) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): live progress status updates via event broadcast Engine v2 now shows live progress in the CLI (and any channel): - "Thinking..." when a step starts - Tool name + success/error when actions execute - "Processing results..." when a step completes Implementation: - ThreadManager holds a broadcast::Sender<ThreadEvent> (capacity 256) - ExecutionLoop.emit_event() writes to thread.events AND broadcasts - ThreadManager.subscribe_events() returns a receiver - Router uses tokio::select! to listen for events while waiting for thread completion, forwarding them as StatusUpdate to the channel This replaces the polling approach with zero-latency event streaming. Agent.channels visibility widened to pub(crate) for bridge access. 102 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): include tool results in code step output for LLM context The LLM was ignoring tool results and answering from training data because the compact output metadata didn't include what tools returned. Tool results lived only as ActionResult messages (role: Tool) which some providers flatten or the model ignores. Now the code step output includes: - stdout from Python print() statements - [tool_name result] with the actual output (truncated to 4K per tool) - [tool_name error] for failed tools - [return] for the code's return value - Total output truncated to 8K chars to prevent context bloat This ensures the model sees web_search results, API responses, etc. in the next iteration and can reason about them instead of hallucinating. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): add debug/trace logging for CodeAct execution Three verbosity levels for debugging the engine: RUST_LOG=ironclaw_engine=debug: - LLM call: message count, iteration, force_text - LLM response: type (text/code/action_calls), token usage - Code execution: code length, action count, had_error, final_answer - Text response: length, FINAL() detection RUST_LOG=ironclaw_engine=trace: - Full message list sent to LLM (role, length, first 200 chars each) - Full code block being executed - stdout preview (first 500 chars) - Per-tool results (name, success, first 300 chars of output) - Text response preview (first 500 chars) Usage: ENGINE_V2=true RUST_LOG=ironclaw_engine=debug cargo run ENGINE_V2=true RUST_LOG=ironclaw_engine=trace cargo run Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): execution trace recording + retrospective analysis Enable with ENGINE_V2_TRACE=1 to get full execution traces and automatic issue detection after each thread completes. Trace recording (executor/trace.rs): - build_trace(): captures full thread state — messages (with full content), events, step count, token usage, detected issues - write_trace(): writes JSON to engine_trace_{timestamp}.json - log_trace_summary(): logs summary + issues at info/warn level Retrospective analyzer detects 8 issue categories: - thread_failure: thread ended in Failed state - no_response: no assistant message generated - tool_error: specific tool failures with error details - code_error: Python errors (NameError, SyntaxError, etc.) in output - missing_tool_output: tool results exist but not in system messages - excessive_steps: >10 steps (may be stuck in loop) - no_tools_used: single-step answer without tools (hallucination risk) - mixed_mode: text responses without code blocks (prompt not followed) Thread state now saved to store after execution completes (for trace access after join_thread). Usage: ENGINE_V2=true ENGINE_V2_TRACE=1 cargo run # After each message: trace JSON + issue log in terminal Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): wire reflection pipeline + trace analysis into thread lifecycle After every thread completes, ThreadManager now automatically runs: 1. Retrospective trace analysis (non-LLM, always): - Detects 8 issue categories (tool errors, code errors, missing outputs, excessive steps, hallucination risk, etc.) - Logs issues at warn level when found 2. Trace file recording (when ENGINE_V2_TRACE=1): - Writes full JSON trace to engine_trace_{timestamp}.json 3. LLM reflection (when enable_reflection=true): - Calls reflection pipeline to produce Summary, Lesson, Issue docs - Saves docs to store for future context retrieval - Enabled by default in the bridge router All three run inside the spawned tokio task after exec.run() completes, before saving the final thread state. No external wiring needed. Removed duplicate trace recording from the router — it's now handled by ThreadManager automatically. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(bridge): convert tool name hyphens to underscores for Python compatibility Root cause from trace analysis: the LLM writes `web_search()` (valid Python identifier) but the tool registry has `web-search` (with hyphen). The EffectBridgeAdapter couldn't find the tool → "Tool not found" error → model fabricated fake data instead. Fixes: - available_actions(): converts tool names from hyphens to underscores (web-search → web_search) so the system prompt lists valid Python names - execute_action(): tries the original name first, then falls back to hyphenated form (web_search → web-search) for tool registry lookup - Same conversion in router's capability registry builder Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(bridge): parse JSON tool output to prevent double-serialization From trace analysis: web_search returned a JSON string, which was wrapped as serde_json::json!(string) creating a Value::String containing JSON. When Monty got this as MontyObject::String, the Python code couldn't index it with result['title'] → TypeError. Fix: try parsing the tool output string as JSON first. If valid, use the parsed Value (becomes a Python dict/list). If not valid JSON, keep as string. This means web_search results are directly indexable in Python: results = web_search(query="...") print(results["results"][0]["title"]) # works now Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): persist variables across code steps via `state` dict Monty creates a fresh runtime per code step, so variables are lost between steps. This caused the model to re-paste tool results from system messages, wasting tokens. Fix: maintain a `persisted_state` JSON dict in the ExecutionLoop that accumulates across steps: - Tool results stored by tool name: state["web_search"] = {results...} - Return values stored: state["last_return"], state["step_0_return"] - Injected as a `state` Python variable in each new MontyRun Now the model can do: Step 1: results = web_search(query="...") # tool result saved in state Step 2: data = state["web_search"] # access previous result summary = llm_query("summarize", str(data)) FINAL(summary) System prompt updated to document the `state` variable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): add state hint on code errors + retrieval engine integration When code fails with NameError/UnboundLocalError (model trying to access variables from a previous step), the error output now includes: [HINT] Variables don't persist between code blocks. Use the `state` dict to access data from previous steps. Available keys: ["web_search", "last_return"] This teaches the model to use `state["web_search"]` instead of `result` after a NameError, reducing wasted steps from 3-4 to 1. Also integrates RetrievalEngine into context building and ThreadManager: - build_step_context() now accepts optional RetrievalEngine to inject relevant memory docs (Lessons, Specs, Playbooks) into LLM context - RetrievalEngine uses keyword matching with doc-type priority scoring - Memory docs from reflection (Phase 4) now feed back into future threads Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: remove trace files and add to .gitignore Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): replace web_fetch example with web_search in CodeAct prompt The system prompt example used web_fetch(url="...") which doesn't exist as a tool. The model learned from the example and tried web_fetch, getting "Tool not found". Changed to web_search(query="...") which is an actual registered tool. Found via trace analysis — reflection pipeline correctly identified this as a "Tool Name Correction" spec doc. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor(engine): extract prompt templates to markdown files Prompt templates moved from inline Rust strings to plain markdown files at crates/ironclaw_engine/prompts/ for easy inspection and iteration: - prompts/codeact_preamble.md — main instructions, special functions, context variables, rules - prompts/codeact_postamble.md — strategy section Loaded at compile time via include_str!(), so no runtime file I/O. Edit the .md files and rebuild to iterate on prompts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): replace byte-index slicing with char-safe truncation Panic: 'byte index 80 is not a char boundary; it is inside ''' when tool output contained multi-byte UTF-8 characters (smart quotes from web search results). Fixed 4 unsafe byte-index slices: - thread.rs:281: message preview &content[..80] → chars().take(80) - loop_engine.rs:556: tool output &str[..4000] → chars().take(4000) - loop_engine.rs:579: output tail &str[len-8000..] → chars().skip() - scripting.rs:82: stdout tail &str[len-N..] → chars().skip() All now use .chars().take() or .chars().skip() which respect character boundaries. Follows CLAUDE.md rule: "Never use byte-index slicing on user-supplied or external strings." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): fix false positive missing_tool_output warning in trace analyzer The check was looking for "[" + "result]" in System-role messages only, but tool output metadata is added with patterns like "[shell result]" and may appear in messages with any role. Changed to scan all messages for " result]" or " error]" patterns regardless of role. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(engine): update architecture plan with Phase 6 status and approval flow design Phase 6 updated to reflect what was actually built: - Bridge adapters (LLM, Effect, InMemoryStore, Router) — all done - Integration touchpoint (4 lines in handle_message) — done - Live progress via broadcast events — done - Conversation persistence across messages — done - Trace recording + retrospective analysis — done - 8 bugs found and fixed via trace analysis — documented Phase 6 remaining work documented: - Approval flow: detailed 5-step design (send to channel, pause thread, route response, resume execution, always handling) with v1 reference - Database persistence (InMemoryStore → real DB tables) - Acceptance testing (TestRig + TraceLlm fixtures) - Two-phase commit for high-stakes effects Progress table updated: Phase 6 marked as DONE (partial), 134 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add self-improving engine design plan Designs a system where the engine debugs and improves itself, based on the pattern observed in the last session: 5 consecutive bug fixes all followed trace → read → identify → edit → test, using tools the engine already has access to. Three levels of self-improvement: - Level 1 (Prompt): edit prompts/*.md to prevent LLM mistakes. Auto-apply. - Level 2 (Config): adjust defaults/mappings. Branch + test + PR. - Level 3 (Code): Rust patches for engine bugs. Branch + test + clippy + PR. Architecture: Self-improvement Mission spawns a Reflection thread that reads traces, reads source, proposes fixes, validates via cargo test, and either auto-applies (Level 1) or creates a PR (Level 2-3). Includes: fix pattern database (seeded from our 8 debugging session fixes), feedback loop diagram, safety model, implementation phases (A through D), and what exists vs what's new. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add engine v2 security model and audit Comprehensive security analysis of engine v2 covering: Threat model: 4 attacker profiles (malicious input, prompt injection via tools, poisoned memory, supply chain). Current state audit: 9 controls working (Monty sandbox, safety layer, policy engine, leases, provenance, events) and 9 gaps identified. Critical finding: ALL tools granted by default — CodeAct code can call shell, write_file, apply_patch without approval. Proposed fix: 3-tier tool classification (auto/approve-once/always-approve). CodeAct-specific threats: tool call amplification, prompt injection via search results, data exfiltration via tool chains, Monty escape. Self-improvement security: poisoned trace attacks, memory poisoning via reflection. Mitigations: edit validation, frequency caps, audit trail, auto-rollback, reflection output scanning. 6-layer security architecture proposed: input validation, capability gating, output sanitization, execution sandboxing, self-improvement controls, observability. Prioritized implementation plan with severity/effort ratings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(security): cross-reference v1 controls — use, don't reinvent Updated security plan with detailed audit of ALL existing v1 security controls and how they map to engine v2 bridge gaps: Key finding: v1 already has solutions for every security gap identified. The bridge just needs to wire them in: - Tool::requires_approval() exists but bridge doesn't call it - safety.wrap_for_llm() exists but tool results enter context unwrapped - RateLimiter exists but bridge doesn't check rate limits - BeforeToolCall hooks exist but bridge doesn't run them - redact_params() exists but bridge doesn't redact sensitive params - Shell risk classification (Low/Medium/High) is inherited but ignored Revised priority: most fixes are small wiring tasks in EffectBridgeAdapter, not new security infrastructure. The bridge is the security boundary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): add missions, reliability tracker, reflection executor, and provenance-aware policy - Add Mission type and MissionManager for recurring thread scheduling - Add ReliabilityTracker for per-capability success/failure/latency tracking - Add reflection executor that spawns CodeAct threads for post-completion reflection - Extend PolicyEngine with provenance-aware taint checking (LLM-generated data requires approval for financial/external-write effects) - Extend Store trait with mission CRUD methods - Add conversation surface tracking, compaction token fix, context memory injection - Wire new modules through lib.rs re-exports and bridge adapters Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): wire v1 security controls into engine v2 adapter Zero engine crate changes. All security controls enforced at the bridge boundary in EffectBridgeAdapter: 1. Tool approval (v1: Tool::requires_approval): - Checks each tool's approval requirement with actual params - Always → returns EngineError::LeaseDenied (blocks execution) - UnlessAutoApproved → checks auto_approved set, blocks if not approved - Never → proceeds - Per-session auto_approved HashSet (for future "always" handling) 2. Hook interception (v1: BeforeToolCall): - Runs HookEvent::ToolCall before every execution - HookOutcome::Reject → blocks with reason - HookError::Rejected → blocks with reason - Hook errors → fail-open (logged, execution continues) 3. Output sanitization (v1: sanitize_tool_output + wrap_for_llm): - Leak detection: API keys in tool output are redacted - Policy enforcement: content policy rules applied - Length truncation: output capped at 100KB - XML boundary protection: prevents injection via tool output 4. Sensitive param redaction (v1: redact_params): - Tool's sensitive_params() consulted before hooks see parameters - Redacted params sent to hooks, original params used for execution 5. available_actions() now sets requires_approval based on each tool's default approval requirement, so the engine's PolicyEngine can gate tools it hasn't seen before. 6. Actual execution timing measured via Instant::now() (replaces placeholder Duration::from_millis(1)). Accessor visibility: hooks() widened to pub(crate). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): implement tool approval flow for engine v2 Adds a complete approval flow that mirrors v1 behavior, using the existing v1 security controls (Tool::requires_approval, auto-approve sets, StatusUpdate::ApprovalNeeded). ## How it works ### Step 1: Tool blocked at execution When the LLM's code calls a tool (e.g., `shell("ls")`): 1. EffectBridgeAdapter.execute_action() looks up the Tool object 2. Calls tool.requires_approval(¶ms) — returns ApprovalRequirement 3. If Always → EngineError::LeaseDenied (always blocks) 4. If UnlessAutoApproved → checks auto_approved HashSet → if not in set, returns EngineError::LeaseDenied 5. If Never → proceeds to execution ### Step 2: Engine returns NeedApproval The LeaseDenied error propagates through: - CodeAct path: becomes Python RuntimeError, code halts, thread returns NeedApproval with action_name + parameters - Structured path: same via ActionResult.is_error ### Step 3: Router stores pending approval - PendingApproval { action_name, original_content } stored on EngineState - StatusUpdate::ApprovalNeeded sent to channel (shows approval card in CLI/web with tool name, parameters, yes/always/no buttons) - Returns text: "Tool 'shell' requires approval. Reply yes/always/no." ### Step 4: User responds handle_message() intercepts Submission::ApprovalResponse when ENGINE_V2: - 'yes' → auto_approve_tool(name) on EffectBridgeAdapter, re-processes original message (tool now passes the approval check on second run) - 'always' → same + logs for session persistence - 'no' → returns "Denied: tool was not executed." ### Key design choice Instead of pausing/resuming mid-execution (which needs engine changes to freeze/restore the Monty VM state), we auto-approve the tool and re-run the full message. The EffectBridgeAdapter's auto_approved set persists across runs, so the second execution passes immediately. This trades one extra LLM call for zero engine modifications. ## Files changed - src/bridge/router.rs: PendingApproval struct, handle_approval(), NeedApproval → StatusUpdate::ApprovalNeeded conversion - src/bridge/mod.rs: export handle_approval - src/agent/agent_loop.rs: intercept ApprovalResponse for engine v2 - src/bridge/effect_adapter.rs: fmt fixes 151 tests passing, clippy + fmt clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): demote trace/reflection logging from info to debug INFO-level log output from background tasks (trace analysis, reflection) corrupts the REPL terminal UI. The trace summary, issue warnings, and reflection doc previews were printing mid-approval-card, breaking the interactive display. Fix: all logging in trace.rs changed from info!/warn! to debug!/warn!. Trace analysis and reflection results now only show when RUST_LOG=ironclaw_engine=debug is set. Also added logging discipline rule to global CLAUDE.md: - info! → user-facing status the REPL intentionally renders - debug! → internal diagnostics (traces, reflection, engine internals) - Background tasks must NEVER use info! — it breaks the TUI Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(bridge): demote all router info! logging to debug! "engine v2: initializing" and "engine v2: handling message" were printing at INFO level, corrupting the REPL UI. All router logging now uses debug! — only visible with RUST_LOG=ironclaw=debug. Zero info! calls remain in crates/ironclaw_engine/ or src/bridge/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(safety): demote leak detector warn-action logs from warn! to debug! The leak detector's Warn-action matches (high_entropy_hex pattern on web search results containing commit SHAs, CSS colors, URL hashes) were logging at warn! level, corrupting the REPL UI with lines like: WARN Potential secret leak detected pattern=high_entropy_hex preview=a96f********cee5 These are informational false positives — real leaks use LeakAction::Redact which silently modifies the content. Warn-action matches only log for debugging purposes and should not appear in production output. Changed to debug! level — visible with RUST_LOG=ironclaw_safety=debug. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): strengthen CodeAct prompt to prevent shallow text answers The model was answering "Suggested 45 improvements" as a brief text summary from training data without actually searching or listing them. The trace showed: no code block, no tool calls, no FINAL(). Prompt changes: - Rule 1: "ALWAYS respond with a ```repl code block. NEVER answer with plain text only." (was: "Always write code... plain text for brief explanations") - Rule 2 (NEW): "NEVER answer from memory or training data alone. Always use tools to get real, current information before answering." - Rule 3: FINAL answer "should be detailed and complete — not just a summary like 'found 45 items'" - Rule 8 (NEW): "Include the actual content in your FINAL() answer, not just a count or summary. Users want to see the details." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): persist reflection docs to workspace for cross-session learning Replaces InMemoryStore with HybridStore: - Ephemeral data (threads, steps, events, leases) stays in-memory - MemoryDocs (lessons, specs, playbooks from reflection) persist to the workspace at engine/docs/{type}/{id}.json On engine init, load_docs_from_workspace() reads existing docs back into the in-memory cache. This means: - Lessons learned in session 1 are available in session 2 - The RetrievalEngine injects relevant past lessons into new threads - The engine genuinely improves over time as reflection accumulates Workspace paths: engine/docs/lessons/{uuid}.json engine/docs/specs/{uuid}.json engine/docs/playbooks/{uuid}.json engine/docs/summaries/{uuid}.json engine/docs/issues/{uuid}.json No new database tables. Uses existing workspace write/read/list. workspace() accessor widened to pub(crate). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(bridge): adapt to execute_tool_with_safety params-by-value change Staging merge changed execute_tool_with_safety to take params by value instead of by reference (perf optimization from PR #926). Updated bridge adapter to clone params before passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(engine): add web gateway integration plan to Phase 6 Documents three gaps between engine v2 and the web gateway: 1. No SSE streaming (engine emits ThreadEvent, gateway expects SseEvent) 2. No conversation persistence (engine uses HybridStore, gateway reads v1 DB) 3. No cross-channel visibility (REPL ↔ web messages invisible to each other) Implementation plan: bridge ThreadEvent→AppEvent, write messages to v1 conversation tables after thread completion. Prerequisite: AppEvent extraction PR (in progress separately). Also updated DB persistence status: HybridStore with workspace-backed MemoryDocs is now implemented (partial persistence). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(engine): document routine/job gap and SIGKILL crash scenario Routines are entirely v1 — not hooked up to engine v2. When a user asks "create a routine" as natural language, engine v2 tries to call routine_create via CodeAct, but the tool needs RoutineEngine + Database refs that the bridge's minimal JobContext doesn't provide. This caused a SIGKILL crash during testing. Options documented: block routine tools in v2 (short term), pass refs through context (medium), replace with Mission system (long term). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: extract AppEvent to crates/ironclaw_common SseEvent was defined in src/channels/web/types.rs but imported by 12+ modules across agent, orchestrator, worker, tools, and extensions — it had become the application-wide event protocol, not a web transport concern. Create crates/ironclaw_common as a shared workspace crate and move the enum there as AppEvent. Also move the truncate_preview utility which was similarly leaked from the web gateway into agent modules. - New crate: crates/ironclaw_common (AppEvent, truncate_preview) - Rename SseEvent → AppEvent, from_sse_event → from_app_event - web/types.rs re-exports AppEvent for internal gateway use - web/util.rs re-exports truncate_preview - Wire format unchanged (serde renames are on variants, not the enum) Aligned with the event bus direction on refactor/architectural-hardening where DomainEvent (≡ AppEvent) is wrapped in a SystemEvent envelope. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): integrate with web gateway via AppEvent + v1 conversation DB Three changes to make engine v2 visible in the web gateway: 1. SSE event streaming (AppEvent broadcast): - ThreadEvent → AppEvent conversion via thread_event_to_app_event() - Events broadcast to SseManager during the poll loop - Covers: Thinking, ToolCompleted (success/error), Status, Response - Web gateway receives real-time progress without any gateway changes 2. Conversation persistence to v1 database: - After thread completes, writes user message + agent response to v1 ConversationStore via add_conversation_message() - Uses get_or_create_assistant_conversation() for per-user per-channel - Web gateway reads from DB as usual — chat history appears 3. Final response broadcast: - AppEvent::Response with full text + thread_id sent via SSE - Web gateway renders the response in the chat UI New EngineState fields: sse (Option<Arc<SseManager>>), db (Option<Arc<dyn Database>>). Both populated from Agent.deps. Agent.deps visibility widened to pub(crate). Depends on: ironclaw_common crate with AppEvent type (PR #1615). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): complete Phase 6 — v1-only tool blocking, rate limiting, call limits Three security/stability improvements in EffectBridgeAdapter: 1. V1-only tool blocking: - routine_create, create_job, build_software (and hyphenated variants) return helpful error: "use the slash command instead" - Filtered out of available_actions() so system prompt doesn't list them - Prevents crash from tools needing RoutineEngine/Scheduler refs 2. Per-step tool call limit: - Max 50 tool calls per code block (AtomicU32 counter) - Prevents amplification: `for i in range(10000): shell(...)` - Returns "call limit reached, break into multiple steps" 3. Rate limiting: - Per-user per-tool sliding window via RateLimiter - Checks tool.rate_limit_config() before every execution - Returns "rate limited, try again in Ns" Architecture plan updated: - Gateway integration: DONE - Routines: BLOCKED (gracefully, with slash command fallback) - Rate limiting: DONE - Call limit: DONE - Phase 6 status: DONE (remaining: acceptance tests, two-phase commit) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add Mission system design — goal-oriented autonomous threads Missions replace routines with evolving, knowledge-accumulating autonomous agents. Unlike routines (fixed prompt, stateless), Missions: - Generate prompts from accumulated Project knowledge (lessons, playbooks, issues from prior threads) - Adapt approach when something fails repeatedly - Track progress toward a goal with success criteria - Self-manage: pause when stuck, complete when goal achieved Architecture: MissionManager with cron ticker spawns threads via ThreadManager. Meta-prompt built from mission goal + Project MemoryDocs via RetrievalEngine. Reflection feeds back automatically. 6-step implementation plan: cron trigger, meta-prompt builder, bridge wiring, CodeAct tools, progress tracking, persistence. Includes two worked examples: daily tech news briefing (ongoing) and test coverage improvement (goal-driven, self-completing). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): extend Mission types with webhook/event triggers + evolving strategy Mission types updated to support external activation sources: MissionCadence expanded: - Cron { expression, timezone } — timezone-aware scheduling - OnEvent { event_pattern } — channel message pattern matching - OnSystemEvent { source, event_type } — structured events from tools - Webhook { path, secret } — external HTTP triggers (GitHub, email, etc.) - Manual — explicit triggering only The engine defines trigger TYPES. The bridge implements infrastructure (cron ticker, webhook endpoints, event matchers). GitHub issues, PRs, email, Slack events all use the generic Webhook cadence — no special-casing in the engine. Webhook payload injected as state["trigger_payload"] in the thread's Python context. Mission struct extended: - current_focus: what the next thread should work on (evolving) - approach_history: what we've tried (for adaptation) - max_threads_per_day / threads_today: daily budget - last_trigger_payload: webhook/event data for thread context Plan updated with trigger type table and webhook integration design. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): implement MissionManager execution with meta-prompts The MissionManager now builds evolving meta-prompts and processes thread outcomes for continuous learning: fire_mission() upgraded: - Loads Project MemoryDocs via RetrievalEngine for context - Builds meta-prompt from: goal, current_focus, approach_history, project knowledge docs, trigger payload, thread count - Spawns thread with meta-prompt as user message - Background task waits for completion and processes outcome - Daily thread budget enforcement (max_threads_per_day) Meta-prompt structure: # Mission: {name} Goal: {goal} ## Current Focus (evolves between threads) ## Previous Approaches (what we've tried) ## Knowledge from Prior Threads (lessons, playbooks, issues) ## Trigger Payload (webhook/event data if applicable) ## Instructions (accomplish step, report next focus, check goal) Outcome processing: - Extracts "next focus:" from FINAL() response → updates current_focus - Detects "goal achieved: yes" → completes mission - Records accomplishment in approach_history - Failed threads recorded as "FAILED: {error}" Cron ticker: - start_cron_ticker() spawns tokio task, ticks every 60s - Checks active Cron missions, fires those past next_fire_at 151 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): wire MissionManager into engine v2 for CodeAct access Missions are now callable from CodeAct Python code: ```python # Create a daily briefing mission result = mission_create( name="Tech News", goal="Daily AI/crypto/software news briefing", cadence="0 9 * * *" ) # List all missions missions = mission_list() # Manually fire a mission mission_fire(id="...") # Pause/resume mission_pause(id="...") mission_resume(id="...") ``` Implementation: - MissionManager created on engine init, cron ticker started - EffectBridgeAdapter intercepts mission_* function calls before tool lookup and routes to MissionManager - parse_cadence() handles: "manual", cron expressions, "event:pattern", "webhook:path" - Mission functions documented in CodeAct system prompt - MissionManager set on adapter via set_mission_manager() after init (avoids circular dependency) System prompt updated with mission_create, mission_list, mission_fire, mission_pause, mission_resume documentation. 151 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): map routine_* calls to mission operations in v2 When the model calls routine_create, routine_list, routine_fire, routine_pause, routine_resume, or routine_delete, the bridge now routes them to the MissionManager instead of blocking with an error. Mapping: routine_create → mission_create (with cadence parsing) routine_list → mission_list routine_fire → mission_fire routine_pause → mission_pause routine_resume → mission_resume routine_update → mission_pause/resume (based on params) routine_delete → mission_complete (marks as done) Routine tools removed from v1-only blocklist and restored in available_actions(). The model can use either "routine" or "mission" vocabulary — both work. Still blocked: create_job, cancel_job, build_software (need v1 Scheduler/ContainerJobManager refs). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(engine): add E2E mission flow tests — 7 new tests Comprehensive mission lifecycle tests: - fire_mission_builds_meta_prompt_with_goal: verifies thread spawned with project context and recorded in history - outcome_processing_extracts_next_focus: "Next focus: X" in FINAL() response → mission.current_focus updated - outcome_processing_detects_goal_achieved: "Goal achieved: yes" → mission status transitions to Completed - mission_evolves_via_direct_outcome_processing: 3-step evolution: step 1 sets focus to "db module", step 2 evolves to "tools module", step 3 detects goal achieved → mission completes. Tests the full learning loop without background task timing dependencies. - fire_with_trigger_payload: webhook payload stored on mission and threads_today counter incremented - daily_budget_enforced: max_threads_per_day=1 → first fire succeeds, second returns None 157 tests passing (151 prior + 6 new mission E2E). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): self-improving engine via Mission system Wire the self-improvement loop as a Mission with OnSystemEvent cadence, inspired by karpathy/autoresearch's program.md approach. The mission fires when threads complete with issues, receives trace data as trigger payload, and uses tools directly to diagnose and fix problems. Key changes: Engine self-improvement (Phase A+B from design doc): - Add fire_on_system_event() to MissionManager for OnSystemEvent cadence - Add start_event_listener() that subscribes to thread events and fires matching missions when non-Mission threads complete with trace issues - Add ensure_self_improvement_mission() with autoresearch-style goal prompt (concrete loop steps, not vague instructions) - Add process_self_improvement_output() for structured JSON fallback - Seed fix pattern database with 8 known patterns from debugging - Runtime prompt overlay via MemoryDoc (build_codeact_system_prompt now async + Store-aware, appends learned rules from prompt_overlay docs) - Pass Store to ExecutionLoop for overlay loading Bridge review fixes (P1/P2): - Scope engine v2 SSE events to requesting user (broadcast_for_user) - Per-user pending approvals via HashMap instead of global Option - Reset tool-call limit counter before each thread execution - Only persist auto-approval when user chose "always", not one-off "yes" - Remove dead store/mission_manager fields from EngineState Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add checkpoint-based engine thread recovery * feat(engine): add Python orchestrator module and host functions Add the orchestrator infrastructure for replacing the Rust execution loop with versioned Python code. This commit adds the module and host functions without switching over — the existing Rust loop is unchanged. New files: - orchestrator/default.py: v0 Python orchestrator (run_loop + helpers) - executor/orchestrator.rs: host function dispatch, orchestrator loading from Store with version selection, OrchestratorResult parsing Host functions exposed to orchestrator Python via Monty suspension: __llm_complete__, __execute_code_step__ (nested Monty VM), __execute_action__, __check_signals__, __emit_event__, __add_message__, __save_checkpoint__, __transition_to__, __retrieve_docs__, __check_budget__, __get_actions__ Also makes json_to_monty, monty_to_json, monty_to_string pub(crate) in scripting.rs for cross-module use. Design doc: docs/plans/2026-03-25-python-orchestrator.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): switch ExecutionLoop::run() to Python orchestrator Replace the 900-line Rust execution loop with a ~80-line bootstrap that loads and runs the versioned Python orchestrator via Monty VM. The orchestrator Python code (orchestrator/default.py) is the v0 compiled-in version. Runtime versions can override it via MemoryDoc storage (orchestrator:main with tag orchestrator_code). Key fixes during switchover: - Use ExtFunctionResult::NotFound for unknown functions so Monty falls through to Python-defined functions (extract_final, etc.) - Move helper function definitions above run_loop for Monty scoping - Use FINAL result value (not VM return value) in Complete handler - Rename 'final' variable to 'final_answer' to avoid Python keyword Status: 171/177 tests pass. 6 remaining failures are step_count and token tracking bookkeeping — the orchestrator manages these internally but doesn't yet update the thread's counters via host functions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): all 177 tests pass with Python orchestrator - Increment step_count and track tokens in __emit_event__("step_completed") so thread bookkeeping matches the old Rust loop behavior - Remove double-counting of tokens in bootstrap (orchestrator handles it) - Match nudge text to existing TOOL_INTENT_NUDGE constant - Fix FINAL result propagation (use stored final_result, not VM return) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): orchestrator versioning, auto-rollback, and tests Add version lifecycle for the Python orchestrator: - Failure tracking via MemoryDoc (orchestrator:failures) - Auto-rollback: after 3 consecutive failures, skip the latest version and fall back to previous (or compiled-in v0) - Success resets the failure counter - OrchestratorRollback event for observability Update self-improvement Mission goal with Level 1.5 instructions for orchestrator patches — the agent can now modify the execution loop itself via memory_write with versioned orchestrator docs. 12 new tests: version selection (highest wins), rollback after failures, rollback to default, failure counting/resetting, outcome parsing for all 5 ThreadOutcome variants. 189 tests pass, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add engine v2 architecture, self-improvement, and dev history Three new docs for contributors: - engine-v2-architecture.md: Two-layer architecture (Rust kernel + Python orchestrator), five primitives, execution model with nested Monty VMs, bridge layer, memory/reflection, missions, capabilities - self-improvement.md: Three improvement levels (prompt/orchestrator/ config/code), autoresearch-inspired Mission loop, versioned orchestrator with auto-rollback, fix pattern database, safety model - development-history.md: Summary of 6 Claude Code sessions that built the system, key design decisions and debugging moments, architecture evolution from 900-line Rust loop to Python orchestrator Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): complete v2 side-by-side integration with gateway API Wire engine v2 into the full submission pipeline and expose threads, projects, and missions through the web gateway REST API. Bridge routing — route ExecApproval, Interrupt, NewThread, and Clear submissions to engine v2 when ENGINE_V2=true. Previously only UserInput and ApprovalResponse were handled; all other control commands fell through to disconnected v1 sessions. Bridge query layer — add 11 read-only query functions and 6 DTO types so gateway handlers can inspect engine state (threads, steps, events, projects, missions) without direct access to the EngineState singleton. Gateway endpoints — new /api/engine/* routes: GET /threads, /threads/{id}, /threads/{id}/steps, /threads/{id}/events GET /projects, /projects/{id} GET /missions, /missions/{id} POST /missions/{id}/fire, /missions/{id}/pause, /missions/{id}/resume SSE events — add ThreadStateChanged, ChildThreadSpawned, and MissionThreadSpawned AppEvent variants. Expand the bridge event mapper to forward StateChanged and ChildSpawned engine events to the browser. Engine crate — add ConversationManager::clear_conversation() for /new and /clear commands. Code quality — replace 10 .expect() calls with proper error returns, remove dead AgentConfig.engine_v2 field, log silent init errors, fix duplicate doc comment, improve fallthrough documentation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): empty call_id on ActionResult and trace analyzer false positives Fix structured executor not stamping call_id onto ActionResult — the EffectExecutor trait doesn't receive call_id, so the structured executor must copy it from the original ActionCall after execution. Empty call_id caused OpenAI-compatible providers to reject the next LLM request with "Invalid 'input[2].call_id': empty string". Fix trace analyzer false positives: - code_error check now only scans User-role code output messages (prefixed with [stdout]/[stderr]/[code ]/Traceback), not System prompt which contains example error text - missing_tool_output check now recognizes ActionResult messages as valid tool output (Tier 0 structured path) - Add NotImplementedError to detected code error patterns New trace checks: - empty_call_id: detect ActionResult messages with missing/empty call_id before they reach the LLM API (severity: Error) - llm_error: extract LLM provider errors from Failed state reason - orchestrator_error: extract orchestrator errors from Failed state Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(web): add Missions tab to gateway UI Add a full Missions page to the web gateway with list view, detail view, and action buttons (Fire, Pause, Resume). Backend: add /api/engine/missions/summary endpoint returning counts by status (active/paused/completed/failed). Frontend: - New "Missions" tab between Jobs and Routines - Summary cards showing mission counts by status - Table with name, goal, cadence type, thread count, status, actions - Detail view with goal, cadence, current focus, success criteria, approach history, spawned thread list, and action buttons - Fire/Pause/Resume actions with toast notifications - i18n support (English + Chinese) - CSS following the existing routines/jobs patterns Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): eagerly initialize engine v2 at startup The gateway API endpoints (/api/engine/missions, etc.) call bridge query functions that return empty results when the engine state hasn't been initialized yet. Previously, initialization only happened lazily on the first chat message via handle_with_engine(). Now when ENGINE_V2=true, the engine is initialized in Agent::run() before channels start, so the self-improvement mission and other engine state is available to gateway API endpoints immediately. Also rename get_or_init_engine → init_engine and make it public so it can be called from agent_loop.rs at startup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(web): improve mission detail with markdown goal and thread table - Goal rendered as full-width markdown block instead of plain-text meta item (uses existing renderMarkdown/marked) - Current focus and success criteria also rendered as markdown - Spawned threads shown as a clickable table with goal, type, state, steps, tokens, and created date instead of a UUID list - Clicking a thread row opens an inline thread detail view showing metadata grid and full message history with markdown rendering - Back button returns to the mission detail view - Backend: mission detail now returns full thread summaries (goal, state, step_count, tokens) instead of just thread IDs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(web): close SSE connections on page unload to prevent connection starvation The browser limits concurrent HTTP/1.1 connections per origin to 6. Without cleanup, SSE connections from prior page loads linger after refresh/navigation, eating into the pool. After 2-3 refreshes, all 6 slots are consumed by stale SSE streams and new API fetch calls queue indefinitely — the UI shows "connected" (SSE works) but data never loads. Add a beforeunload handler that closes both eventSource (chat events) and logEventSource (log stream) so the browser can reuse connections immediately on page reload. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(web): support multiple gateway tabs by reducing SSE connections Each browser tab opened 2 SSE connections (chat events + log events). With the HTTP/1.1 per-origin limit of 6, the 3rd tab exhausted the pool and couldn't load any data. Three changes: 1. Lazy log SSE — only connect when the logs tab is active, disconnect when switching away. Most users rarely view logs, so this saves a connection slot per tab. 2. Visibility API — close SSE when the browser tab goes to background (user switches to another tab), reconnect when it becomes visible. Background tabs don't need real-time events. 3. Combined with the existing beforeunload cleanup, this means: - Active foreground tab: 1 connection (chat SSE only, +1 if logs tab) - Background tabs: 0 connections - Closed/refreshed tabs: 0 connections (beforeunload cleanup) This allows many gateway tabs to coexist within the 6-connection limit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): route messages to correct conversation by thread scope Messages sent from a new conversation in the gateway always appeared in the default assistant conversation because handle_with_engine ignored the thread_id from the frontend. Two fixes: 1. Engine conversation scoping — when the message carries a thread_id (from the frontend's conversation picker), use it as part of the engine conversation key: "gateway:<thread_id>" instead of just "gateway". This creates a distinct engine conversation per v1 thread, so messages don't cross-contaminate. 2. V1 dual-write targeting — write user messages and assistant responses to the v1 conversation matching the thread_id (via ensure_conversation), not the hardcoded assistant conversation. Falls back to the assistant conversation when no thread_id is present (e.g., default chat). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(web): richer activity indicators for engine v2 execution The gateway UI showed only generic "Thinking..." during engine v2 execution with no visibility into CodeAct code execution, tool calls, or reflection. Now the event mapping produces detailed status updates: Step lifecycle: - "Calling LLM..." when a step starts (was "…
…architecture) (#1557) * v2 architecture phase 1 * feat(engine): Phase 2 — execution loop, capability system, thread runtime Add the core execution engine to ironclaw_engine crate: - CapabilityRegistry: register/get/list capabilities and actions - LeaseManager: async lease lifecycle (grant, check, consume, revoke, expire) - PolicyEngine: deterministic effect-level allow/deny/approve - ThreadTree: parent-child relationship tracking - ThreadSignal/ThreadOutcome: inter-thread messaging via mpsc - ThreadManager: spawn threads as tokio tasks, stop, inject messages, join - ExecutionLoop: core loop replacing run_agentic_loop() with signals, context building, LLM calls, action execution, and event recording - Structured executor (Tier 0): lease lookup → policy check → effect execution - Tool intent nudge detection - MemoryStore + RetrievalEngine stubs for Phase 4 - Full 8-phase architecture plan in docs/plans/ - CLAUDE.md spec for the engine crate 74 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): Phase 3 — Monty Python executor with RLM pattern Add CodeAct execution (Tier 1) using the Monty embedded Python interpreter, following the Recursive Language Model (RLM) pattern from arXiv:2512.24601. Key additions: - executor/scripting.rs: Monty integration with FunctionCall-based tool dispatch, catch_unwind panic safety, resource limits (30s, 64MB, 1M allocs) - LlmResponse::Code variant + ExecutionTier::Scripting - Context-as-variables (RLM 3.4): thread messages, goal, step_number, previous_results injected as Python variables — LLM context stays lean while code accesses data selectively - llm_query(prompt, context) (RLM 3.5): recursive subagent calls from within Python code — results stored as variables, not injected into parent's attention window (symbolic composition) - Compact output metadata between code steps instead of full stdout - MontyObject ↔ serde_json::Value bidirectional conversion - Updated architecture plan with RLM design principles 74 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): RLM best-practices enhancements from cross-reference analysis Cross-referenced our implementation against the official RLM (alexzhang13/rlm), fast-rlm (avbiswas/fast-rlm), and Prime Intellect's verifiers implementation. Key enhancements: - FINAL(answer) / FINAL_VAR(name): explicit termination pattern matching all three reference implementations. Code can signal completion at any point, not just via return value. - llm_query_batched(prompts): parallel recursive sub-calls via tokio::spawn, matching fast-rlm's asyncio.gather pattern and Prime Intellect's llm_batch. - Output truncation increased to 8000 chars (from 120), matching Prime Intellect's 8192 default. Shows [TRUNCATED: last N chars] or [FULL OUTPUT]. - Step 0 orientation preamble: auto-injects context metadata (message count, total chars, goal, last user message preview) before first code step, matching fast-rlm's auto-print pattern. - Error-to-LLM flow: Python parse errors, runtime errors, NameErrors, OS errors, and async errors now flow back as stdout content instead of terminating the step, enabling LLM self-correction on next iteration. Only VM panics (catch_unwind) terminate as EngineError. 74 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(engine): update architecture plan with RLM cross-reference learnings Comprehensive update after cross-referencing against official RLM (alexzhang13/rlm), fast-rlm (avbiswas/fast-rlm), Prime Intellect (verifiers/RLMEnv), rlm-rs (zircote/rlm-rs), and Google ADK RLM. Changes: - Mark Phases 1-3 as DONE with commit refs and test counts - Add "Key Influences" section documenting all reference implementations - Phase 3: full table of implemented RLM features with sources - Phase 3: "Remaining gaps" table with which phase addresses each - Phase 4: expanded with compaction (85% context), rlm_query() (full recursive sub-agent), dual model routing, budget controls (USD, timeout, tokens, consecutive errors), lazy loading, pass-by-reference - Add "RLM Execution Model" cross-cutting section - Add "Implementation Progress" tracking table - Remove stale "TO IMPLEMENT" markers (all Phase 3 work is done) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): Phase 4 — budget controls, compaction, reflection pipeline Budget enforcement in ExecutionLoop: - max_tokens_total: cumulative token limit, checked before each iteration - max_duration: wall-clock timeout for entire thread - max_consecutive_errors: consecutive error steps threshold (resets on success, matching official RLM behavior) - All produce ThreadOutcome::Failed with descriptive messages Context compaction (from RLM paper, 85% threshold): - estimate_tokens(): char-based estimation (chars/4, matching RLM) - should_compact(): triggers when tokens >= threshold_pct * context_limit - compact_messages(): asks LLM to summarize progress, replaces history with [system, summary, continuation_note], preserves intermediate results - Configurable via ThreadConfig: model_context_limit, compaction_threshold Dual model routing: - LlmCallConfig gains depth field (0=root, 1+=sub-call) - Implementations can route to cheaper models for sub-calls - ExecutionLoop passes thread depth to every LLM call Reflection pipeline (reflection/pipeline.rs): - reflect(thread, llm): analyzes completed thread via LLM - Produces Summary doc (always), Lesson doc (if errors), Issue doc (if failed) - Builds transcript from thread messages + error events - Returns ReflectionResult with docs + token usage ThreadConfig extended with: max_tokens_total, max_consecutive_errors, model_context_limit, enable_compaction, compaction_threshold, depth, max_depth. 78 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): Phase 5 — conversation surface separated from execution Conversation is now a UI layer, not an execution boundary. Multiple threads can run concurrently within one conversation; threads can outlive their originating conversation. New types (types/conversation.rs): - ConversationSurface: channel + user + entries + active_threads - ConversationEntry: sender (User/Agent/System) + content + origin_thread_id - ConversationId, EntryId (UUID newtypes) - EntrySender enum (User, Agent{thread_id}, System) ConversationManager (runtime/conversation.rs): - get_or_create_conversation(channel, user) — indexed by (channel, user) - handle_user_message() — injects into active foreground thread or spawns new - record_thread_outcome() — adds agent/system entries, untracks completed threads - get_conversation(), list_conversations() This enables the key architectural insight: a user can ask "what's the weather?" while a deployment thread is still running. Both produce entries in the same conversation. 85 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(engine): simplify execution tiers — Monty-only for CodeAct/RLM Restructure phases 6-8 to clarify execution model: - Monty is the sole Python executor for CodeAct/RLM. No WASM or Docker Python runtimes for LLM-generated code. - WASM sandbox is for third-party tool isolation (existing infra, Phase 8) - Docker containers are for thread-level isolation of high-risk work (Phase 8) - Two-phase commit moves to Phase 6 (integration) at the adapter boundary Phase renumbering: - Old Phase 6 (Tier 2-3) → removed as separate phase - Old Phase 7 (integration) → Phase 6 - Old Phase 8 (cleanup) → Phase 7 - New Phase 8: WASM tools + Docker thread isolation (infra integration) Updated progress table: Phases 1-5 marked DONE with test counts and commits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): Phase 6 — bridge adapters for main crate integration Strategy C parallel deployment: when ENGINE_V2=true env var is set, user messages route through the engine instead of the existing agentic loop. All existing behavior is unchanged when the flag is off. Bridge module (src/bridge/): - LlmBridgeAdapter: wraps LlmProvider as engine LlmBackend, converts ThreadMessage↔ChatMessage, ActionDef↔ToolDefinition, depth-based model routing (primary vs cheap_llm) - EffectBridgeAdapter: wraps ToolRegistry+SafetyLayer as EffectExecutor, routes tool calls through existing execute_tool_with_safety pipeline - InMemoryStore: HashMap-backed Store impl (no DB tables needed yet) - EngineRouter: is_engine_v2_enabled() + handle_with_engine() that builds engine from Agent deps and processes messages end-to-end Integration touchpoint (4 lines in agent_loop.rs): After hook processing, before session resolution, check ENGINE_V2 flag and route UserInput through the engine path. Accessor visibility widened: llm(), cheap_llm(), safety(), tools() changed from pub(super) to pub(crate) for bridge access. 85 engine tests + main crate clippy clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): add user message and system prompt to thread before execution The ExecutionLoop was sending empty messages to the LLM because the thread was spawned with the user's input as the goal but no messages. Fixes: - ThreadManager.spawn_thread() now adds the goal as an initial user message before starting the execution loop - ExecutionLoop.run() injects a default system prompt if none exists Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(bridge): match existing LLM request format to prevent 400 errors The LLM bridge was missing several defaults that the existing Reasoning.respond_with_tools() sets: - tool_choice: "auto" when tools are present (required by some providers) - max_tokens: 4096 (default) - temperature: 0.7 (default) - When no tools (force_text): use plain complete() instead of complete_with_tools() with empty tools array — matches existing no-tools fallback path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): persist conversation context across messages The engine was creating a fresh ThreadManager and InMemoryStore per message, losing all context between turns. A follow-up question like "what are the latest 10 issues?" had no memory of the prior "how many issues" response. Fixes: - EngineState (ThreadManager, ConversationManager, InMemoryStore) now persists across messages via OnceLock, initialized on first use - ConversationManager builds message history from prior conversation entries (user messages + agent responses) and passes it to new threads - ThreadManager.spawn_thread_with_history() accepts initial_messages that are prepended before the current user message - System notifications (thread started/completed) are filtered out of the history (not useful as LLM context) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): enable CodeAct/RLM mode with code block detection The engine now operates in CodeAct/RLM mode: System prompt (executor/prompt.rs): - Instructs LLM to write Python in ```repl fenced blocks - Documents available tools as callable Python functions - Documents llm_query(), llm_query_batched(), FINAL() - Documents context variables (context, goal, step_number, previous_results) - Strategy guidance: examine context, break into steps, use tools, call FINAL() Code block detection (bridge/llm_adapter.rs): - extract_code_block() scans LLM text responses for ```repl or ```python blocks - When detected, returns LlmResponse::Code instead of LlmResponse::Text - The ExecutionLoop routes Code responses through Monty for execution No structured tool definitions sent to LLM: - Tools are described in the system prompt as Python functions - The LLM call sends empty actions array, forcing text-mode responses - This ensures the LLM writes code blocks (CodeAct) instead of structured tool calls (which would bypass the REPL) 85 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(engine): add 8 CodeAct/RLM E2E tests with mock LLM Comprehensive test coverage for the Monty Python execution path: - codeact_simple_final: Python code calls FINAL('answer') → thread completes - codeact_tool_call_then_final: code calls test_tool() → FunctionCall suspends VM → MockEffects returns result → code resumes → FINAL() - codeact_pure_python_computation: sum([1,2,3,4,5]) → FINAL('Sum is 15') with no tool calls — pure Python in Monty - codeact_multi_step: first step prints output (no FINAL), second step sees output metadata and calls FINAL — tests iterative REPL flow - codeact_error_recovery: first step has NameError → error flows to LLM as stdout → second step recovers with FINAL — tests error transparency - codeact_context_variables_available: code accesses `goal` and `context` variables injected by the RLM context builder - codeact_multiple_tool_calls_in_loop: for loop calls test_tool() 3 times → 3 FunctionCall suspensions → all results collected → FINAL - codeact_llm_query_recursive: code calls llm_query('prompt') → VM suspends → MockLlm provides sub-agent response → result returned as Python string variable 93 tests passing (85 prior + 8 new), zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(bridge): detect code blocks in plain completion path + multi-block support Two bugs fixed: 1. The no-tools completion path (used by CodeAct since we send empty actions) returned LlmResponse::Text without checking for code blocks. Code blocks were rendered as markdown text instead of being executed. 2. extract_code_block now: - Handles bare ``` fences (skips non-Python languages) - Collects ALL code blocks in the response and concatenates them (models often split code across multiple blocks with explanation) - Tries markers in order: ```repl, ```python, ```py, then bare ``` Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(bridge): add 11 regression tests for code block extraction Covers the exact failure modes discovered during live testing: - extract_repl_block: standard ```repl fenced block - extract_python_block: ```python marker - extract_py_block: ```py shorthand - extract_bare_backtick_block: bare ``` with Python content - skip_non_python_language: ```json should NOT be extracted - no_code_blocks_returns_none: plain text, no fences - multiple_code_blocks_concatenated: two ```repl blocks with explanation between them → concatenated with \n\n - mixed_thinking_and_code: model outputs explanation + two ```python blocks (the Hyperliquid case) → both extracted - repl_preferred_over_bare: ```repl takes priority over bare ``` - empty_code_block_skipped: empty fenced block returns None - unclosed_block_returns_none: no closing ``` returns None Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): detect FINAL() in text responses + regression tests Models sometimes write FINAL() outside code blocks — as plain text after an explanation. The Hyperliquid case: model outputs a long analysis then FINAL("""...""") at the end, not inside ```repl fences. Fixes: - extract_final_from_text(): regex-based FINAL detection in text responses, matching the official RLM's find_final_answer() fallback - Handles: double-quoted, single-quoted, triple-quoted, unquoted, nested parens - Checked in LlmResponse::Text handler BEFORE tool intent nudge (FINAL takes priority) 9 new tests: - codeact_final_in_text_response: FINAL("answer") in plain text - codeact_final_triple_quoted_in_text: FINAL("""multi\nline""") in text - final_double_quoted, final_single_quoted, final_triple_quoted, final_unquoted, final_with_nested_parens, final_after_long_text, no_final_returns_none 102 tests passing (93 + 9 new), zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add crate extraction & cleanup roadmap Documents architectural recommendations from the engine v2 design process for future reference: - Root directory consolidation (channels-src + tools-src → extensions/) - Crate extraction tiers: zero-coupling (estimation, observability, tunnel), trivial-coupling (document_extraction, pairing, hooks), medium-coupling (secrets, MCP, db, workspace, llm, skills), heavy-coupling (web gateway, agent, extensions) - src/ module reorganization into logical groups (core, persistence, infra, media, support) - main.rs/app.rs slimming targets (100/500 lines after migration) - WASM module candidates (document_extraction) and non-candidates (REPL, web gateway → separate crates instead) - Priority ordering for extraction work - Tracks completed items (ironclaw_safety, ironclaw_engine, transcription move) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): live progress status updates via event broadcast Engine v2 now shows live progress in the CLI (and any channel): - "Thinking..." when a step starts - Tool name + success/error when actions execute - "Processing results..." when a step completes Implementation: - ThreadManager holds a broadcast::Sender<ThreadEvent> (capacity 256) - ExecutionLoop.emit_event() writes to thread.events AND broadcasts - ThreadManager.subscribe_events() returns a receiver - Router uses tokio::select! to listen for events while waiting for thread completion, forwarding them as StatusUpdate to the channel This replaces the polling approach with zero-latency event streaming. Agent.channels visibility widened to pub(crate) for bridge access. 102 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): include tool results in code step output for LLM context The LLM was ignoring tool results and answering from training data because the compact output metadata didn't include what tools returned. Tool results lived only as ActionResult messages (role: Tool) which some providers flatten or the model ignores. Now the code step output includes: - stdout from Python print() statements - [tool_name result] with the actual output (truncated to 4K per tool) - [tool_name error] for failed tools - [return] for the code's return value - Total output truncated to 8K chars to prevent context bloat This ensures the model sees web_search results, API responses, etc. in the next iteration and can reason about them instead of hallucinating. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): add debug/trace logging for CodeAct execution Three verbosity levels for debugging the engine: RUST_LOG=ironclaw_engine=debug: - LLM call: message count, iteration, force_text - LLM response: type (text/code/action_calls), token usage - Code execution: code length, action count, had_error, final_answer - Text response: length, FINAL() detection RUST_LOG=ironclaw_engine=trace: - Full message list sent to LLM (role, length, first 200 chars each) - Full code block being executed - stdout preview (first 500 chars) - Per-tool results (name, success, first 300 chars of output) - Text response preview (first 500 chars) Usage: ENGINE_V2=true RUST_LOG=ironclaw_engine=debug cargo run ENGINE_V2=true RUST_LOG=ironclaw_engine=trace cargo run Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): execution trace recording + retrospective analysis Enable with ENGINE_V2_TRACE=1 to get full execution traces and automatic issue detection after each thread completes. Trace recording (executor/trace.rs): - build_trace(): captures full thread state — messages (with full content), events, step count, token usage, detected issues - write_trace(): writes JSON to engine_trace_{timestamp}.json - log_trace_summary(): logs summary + issues at info/warn level Retrospective analyzer detects 8 issue categories: - thread_failure: thread ended in Failed state - no_response: no assistant message generated - tool_error: specific tool failures with error details - code_error: Python errors (NameError, SyntaxError, etc.) in output - missing_tool_output: tool results exist but not in system messages - excessive_steps: >10 steps (may be stuck in loop) - no_tools_used: single-step answer without tools (hallucination risk) - mixed_mode: text responses without code blocks (prompt not followed) Thread state now saved to store after execution completes (for trace access after join_thread). Usage: ENGINE_V2=true ENGINE_V2_TRACE=1 cargo run # After each message: trace JSON + issue log in terminal Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): wire reflection pipeline + trace analysis into thread lifecycle After every thread completes, ThreadManager now automatically runs: 1. Retrospective trace analysis (non-LLM, always): - Detects 8 issue categories (tool errors, code errors, missing outputs, excessive steps, hallucination risk, etc.) - Logs issues at warn level when found 2. Trace file recording (when ENGINE_V2_TRACE=1): - Writes full JSON trace to engine_trace_{timestamp}.json 3. LLM reflection (when enable_reflection=true): - Calls reflection pipeline to produce Summary, Lesson, Issue docs - Saves docs to store for future context retrieval - Enabled by default in the bridge router All three run inside the spawned tokio task after exec.run() completes, before saving the final thread state. No external wiring needed. Removed duplicate trace recording from the router — it's now handled by ThreadManager automatically. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(bridge): convert tool name hyphens to underscores for Python compatibility Root cause from trace analysis: the LLM writes `web_search()` (valid Python identifier) but the tool registry has `web-search` (with hyphen). The EffectBridgeAdapter couldn't find the tool → "Tool not found" error → model fabricated fake data instead. Fixes: - available_actions(): converts tool names from hyphens to underscores (web-search → web_search) so the system prompt lists valid Python names - execute_action(): tries the original name first, then falls back to hyphenated form (web_search → web-search) for tool registry lookup - Same conversion in router's capability registry builder Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(bridge): parse JSON tool output to prevent double-serialization From trace analysis: web_search returned a JSON string, which was wrapped as serde_json::json!(string) creating a Value::String containing JSON. When Monty got this as MontyObject::String, the Python code couldn't index it with result['title'] → TypeError. Fix: try parsing the tool output string as JSON first. If valid, use the parsed Value (becomes a Python dict/list). If not valid JSON, keep as string. This means web_search results are directly indexable in Python: results = web_search(query="...") print(results["results"][0]["title"]) # works now Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): persist variables across code steps via `state` dict Monty creates a fresh runtime per code step, so variables are lost between steps. This caused the model to re-paste tool results from system messages, wasting tokens. Fix: maintain a `persisted_state` JSON dict in the ExecutionLoop that accumulates across steps: - Tool results stored by tool name: state["web_search"] = {results...} - Return values stored: state["last_return"], state["step_0_return"] - Injected as a `state` Python variable in each new MontyRun Now the model can do: Step 1: results = web_search(query="...") # tool result saved in state Step 2: data = state["web_search"] # access previous result summary = llm_query("summarize", str(data)) FINAL(summary) System prompt updated to document the `state` variable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): add state hint on code errors + retrieval engine integration When code fails with NameError/UnboundLocalError (model trying to access variables from a previous step), the error output now includes: [HINT] Variables don't persist between code blocks. Use the `state` dict to access data from previous steps. Available keys: ["web_search", "last_return"] This teaches the model to use `state["web_search"]` instead of `result` after a NameError, reducing wasted steps from 3-4 to 1. Also integrates RetrievalEngine into context building and ThreadManager: - build_step_context() now accepts optional RetrievalEngine to inject relevant memory docs (Lessons, Specs, Playbooks) into LLM context - RetrievalEngine uses keyword matching with doc-type priority scoring - Memory docs from reflection (Phase 4) now feed back into future threads Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: remove trace files and add to .gitignore Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): replace web_fetch example with web_search in CodeAct prompt The system prompt example used web_fetch(url="...") which doesn't exist as a tool. The model learned from the example and tried web_fetch, getting "Tool not found". Changed to web_search(query="...") which is an actual registered tool. Found via trace analysis — reflection pipeline correctly identified this as a "Tool Name Correction" spec doc. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor(engine): extract prompt templates to markdown files Prompt templates moved from inline Rust strings to plain markdown files at crates/ironclaw_engine/prompts/ for easy inspection and iteration: - prompts/codeact_preamble.md — main instructions, special functions, context variables, rules - prompts/codeact_postamble.md — strategy section Loaded at compile time via include_str!(), so no runtime file I/O. Edit the .md files and rebuild to iterate on prompts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): replace byte-index slicing with char-safe truncation Panic: 'byte index 80 is not a char boundary; it is inside ''' when tool output contained multi-byte UTF-8 characters (smart quotes from web search results). Fixed 4 unsafe byte-index slices: - thread.rs:281: message preview &content[..80] → chars().take(80) - loop_engine.rs:556: tool output &str[..4000] → chars().take(4000) - loop_engine.rs:579: output tail &str[len-8000..] → chars().skip() - scripting.rs:82: stdout tail &str[len-N..] → chars().skip() All now use .chars().take() or .chars().skip() which respect character boundaries. Follows CLAUDE.md rule: "Never use byte-index slicing on user-supplied or external strings." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): fix false positive missing_tool_output warning in trace analyzer The check was looking for "[" + "result]" in System-role messages only, but tool output metadata is added with patterns like "[shell result]" and may appear in messages with any role. Changed to scan all messages for " result]" or " error]" patterns regardless of role. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(engine): update architecture plan with Phase 6 status and approval flow design Phase 6 updated to reflect what was actually built: - Bridge adapters (LLM, Effect, InMemoryStore, Router) — all done - Integration touchpoint (4 lines in handle_message) — done - Live progress via broadcast events — done - Conversation persistence across messages — done - Trace recording + retrospective analysis — done - 8 bugs found and fixed via trace analysis — documented Phase 6 remaining work documented: - Approval flow: detailed 5-step design (send to channel, pause thread, route response, resume execution, always handling) with v1 reference - Database persistence (InMemoryStore → real DB tables) - Acceptance testing (TestRig + TraceLlm fixtures) - Two-phase commit for high-stakes effects Progress table updated: Phase 6 marked as DONE (partial), 134 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add self-improving engine design plan Designs a system where the engine debugs and improves itself, based on the pattern observed in the last session: 5 consecutive bug fixes all followed trace → read → identify → edit → test, using tools the engine already has access to. Three levels of self-improvement: - Level 1 (Prompt): edit prompts/*.md to prevent LLM mistakes. Auto-apply. - Level 2 (Config): adjust defaults/mappings. Branch + test + PR. - Level 3 (Code): Rust patches for engine bugs. Branch + test + clippy + PR. Architecture: Self-improvement Mission spawns a Reflection thread that reads traces, reads source, proposes fixes, validates via cargo test, and either auto-applies (Level 1) or creates a PR (Level 2-3). Includes: fix pattern database (seeded from our 8 debugging session fixes), feedback loop diagram, safety model, implementation phases (A through D), and what exists vs what's new. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add engine v2 security model and audit Comprehensive security analysis of engine v2 covering: Threat model: 4 attacker profiles (malicious input, prompt injection via tools, poisoned memory, supply chain). Current state audit: 9 controls working (Monty sandbox, safety layer, policy engine, leases, provenance, events) and 9 gaps identified. Critical finding: ALL tools granted by default — CodeAct code can call shell, write_file, apply_patch without approval. Proposed fix: 3-tier tool classification (auto/approve-once/always-approve). CodeAct-specific threats: tool call amplification, prompt injection via search results, data exfiltration via tool chains, Monty escape. Self-improvement security: poisoned trace attacks, memory poisoning via reflection. Mitigations: edit validation, frequency caps, audit trail, auto-rollback, reflection output scanning. 6-layer security architecture proposed: input validation, capability gating, output sanitization, execution sandboxing, self-improvement controls, observability. Prioritized implementation plan with severity/effort ratings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(security): cross-reference v1 controls — use, don't reinvent Updated security plan with detailed audit of ALL existing v1 security controls and how they map to engine v2 bridge gaps: Key finding: v1 already has solutions for every security gap identified. The bridge just needs to wire them in: - Tool::requires_approval() exists but bridge doesn't call it - safety.wrap_for_llm() exists but tool results enter context unwrapped - RateLimiter exists but bridge doesn't check rate limits - BeforeToolCall hooks exist but bridge doesn't run them - redact_params() exists but bridge doesn't redact sensitive params - Shell risk classification (Low/Medium/High) is inherited but ignored Revised priority: most fixes are small wiring tasks in EffectBridgeAdapter, not new security infrastructure. The bridge is the security boundary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): add missions, reliability tracker, reflection executor, and provenance-aware policy - Add Mission type and MissionManager for recurring thread scheduling - Add ReliabilityTracker for per-capability success/failure/latency tracking - Add reflection executor that spawns CodeAct threads for post-completion reflection - Extend PolicyEngine with provenance-aware taint checking (LLM-generated data requires approval for financial/external-write effects) - Extend Store trait with mission CRUD methods - Add conversation surface tracking, compaction token fix, context memory injection - Wire new modules through lib.rs re-exports and bridge adapters Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): wire v1 security controls into engine v2 adapter Zero engine crate changes. All security controls enforced at the bridge boundary in EffectBridgeAdapter: 1. Tool approval (v1: Tool::requires_approval): - Checks each tool's approval requirement with actual params - Always → returns EngineError::LeaseDenied (blocks execution) - UnlessAutoApproved → checks auto_approved set, blocks if not approved - Never → proceeds - Per-session auto_approved HashSet (for future "always" handling) 2. Hook interception (v1: BeforeToolCall): - Runs HookEvent::ToolCall before every execution - HookOutcome::Reject → blocks with reason - HookError::Rejected → blocks with reason - Hook errors → fail-open (logged, execution continues) 3. Output sanitization (v1: sanitize_tool_output + wrap_for_llm): - Leak detection: API keys in tool output are redacted - Policy enforcement: content policy rules applied - Length truncation: output capped at 100KB - XML boundary protection: prevents injection via tool output 4. Sensitive param redaction (v1: redact_params): - Tool's sensitive_params() consulted before hooks see parameters - Redacted params sent to hooks, original params used for execution 5. available_actions() now sets requires_approval based on each tool's default approval requirement, so the engine's PolicyEngine can gate tools it hasn't seen before. 6. Actual execution timing measured via Instant::now() (replaces placeholder Duration::from_millis(1)). Accessor visibility: hooks() widened to pub(crate). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): implement tool approval flow for engine v2 Adds a complete approval flow that mirrors v1 behavior, using the existing v1 security controls (Tool::requires_approval, auto-approve sets, StatusUpdate::ApprovalNeeded). ## How it works ### Step 1: Tool blocked at execution When the LLM's code calls a tool (e.g., `shell("ls")`): 1. EffectBridgeAdapter.execute_action() looks up the Tool object 2. Calls tool.requires_approval(¶ms) — returns ApprovalRequirement 3. If Always → EngineError::LeaseDenied (always blocks) 4. If UnlessAutoApproved → checks auto_approved HashSet → if not in set, returns EngineError::LeaseDenied 5. If Never → proceeds to execution ### Step 2: Engine returns NeedApproval The LeaseDenied error propagates through: - CodeAct path: becomes Python RuntimeError, code halts, thread returns NeedApproval with action_name + parameters - Structured path: same via ActionResult.is_error ### Step 3: Router stores pending approval - PendingApproval { action_name, original_content } stored on EngineState - StatusUpdate::ApprovalNeeded sent to channel (shows approval card in CLI/web with tool name, parameters, yes/always/no buttons) - Returns text: "Tool 'shell' requires approval. Reply yes/always/no." ### Step 4: User responds handle_message() intercepts Submission::ApprovalResponse when ENGINE_V2: - 'yes' → auto_approve_tool(name) on EffectBridgeAdapter, re-processes original message (tool now passes the approval check on second run) - 'always' → same + logs for session persistence - 'no' → returns "Denied: tool was not executed." ### Key design choice Instead of pausing/resuming mid-execution (which needs engine changes to freeze/restore the Monty VM state), we auto-approve the tool and re-run the full message. The EffectBridgeAdapter's auto_approved set persists across runs, so the second execution passes immediately. This trades one extra LLM call for zero engine modifications. ## Files changed - src/bridge/router.rs: PendingApproval struct, handle_approval(), NeedApproval → StatusUpdate::ApprovalNeeded conversion - src/bridge/mod.rs: export handle_approval - src/agent/agent_loop.rs: intercept ApprovalResponse for engine v2 - src/bridge/effect_adapter.rs: fmt fixes 151 tests passing, clippy + fmt clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): demote trace/reflection logging from info to debug INFO-level log output from background tasks (trace analysis, reflection) corrupts the REPL terminal UI. The trace summary, issue warnings, and reflection doc previews were printing mid-approval-card, breaking the interactive display. Fix: all logging in trace.rs changed from info!/warn! to debug!/warn!. Trace analysis and reflection results now only show when RUST_LOG=ironclaw_engine=debug is set. Also added logging discipline rule to global CLAUDE.md: - info! → user-facing status the REPL intentionally renders - debug! → internal diagnostics (traces, reflection, engine internals) - Background tasks must NEVER use info! — it breaks the TUI Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(bridge): demote all router info! logging to debug! "engine v2: initializing" and "engine v2: handling message" were printing at INFO level, corrupting the REPL UI. All router logging now uses debug! — only visible with RUST_LOG=ironclaw=debug. Zero info! calls remain in crates/ironclaw_engine/ or src/bridge/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(safety): demote leak detector warn-action logs from warn! to debug! The leak detector's Warn-action matches (high_entropy_hex pattern on web search results containing commit SHAs, CSS colors, URL hashes) were logging at warn! level, corrupting the REPL UI with lines like: WARN Potential secret leak detected pattern=high_entropy_hex preview=a96f********cee5 These are informational false positives — real leaks use LeakAction::Redact which silently modifies the content. Warn-action matches only log for debugging purposes and should not appear in production output. Changed to debug! level — visible with RUST_LOG=ironclaw_safety=debug. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): strengthen CodeAct prompt to prevent shallow text answers The model was answering "Suggested 45 improvements" as a brief text summary from training data without actually searching or listing them. The trace showed: no code block, no tool calls, no FINAL(). Prompt changes: - Rule 1: "ALWAYS respond with a ```repl code block. NEVER answer with plain text only." (was: "Always write code... plain text for brief explanations") - Rule 2 (NEW): "NEVER answer from memory or training data alone. Always use tools to get real, current information before answering." - Rule 3: FINAL answer "should be detailed and complete — not just a summary like 'found 45 items'" - Rule 8 (NEW): "Include the actual content in your FINAL() answer, not just a count or summary. Users want to see the details." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): persist reflection docs to workspace for cross-session learning Replaces InMemoryStore with HybridStore: - Ephemeral data (threads, steps, events, leases) stays in-memory - MemoryDocs (lessons, specs, playbooks from reflection) persist to the workspace at engine/docs/{type}/{id}.json On engine init, load_docs_from_workspace() reads existing docs back into the in-memory cache. This means: - Lessons learned in session 1 are available in session 2 - The RetrievalEngine injects relevant past lessons into new threads - The engine genuinely improves over time as reflection accumulates Workspace paths: engine/docs/lessons/{uuid}.json engine/docs/specs/{uuid}.json engine/docs/playbooks/{uuid}.json engine/docs/summaries/{uuid}.json engine/docs/issues/{uuid}.json No new database tables. Uses existing workspace write/read/list. workspace() accessor widened to pub(crate). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(bridge): adapt to execute_tool_with_safety params-by-value change Staging merge changed execute_tool_with_safety to take params by value instead of by reference (perf optimization from PR #926). Updated bridge adapter to clone params before passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(engine): add web gateway integration plan to Phase 6 Documents three gaps between engine v2 and the web gateway: 1. No SSE streaming (engine emits ThreadEvent, gateway expects SseEvent) 2. No conversation persistence (engine uses HybridStore, gateway reads v1 DB) 3. No cross-channel visibility (REPL ↔ web messages invisible to each other) Implementation plan: bridge ThreadEvent→AppEvent, write messages to v1 conversation tables after thread completion. Prerequisite: AppEvent extraction PR (in progress separately). Also updated DB persistence status: HybridStore with workspace-backed MemoryDocs is now implemented (partial persistence). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(engine): document routine/job gap and SIGKILL crash scenario Routines are entirely v1 — not hooked up to engine v2. When a user asks "create a routine" as natural language, engine v2 tries to call routine_create via CodeAct, but the tool needs RoutineEngine + Database refs that the bridge's minimal JobContext doesn't provide. This caused a SIGKILL crash during testing. Options documented: block routine tools in v2 (short term), pass refs through context (medium), replace with Mission system (long term). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: extract AppEvent to crates/ironclaw_common SseEvent was defined in src/channels/web/types.rs but imported by 12+ modules across agent, orchestrator, worker, tools, and extensions — it had become the application-wide event protocol, not a web transport concern. Create crates/ironclaw_common as a shared workspace crate and move the enum there as AppEvent. Also move the truncate_preview utility which was similarly leaked from the web gateway into agent modules. - New crate: crates/ironclaw_common (AppEvent, truncate_preview) - Rename SseEvent → AppEvent, from_sse_event → from_app_event - web/types.rs re-exports AppEvent for internal gateway use - web/util.rs re-exports truncate_preview - Wire format unchanged (serde renames are on variants, not the enum) Aligned with the event bus direction on refactor/architectural-hardening where DomainEvent (≡ AppEvent) is wrapped in a SystemEvent envelope. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): integrate with web gateway via AppEvent + v1 conversation DB Three changes to make engine v2 visible in the web gateway: 1. SSE event streaming (AppEvent broadcast): - ThreadEvent → AppEvent conversion via thread_event_to_app_event() - Events broadcast to SseManager during the poll loop - Covers: Thinking, ToolCompleted (success/error), Status, Response - Web gateway receives real-time progress without any gateway changes 2. Conversation persistence to v1 database: - After thread completes, writes user message + agent response to v1 ConversationStore via add_conversation_message() - Uses get_or_create_assistant_conversation() for per-user per-channel - Web gateway reads from DB as usual — chat history appears 3. Final response broadcast: - AppEvent::Response with full text + thread_id sent via SSE - Web gateway renders the response in the chat UI New EngineState fields: sse (Option<Arc<SseManager>>), db (Option<Arc<dyn Database>>). Both populated from Agent.deps. Agent.deps visibility widened to pub(crate). Depends on: ironclaw_common crate with AppEvent type (PR #1615). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): complete Phase 6 — v1-only tool blocking, rate limiting, call limits Three security/stability improvements in EffectBridgeAdapter: 1. V1-only tool blocking: - routine_create, create_job, build_software (and hyphenated variants) return helpful error: "use the slash command instead" - Filtered out of available_actions() so system prompt doesn't list them - Prevents crash from tools needing RoutineEngine/Scheduler refs 2. Per-step tool call limit: - Max 50 tool calls per code block (AtomicU32 counter) - Prevents amplification: `for i in range(10000): shell(...)` - Returns "call limit reached, break into multiple steps" 3. Rate limiting: - Per-user per-tool sliding window via RateLimiter - Checks tool.rate_limit_config() before every execution - Returns "rate limited, try again in Ns" Architecture plan updated: - Gateway integration: DONE - Routines: BLOCKED (gracefully, with slash command fallback) - Rate limiting: DONE - Call limit: DONE - Phase 6 status: DONE (remaining: acceptance tests, two-phase commit) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add Mission system design — goal-oriented autonomous threads Missions replace routines with evolving, knowledge-accumulating autonomous agents. Unlike routines (fixed prompt, stateless), Missions: - Generate prompts from accumulated Project knowledge (lessons, playbooks, issues from prior threads) - Adapt approach when something fails repeatedly - Track progress toward a goal with success criteria - Self-manage: pause when stuck, complete when goal achieved Architecture: MissionManager with cron ticker spawns threads via ThreadManager. Meta-prompt built from mission goal + Project MemoryDocs via RetrievalEngine. Reflection feeds back automatically. 6-step implementation plan: cron trigger, meta-prompt builder, bridge wiring, CodeAct tools, progress tracking, persistence. Includes two worked examples: daily tech news briefing (ongoing) and test coverage improvement (goal-driven, self-completing). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): extend Mission types with webhook/event triggers + evolving strategy Mission types updated to support external activation sources: MissionCadence expanded: - Cron { expression, timezone } — timezone-aware scheduling - OnEvent { event_pattern } — channel message pattern matching - OnSystemEvent { source, event_type } — structured events from tools - Webhook { path, secret } — external HTTP triggers (GitHub, email, etc.) - Manual — explicit triggering only The engine defines trigger TYPES. The bridge implements infrastructure (cron ticker, webhook endpoints, event matchers). GitHub issues, PRs, email, Slack events all use the generic Webhook cadence — no special-casing in the engine. Webhook payload injected as state["trigger_payload"] in the thread's Python context. Mission struct extended: - current_focus: what the next thread should work on (evolving) - approach_history: what we've tried (for adaptation) - max_threads_per_day / threads_today: daily budget - last_trigger_payload: webhook/event data for thread context Plan updated with trigger type table and webhook integration design. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): implement MissionManager execution with meta-prompts The MissionManager now builds evolving meta-prompts and processes thread outcomes for continuous learning: fire_mission() upgraded: - Loads Project MemoryDocs via RetrievalEngine for context - Builds meta-prompt from: goal, current_focus, approach_history, project knowledge docs, trigger payload, thread count - Spawns thread with meta-prompt as user message - Background task waits for completion and processes outcome - Daily thread budget enforcement (max_threads_per_day) Meta-prompt structure: # Mission: {name} Goal: {goal} ## Current Focus (evolves between threads) ## Previous Approaches (what we've tried) ## Knowledge from Prior Threads (lessons, playbooks, issues) ## Trigger Payload (webhook/event data if applicable) ## Instructions (accomplish step, report next focus, check goal) Outcome processing: - Extracts "next focus:" from FINAL() response → updates current_focus - Detects "goal achieved: yes" → completes mission - Records accomplishment in approach_history - Failed threads recorded as "FAILED: {error}" Cron ticker: - start_cron_ticker() spawns tokio task, ticks every 60s - Checks active Cron missions, fires those past next_fire_at 151 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): wire MissionManager into engine v2 for CodeAct access Missions are now callable from CodeAct Python code: ```python # Create a daily briefing mission result = mission_create( name="Tech News", goal="Daily AI/crypto/software news briefing", cadence="0 9 * * *" ) # List all missions missions = mission_list() # Manually fire a mission mission_fire(id="...") # Pause/resume mission_pause(id="...") mission_resume(id="...") ``` Implementation: - MissionManager created on engine init, cron ticker started - EffectBridgeAdapter intercepts mission_* function calls before tool lookup and routes to MissionManager - parse_cadence() handles: "manual", cron expressions, "event:pattern", "webhook:path" - Mission functions documented in CodeAct system prompt - MissionManager set on adapter via set_mission_manager() after init (avoids circular dependency) System prompt updated with mission_create, mission_list, mission_fire, mission_pause, mission_resume documentation. 151 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): map routine_* calls to mission operations in v2 When the model calls routine_create, routine_list, routine_fire, routine_pause, routine_resume, or routine_delete, the bridge now routes them to the MissionManager instead of blocking with an error. Mapping: routine_create → mission_create (with cadence parsing) routine_list → mission_list routine_fire → mission_fire routine_pause → mission_pause routine_resume → mission_resume routine_update → mission_pause/resume (based on params) routine_delete → mission_complete (marks as done) Routine tools removed from v1-only blocklist and restored in available_actions(). The model can use either "routine" or "mission" vocabulary — both work. Still blocked: create_job, cancel_job, build_software (need v1 Scheduler/ContainerJobManager refs). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(engine): add E2E mission flow tests — 7 new tests Comprehensive mission lifecycle tests: - fire_mission_builds_meta_prompt_with_goal: verifies thread spawned with project context and recorded in history - outcome_processing_extracts_next_focus: "Next focus: X" in FINAL() response → mission.current_focus updated - outcome_processing_detects_goal_achieved: "Goal achieved: yes" → mission status transitions to Completed - mission_evolves_via_direct_outcome_processing: 3-step evolution: step 1 sets focus to "db module", step 2 evolves to "tools module", step 3 detects goal achieved → mission completes. Tests the full learning loop without background task timing dependencies. - fire_with_trigger_payload: webhook payload stored on mission and threads_today counter incremented - daily_budget_enforced: max_threads_per_day=1 → first fire succeeds, second returns None 157 tests passing (151 prior + 6 new mission E2E). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): self-improving engine via Mission system Wire the self-improvement loop as a Mission with OnSystemEvent cadence, inspired by karpathy/autoresearch's program.md approach. The mission fires when threads complete with issues, receives trace data as trigger payload, and uses tools directly to diagnose and fix problems. Key changes: Engine self-improvement (Phase A+B from design doc): - Add fire_on_system_event() to MissionManager for OnSystemEvent cadence - Add start_event_listener() that subscribes to thread events and fires matching missions when non-Mission threads complete with trace issues - Add ensure_self_improvement_mission() with autoresearch-style goal prompt (concrete loop steps, not vague instructions) - Add process_self_improvement_output() for structured JSON fallback - Seed fix pattern database with 8 known patterns from debugging - Runtime prompt overlay via MemoryDoc (build_codeact_system_prompt now async + Store-aware, appends learned rules from prompt_overlay docs) - Pass Store to ExecutionLoop for overlay loading Bridge review fixes (P1/P2): - Scope engine v2 SSE events to requesting user (broadcast_for_user) - Per-user pending approvals via HashMap instead of global Option - Reset tool-call limit counter before each thread execution - Only persist auto-approval when user chose "always", not one-off "yes" - Remove dead store/mission_manager fields from EngineState Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add checkpoint-based engine thread recovery * feat(engine): add Python orchestrator module and host functions Add the orchestrator infrastructure for replacing the Rust execution loop with versioned Python code. This commit adds the module and host functions without switching over — the existing Rust loop is unchanged. New files: - orchestrator/default.py: v0 Python orchestrator (run_loop + helpers) - executor/orchestrator.rs: host function dispatch, orchestrator loading from Store with version selection, OrchestratorResult parsing Host functions exposed to orchestrator Python via Monty suspension: __llm_complete__, __execute_code_step__ (nested Monty VM), __execute_action__, __check_signals__, __emit_event__, __add_message__, __save_checkpoint__, __transition_to__, __retrieve_docs__, __check_budget__, __get_actions__ Also makes json_to_monty, monty_to_json, monty_to_string pub(crate) in scripting.rs for cross-module use. Design doc: docs/plans/2026-03-25-python-orchestrator.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): switch ExecutionLoop::run() to Python orchestrator Replace the 900-line Rust execution loop with a ~80-line bootstrap that loads and runs the versioned Python orchestrator via Monty VM. The orchestrator Python code (orchestrator/default.py) is the v0 compiled-in version. Runtime versions can override it via MemoryDoc storage (orchestrator:main with tag orchestrator_code). Key fixes during switchover: - Use ExtFunctionResult::NotFound for unknown functions so Monty falls through to Python-defined functions (extract_final, etc.) - Move helper function definitions above run_loop for Monty scoping - Use FINAL result value (not VM return value) in Complete handler - Rename 'final' variable to 'final_answer' to avoid Python keyword Status: 171/177 tests pass. 6 remaining failures are step_count and token tracking bookkeeping — the orchestrator manages these internally but doesn't yet update the thread's counters via host functions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): all 177 tests pass with Python orchestrator - Increment step_count and track tokens in __emit_event__("step_completed") so thread bookkeeping matches the old Rust loop behavior - Remove double-counting of tokens in bootstrap (orchestrator handles it) - Match nudge text to existing TOOL_INTENT_NUDGE constant - Fix FINAL result propagation (use stored final_result, not VM return) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): orchestrator versioning, auto-rollback, and tests Add version lifecycle for the Python orchestrator: - Failure tracking via MemoryDoc (orchestrator:failures) - Auto-rollback: after 3 consecutive failures, skip the latest version and fall back to previous (or compiled-in v0) - Success resets the failure counter - OrchestratorRollback event for observability Update self-improvement Mission goal with Level 1.5 instructions for orchestrator patches — the agent can now modify the execution loop itself via memory_write with versioned orchestrator docs. 12 new tests: version selection (highest wins), rollback after failures, rollback to default, failure counting/resetting, outcome parsing for all 5 ThreadOutcome variants. 189 tests pass, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add engine v2 architecture, self-improvement, and dev history Three new docs for contributors: - engine-v2-architecture.md: Two-layer architecture (Rust kernel + Python orchestrator), five primitives, execution model with nested Monty VMs, bridge layer, memory/reflection, missions, capabilities - self-improvement.md: Three improvement levels (prompt/orchestrator/ config/code), autoresearch-inspired Mission loop, versioned orchestrator with auto-rollback, fix pattern database, safety model - development-history.md: Summary of 6 Claude Code sessions that built the system, key design decisions and debugging moments, architecture evolution from 900-line Rust loop to Python orchestrator Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): complete v2 side-by-side integration with gateway API Wire engine v2 into the full submission pipeline and expose threads, projects, and missions through the web gateway REST API. Bridge routing — route ExecApproval, Interrupt, NewThread, and Clear submissions to engine v2 when ENGINE_V2=true. Previously only UserInput and ApprovalResponse were handled; all other control commands fell through to disconnected v1 sessions. Bridge query layer — add 11 read-only query functions and 6 DTO types so gateway handlers can inspect engine state (threads, steps, events, projects, missions) without direct access to the EngineState singleton. Gateway endpoints — new /api/engine/* routes: GET /threads, /threads/{id}, /threads/{id}/steps, /threads/{id}/events GET /projects, /projects/{id} GET /missions, /missions/{id} POST /missions/{id}/fire, /missions/{id}/pause, /missions/{id}/resume SSE events — add ThreadStateChanged, ChildThreadSpawned, and MissionThreadSpawned AppEvent variants. Expand the bridge event mapper to forward StateChanged and ChildSpawned engine events to the browser. Engine crate — add ConversationManager::clear_conversation() for /new and /clear commands. Code quality — replace 10 .expect() calls with proper error returns, remove dead AgentConfig.engine_v2 field, log silent init errors, fix duplicate doc comment, improve fallthrough documentation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): empty call_id on ActionResult and trace analyzer false positives Fix structured executor not stamping call_id onto ActionResult — the EffectExecutor trait doesn't receive call_id, so the structured executor must copy it from the original ActionCall after execution. Empty call_id caused OpenAI-compatible providers to reject the next LLM request with "Invalid 'input[2].call_id': empty string". Fix trace analyzer false positives: - code_error check now only scans User-role code output messages (prefixed with [stdout]/[stderr]/[code ]/Traceback), not System prompt which contains example error text - missing_tool_output check now recognizes ActionResult messages as valid tool output (Tier 0 structured path) - Add NotImplementedError to detected code error patterns New trace checks: - empty_call_id: detect ActionResult messages with missing/empty call_id before they reach the LLM API (severity: Error) - llm_error: extract LLM provider errors from Failed state reason - orchestrator_error: extract orchestrator errors from Failed state Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(web): add Missions tab to gateway UI Add a full Missions page to the web gateway with list view, detail view, and action buttons (Fire, Pause, Resume). Backend: add /api/engine/missions/summary endpoint returning counts by status (active/paused/completed/failed). Frontend: - New "Missions" tab between Jobs and Routines - Summary cards showing mission counts by status - Table with name, goal, cadence type, thread count, status, actions - Detail view with goal, cadence, current focus, success criteria, approach history, spawned thread list, and action buttons - Fire/Pause/Resume actions with toast notifications - i18n support (English + Chinese) - CSS following the existing routines/jobs patterns Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): eagerly initialize engine v2 at startup The gateway API endpoints (/api/engine/missions, etc.) call bridge query functions that return empty results when the engine state hasn't been initialized yet. Previously, initialization only happened lazily on the first chat message via handle_with_engine(). Now when ENGINE_V2=true, the engine is initialized in Agent::run() before channels start, so the self-improvement mission and other engine state is available to gateway API endpoints immediately. Also rename get_or_init_engine → init_engine and make it public so it can be called from agent_loop.rs at startup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(web): improve mission detail with markdown goal and thread table - Goal rendered as full-width markdown block instead of plain-text meta item (uses existing renderMarkdown/marked) - Current focus and success criteria also rendered as markdown - Spawned threads shown as a clickable table with goal, type, state, steps, tokens, and created date instead of a UUID list - Clicking a thread row opens an inline thread detail view showing metadata grid and full message history with markdown rendering - Back button returns to the mission detail view - Backend: mission detail now returns full thread summaries (goal, state, step_count, tokens) instead of just thread IDs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(web): close SSE connections on page unload to prevent connection starvation The browser limits concurrent HTTP/1.1 connections per origin to 6. Without cleanup, SSE connections from prior page loads linger after refresh/navigation, eating into the pool. After 2-3 refreshes, all 6 slots are consumed by stale SSE streams and new API fetch calls queue indefinitely — the UI shows "connected" (SSE works) but data never loads. Add a beforeunload handler that closes both eventSource (chat events) and logEventSource (log stream) so the browser can reuse connections immediately on page reload. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(web): support multiple gateway tabs by reducing SSE connections Each browser tab opened 2 SSE connections (chat events + log events). With the HTTP/1.1 per-origin limit of 6, the 3rd tab exhausted the pool and couldn't load any data. Three changes: 1. Lazy log SSE — only connect when the logs tab is active, disconnect when switching away. Most users rarely view logs, so this saves a connection slot per tab. 2. Visibility API — close SSE when the browser tab goes to background (user switches to another tab), reconnect when it becomes visible. Background tabs don't need real-time events. 3. Combined with the existing beforeunload cleanup, this means: - Active foreground tab: 1 connection (chat SSE only, +1 if logs tab) - Background tabs: 0 connections - Closed/refreshed tabs: 0 connections (beforeunload cleanup) This allows many gateway tabs to coexist within the 6-connection limit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): route messages to correct conversation by thread scope Messages sent from a new conversation in the gateway always appeared in the default assistant conversation because handle_with_engine ignored the thread_id from the frontend. Two fixes: 1. Engine conversation scoping — when the message carries a thread_id (from the frontend's conversation picker), use it as part of the engine conversation key: "gateway:<thread_id>" instead of just "gateway". This creates a distinct engine conversation per v1 thread, so messages don't cross-contaminate. 2. V1 dual-write targeting — write user messages and assistant responses to the v1 conversation matching the thread_id (via ensure_conversation), not the hardcoded assistant conversation. Falls back to the assistant conversation when no thread_id is present (e.g., default chat). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(web): richer activity indicators for engine v2 execution The gateway UI showed only generic "Thinking..." during engine v2 execution with no visibility into CodeAct code execution, tool calls, or reflection. Now the event mapping produces detailed status updates: Step lifecycle: - "Calling LLM..." when a step starts (was "…
* refactor: extract AppEvent to crates/ironclaw_common SseEvent was defined in src/channels/web/types.rs but imported by 12+ modules across agent, orchestrator, worker, tools, and extensions — it had become the application-wide event protocol, not a web transport concern. Create crates/ironclaw_common as a shared workspace crate and move the enum there as AppEvent. Also move the truncate_preview utility which was similarly leaked from the web gateway into agent modules. - New crate: crates/ironclaw_common (AppEvent, truncate_preview) - Rename SseEvent → AppEvent, from_sse_event → from_app_event - web/types.rs re-exports AppEvent for internal gateway use - web/util.rs re-exports truncate_preview - Wire format unchanged (serde renames are on variants, not the enum) Aligned with the event bus direction on refactor/architectural-hardening where DomainEvent (≡ AppEvent) is wrapped in a SystemEvent envelope. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: add AppEvent::event_type() helper, deduplicate match blocks Address Gemini review: extract the variant→string match into a single method on AppEvent, replacing the duplicated 22-arm matches in sse.rs and types.rs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: rename leftover sse vars/tests to match AppEvent rename Address Copilot review: rename sse_event vars to app_event in orchestrator/api.rs and ws.rs, rename test functions from test_ws_server_from_sse_* to test_ws_server_from_app_event_*, and update stale SSE comments. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: add Deserialize to AppEvent, round-trip test, fix stale comments Address zmanian review: - Add Deserialize derive to AppEvent so downstream consumers can deserialize incoming events - Add event_type_matches_serde_type_field test that round-trips every variant through serde and asserts event_type() matches the serialized "type" field — catches drift between serde renames and the manual match - Add round_trip_deserialize test for basic Serialize/Deserialize parity - Update remaining "SSE" references in comments across server.rs, manager.rs, ws_gateway_integration.rs, and worker/job.rs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…architecture) (nearai#1557) * v2 architecture phase 1 * feat(engine): Phase 2 — execution loop, capability system, thread runtime Add the core execution engine to ironclaw_engine crate: - CapabilityRegistry: register/get/list capabilities and actions - LeaseManager: async lease lifecycle (grant, check, consume, revoke, expire) - PolicyEngine: deterministic effect-level allow/deny/approve - ThreadTree: parent-child relationship tracking - ThreadSignal/ThreadOutcome: inter-thread messaging via mpsc - ThreadManager: spawn threads as tokio tasks, stop, inject messages, join - ExecutionLoop: core loop replacing run_agentic_loop() with signals, context building, LLM calls, action execution, and event recording - Structured executor (Tier 0): lease lookup → policy check → effect execution - Tool intent nudge detection - MemoryStore + RetrievalEngine stubs for Phase 4 - Full 8-phase architecture plan in docs/plans/ - CLAUDE.md spec for the engine crate 74 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): Phase 3 — Monty Python executor with RLM pattern Add CodeAct execution (Tier 1) using the Monty embedded Python interpreter, following the Recursive Language Model (RLM) pattern from arXiv:2512.24601. Key additions: - executor/scripting.rs: Monty integration with FunctionCall-based tool dispatch, catch_unwind panic safety, resource limits (30s, 64MB, 1M allocs) - LlmResponse::Code variant + ExecutionTier::Scripting - Context-as-variables (RLM 3.4): thread messages, goal, step_number, previous_results injected as Python variables — LLM context stays lean while code accesses data selectively - llm_query(prompt, context) (RLM 3.5): recursive subagent calls from within Python code — results stored as variables, not injected into parent's attention window (symbolic composition) - Compact output metadata between code steps instead of full stdout - MontyObject ↔ serde_json::Value bidirectional conversion - Updated architecture plan with RLM design principles 74 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): RLM best-practices enhancements from cross-reference analysis Cross-referenced our implementation against the official RLM (alexzhang13/rlm), fast-rlm (avbiswas/fast-rlm), and Prime Intellect's verifiers implementation. Key enhancements: - FINAL(answer) / FINAL_VAR(name): explicit termination pattern matching all three reference implementations. Code can signal completion at any point, not just via return value. - llm_query_batched(prompts): parallel recursive sub-calls via tokio::spawn, matching fast-rlm's asyncio.gather pattern and Prime Intellect's llm_batch. - Output truncation increased to 8000 chars (from 120), matching Prime Intellect's 8192 default. Shows [TRUNCATED: last N chars] or [FULL OUTPUT]. - Step 0 orientation preamble: auto-injects context metadata (message count, total chars, goal, last user message preview) before first code step, matching fast-rlm's auto-print pattern. - Error-to-LLM flow: Python parse errors, runtime errors, NameErrors, OS errors, and async errors now flow back as stdout content instead of terminating the step, enabling LLM self-correction on next iteration. Only VM panics (catch_unwind) terminate as EngineError. 74 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(engine): update architecture plan with RLM cross-reference learnings Comprehensive update after cross-referencing against official RLM (alexzhang13/rlm), fast-rlm (avbiswas/fast-rlm), Prime Intellect (verifiers/RLMEnv), rlm-rs (zircote/rlm-rs), and Google ADK RLM. Changes: - Mark Phases 1-3 as DONE with commit refs and test counts - Add "Key Influences" section documenting all reference implementations - Phase 3: full table of implemented RLM features with sources - Phase 3: "Remaining gaps" table with which phase addresses each - Phase 4: expanded with compaction (85% context), rlm_query() (full recursive sub-agent), dual model routing, budget controls (USD, timeout, tokens, consecutive errors), lazy loading, pass-by-reference - Add "RLM Execution Model" cross-cutting section - Add "Implementation Progress" tracking table - Remove stale "TO IMPLEMENT" markers (all Phase 3 work is done) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): Phase 4 — budget controls, compaction, reflection pipeline Budget enforcement in ExecutionLoop: - max_tokens_total: cumulative token limit, checked before each iteration - max_duration: wall-clock timeout for entire thread - max_consecutive_errors: consecutive error steps threshold (resets on success, matching official RLM behavior) - All produce ThreadOutcome::Failed with descriptive messages Context compaction (from RLM paper, 85% threshold): - estimate_tokens(): char-based estimation (chars/4, matching RLM) - should_compact(): triggers when tokens >= threshold_pct * context_limit - compact_messages(): asks LLM to summarize progress, replaces history with [system, summary, continuation_note], preserves intermediate results - Configurable via ThreadConfig: model_context_limit, compaction_threshold Dual model routing: - LlmCallConfig gains depth field (0=root, 1+=sub-call) - Implementations can route to cheaper models for sub-calls - ExecutionLoop passes thread depth to every LLM call Reflection pipeline (reflection/pipeline.rs): - reflect(thread, llm): analyzes completed thread via LLM - Produces Summary doc (always), Lesson doc (if errors), Issue doc (if failed) - Builds transcript from thread messages + error events - Returns ReflectionResult with docs + token usage ThreadConfig extended with: max_tokens_total, max_consecutive_errors, model_context_limit, enable_compaction, compaction_threshold, depth, max_depth. 78 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): Phase 5 — conversation surface separated from execution Conversation is now a UI layer, not an execution boundary. Multiple threads can run concurrently within one conversation; threads can outlive their originating conversation. New types (types/conversation.rs): - ConversationSurface: channel + user + entries + active_threads - ConversationEntry: sender (User/Agent/System) + content + origin_thread_id - ConversationId, EntryId (UUID newtypes) - EntrySender enum (User, Agent{thread_id}, System) ConversationManager (runtime/conversation.rs): - get_or_create_conversation(channel, user) — indexed by (channel, user) - handle_user_message() — injects into active foreground thread or spawns new - record_thread_outcome() — adds agent/system entries, untracks completed threads - get_conversation(), list_conversations() This enables the key architectural insight: a user can ask "what's the weather?" while a deployment thread is still running. Both produce entries in the same conversation. 85 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(engine): simplify execution tiers — Monty-only for CodeAct/RLM Restructure phases 6-8 to clarify execution model: - Monty is the sole Python executor for CodeAct/RLM. No WASM or Docker Python runtimes for LLM-generated code. - WASM sandbox is for third-party tool isolation (existing infra, Phase 8) - Docker containers are for thread-level isolation of high-risk work (Phase 8) - Two-phase commit moves to Phase 6 (integration) at the adapter boundary Phase renumbering: - Old Phase 6 (Tier 2-3) → removed as separate phase - Old Phase 7 (integration) → Phase 6 - Old Phase 8 (cleanup) → Phase 7 - New Phase 8: WASM tools + Docker thread isolation (infra integration) Updated progress table: Phases 1-5 marked DONE with test counts and commits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): Phase 6 — bridge adapters for main crate integration Strategy C parallel deployment: when ENGINE_V2=true env var is set, user messages route through the engine instead of the existing agentic loop. All existing behavior is unchanged when the flag is off. Bridge module (src/bridge/): - LlmBridgeAdapter: wraps LlmProvider as engine LlmBackend, converts ThreadMessage↔ChatMessage, ActionDef↔ToolDefinition, depth-based model routing (primary vs cheap_llm) - EffectBridgeAdapter: wraps ToolRegistry+SafetyLayer as EffectExecutor, routes tool calls through existing execute_tool_with_safety pipeline - InMemoryStore: HashMap-backed Store impl (no DB tables needed yet) - EngineRouter: is_engine_v2_enabled() + handle_with_engine() that builds engine from Agent deps and processes messages end-to-end Integration touchpoint (4 lines in agent_loop.rs): After hook processing, before session resolution, check ENGINE_V2 flag and route UserInput through the engine path. Accessor visibility widened: llm(), cheap_llm(), safety(), tools() changed from pub(super) to pub(crate) for bridge access. 85 engine tests + main crate clippy clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): add user message and system prompt to thread before execution The ExecutionLoop was sending empty messages to the LLM because the thread was spawned with the user's input as the goal but no messages. Fixes: - ThreadManager.spawn_thread() now adds the goal as an initial user message before starting the execution loop - ExecutionLoop.run() injects a default system prompt if none exists Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(bridge): match existing LLM request format to prevent 400 errors The LLM bridge was missing several defaults that the existing Reasoning.respond_with_tools() sets: - tool_choice: "auto" when tools are present (required by some providers) - max_tokens: 4096 (default) - temperature: 0.7 (default) - When no tools (force_text): use plain complete() instead of complete_with_tools() with empty tools array — matches existing no-tools fallback path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): persist conversation context across messages The engine was creating a fresh ThreadManager and InMemoryStore per message, losing all context between turns. A follow-up question like "what are the latest 10 issues?" had no memory of the prior "how many issues" response. Fixes: - EngineState (ThreadManager, ConversationManager, InMemoryStore) now persists across messages via OnceLock, initialized on first use - ConversationManager builds message history from prior conversation entries (user messages + agent responses) and passes it to new threads - ThreadManager.spawn_thread_with_history() accepts initial_messages that are prepended before the current user message - System notifications (thread started/completed) are filtered out of the history (not useful as LLM context) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): enable CodeAct/RLM mode with code block detection The engine now operates in CodeAct/RLM mode: System prompt (executor/prompt.rs): - Instructs LLM to write Python in ```repl fenced blocks - Documents available tools as callable Python functions - Documents llm_query(), llm_query_batched(), FINAL() - Documents context variables (context, goal, step_number, previous_results) - Strategy guidance: examine context, break into steps, use tools, call FINAL() Code block detection (bridge/llm_adapter.rs): - extract_code_block() scans LLM text responses for ```repl or ```python blocks - When detected, returns LlmResponse::Code instead of LlmResponse::Text - The ExecutionLoop routes Code responses through Monty for execution No structured tool definitions sent to LLM: - Tools are described in the system prompt as Python functions - The LLM call sends empty actions array, forcing text-mode responses - This ensures the LLM writes code blocks (CodeAct) instead of structured tool calls (which would bypass the REPL) 85 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(engine): add 8 CodeAct/RLM E2E tests with mock LLM Comprehensive test coverage for the Monty Python execution path: - codeact_simple_final: Python code calls FINAL('answer') → thread completes - codeact_tool_call_then_final: code calls test_tool() → FunctionCall suspends VM → MockEffects returns result → code resumes → FINAL() - codeact_pure_python_computation: sum([1,2,3,4,5]) → FINAL('Sum is 15') with no tool calls — pure Python in Monty - codeact_multi_step: first step prints output (no FINAL), second step sees output metadata and calls FINAL — tests iterative REPL flow - codeact_error_recovery: first step has NameError → error flows to LLM as stdout → second step recovers with FINAL — tests error transparency - codeact_context_variables_available: code accesses `goal` and `context` variables injected by the RLM context builder - codeact_multiple_tool_calls_in_loop: for loop calls test_tool() 3 times → 3 FunctionCall suspensions → all results collected → FINAL - codeact_llm_query_recursive: code calls llm_query('prompt') → VM suspends → MockLlm provides sub-agent response → result returned as Python string variable 93 tests passing (85 prior + 8 new), zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(bridge): detect code blocks in plain completion path + multi-block support Two bugs fixed: 1. The no-tools completion path (used by CodeAct since we send empty actions) returned LlmResponse::Text without checking for code blocks. Code blocks were rendered as markdown text instead of being executed. 2. extract_code_block now: - Handles bare ``` fences (skips non-Python languages) - Collects ALL code blocks in the response and concatenates them (models often split code across multiple blocks with explanation) - Tries markers in order: ```repl, ```python, ```py, then bare ``` Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(bridge): add 11 regression tests for code block extraction Covers the exact failure modes discovered during live testing: - extract_repl_block: standard ```repl fenced block - extract_python_block: ```python marker - extract_py_block: ```py shorthand - extract_bare_backtick_block: bare ``` with Python content - skip_non_python_language: ```json should NOT be extracted - no_code_blocks_returns_none: plain text, no fences - multiple_code_blocks_concatenated: two ```repl blocks with explanation between them → concatenated with \n\n - mixed_thinking_and_code: model outputs explanation + two ```python blocks (the Hyperliquid case) → both extracted - repl_preferred_over_bare: ```repl takes priority over bare ``` - empty_code_block_skipped: empty fenced block returns None - unclosed_block_returns_none: no closing ``` returns None Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): detect FINAL() in text responses + regression tests Models sometimes write FINAL() outside code blocks — as plain text after an explanation. The Hyperliquid case: model outputs a long analysis then FINAL("""...""") at the end, not inside ```repl fences. Fixes: - extract_final_from_text(): regex-based FINAL detection in text responses, matching the official RLM's find_final_answer() fallback - Handles: double-quoted, single-quoted, triple-quoted, unquoted, nested parens - Checked in LlmResponse::Text handler BEFORE tool intent nudge (FINAL takes priority) 9 new tests: - codeact_final_in_text_response: FINAL("answer") in plain text - codeact_final_triple_quoted_in_text: FINAL("""multi\nline""") in text - final_double_quoted, final_single_quoted, final_triple_quoted, final_unquoted, final_with_nested_parens, final_after_long_text, no_final_returns_none 102 tests passing (93 + 9 new), zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add crate extraction & cleanup roadmap Documents architectural recommendations from the engine v2 design process for future reference: - Root directory consolidation (channels-src + tools-src → extensions/) - Crate extraction tiers: zero-coupling (estimation, observability, tunnel), trivial-coupling (document_extraction, pairing, hooks), medium-coupling (secrets, MCP, db, workspace, llm, skills), heavy-coupling (web gateway, agent, extensions) - src/ module reorganization into logical groups (core, persistence, infra, media, support) - main.rs/app.rs slimming targets (100/500 lines after migration) - WASM module candidates (document_extraction) and non-candidates (REPL, web gateway → separate crates instead) - Priority ordering for extraction work - Tracks completed items (ironclaw_safety, ironclaw_engine, transcription move) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): live progress status updates via event broadcast Engine v2 now shows live progress in the CLI (and any channel): - "Thinking..." when a step starts - Tool name + success/error when actions execute - "Processing results..." when a step completes Implementation: - ThreadManager holds a broadcast::Sender<ThreadEvent> (capacity 256) - ExecutionLoop.emit_event() writes to thread.events AND broadcasts - ThreadManager.subscribe_events() returns a receiver - Router uses tokio::select! to listen for events while waiting for thread completion, forwarding them as StatusUpdate to the channel This replaces the polling approach with zero-latency event streaming. Agent.channels visibility widened to pub(crate) for bridge access. 102 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): include tool results in code step output for LLM context The LLM was ignoring tool results and answering from training data because the compact output metadata didn't include what tools returned. Tool results lived only as ActionResult messages (role: Tool) which some providers flatten or the model ignores. Now the code step output includes: - stdout from Python print() statements - [tool_name result] with the actual output (truncated to 4K per tool) - [tool_name error] for failed tools - [return] for the code's return value - Total output truncated to 8K chars to prevent context bloat This ensures the model sees web_search results, API responses, etc. in the next iteration and can reason about them instead of hallucinating. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): add debug/trace logging for CodeAct execution Three verbosity levels for debugging the engine: RUST_LOG=ironclaw_engine=debug: - LLM call: message count, iteration, force_text - LLM response: type (text/code/action_calls), token usage - Code execution: code length, action count, had_error, final_answer - Text response: length, FINAL() detection RUST_LOG=ironclaw_engine=trace: - Full message list sent to LLM (role, length, first 200 chars each) - Full code block being executed - stdout preview (first 500 chars) - Per-tool results (name, success, first 300 chars of output) - Text response preview (first 500 chars) Usage: ENGINE_V2=true RUST_LOG=ironclaw_engine=debug cargo run ENGINE_V2=true RUST_LOG=ironclaw_engine=trace cargo run Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): execution trace recording + retrospective analysis Enable with ENGINE_V2_TRACE=1 to get full execution traces and automatic issue detection after each thread completes. Trace recording (executor/trace.rs): - build_trace(): captures full thread state — messages (with full content), events, step count, token usage, detected issues - write_trace(): writes JSON to engine_trace_{timestamp}.json - log_trace_summary(): logs summary + issues at info/warn level Retrospective analyzer detects 8 issue categories: - thread_failure: thread ended in Failed state - no_response: no assistant message generated - tool_error: specific tool failures with error details - code_error: Python errors (NameError, SyntaxError, etc.) in output - missing_tool_output: tool results exist but not in system messages - excessive_steps: >10 steps (may be stuck in loop) - no_tools_used: single-step answer without tools (hallucination risk) - mixed_mode: text responses without code blocks (prompt not followed) Thread state now saved to store after execution completes (for trace access after join_thread). Usage: ENGINE_V2=true ENGINE_V2_TRACE=1 cargo run # After each message: trace JSON + issue log in terminal Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): wire reflection pipeline + trace analysis into thread lifecycle After every thread completes, ThreadManager now automatically runs: 1. Retrospective trace analysis (non-LLM, always): - Detects 8 issue categories (tool errors, code errors, missing outputs, excessive steps, hallucination risk, etc.) - Logs issues at warn level when found 2. Trace file recording (when ENGINE_V2_TRACE=1): - Writes full JSON trace to engine_trace_{timestamp}.json 3. LLM reflection (when enable_reflection=true): - Calls reflection pipeline to produce Summary, Lesson, Issue docs - Saves docs to store for future context retrieval - Enabled by default in the bridge router All three run inside the spawned tokio task after exec.run() completes, before saving the final thread state. No external wiring needed. Removed duplicate trace recording from the router — it's now handled by ThreadManager automatically. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(bridge): convert tool name hyphens to underscores for Python compatibility Root cause from trace analysis: the LLM writes `web_search()` (valid Python identifier) but the tool registry has `web-search` (with hyphen). The EffectBridgeAdapter couldn't find the tool → "Tool not found" error → model fabricated fake data instead. Fixes: - available_actions(): converts tool names from hyphens to underscores (web-search → web_search) so the system prompt lists valid Python names - execute_action(): tries the original name first, then falls back to hyphenated form (web_search → web-search) for tool registry lookup - Same conversion in router's capability registry builder Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(bridge): parse JSON tool output to prevent double-serialization From trace analysis: web_search returned a JSON string, which was wrapped as serde_json::json!(string) creating a Value::String containing JSON. When Monty got this as MontyObject::String, the Python code couldn't index it with result['title'] → TypeError. Fix: try parsing the tool output string as JSON first. If valid, use the parsed Value (becomes a Python dict/list). If not valid JSON, keep as string. This means web_search results are directly indexable in Python: results = web_search(query="...") print(results["results"][0]["title"]) # works now Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): persist variables across code steps via `state` dict Monty creates a fresh runtime per code step, so variables are lost between steps. This caused the model to re-paste tool results from system messages, wasting tokens. Fix: maintain a `persisted_state` JSON dict in the ExecutionLoop that accumulates across steps: - Tool results stored by tool name: state["web_search"] = {results...} - Return values stored: state["last_return"], state["step_0_return"] - Injected as a `state` Python variable in each new MontyRun Now the model can do: Step 1: results = web_search(query="...") # tool result saved in state Step 2: data = state["web_search"] # access previous result summary = llm_query("summarize", str(data)) FINAL(summary) System prompt updated to document the `state` variable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): add state hint on code errors + retrieval engine integration When code fails with NameError/UnboundLocalError (model trying to access variables from a previous step), the error output now includes: [HINT] Variables don't persist between code blocks. Use the `state` dict to access data from previous steps. Available keys: ["web_search", "last_return"] This teaches the model to use `state["web_search"]` instead of `result` after a NameError, reducing wasted steps from 3-4 to 1. Also integrates RetrievalEngine into context building and ThreadManager: - build_step_context() now accepts optional RetrievalEngine to inject relevant memory docs (Lessons, Specs, Playbooks) into LLM context - RetrievalEngine uses keyword matching with doc-type priority scoring - Memory docs from reflection (Phase 4) now feed back into future threads Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: remove trace files and add to .gitignore Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): replace web_fetch example with web_search in CodeAct prompt The system prompt example used web_fetch(url="...") which doesn't exist as a tool. The model learned from the example and tried web_fetch, getting "Tool not found". Changed to web_search(query="...") which is an actual registered tool. Found via trace analysis — reflection pipeline correctly identified this as a "Tool Name Correction" spec doc. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor(engine): extract prompt templates to markdown files Prompt templates moved from inline Rust strings to plain markdown files at crates/ironclaw_engine/prompts/ for easy inspection and iteration: - prompts/codeact_preamble.md — main instructions, special functions, context variables, rules - prompts/codeact_postamble.md — strategy section Loaded at compile time via include_str!(), so no runtime file I/O. Edit the .md files and rebuild to iterate on prompts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): replace byte-index slicing with char-safe truncation Panic: 'byte index 80 is not a char boundary; it is inside ''' when tool output contained multi-byte UTF-8 characters (smart quotes from web search results). Fixed 4 unsafe byte-index slices: - thread.rs:281: message preview &content[..80] → chars().take(80) - loop_engine.rs:556: tool output &str[..4000] → chars().take(4000) - loop_engine.rs:579: output tail &str[len-8000..] → chars().skip() - scripting.rs:82: stdout tail &str[len-N..] → chars().skip() All now use .chars().take() or .chars().skip() which respect character boundaries. Follows CLAUDE.md rule: "Never use byte-index slicing on user-supplied or external strings." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): fix false positive missing_tool_output warning in trace analyzer The check was looking for "[" + "result]" in System-role messages only, but tool output metadata is added with patterns like "[shell result]" and may appear in messages with any role. Changed to scan all messages for " result]" or " error]" patterns regardless of role. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(engine): update architecture plan with Phase 6 status and approval flow design Phase 6 updated to reflect what was actually built: - Bridge adapters (LLM, Effect, InMemoryStore, Router) — all done - Integration touchpoint (4 lines in handle_message) — done - Live progress via broadcast events — done - Conversation persistence across messages — done - Trace recording + retrospective analysis — done - 8 bugs found and fixed via trace analysis — documented Phase 6 remaining work documented: - Approval flow: detailed 5-step design (send to channel, pause thread, route response, resume execution, always handling) with v1 reference - Database persistence (InMemoryStore → real DB tables) - Acceptance testing (TestRig + TraceLlm fixtures) - Two-phase commit for high-stakes effects Progress table updated: Phase 6 marked as DONE (partial), 134 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add self-improving engine design plan Designs a system where the engine debugs and improves itself, based on the pattern observed in the last session: 5 consecutive bug fixes all followed trace → read → identify → edit → test, using tools the engine already has access to. Three levels of self-improvement: - Level 1 (Prompt): edit prompts/*.md to prevent LLM mistakes. Auto-apply. - Level 2 (Config): adjust defaults/mappings. Branch + test + PR. - Level 3 (Code): Rust patches for engine bugs. Branch + test + clippy + PR. Architecture: Self-improvement Mission spawns a Reflection thread that reads traces, reads source, proposes fixes, validates via cargo test, and either auto-applies (Level 1) or creates a PR (Level 2-3). Includes: fix pattern database (seeded from our 8 debugging session fixes), feedback loop diagram, safety model, implementation phases (A through D), and what exists vs what's new. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add engine v2 security model and audit Comprehensive security analysis of engine v2 covering: Threat model: 4 attacker profiles (malicious input, prompt injection via tools, poisoned memory, supply chain). Current state audit: 9 controls working (Monty sandbox, safety layer, policy engine, leases, provenance, events) and 9 gaps identified. Critical finding: ALL tools granted by default — CodeAct code can call shell, write_file, apply_patch without approval. Proposed fix: 3-tier tool classification (auto/approve-once/always-approve). CodeAct-specific threats: tool call amplification, prompt injection via search results, data exfiltration via tool chains, Monty escape. Self-improvement security: poisoned trace attacks, memory poisoning via reflection. Mitigations: edit validation, frequency caps, audit trail, auto-rollback, reflection output scanning. 6-layer security architecture proposed: input validation, capability gating, output sanitization, execution sandboxing, self-improvement controls, observability. Prioritized implementation plan with severity/effort ratings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(security): cross-reference v1 controls — use, don't reinvent Updated security plan with detailed audit of ALL existing v1 security controls and how they map to engine v2 bridge gaps: Key finding: v1 already has solutions for every security gap identified. The bridge just needs to wire them in: - Tool::requires_approval() exists but bridge doesn't call it - safety.wrap_for_llm() exists but tool results enter context unwrapped - RateLimiter exists but bridge doesn't check rate limits - BeforeToolCall hooks exist but bridge doesn't run them - redact_params() exists but bridge doesn't redact sensitive params - Shell risk classification (Low/Medium/High) is inherited but ignored Revised priority: most fixes are small wiring tasks in EffectBridgeAdapter, not new security infrastructure. The bridge is the security boundary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): add missions, reliability tracker, reflection executor, and provenance-aware policy - Add Mission type and MissionManager for recurring thread scheduling - Add ReliabilityTracker for per-capability success/failure/latency tracking - Add reflection executor that spawns CodeAct threads for post-completion reflection - Extend PolicyEngine with provenance-aware taint checking (LLM-generated data requires approval for financial/external-write effects) - Extend Store trait with mission CRUD methods - Add conversation surface tracking, compaction token fix, context memory injection - Wire new modules through lib.rs re-exports and bridge adapters Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): wire v1 security controls into engine v2 adapter Zero engine crate changes. All security controls enforced at the bridge boundary in EffectBridgeAdapter: 1. Tool approval (v1: Tool::requires_approval): - Checks each tool's approval requirement with actual params - Always → returns EngineError::LeaseDenied (blocks execution) - UnlessAutoApproved → checks auto_approved set, blocks if not approved - Never → proceeds - Per-session auto_approved HashSet (for future "always" handling) 2. Hook interception (v1: BeforeToolCall): - Runs HookEvent::ToolCall before every execution - HookOutcome::Reject → blocks with reason - HookError::Rejected → blocks with reason - Hook errors → fail-open (logged, execution continues) 3. Output sanitization (v1: sanitize_tool_output + wrap_for_llm): - Leak detection: API keys in tool output are redacted - Policy enforcement: content policy rules applied - Length truncation: output capped at 100KB - XML boundary protection: prevents injection via tool output 4. Sensitive param redaction (v1: redact_params): - Tool's sensitive_params() consulted before hooks see parameters - Redacted params sent to hooks, original params used for execution 5. available_actions() now sets requires_approval based on each tool's default approval requirement, so the engine's PolicyEngine can gate tools it hasn't seen before. 6. Actual execution timing measured via Instant::now() (replaces placeholder Duration::from_millis(1)). Accessor visibility: hooks() widened to pub(crate). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): implement tool approval flow for engine v2 Adds a complete approval flow that mirrors v1 behavior, using the existing v1 security controls (Tool::requires_approval, auto-approve sets, StatusUpdate::ApprovalNeeded). ## How it works ### Step 1: Tool blocked at execution When the LLM's code calls a tool (e.g., `shell("ls")`): 1. EffectBridgeAdapter.execute_action() looks up the Tool object 2. Calls tool.requires_approval(¶ms) — returns ApprovalRequirement 3. If Always → EngineError::LeaseDenied (always blocks) 4. If UnlessAutoApproved → checks auto_approved HashSet → if not in set, returns EngineError::LeaseDenied 5. If Never → proceeds to execution ### Step 2: Engine returns NeedApproval The LeaseDenied error propagates through: - CodeAct path: becomes Python RuntimeError, code halts, thread returns NeedApproval with action_name + parameters - Structured path: same via ActionResult.is_error ### Step 3: Router stores pending approval - PendingApproval { action_name, original_content } stored on EngineState - StatusUpdate::ApprovalNeeded sent to channel (shows approval card in CLI/web with tool name, parameters, yes/always/no buttons) - Returns text: "Tool 'shell' requires approval. Reply yes/always/no." ### Step 4: User responds handle_message() intercepts Submission::ApprovalResponse when ENGINE_V2: - 'yes' → auto_approve_tool(name) on EffectBridgeAdapter, re-processes original message (tool now passes the approval check on second run) - 'always' → same + logs for session persistence - 'no' → returns "Denied: tool was not executed." ### Key design choice Instead of pausing/resuming mid-execution (which needs engine changes to freeze/restore the Monty VM state), we auto-approve the tool and re-run the full message. The EffectBridgeAdapter's auto_approved set persists across runs, so the second execution passes immediately. This trades one extra LLM call for zero engine modifications. ## Files changed - src/bridge/router.rs: PendingApproval struct, handle_approval(), NeedApproval → StatusUpdate::ApprovalNeeded conversion - src/bridge/mod.rs: export handle_approval - src/agent/agent_loop.rs: intercept ApprovalResponse for engine v2 - src/bridge/effect_adapter.rs: fmt fixes 151 tests passing, clippy + fmt clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): demote trace/reflection logging from info to debug INFO-level log output from background tasks (trace analysis, reflection) corrupts the REPL terminal UI. The trace summary, issue warnings, and reflection doc previews were printing mid-approval-card, breaking the interactive display. Fix: all logging in trace.rs changed from info!/warn! to debug!/warn!. Trace analysis and reflection results now only show when RUST_LOG=ironclaw_engine=debug is set. Also added logging discipline rule to global CLAUDE.md: - info! → user-facing status the REPL intentionally renders - debug! → internal diagnostics (traces, reflection, engine internals) - Background tasks must NEVER use info! — it breaks the TUI Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(bridge): demote all router info! logging to debug! "engine v2: initializing" and "engine v2: handling message" were printing at INFO level, corrupting the REPL UI. All router logging now uses debug! — only visible with RUST_LOG=ironclaw=debug. Zero info! calls remain in crates/ironclaw_engine/ or src/bridge/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(safety): demote leak detector warn-action logs from warn! to debug! The leak detector's Warn-action matches (high_entropy_hex pattern on web search results containing commit SHAs, CSS colors, URL hashes) were logging at warn! level, corrupting the REPL UI with lines like: WARN Potential secret leak detected pattern=high_entropy_hex preview=a96f********cee5 These are informational false positives — real leaks use LeakAction::Redact which silently modifies the content. Warn-action matches only log for debugging purposes and should not appear in production output. Changed to debug! level — visible with RUST_LOG=ironclaw_safety=debug. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): strengthen CodeAct prompt to prevent shallow text answers The model was answering "Suggested 45 improvements" as a brief text summary from training data without actually searching or listing them. The trace showed: no code block, no tool calls, no FINAL(). Prompt changes: - Rule 1: "ALWAYS respond with a ```repl code block. NEVER answer with plain text only." (was: "Always write code... plain text for brief explanations") - Rule 2 (NEW): "NEVER answer from memory or training data alone. Always use tools to get real, current information before answering." - Rule 3: FINAL answer "should be detailed and complete — not just a summary like 'found 45 items'" - Rule 8 (NEW): "Include the actual content in your FINAL() answer, not just a count or summary. Users want to see the details." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): persist reflection docs to workspace for cross-session learning Replaces InMemoryStore with HybridStore: - Ephemeral data (threads, steps, events, leases) stays in-memory - MemoryDocs (lessons, specs, playbooks from reflection) persist to the workspace at engine/docs/{type}/{id}.json On engine init, load_docs_from_workspace() reads existing docs back into the in-memory cache. This means: - Lessons learned in session 1 are available in session 2 - The RetrievalEngine injects relevant past lessons into new threads - The engine genuinely improves over time as reflection accumulates Workspace paths: engine/docs/lessons/{uuid}.json engine/docs/specs/{uuid}.json engine/docs/playbooks/{uuid}.json engine/docs/summaries/{uuid}.json engine/docs/issues/{uuid}.json No new database tables. Uses existing workspace write/read/list. workspace() accessor widened to pub(crate). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(bridge): adapt to execute_tool_with_safety params-by-value change Staging merge changed execute_tool_with_safety to take params by value instead of by reference (perf optimization from PR #926). Updated bridge adapter to clone params before passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(engine): add web gateway integration plan to Phase 6 Documents three gaps between engine v2 and the web gateway: 1. No SSE streaming (engine emits ThreadEvent, gateway expects SseEvent) 2. No conversation persistence (engine uses HybridStore, gateway reads v1 DB) 3. No cross-channel visibility (REPL ↔ web messages invisible to each other) Implementation plan: bridge ThreadEvent→AppEvent, write messages to v1 conversation tables after thread completion. Prerequisite: AppEvent extraction PR (in progress separately). Also updated DB persistence status: HybridStore with workspace-backed MemoryDocs is now implemented (partial persistence). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(engine): document routine/job gap and SIGKILL crash scenario Routines are entirely v1 — not hooked up to engine v2. When a user asks "create a routine" as natural language, engine v2 tries to call routine_create via CodeAct, but the tool needs RoutineEngine + Database refs that the bridge's minimal JobContext doesn't provide. This caused a SIGKILL crash during testing. Options documented: block routine tools in v2 (short term), pass refs through context (medium), replace with Mission system (long term). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: extract AppEvent to crates/ironclaw_common SseEvent was defined in src/channels/web/types.rs but imported by 12+ modules across agent, orchestrator, worker, tools, and extensions — it had become the application-wide event protocol, not a web transport concern. Create crates/ironclaw_common as a shared workspace crate and move the enum there as AppEvent. Also move the truncate_preview utility which was similarly leaked from the web gateway into agent modules. - New crate: crates/ironclaw_common (AppEvent, truncate_preview) - Rename SseEvent → AppEvent, from_sse_event → from_app_event - web/types.rs re-exports AppEvent for internal gateway use - web/util.rs re-exports truncate_preview - Wire format unchanged (serde renames are on variants, not the enum) Aligned with the event bus direction on refactor/architectural-hardening where DomainEvent (≡ AppEvent) is wrapped in a SystemEvent envelope. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): integrate with web gateway via AppEvent + v1 conversation DB Three changes to make engine v2 visible in the web gateway: 1. SSE event streaming (AppEvent broadcast): - ThreadEvent → AppEvent conversion via thread_event_to_app_event() - Events broadcast to SseManager during the poll loop - Covers: Thinking, ToolCompleted (success/error), Status, Response - Web gateway receives real-time progress without any gateway changes 2. Conversation persistence to v1 database: - After thread completes, writes user message + agent response to v1 ConversationStore via add_conversation_message() - Uses get_or_create_assistant_conversation() for per-user per-channel - Web gateway reads from DB as usual — chat history appears 3. Final response broadcast: - AppEvent::Response with full text + thread_id sent via SSE - Web gateway renders the response in the chat UI New EngineState fields: sse (Option<Arc<SseManager>>), db (Option<Arc<dyn Database>>). Both populated from Agent.deps. Agent.deps visibility widened to pub(crate). Depends on: ironclaw_common crate with AppEvent type (PR #1615). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): complete Phase 6 — v1-only tool blocking, rate limiting, call limits Three security/stability improvements in EffectBridgeAdapter: 1. V1-only tool blocking: - routine_create, create_job, build_software (and hyphenated variants) return helpful error: "use the slash command instead" - Filtered out of available_actions() so system prompt doesn't list them - Prevents crash from tools needing RoutineEngine/Scheduler refs 2. Per-step tool call limit: - Max 50 tool calls per code block (AtomicU32 counter) - Prevents amplification: `for i in range(10000): shell(...)` - Returns "call limit reached, break into multiple steps" 3. Rate limiting: - Per-user per-tool sliding window via RateLimiter - Checks tool.rate_limit_config() before every execution - Returns "rate limited, try again in Ns" Architecture plan updated: - Gateway integration: DONE - Routines: BLOCKED (gracefully, with slash command fallback) - Rate limiting: DONE - Call limit: DONE - Phase 6 status: DONE (remaining: acceptance tests, two-phase commit) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add Mission system design — goal-oriented autonomous threads Missions replace routines with evolving, knowledge-accumulating autonomous agents. Unlike routines (fixed prompt, stateless), Missions: - Generate prompts from accumulated Project knowledge (lessons, playbooks, issues from prior threads) - Adapt approach when something fails repeatedly - Track progress toward a goal with success criteria - Self-manage: pause when stuck, complete when goal achieved Architecture: MissionManager with cron ticker spawns threads via ThreadManager. Meta-prompt built from mission goal + Project MemoryDocs via RetrievalEngine. Reflection feeds back automatically. 6-step implementation plan: cron trigger, meta-prompt builder, bridge wiring, CodeAct tools, progress tracking, persistence. Includes two worked examples: daily tech news briefing (ongoing) and test coverage improvement (goal-driven, self-completing). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): extend Mission types with webhook/event triggers + evolving strategy Mission types updated to support external activation sources: MissionCadence expanded: - Cron { expression, timezone } — timezone-aware scheduling - OnEvent { event_pattern } — channel message pattern matching - OnSystemEvent { source, event_type } — structured events from tools - Webhook { path, secret } — external HTTP triggers (GitHub, email, etc.) - Manual — explicit triggering only The engine defines trigger TYPES. The bridge implements infrastructure (cron ticker, webhook endpoints, event matchers). GitHub issues, PRs, email, Slack events all use the generic Webhook cadence — no special-casing in the engine. Webhook payload injected as state["trigger_payload"] in the thread's Python context. Mission struct extended: - current_focus: what the next thread should work on (evolving) - approach_history: what we've tried (for adaptation) - max_threads_per_day / threads_today: daily budget - last_trigger_payload: webhook/event data for thread context Plan updated with trigger type table and webhook integration design. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): implement MissionManager execution with meta-prompts The MissionManager now builds evolving meta-prompts and processes thread outcomes for continuous learning: fire_mission() upgraded: - Loads Project MemoryDocs via RetrievalEngine for context - Builds meta-prompt from: goal, current_focus, approach_history, project knowledge docs, trigger payload, thread count - Spawns thread with meta-prompt as user message - Background task waits for completion and processes outcome - Daily thread budget enforcement (max_threads_per_day) Meta-prompt structure: # Mission: {name} Goal: {goal} ## Current Focus (evolves between threads) ## Previous Approaches (what we've tried) ## Knowledge from Prior Threads (lessons, playbooks, issues) ## Trigger Payload (webhook/event data if applicable) ## Instructions (accomplish step, report next focus, check goal) Outcome processing: - Extracts "next focus:" from FINAL() response → updates current_focus - Detects "goal achieved: yes" → completes mission - Records accomplishment in approach_history - Failed threads recorded as "FAILED: {error}" Cron ticker: - start_cron_ticker() spawns tokio task, ticks every 60s - Checks active Cron missions, fires those past next_fire_at 151 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): wire MissionManager into engine v2 for CodeAct access Missions are now callable from CodeAct Python code: ```python # Create a daily briefing mission result = mission_create( name="Tech News", goal="Daily AI/crypto/software news briefing", cadence="0 9 * * *" ) # List all missions missions = mission_list() # Manually fire a mission mission_fire(id="...") # Pause/resume mission_pause(id="...") mission_resume(id="...") ``` Implementation: - MissionManager created on engine init, cron ticker started - EffectBridgeAdapter intercepts mission_* function calls before tool lookup and routes to MissionManager - parse_cadence() handles: "manual", cron expressions, "event:pattern", "webhook:path" - Mission functions documented in CodeAct system prompt - MissionManager set on adapter via set_mission_manager() after init (avoids circular dependency) System prompt updated with mission_create, mission_list, mission_fire, mission_pause, mission_resume documentation. 151 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): map routine_* calls to mission operations in v2 When the model calls routine_create, routine_list, routine_fire, routine_pause, routine_resume, or routine_delete, the bridge now routes them to the MissionManager instead of blocking with an error. Mapping: routine_create → mission_create (with cadence parsing) routine_list → mission_list routine_fire → mission_fire routine_pause → mission_pause routine_resume → mission_resume routine_update → mission_pause/resume (based on params) routine_delete → mission_complete (marks as done) Routine tools removed from v1-only blocklist and restored in available_actions(). The model can use either "routine" or "mission" vocabulary — both work. Still blocked: create_job, cancel_job, build_software (need v1 Scheduler/ContainerJobManager refs). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(engine): add E2E mission flow tests — 7 new tests Comprehensive mission lifecycle tests: - fire_mission_builds_meta_prompt_with_goal: verifies thread spawned with project context and recorded in history - outcome_processing_extracts_next_focus: "Next focus: X" in FINAL() response → mission.current_focus updated - outcome_processing_detects_goal_achieved: "Goal achieved: yes" → mission status transitions to Completed - mission_evolves_via_direct_outcome_processing: 3-step evolution: step 1 sets focus to "db module", step 2 evolves to "tools module", step 3 detects goal achieved → mission completes. Tests the full learning loop without background task timing dependencies. - fire_with_trigger_payload: webhook payload stored on mission and threads_today counter incremented - daily_budget_enforced: max_threads_per_day=1 → first fire succeeds, second returns None 157 tests passing (151 prior + 6 new mission E2E). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): self-improving engine via Mission system Wire the self-improvement loop as a Mission with OnSystemEvent cadence, inspired by karpathy/autoresearch's program.md approach. The mission fires when threads complete with issues, receives trace data as trigger payload, and uses tools directly to diagnose and fix problems. Key changes: Engine self-improvement (Phase A+B from design doc): - Add fire_on_system_event() to MissionManager for OnSystemEvent cadence - Add start_event_listener() that subscribes to thread events and fires matching missions when non-Mission threads complete with trace issues - Add ensure_self_improvement_mission() with autoresearch-style goal prompt (concrete loop steps, not vague instructions) - Add process_self_improvement_output() for structured JSON fallback - Seed fix pattern database with 8 known patterns from debugging - Runtime prompt overlay via MemoryDoc (build_codeact_system_prompt now async + Store-aware, appends learned rules from prompt_overlay docs) - Pass Store to ExecutionLoop for overlay loading Bridge review fixes (P1/P2): - Scope engine v2 SSE events to requesting user (broadcast_for_user) - Per-user pending approvals via HashMap instead of global Option - Reset tool-call limit counter before each thread execution - Only persist auto-approval when user chose "always", not one-off "yes" - Remove dead store/mission_manager fields from EngineState Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add checkpoint-based engine thread recovery * feat(engine): add Python orchestrator module and host functions Add the orchestrator infrastructure for replacing the Rust execution loop with versioned Python code. This commit adds the module and host functions without switching over — the existing Rust loop is unchanged. New files: - orchestrator/default.py: v0 Python orchestrator (run_loop + helpers) - executor/orchestrator.rs: host function dispatch, orchestrator loading from Store with version selection, OrchestratorResult parsing Host functions exposed to orchestrator Python via Monty suspension: __llm_complete__, __execute_code_step__ (nested Monty VM), __execute_action__, __check_signals__, __emit_event__, __add_message__, __save_checkpoint__, __transition_to__, __retrieve_docs__, __check_budget__, __get_actions__ Also makes json_to_monty, monty_to_json, monty_to_string pub(crate) in scripting.rs for cross-module use. Design doc: docs/plans/2026-03-25-python-orchestrator.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): switch ExecutionLoop::run() to Python orchestrator Replace the 900-line Rust execution loop with a ~80-line bootstrap that loads and runs the versioned Python orchestrator via Monty VM. The orchestrator Python code (orchestrator/default.py) is the v0 compiled-in version. Runtime versions can override it via MemoryDoc storage (orchestrator:main with tag orchestrator_code). Key fixes during switchover: - Use ExtFunctionResult::NotFound for unknown functions so Monty falls through to Python-defined functions (extract_final, etc.) - Move helper function definitions above run_loop for Monty scoping - Use FINAL result value (not VM return value) in Complete handler - Rename 'final' variable to 'final_answer' to avoid Python keyword Status: 171/177 tests pass. 6 remaining failures are step_count and token tracking bookkeeping — the orchestrator manages these internally but doesn't yet update the thread's counters via host functions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): all 177 tests pass with Python orchestrator - Increment step_count and track tokens in __emit_event__("step_completed") so thread bookkeeping matches the old Rust loop behavior - Remove double-counting of tokens in bootstrap (orchestrator handles it) - Match nudge text to existing TOOL_INTENT_NUDGE constant - Fix FINAL result propagation (use stored final_result, not VM return) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): orchestrator versioning, auto-rollback, and tests Add version lifecycle for the Python orchestrator: - Failure tracking via MemoryDoc (orchestrator:failures) - Auto-rollback: after 3 consecutive failures, skip the latest version and fall back to previous (or compiled-in v0) - Success resets the failure counter - OrchestratorRollback event for observability Update self-improvement Mission goal with Level 1.5 instructions for orchestrator patches — the agent can now modify the execution loop itself via memory_write with versioned orchestrator docs. 12 new tests: version selection (highest wins), rollback after failures, rollback to default, failure counting/resetting, outcome parsing for all 5 ThreadOutcome variants. 189 tests pass, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add engine v2 architecture, self-improvement, and dev history Three new docs for contributors: - engine-v2-architecture.md: Two-layer architecture (Rust kernel + Python orchestrator), five primitives, execution model with nested Monty VMs, bridge layer, memory/reflection, missions, capabilities - self-improvement.md: Three improvement levels (prompt/orchestrator/ config/code), autoresearch-inspired Mission loop, versioned orchestrator with auto-rollback, fix pattern database, safety model - development-history.md: Summary of 6 Claude Code sessions that built the system, key design decisions and debugging moments, architecture evolution from 900-line Rust loop to Python orchestrator Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): complete v2 side-by-side integration with gateway API Wire engine v2 into the full submission pipeline and expose threads, projects, and missions through the web gateway REST API. Bridge routing — route ExecApproval, Interrupt, NewThread, and Clear submissions to engine v2 when ENGINE_V2=true. Previously only UserInput and ApprovalResponse were handled; all other control commands fell through to disconnected v1 sessions. Bridge query layer — add 11 read-only query functions and 6 DTO types so gateway handlers can inspect engine state (threads, steps, events, projects, missions) without direct access to the EngineState singleton. Gateway endpoints — new /api/engine/* routes: GET /threads, /threads/{id}, /threads/{id}/steps, /threads/{id}/events GET /projects, /projects/{id} GET /missions, /missions/{id} POST /missions/{id}/fire, /missions/{id}/pause, /missions/{id}/resume SSE events — add ThreadStateChanged, ChildThreadSpawned, and MissionThreadSpawned AppEvent variants. Expand the bridge event mapper to forward StateChanged and ChildSpawned engine events to the browser. Engine crate — add ConversationManager::clear_conversation() for /new and /clear commands. Code quality — replace 10 .expect() calls with proper error returns, remove dead AgentConfig.engine_v2 field, log silent init errors, fix duplicate doc comment, improve fallthrough documentation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): empty call_id on ActionResult and trace analyzer false positives Fix structured executor not stamping call_id onto ActionResult — the EffectExecutor trait doesn't receive call_id, so the structured executor must copy it from the original ActionCall after execution. Empty call_id caused OpenAI-compatible providers to reject the next LLM request with "Invalid 'input[2].call_id': empty string". Fix trace analyzer false positives: - code_error check now only scans User-role code output messages (prefixed with [stdout]/[stderr]/[code ]/Traceback), not System prompt which contains example error text - missing_tool_output check now recognizes ActionResult messages as valid tool output (Tier 0 structured path) - Add NotImplementedError to detected code error patterns New trace checks: - empty_call_id: detect ActionResult messages with missing/empty call_id before they reach the LLM API (severity: Error) - llm_error: extract LLM provider errors from Failed state reason - orchestrator_error: extract orchestrator errors from Failed state Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(web): add Missions tab to gateway UI Add a full Missions page to the web gateway with list view, detail view, and action buttons (Fire, Pause, Resume). Backend: add /api/engine/missions/summary endpoint returning counts by status (active/paused/completed/failed). Frontend: - New "Missions" tab between Jobs and Routines - Summary cards showing mission counts by status - Table with name, goal, cadence type, thread count, status, actions - Detail view with goal, cadence, current focus, success criteria, approach history, spawned thread list, and action buttons - Fire/Pause/Resume actions with toast notifications - i18n support (English + Chinese) - CSS following the existing routines/jobs patterns Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): eagerly initialize engine v2 at startup The gateway API endpoints (/api/engine/missions, etc.) call bridge query functions that return empty results when the engine state hasn't been initialized yet. Previously, initialization only happened lazily on the first chat message via handle_with_engine(). Now when ENGINE_V2=true, the engine is initialized in Agent::run() before channels start, so the self-improvement mission and other engine state is available to gateway API endpoints immediately. Also rename get_or_init_engine → init_engine and make it public so it can be called from agent_loop.rs at startup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(web): improve mission detail with markdown goal and thread table - Goal rendered as full-width markdown block instead of plain-text meta item (uses existing renderMarkdown/marked) - Current focus and success criteria also rendered as markdown - Spawned threads shown as a clickable table with goal, type, state, steps, tokens, and created date instead of a UUID list - Clicking a thread row opens an inline thread detail view showing metadata grid and full message history with markdown rendering - Back button returns to the mission detail view - Backend: mission detail now returns full thread summaries (goal, state, step_count, tokens) instead of just thread IDs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(web): close SSE connections on page unload to prevent connection starvation The browser limits concurrent HTTP/1.1 connections per origin to 6. Without cleanup, SSE connections from prior page loads linger after refresh/navigation, eating into the pool. After 2-3 refreshes, all 6 slots are consumed by stale SSE streams and new API fetch calls queue indefinitely — the UI shows "connected" (SSE works) but data never loads. Add a beforeunload handler that closes both eventSource (chat events) and logEventSource (log stream) so the browser can reuse connections immediately on page reload. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(web): support multiple gateway tabs by reducing SSE connections Each browser tab opened 2 SSE connections (chat events + log events). With the HTTP/1.1 per-origin limit of 6, the 3rd tab exhausted the pool and couldn't load any data. Three changes: 1. Lazy log SSE — only connect when the logs tab is active, disconnect when switching away. Most users rarely view logs, so this saves a connection slot per tab. 2. Visibility API — close SSE when the browser tab goes to background (user switches to another tab), reconnect when it becomes visible. Background tabs don't need real-time events. 3. Combined with the existing beforeunload cleanup, this means: - Active foreground tab: 1 connection (chat SSE only, +1 if logs tab) - Background tabs: 0 connections - Closed/refreshed tabs: 0 connections (beforeunload cleanup) This allows many gateway tabs to coexist within the 6-connection limit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): route messages to correct conversation by thread scope Messages sent from a new conversation in the gateway always appeared in the default assistant conversation because handle_with_engine ignored the thread_id from the frontend. Two fixes: 1. Engine conversation scoping — when the message carries a thread_id (from the frontend's conversation picker), use it as part of the engine conversation key: "gateway:<thread_id>" instead of just "gateway". This creates a distinct engine conversation per v1 thread, so messages don't cross-contaminate. 2. V1 dual-write targeting — write user messages and assistant responses to the v1 conversation matching the thread_id (via ensure_conversation), not the hardcoded assistant conversation. Falls back to the assistant conversation when no thread_id is present (e.g., default chat). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(web): richer activity indicators for engine v2 execution The gateway UI showed only generic "Thinking..." during engine v2 execution with no visibility into CodeAct code execution, tool calls, or reflection. Now the event mapping produces detailed status updates: Step lifecycle: - "Calling LLM..." when a step starts (was "…
…stant (#1736) * v2 architecture phase 1 * feat(engine): Phase 2 — execution loop, capability system, thread runtime Add the core execution engine to ironclaw_engine crate: - CapabilityRegistry: register/get/list capabilities and actions - LeaseManager: async lease lifecycle (grant, check, consume, revoke, expire) - PolicyEngine: deterministic effect-level allow/deny/approve - ThreadTree: parent-child relationship tracking - ThreadSignal/ThreadOutcome: inter-thread messaging via mpsc - ThreadManager: spawn threads as tokio tasks, stop, inject messages, join - ExecutionLoop: core loop replacing run_agentic_loop() with signals, context building, LLM calls, action execution, and event recording - Structured executor (Tier 0): lease lookup → policy check → effect execution - Tool intent nudge detection - MemoryStore + RetrievalEngine stubs for Phase 4 - Full 8-phase architecture plan in docs/plans/ - CLAUDE.md spec for the engine crate 74 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): Phase 3 — Monty Python executor with RLM pattern Add CodeAct execution (Tier 1) using the Monty embedded Python interpreter, following the Recursive Language Model (RLM) pattern from arXiv:2512.24601. Key additions: - executor/scripting.rs: Monty integration with FunctionCall-based tool dispatch, catch_unwind panic safety, resource limits (30s, 64MB, 1M allocs) - LlmResponse::Code variant + ExecutionTier::Scripting - Context-as-variables (RLM 3.4): thread messages, goal, step_number, previous_results injected as Python variables — LLM context stays lean while code accesses data selectively - llm_query(prompt, context) (RLM 3.5): recursive subagent calls from within Python code — results stored as variables, not injected into parent's attention window (symbolic composition) - Compact output metadata between code steps instead of full stdout - MontyObject ↔ serde_json::Value bidirectional conversion - Updated architecture plan with RLM design principles 74 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): RLM best-practices enhancements from cross-reference analysis Cross-referenced our implementation against the official RLM (alexzhang13/rlm), fast-rlm (avbiswas/fast-rlm), and Prime Intellect's verifiers implementation. Key enhancements: - FINAL(answer) / FINAL_VAR(name): explicit termination pattern matching all three reference implementations. Code can signal completion at any point, not just via return value. - llm_query_batched(prompts): parallel recursive sub-calls via tokio::spawn, matching fast-rlm's asyncio.gather pattern and Prime Intellect's llm_batch. - Output truncation increased to 8000 chars (from 120), matching Prime Intellect's 8192 default. Shows [TRUNCATED: last N chars] or [FULL OUTPUT]. - Step 0 orientation preamble: auto-injects context metadata (message count, total chars, goal, last user message preview) before first code step, matching fast-rlm's auto-print pattern. - Error-to-LLM flow: Python parse errors, runtime errors, NameErrors, OS errors, and async errors now flow back as stdout content instead of terminating the step, enabling LLM self-correction on next iteration. Only VM panics (catch_unwind) terminate as EngineError. 74 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(engine): update architecture plan with RLM cross-reference learnings Comprehensive update after cross-referencing against official RLM (alexzhang13/rlm), fast-rlm (avbiswas/fast-rlm), Prime Intellect (verifiers/RLMEnv), rlm-rs (zircote/rlm-rs), and Google ADK RLM. Changes: - Mark Phases 1-3 as DONE with commit refs and test counts - Add "Key Influences" section documenting all reference implementations - Phase 3: full table of implemented RLM features with sources - Phase 3: "Remaining gaps" table with which phase addresses each - Phase 4: expanded with compaction (85% context), rlm_query() (full recursive sub-agent), dual model routing, budget controls (USD, timeout, tokens, consecutive errors), lazy loading, pass-by-reference - Add "RLM Execution Model" cross-cutting section - Add "Implementation Progress" tracking table - Remove stale "TO IMPLEMENT" markers (all Phase 3 work is done) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): Phase 4 — budget controls, compaction, reflection pipeline Budget enforcement in ExecutionLoop: - max_tokens_total: cumulative token limit, checked before each iteration - max_duration: wall-clock timeout for entire thread - max_consecutive_errors: consecutive error steps threshold (resets on success, matching official RLM behavior) - All produce ThreadOutcome::Failed with descriptive messages Context compaction (from RLM paper, 85% threshold): - estimate_tokens(): char-based estimation (chars/4, matching RLM) - should_compact(): triggers when tokens >= threshold_pct * context_limit - compact_messages(): asks LLM to summarize progress, replaces history with [system, summary, continuation_note], preserves intermediate results - Configurable via ThreadConfig: model_context_limit, compaction_threshold Dual model routing: - LlmCallConfig gains depth field (0=root, 1+=sub-call) - Implementations can route to cheaper models for sub-calls - ExecutionLoop passes thread depth to every LLM call Reflection pipeline (reflection/pipeline.rs): - reflect(thread, llm): analyzes completed thread via LLM - Produces Summary doc (always), Lesson doc (if errors), Issue doc (if failed) - Builds transcript from thread messages + error events - Returns ReflectionResult with docs + token usage ThreadConfig extended with: max_tokens_total, max_consecutive_errors, model_context_limit, enable_compaction, compaction_threshold, depth, max_depth. 78 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): Phase 5 — conversation surface separated from execution Conversation is now a UI layer, not an execution boundary. Multiple threads can run concurrently within one conversation; threads can outlive their originating conversation. New types (types/conversation.rs): - ConversationSurface: channel + user + entries + active_threads - ConversationEntry: sender (User/Agent/System) + content + origin_thread_id - ConversationId, EntryId (UUID newtypes) - EntrySender enum (User, Agent{thread_id}, System) ConversationManager (runtime/conversation.rs): - get_or_create_conversation(channel, user) — indexed by (channel, user) - handle_user_message() — injects into active foreground thread or spawns new - record_thread_outcome() — adds agent/system entries, untracks completed threads - get_conversation(), list_conversations() This enables the key architectural insight: a user can ask "what's the weather?" while a deployment thread is still running. Both produce entries in the same conversation. 85 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(engine): simplify execution tiers — Monty-only for CodeAct/RLM Restructure phases 6-8 to clarify execution model: - Monty is the sole Python executor for CodeAct/RLM. No WASM or Docker Python runtimes for LLM-generated code. - WASM sandbox is for third-party tool isolation (existing infra, Phase 8) - Docker containers are for thread-level isolation of high-risk work (Phase 8) - Two-phase commit moves to Phase 6 (integration) at the adapter boundary Phase renumbering: - Old Phase 6 (Tier 2-3) → removed as separate phase - Old Phase 7 (integration) → Phase 6 - Old Phase 8 (cleanup) → Phase 7 - New Phase 8: WASM tools + Docker thread isolation (infra integration) Updated progress table: Phases 1-5 marked DONE with test counts and commits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): Phase 6 — bridge adapters for main crate integration Strategy C parallel deployment: when ENGINE_V2=true env var is set, user messages route through the engine instead of the existing agentic loop. All existing behavior is unchanged when the flag is off. Bridge module (src/bridge/): - LlmBridgeAdapter: wraps LlmProvider as engine LlmBackend, converts ThreadMessage↔ChatMessage, ActionDef↔ToolDefinition, depth-based model routing (primary vs cheap_llm) - EffectBridgeAdapter: wraps ToolRegistry+SafetyLayer as EffectExecutor, routes tool calls through existing execute_tool_with_safety pipeline - InMemoryStore: HashMap-backed Store impl (no DB tables needed yet) - EngineRouter: is_engine_v2_enabled() + handle_with_engine() that builds engine from Agent deps and processes messages end-to-end Integration touchpoint (4 lines in agent_loop.rs): After hook processing, before session resolution, check ENGINE_V2 flag and route UserInput through the engine path. Accessor visibility widened: llm(), cheap_llm(), safety(), tools() changed from pub(super) to pub(crate) for bridge access. 85 engine tests + main crate clippy clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): add user message and system prompt to thread before execution The ExecutionLoop was sending empty messages to the LLM because the thread was spawned with the user's input as the goal but no messages. Fixes: - ThreadManager.spawn_thread() now adds the goal as an initial user message before starting the execution loop - ExecutionLoop.run() injects a default system prompt if none exists Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(bridge): match existing LLM request format to prevent 400 errors The LLM bridge was missing several defaults that the existing Reasoning.respond_with_tools() sets: - tool_choice: "auto" when tools are present (required by some providers) - max_tokens: 4096 (default) - temperature: 0.7 (default) - When no tools (force_text): use plain complete() instead of complete_with_tools() with empty tools array — matches existing no-tools fallback path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): persist conversation context across messages The engine was creating a fresh ThreadManager and InMemoryStore per message, losing all context between turns. A follow-up question like "what are the latest 10 issues?" had no memory of the prior "how many issues" response. Fixes: - EngineState (ThreadManager, ConversationManager, InMemoryStore) now persists across messages via OnceLock, initialized on first use - ConversationManager builds message history from prior conversation entries (user messages + agent responses) and passes it to new threads - ThreadManager.spawn_thread_with_history() accepts initial_messages that are prepended before the current user message - System notifications (thread started/completed) are filtered out of the history (not useful as LLM context) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): enable CodeAct/RLM mode with code block detection The engine now operates in CodeAct/RLM mode: System prompt (executor/prompt.rs): - Instructs LLM to write Python in ```repl fenced blocks - Documents available tools as callable Python functions - Documents llm_query(), llm_query_batched(), FINAL() - Documents context variables (context, goal, step_number, previous_results) - Strategy guidance: examine context, break into steps, use tools, call FINAL() Code block detection (bridge/llm_adapter.rs): - extract_code_block() scans LLM text responses for ```repl or ```python blocks - When detected, returns LlmResponse::Code instead of LlmResponse::Text - The ExecutionLoop routes Code responses through Monty for execution No structured tool definitions sent to LLM: - Tools are described in the system prompt as Python functions - The LLM call sends empty actions array, forcing text-mode responses - This ensures the LLM writes code blocks (CodeAct) instead of structured tool calls (which would bypass the REPL) 85 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(engine): add 8 CodeAct/RLM E2E tests with mock LLM Comprehensive test coverage for the Monty Python execution path: - codeact_simple_final: Python code calls FINAL('answer') → thread completes - codeact_tool_call_then_final: code calls test_tool() → FunctionCall suspends VM → MockEffects returns result → code resumes → FINAL() - codeact_pure_python_computation: sum([1,2,3,4,5]) → FINAL('Sum is 15') with no tool calls — pure Python in Monty - codeact_multi_step: first step prints output (no FINAL), second step sees output metadata and calls FINAL — tests iterative REPL flow - codeact_error_recovery: first step has NameError → error flows to LLM as stdout → second step recovers with FINAL — tests error transparency - codeact_context_variables_available: code accesses `goal` and `context` variables injected by the RLM context builder - codeact_multiple_tool_calls_in_loop: for loop calls test_tool() 3 times → 3 FunctionCall suspensions → all results collected → FINAL - codeact_llm_query_recursive: code calls llm_query('prompt') → VM suspends → MockLlm provides sub-agent response → result returned as Python string variable 93 tests passing (85 prior + 8 new), zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(bridge): detect code blocks in plain completion path + multi-block support Two bugs fixed: 1. The no-tools completion path (used by CodeAct since we send empty actions) returned LlmResponse::Text without checking for code blocks. Code blocks were rendered as markdown text instead of being executed. 2. extract_code_block now: - Handles bare ``` fences (skips non-Python languages) - Collects ALL code blocks in the response and concatenates them (models often split code across multiple blocks with explanation) - Tries markers in order: ```repl, ```python, ```py, then bare ``` Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(bridge): add 11 regression tests for code block extraction Covers the exact failure modes discovered during live testing: - extract_repl_block: standard ```repl fenced block - extract_python_block: ```python marker - extract_py_block: ```py shorthand - extract_bare_backtick_block: bare ``` with Python content - skip_non_python_language: ```json should NOT be extracted - no_code_blocks_returns_none: plain text, no fences - multiple_code_blocks_concatenated: two ```repl blocks with explanation between them → concatenated with \n\n - mixed_thinking_and_code: model outputs explanation + two ```python blocks (the Hyperliquid case) → both extracted - repl_preferred_over_bare: ```repl takes priority over bare ``` - empty_code_block_skipped: empty fenced block returns None - unclosed_block_returns_none: no closing ``` returns None Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): detect FINAL() in text responses + regression tests Models sometimes write FINAL() outside code blocks — as plain text after an explanation. The Hyperliquid case: model outputs a long analysis then FINAL("""...""") at the end, not inside ```repl fences. Fixes: - extract_final_from_text(): regex-based FINAL detection in text responses, matching the official RLM's find_final_answer() fallback - Handles: double-quoted, single-quoted, triple-quoted, unquoted, nested parens - Checked in LlmResponse::Text handler BEFORE tool intent nudge (FINAL takes priority) 9 new tests: - codeact_final_in_text_response: FINAL("answer") in plain text - codeact_final_triple_quoted_in_text: FINAL("""multi\nline""") in text - final_double_quoted, final_single_quoted, final_triple_quoted, final_unquoted, final_with_nested_parens, final_after_long_text, no_final_returns_none 102 tests passing (93 + 9 new), zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add crate extraction & cleanup roadmap Documents architectural recommendations from the engine v2 design process for future reference: - Root directory consolidation (channels-src + tools-src → extensions/) - Crate extraction tiers: zero-coupling (estimation, observability, tunnel), trivial-coupling (document_extraction, pairing, hooks), medium-coupling (secrets, MCP, db, workspace, llm, skills), heavy-coupling (web gateway, agent, extensions) - src/ module reorganization into logical groups (core, persistence, infra, media, support) - main.rs/app.rs slimming targets (100/500 lines after migration) - WASM module candidates (document_extraction) and non-candidates (REPL, web gateway → separate crates instead) - Priority ordering for extraction work - Tracks completed items (ironclaw_safety, ironclaw_engine, transcription move) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): live progress status updates via event broadcast Engine v2 now shows live progress in the CLI (and any channel): - "Thinking..." when a step starts - Tool name + success/error when actions execute - "Processing results..." when a step completes Implementation: - ThreadManager holds a broadcast::Sender<ThreadEvent> (capacity 256) - ExecutionLoop.emit_event() writes to thread.events AND broadcasts - ThreadManager.subscribe_events() returns a receiver - Router uses tokio::select! to listen for events while waiting for thread completion, forwarding them as StatusUpdate to the channel This replaces the polling approach with zero-latency event streaming. Agent.channels visibility widened to pub(crate) for bridge access. 102 tests passing, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): include tool results in code step output for LLM context The LLM was ignoring tool results and answering from training data because the compact output metadata didn't include what tools returned. Tool results lived only as ActionResult messages (role: Tool) which some providers flatten or the model ignores. Now the code step output includes: - stdout from Python print() statements - [tool_name result] with the actual output (truncated to 4K per tool) - [tool_name error] for failed tools - [return] for the code's return value - Total output truncated to 8K chars to prevent context bloat This ensures the model sees web_search results, API responses, etc. in the next iteration and can reason about them instead of hallucinating. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): add debug/trace logging for CodeAct execution Three verbosity levels for debugging the engine: RUST_LOG=ironclaw_engine=debug: - LLM call: message count, iteration, force_text - LLM response: type (text/code/action_calls), token usage - Code execution: code length, action count, had_error, final_answer - Text response: length, FINAL() detection RUST_LOG=ironclaw_engine=trace: - Full message list sent to LLM (role, length, first 200 chars each) - Full code block being executed - stdout preview (first 500 chars) - Per-tool results (name, success, first 300 chars of output) - Text response preview (first 500 chars) Usage: ENGINE_V2=true RUST_LOG=ironclaw_engine=debug cargo run ENGINE_V2=true RUST_LOG=ironclaw_engine=trace cargo run Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): execution trace recording + retrospective analysis Enable with ENGINE_V2_TRACE=1 to get full execution traces and automatic issue detection after each thread completes. Trace recording (executor/trace.rs): - build_trace(): captures full thread state — messages (with full content), events, step count, token usage, detected issues - write_trace(): writes JSON to engine_trace_{timestamp}.json - log_trace_summary(): logs summary + issues at info/warn level Retrospective analyzer detects 8 issue categories: - thread_failure: thread ended in Failed state - no_response: no assistant message generated - tool_error: specific tool failures with error details - code_error: Python errors (NameError, SyntaxError, etc.) in output - missing_tool_output: tool results exist but not in system messages - excessive_steps: >10 steps (may be stuck in loop) - no_tools_used: single-step answer without tools (hallucination risk) - mixed_mode: text responses without code blocks (prompt not followed) Thread state now saved to store after execution completes (for trace access after join_thread). Usage: ENGINE_V2=true ENGINE_V2_TRACE=1 cargo run # After each message: trace JSON + issue log in terminal Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): wire reflection pipeline + trace analysis into thread lifecycle After every thread completes, ThreadManager now automatically runs: 1. Retrospective trace analysis (non-LLM, always): - Detects 8 issue categories (tool errors, code errors, missing outputs, excessive steps, hallucination risk, etc.) - Logs issues at warn level when found 2. Trace file recording (when ENGINE_V2_TRACE=1): - Writes full JSON trace to engine_trace_{timestamp}.json 3. LLM reflection (when enable_reflection=true): - Calls reflection pipeline to produce Summary, Lesson, Issue docs - Saves docs to store for future context retrieval - Enabled by default in the bridge router All three run inside the spawned tokio task after exec.run() completes, before saving the final thread state. No external wiring needed. Removed duplicate trace recording from the router — it's now handled by ThreadManager automatically. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(bridge): convert tool name hyphens to underscores for Python compatibility Root cause from trace analysis: the LLM writes `web_search()` (valid Python identifier) but the tool registry has `web-search` (with hyphen). The EffectBridgeAdapter couldn't find the tool → "Tool not found" error → model fabricated fake data instead. Fixes: - available_actions(): converts tool names from hyphens to underscores (web-search → web_search) so the system prompt lists valid Python names - execute_action(): tries the original name first, then falls back to hyphenated form (web_search → web-search) for tool registry lookup - Same conversion in router's capability registry builder Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(bridge): parse JSON tool output to prevent double-serialization From trace analysis: web_search returned a JSON string, which was wrapped as serde_json::json!(string) creating a Value::String containing JSON. When Monty got this as MontyObject::String, the Python code couldn't index it with result['title'] → TypeError. Fix: try parsing the tool output string as JSON first. If valid, use the parsed Value (becomes a Python dict/list). If not valid JSON, keep as string. This means web_search results are directly indexable in Python: results = web_search(query="...") print(results["results"][0]["title"]) # works now Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): persist variables across code steps via `state` dict Monty creates a fresh runtime per code step, so variables are lost between steps. This caused the model to re-paste tool results from system messages, wasting tokens. Fix: maintain a `persisted_state` JSON dict in the ExecutionLoop that accumulates across steps: - Tool results stored by tool name: state["web_search"] = {results...} - Return values stored: state["last_return"], state["step_0_return"] - Injected as a `state` Python variable in each new MontyRun Now the model can do: Step 1: results = web_search(query="...") # tool result saved in state Step 2: data = state["web_search"] # access previous result summary = llm_query("summarize", str(data)) FINAL(summary) System prompt updated to document the `state` variable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): add state hint on code errors + retrieval engine integration When code fails with NameError/UnboundLocalError (model trying to access variables from a previous step), the error output now includes: [HINT] Variables don't persist between code blocks. Use the `state` dict to access data from previous steps. Available keys: ["web_search", "last_return"] This teaches the model to use `state["web_search"]` instead of `result` after a NameError, reducing wasted steps from 3-4 to 1. Also integrates RetrievalEngine into context building and ThreadManager: - build_step_context() now accepts optional RetrievalEngine to inject relevant memory docs (Lessons, Specs, Playbooks) into LLM context - RetrievalEngine uses keyword matching with doc-type priority scoring - Memory docs from reflection (Phase 4) now feed back into future threads Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: remove trace files and add to .gitignore Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): replace web_fetch example with web_search in CodeAct prompt The system prompt example used web_fetch(url="...") which doesn't exist as a tool. The model learned from the example and tried web_fetch, getting "Tool not found". Changed to web_search(query="...") which is an actual registered tool. Found via trace analysis — reflection pipeline correctly identified this as a "Tool Name Correction" spec doc. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor(engine): extract prompt templates to markdown files Prompt templates moved from inline Rust strings to plain markdown files at crates/ironclaw_engine/prompts/ for easy inspection and iteration: - prompts/codeact_preamble.md — main instructions, special functions, context variables, rules - prompts/codeact_postamble.md — strategy section Loaded at compile time via include_str!(), so no runtime file I/O. Edit the .md files and rebuild to iterate on prompts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): replace byte-index slicing with char-safe truncation Panic: 'byte index 80 is not a char boundary; it is inside ''' when tool output contained multi-byte UTF-8 characters (smart quotes from web search results). Fixed 4 unsafe byte-index slices: - thread.rs:281: message preview &content[..80] → chars().take(80) - loop_engine.rs:556: tool output &str[..4000] → chars().take(4000) - loop_engine.rs:579: output tail &str[len-8000..] → chars().skip() - scripting.rs:82: stdout tail &str[len-N..] → chars().skip() All now use .chars().take() or .chars().skip() which respect character boundaries. Follows CLAUDE.md rule: "Never use byte-index slicing on user-supplied or external strings." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): fix false positive missing_tool_output warning in trace analyzer The check was looking for "[" + "result]" in System-role messages only, but tool output metadata is added with patterns like "[shell result]" and may appear in messages with any role. Changed to scan all messages for " result]" or " error]" patterns regardless of role. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(engine): update architecture plan with Phase 6 status and approval flow design Phase 6 updated to reflect what was actually built: - Bridge adapters (LLM, Effect, InMemoryStore, Router) — all done - Integration touchpoint (4 lines in handle_message) — done - Live progress via broadcast events — done - Conversation persistence across messages — done - Trace recording + retrospective analysis — done - 8 bugs found and fixed via trace analysis — documented Phase 6 remaining work documented: - Approval flow: detailed 5-step design (send to channel, pause thread, route response, resume execution, always handling) with v1 reference - Database persistence (InMemoryStore → real DB tables) - Acceptance testing (TestRig + TraceLlm fixtures) - Two-phase commit for high-stakes effects Progress table updated: Phase 6 marked as DONE (partial), 134 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add self-improving engine design plan Designs a system where the engine debugs and improves itself, based on the pattern observed in the last session: 5 consecutive bug fixes all followed trace → read → identify → edit → test, using tools the engine already has access to. Three levels of self-improvement: - Level 1 (Prompt): edit prompts/*.md to prevent LLM mistakes. Auto-apply. - Level 2 (Config): adjust defaults/mappings. Branch + test + PR. - Level 3 (Code): Rust patches for engine bugs. Branch + test + clippy + PR. Architecture: Self-improvement Mission spawns a Reflection thread that reads traces, reads source, proposes fixes, validates via cargo test, and either auto-applies (Level 1) or creates a PR (Level 2-3). Includes: fix pattern database (seeded from our 8 debugging session fixes), feedback loop diagram, safety model, implementation phases (A through D), and what exists vs what's new. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add engine v2 security model and audit Comprehensive security analysis of engine v2 covering: Threat model: 4 attacker profiles (malicious input, prompt injection via tools, poisoned memory, supply chain). Current state audit: 9 controls working (Monty sandbox, safety layer, policy engine, leases, provenance, events) and 9 gaps identified. Critical finding: ALL tools granted by default — CodeAct code can call shell, write_file, apply_patch without approval. Proposed fix: 3-tier tool classification (auto/approve-once/always-approve). CodeAct-specific threats: tool call amplification, prompt injection via search results, data exfiltration via tool chains, Monty escape. Self-improvement security: poisoned trace attacks, memory poisoning via reflection. Mitigations: edit validation, frequency caps, audit trail, auto-rollback, reflection output scanning. 6-layer security architecture proposed: input validation, capability gating, output sanitization, execution sandboxing, self-improvement controls, observability. Prioritized implementation plan with severity/effort ratings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(security): cross-reference v1 controls — use, don't reinvent Updated security plan with detailed audit of ALL existing v1 security controls and how they map to engine v2 bridge gaps: Key finding: v1 already has solutions for every security gap identified. The bridge just needs to wire them in: - Tool::requires_approval() exists but bridge doesn't call it - safety.wrap_for_llm() exists but tool results enter context unwrapped - RateLimiter exists but bridge doesn't check rate limits - BeforeToolCall hooks exist but bridge doesn't run them - redact_params() exists but bridge doesn't redact sensitive params - Shell risk classification (Low/Medium/High) is inherited but ignored Revised priority: most fixes are small wiring tasks in EffectBridgeAdapter, not new security infrastructure. The bridge is the security boundary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): add missions, reliability tracker, reflection executor, and provenance-aware policy - Add Mission type and MissionManager for recurring thread scheduling - Add ReliabilityTracker for per-capability success/failure/latency tracking - Add reflection executor that spawns CodeAct threads for post-completion reflection - Extend PolicyEngine with provenance-aware taint checking (LLM-generated data requires approval for financial/external-write effects) - Extend Store trait with mission CRUD methods - Add conversation surface tracking, compaction token fix, context memory injection - Wire new modules through lib.rs re-exports and bridge adapters Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): wire v1 security controls into engine v2 adapter Zero engine crate changes. All security controls enforced at the bridge boundary in EffectBridgeAdapter: 1. Tool approval (v1: Tool::requires_approval): - Checks each tool's approval requirement with actual params - Always → returns EngineError::LeaseDenied (blocks execution) - UnlessAutoApproved → checks auto_approved set, blocks if not approved - Never → proceeds - Per-session auto_approved HashSet (for future "always" handling) 2. Hook interception (v1: BeforeToolCall): - Runs HookEvent::ToolCall before every execution - HookOutcome::Reject → blocks with reason - HookError::Rejected → blocks with reason - Hook errors → fail-open (logged, execution continues) 3. Output sanitization (v1: sanitize_tool_output + wrap_for_llm): - Leak detection: API keys in tool output are redacted - Policy enforcement: content policy rules applied - Length truncation: output capped at 100KB - XML boundary protection: prevents injection via tool output 4. Sensitive param redaction (v1: redact_params): - Tool's sensitive_params() consulted before hooks see parameters - Redacted params sent to hooks, original params used for execution 5. available_actions() now sets requires_approval based on each tool's default approval requirement, so the engine's PolicyEngine can gate tools it hasn't seen before. 6. Actual execution timing measured via Instant::now() (replaces placeholder Duration::from_millis(1)). Accessor visibility: hooks() widened to pub(crate). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): implement tool approval flow for engine v2 Adds a complete approval flow that mirrors v1 behavior, using the existing v1 security controls (Tool::requires_approval, auto-approve sets, StatusUpdate::ApprovalNeeded). ## How it works ### Step 1: Tool blocked at execution When the LLM's code calls a tool (e.g., `shell("ls")`): 1. EffectBridgeAdapter.execute_action() looks up the Tool object 2. Calls tool.requires_approval(¶ms) — returns ApprovalRequirement 3. If Always → EngineError::LeaseDenied (always blocks) 4. If UnlessAutoApproved → checks auto_approved HashSet → if not in set, returns EngineError::LeaseDenied 5. If Never → proceeds to execution ### Step 2: Engine returns NeedApproval The LeaseDenied error propagates through: - CodeAct path: becomes Python RuntimeError, code halts, thread returns NeedApproval with action_name + parameters - Structured path: same via ActionResult.is_error ### Step 3: Router stores pending approval - PendingApproval { action_name, original_content } stored on EngineState - StatusUpdate::ApprovalNeeded sent to channel (shows approval card in CLI/web with tool name, parameters, yes/always/no buttons) - Returns text: "Tool 'shell' requires approval. Reply yes/always/no." ### Step 4: User responds handle_message() intercepts Submission::ApprovalResponse when ENGINE_V2: - 'yes' → auto_approve_tool(name) on EffectBridgeAdapter, re-processes original message (tool now passes the approval check on second run) - 'always' → same + logs for session persistence - 'no' → returns "Denied: tool was not executed." ### Key design choice Instead of pausing/resuming mid-execution (which needs engine changes to freeze/restore the Monty VM state), we auto-approve the tool and re-run the full message. The EffectBridgeAdapter's auto_approved set persists across runs, so the second execution passes immediately. This trades one extra LLM call for zero engine modifications. ## Files changed - src/bridge/router.rs: PendingApproval struct, handle_approval(), NeedApproval → StatusUpdate::ApprovalNeeded conversion - src/bridge/mod.rs: export handle_approval - src/agent/agent_loop.rs: intercept ApprovalResponse for engine v2 - src/bridge/effect_adapter.rs: fmt fixes 151 tests passing, clippy + fmt clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): demote trace/reflection logging from info to debug INFO-level log output from background tasks (trace analysis, reflection) corrupts the REPL terminal UI. The trace summary, issue warnings, and reflection doc previews were printing mid-approval-card, breaking the interactive display. Fix: all logging in trace.rs changed from info!/warn! to debug!/warn!. Trace analysis and reflection results now only show when RUST_LOG=ironclaw_engine=debug is set. Also added logging discipline rule to global CLAUDE.md: - info! → user-facing status the REPL intentionally renders - debug! → internal diagnostics (traces, reflection, engine internals) - Background tasks must NEVER use info! — it breaks the TUI Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(bridge): demote all router info! logging to debug! "engine v2: initializing" and "engine v2: handling message" were printing at INFO level, corrupting the REPL UI. All router logging now uses debug! — only visible with RUST_LOG=ironclaw=debug. Zero info! calls remain in crates/ironclaw_engine/ or src/bridge/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(safety): demote leak detector warn-action logs from warn! to debug! The leak detector's Warn-action matches (high_entropy_hex pattern on web search results containing commit SHAs, CSS colors, URL hashes) were logging at warn! level, corrupting the REPL UI with lines like: WARN Potential secret leak detected pattern=high_entropy_hex preview=a96f********cee5 These are informational false positives — real leaks use LeakAction::Redact which silently modifies the content. Warn-action matches only log for debugging purposes and should not appear in production output. Changed to debug! level — visible with RUST_LOG=ironclaw_safety=debug. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): strengthen CodeAct prompt to prevent shallow text answers The model was answering "Suggested 45 improvements" as a brief text summary from training data without actually searching or listing them. The trace showed: no code block, no tool calls, no FINAL(). Prompt changes: - Rule 1: "ALWAYS respond with a ```repl code block. NEVER answer with plain text only." (was: "Always write code... plain text for brief explanations") - Rule 2 (NEW): "NEVER answer from memory or training data alone. Always use tools to get real, current information before answering." - Rule 3: FINAL answer "should be detailed and complete — not just a summary like 'found 45 items'" - Rule 8 (NEW): "Include the actual content in your FINAL() answer, not just a count or summary. Users want to see the details." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): persist reflection docs to workspace for cross-session learning Replaces InMemoryStore with HybridStore: - Ephemeral data (threads, steps, events, leases) stays in-memory - MemoryDocs (lessons, specs, playbooks from reflection) persist to the workspace at engine/docs/{type}/{id}.json On engine init, load_docs_from_workspace() reads existing docs back into the in-memory cache. This means: - Lessons learned in session 1 are available in session 2 - The RetrievalEngine injects relevant past lessons into new threads - The engine genuinely improves over time as reflection accumulates Workspace paths: engine/docs/lessons/{uuid}.json engine/docs/specs/{uuid}.json engine/docs/playbooks/{uuid}.json engine/docs/summaries/{uuid}.json engine/docs/issues/{uuid}.json No new database tables. Uses existing workspace write/read/list. workspace() accessor widened to pub(crate). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(bridge): adapt to execute_tool_with_safety params-by-value change Staging merge changed execute_tool_with_safety to take params by value instead of by reference (perf optimization from PR #926). Updated bridge adapter to clone params before passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(engine): add web gateway integration plan to Phase 6 Documents three gaps between engine v2 and the web gateway: 1. No SSE streaming (engine emits ThreadEvent, gateway expects SseEvent) 2. No conversation persistence (engine uses HybridStore, gateway reads v1 DB) 3. No cross-channel visibility (REPL ↔ web messages invisible to each other) Implementation plan: bridge ThreadEvent→AppEvent, write messages to v1 conversation tables after thread completion. Prerequisite: AppEvent extraction PR (in progress separately). Also updated DB persistence status: HybridStore with workspace-backed MemoryDocs is now implemented (partial persistence). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(engine): document routine/job gap and SIGKILL crash scenario Routines are entirely v1 — not hooked up to engine v2. When a user asks "create a routine" as natural language, engine v2 tries to call routine_create via CodeAct, but the tool needs RoutineEngine + Database refs that the bridge's minimal JobContext doesn't provide. This caused a SIGKILL crash during testing. Options documented: block routine tools in v2 (short term), pass refs through context (medium), replace with Mission system (long term). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: extract AppEvent to crates/ironclaw_common SseEvent was defined in src/channels/web/types.rs but imported by 12+ modules across agent, orchestrator, worker, tools, and extensions — it had become the application-wide event protocol, not a web transport concern. Create crates/ironclaw_common as a shared workspace crate and move the enum there as AppEvent. Also move the truncate_preview utility which was similarly leaked from the web gateway into agent modules. - New crate: crates/ironclaw_common (AppEvent, truncate_preview) - Rename SseEvent → AppEvent, from_sse_event → from_app_event - web/types.rs re-exports AppEvent for internal gateway use - web/util.rs re-exports truncate_preview - Wire format unchanged (serde renames are on variants, not the enum) Aligned with the event bus direction on refactor/architectural-hardening where DomainEvent (≡ AppEvent) is wrapped in a SystemEvent envelope. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): integrate with web gateway via AppEvent + v1 conversation DB Three changes to make engine v2 visible in the web gateway: 1. SSE event streaming (AppEvent broadcast): - ThreadEvent → AppEvent conversion via thread_event_to_app_event() - Events broadcast to SseManager during the poll loop - Covers: Thinking, ToolCompleted (success/error), Status, Response - Web gateway receives real-time progress without any gateway changes 2. Conversation persistence to v1 database: - After thread completes, writes user message + agent response to v1 ConversationStore via add_conversation_message() - Uses get_or_create_assistant_conversation() for per-user per-channel - Web gateway reads from DB as usual — chat history appears 3. Final response broadcast: - AppEvent::Response with full text + thread_id sent via SSE - Web gateway renders the response in the chat UI New EngineState fields: sse (Option<Arc<SseManager>>), db (Option<Arc<dyn Database>>). Both populated from Agent.deps. Agent.deps visibility widened to pub(crate). Depends on: ironclaw_common crate with AppEvent type (PR #1615). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): complete Phase 6 — v1-only tool blocking, rate limiting, call limits Three security/stability improvements in EffectBridgeAdapter: 1. V1-only tool blocking: - routine_create, create_job, build_software (and hyphenated variants) return helpful error: "use the slash command instead" - Filtered out of available_actions() so system prompt doesn't list them - Prevents crash from tools needing RoutineEngine/Scheduler refs 2. Per-step tool call limit: - Max 50 tool calls per code block (AtomicU32 counter) - Prevents amplification: `for i in range(10000): shell(...)` - Returns "call limit reached, break into multiple steps" 3. Rate limiting: - Per-user per-tool sliding window via RateLimiter - Checks tool.rate_limit_config() before every execution - Returns "rate limited, try again in Ns" Architecture plan updated: - Gateway integration: DONE - Routines: BLOCKED (gracefully, with slash command fallback) - Rate limiting: DONE - Call limit: DONE - Phase 6 status: DONE (remaining: acceptance tests, two-phase commit) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add Mission system design — goal-oriented autonomous threads Missions replace routines with evolving, knowledge-accumulating autonomous agents. Unlike routines (fixed prompt, stateless), Missions: - Generate prompts from accumulated Project knowledge (lessons, playbooks, issues from prior threads) - Adapt approach when something fails repeatedly - Track progress toward a goal with success criteria - Self-manage: pause when stuck, complete when goal achieved Architecture: MissionManager with cron ticker spawns threads via ThreadManager. Meta-prompt built from mission goal + Project MemoryDocs via RetrievalEngine. Reflection feeds back automatically. 6-step implementation plan: cron trigger, meta-prompt builder, bridge wiring, CodeAct tools, progress tracking, persistence. Includes two worked examples: daily tech news briefing (ongoing) and test coverage improvement (goal-driven, self-completing). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): extend Mission types with webhook/event triggers + evolving strategy Mission types updated to support external activation sources: MissionCadence expanded: - Cron { expression, timezone } — timezone-aware scheduling - OnEvent { event_pattern } — channel message pattern matching - OnSystemEvent { source, event_type } — structured events from tools - Webhook { path, secret } — external HTTP triggers (GitHub, email, etc.) - Manual — explicit triggering only The engine defines trigger TYPES. The bridge implements infrastructure (cron ticker, webhook endpoints, event matchers). GitHub issues, PRs, email, Slack events all use the generic Webhook cadence — no special-casing in the engine. Webhook payload injected as state["trigger_payload"] in the thread's Python context. Mission struct extended: - current_focus: what the next thread should work on (evolving) - approach_history: what we've tried (for adaptation) - max_threads_per_day / threads_today: daily budget - last_trigger_payload: webhook/event data for thread context Plan updated with trigger type table and webhook integration design. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): implement MissionManager execution with meta-prompts The MissionManager now builds evolving meta-prompts and processes thread outcomes for continuous learning: fire_mission() upgraded: - Loads Project MemoryDocs via RetrievalEngine for context - Builds meta-prompt from: goal, current_focus, approach_history, project knowledge docs, trigger payload, thread count - Spawns thread with meta-prompt as user message - Background task waits for completion and processes outcome - Daily thread budget enforcement (max_threads_per_day) Meta-prompt structure: # Mission: {name} Goal: {goal} ## Current Focus (evolves between threads) ## Previous Approaches (what we've tried) ## Knowledge from Prior Threads (lessons, playbooks, issues) ## Trigger Payload (webhook/event data if applicable) ## Instructions (accomplish step, report next focus, check goal) Outcome processing: - Extracts "next focus:" from FINAL() response → updates current_focus - Detects "goal achieved: yes" → completes mission - Records accomplishment in approach_history - Failed threads recorded as "FAILED: {error}" Cron ticker: - start_cron_ticker() spawns tokio task, ticks every 60s - Checks active Cron missions, fires those past next_fire_at 151 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): wire MissionManager into engine v2 for CodeAct access Missions are now callable from CodeAct Python code: ```python # Create a daily briefing mission result = mission_create( name="Tech News", goal="Daily AI/crypto/software news briefing", cadence="0 9 * * *" ) # List all missions missions = mission_list() # Manually fire a mission mission_fire(id="...") # Pause/resume mission_pause(id="...") mission_resume(id="...") ``` Implementation: - MissionManager created on engine init, cron ticker started - EffectBridgeAdapter intercepts mission_* function calls before tool lookup and routes to MissionManager - parse_cadence() handles: "manual", cron expressions, "event:pattern", "webhook:path" - Mission functions documented in CodeAct system prompt - MissionManager set on adapter via set_mission_manager() after init (avoids circular dependency) System prompt updated with mission_create, mission_list, mission_fire, mission_pause, mission_resume documentation. 151 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(bridge): map routine_* calls to mission operations in v2 When the model calls routine_create, routine_list, routine_fire, routine_pause, routine_resume, or routine_delete, the bridge now routes them to the MissionManager instead of blocking with an error. Mapping: routine_create → mission_create (with cadence parsing) routine_list → mission_list routine_fire → mission_fire routine_pause → mission_pause routine_resume → mission_resume routine_update → mission_pause/resume (based on params) routine_delete → mission_complete (marks as done) Routine tools removed from v1-only blocklist and restored in available_actions(). The model can use either "routine" or "mission" vocabulary — both work. Still blocked: create_job, cancel_job, build_software (need v1 Scheduler/ContainerJobManager refs). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(engine): add E2E mission flow tests — 7 new tests Comprehensive mission lifecycle tests: - fire_mission_builds_meta_prompt_with_goal: verifies thread spawned with project context and recorded in history - outcome_processing_extracts_next_focus: "Next focus: X" in FINAL() response → mission.current_focus updated - outcome_processing_detects_goal_achieved: "Goal achieved: yes" → mission status transitions to Completed - mission_evolves_via_direct_outcome_processing: 3-step evolution: step 1 sets focus to "db module", step 2 evolves to "tools module", step 3 detects goal achieved → mission completes. Tests the full learning loop without background task timing dependencies. - fire_with_trigger_payload: webhook payload stored on mission and threads_today counter incremented - daily_budget_enforced: max_threads_per_day=1 → first fire succeeds, second returns None 157 tests passing (151 prior + 6 new mission E2E). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): self-improving engine via Mission system Wire the self-improvement loop as a Mission with OnSystemEvent cadence, inspired by karpathy/autoresearch's program.md approach. The mission fires when threads complete with issues, receives trace data as trigger payload, and uses tools directly to diagnose and fix problems. Key changes: Engine self-improvement (Phase A+B from design doc): - Add fire_on_system_event() to MissionManager for OnSystemEvent cadence - Add start_event_listener() that subscribes to thread events and fires matching missions when non-Mission threads complete with trace issues - Add ensure_self_improvement_mission() with autoresearch-style goal prompt (concrete loop steps, not vague instructions) - Add process_self_improvement_output() for structured JSON fallback - Seed fix pattern database with 8 known patterns from debugging - Runtime prompt overlay via MemoryDoc (build_codeact_system_prompt now async + Store-aware, appends learned rules from prompt_overlay docs) - Pass Store to ExecutionLoop for overlay loading Bridge review fixes (P1/P2): - Scope engine v2 SSE events to requesting user (broadcast_for_user) - Per-user pending approvals via HashMap instead of global Option - Reset tool-call limit counter before each thread execution - Only persist auto-approval when user chose "always", not one-off "yes" - Remove dead store/mission_manager fields from EngineState Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add checkpoint-based engine thread recovery * feat(engine): add Python orchestrator module and host functions Add the orchestrator infrastructure for replacing the Rust execution loop with versioned Python code. This commit adds the module and host functions without switching over — the existing Rust loop is unchanged. New files: - orchestrator/default.py: v0 Python orchestrator (run_loop + helpers) - executor/orchestrator.rs: host function dispatch, orchestrator loading from Store with version selection, OrchestratorResult parsing Host functions exposed to orchestrator Python via Monty suspension: __llm_complete__, __execute_code_step__ (nested Monty VM), __execute_action__, __check_signals__, __emit_event__, __add_message__, __save_checkpoint__, __transition_to__, __retrieve_docs__, __check_budget__, __get_actions__ Also makes json_to_monty, monty_to_json, monty_to_string pub(crate) in scripting.rs for cross-module use. Design doc: docs/plans/2026-03-25-python-orchestrator.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): switch ExecutionLoop::run() to Python orchestrator Replace the 900-line Rust execution loop with a ~80-line bootstrap that loads and runs the versioned Python orchestrator via Monty VM. The orchestrator Python code (orchestrator/default.py) is the v0 compiled-in version. Runtime versions can override it via MemoryDoc storage (orchestrator:main with tag orchestrator_code). Key fixes during switchover: - Use ExtFunctionResult::NotFound for unknown functions so Monty falls through to Python-defined functions (extract_final, etc.) - Move helper function definitions above run_loop for Monty scoping - Use FINAL result value (not VM return value) in Complete handler - Rename 'final' variable to 'final_answer' to avoid Python keyword Status: 171/177 tests pass. 6 remaining failures are step_count and token tracking bookkeeping — the orchestrator manages these internally but doesn't yet update the thread's counters via host functions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): all 177 tests pass with Python orchestrator - Increment step_count and track tokens in __emit_event__("step_completed") so thread bookkeeping matches the old Rust loop behavior - Remove double-counting of tokens in bootstrap (orchestrator handles it) - Match nudge text to existing TOOL_INTENT_NUDGE constant - Fix FINAL result propagation (use stored final_result, not VM return) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): orchestrator versioning, auto-rollback, and tests Add version lifecycle for the Python orchestrator: - Failure tracking via MemoryDoc (orchestrator:failures) - Auto-rollback: after 3 consecutive failures, skip the latest version and fall back to previous (or compiled-in v0) - Success resets the failure counter - OrchestratorRollback event for observability Update self-improvement Mission goal with Level 1.5 instructions for orchestrator patches — the agent can now modify the execution loop itself via memory_write with versioned orchestrator docs. 12 new tests: version selection (highest wins), rollback after failures, rollback to default, failure counting/resetting, outcome parsing for all 5 ThreadOutcome variants. 189 tests pass, zero clippy warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add engine v2 architecture, self-improvement, and dev history Three new docs for contributors: - engine-v2-architecture.md: Two-layer architecture (Rust kernel + Python orchestrator), five primitives, execution model with nested Monty VMs, bridge layer, memory/reflection, missions, capabilities - self-improvement.md: Three improvement levels (prompt/orchestrator/ config/code), autoresearch-inspired Mission loop, versioned orchestrator with auto-rollback, fix pattern database, safety model - development-history.md: Summary of 6 Claude Code sessions that built the system, key design decisions and debugging moments, architecture evolution from 900-line Rust loop to Python orchestrator Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(engine): complete v2 side-by-side integration with gateway API Wire engine v2 into the full submission pipeline and expose threads, projects, and missions through the web gateway REST API. Bridge routing — route ExecApproval, Interrupt, NewThread, and Clear submissions to engine v2 when ENGINE_V2=true. Previously only UserInput and ApprovalResponse were handled; all other control commands fell through to disconnected v1 sessions. Bridge query layer — add 11 read-only query functions and 6 DTO types so gateway handlers can inspect engine state (threads, steps, events, projects, missions) without direct access to the EngineState singleton. Gateway endpoints — new /api/engine/* routes: GET /threads, /threads/{id}, /threads/{id}/steps, /threads/{id}/events GET /projects, /projects/{id} GET /missions, /missions/{id} POST /missions/{id}/fire, /missions/{id}/pause, /missions/{id}/resume SSE events — add ThreadStateChanged, ChildThreadSpawned, and MissionThreadSpawned AppEvent variants. Expand the bridge event mapper to forward StateChanged and ChildSpawned engine events to the browser. Engine crate — add ConversationManager::clear_conversation() for /new and /clear commands. Code quality — replace 10 .expect() calls with proper error returns, remove dead AgentConfig.engine_v2 field, log silent init errors, fix duplicate doc comment, improve fallthrough documentation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): empty call_id on ActionResult and trace analyzer false positives Fix structured executor not stamping call_id onto ActionResult — the EffectExecutor trait doesn't receive call_id, so the structured executor must copy it from the original ActionCall after execution. Empty call_id caused OpenAI-compatible providers to reject the next LLM request with "Invalid 'input[2].call_id': empty string". Fix trace analyzer false positives: - code_error check now only scans User-role code output messages (prefixed with [stdout]/[stderr]/[code ]/Traceback), not System prompt which contains example error text - missing_tool_output check now recognizes ActionResult messages as valid tool output (Tier 0 structured path) - Add NotImplementedError to detected code error patterns New trace checks: - empty_call_id: detect ActionResult messages with missing/empty call_id before they reach the LLM API (severity: Error) - llm_error: extract LLM provider errors from Failed state reason - orchestrator_error: extract orchestrator errors from Failed state Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(web): add Missions tab to gateway UI Add a full Missions page to the web gateway with list view, detail view, and action buttons (Fire, Pause, Resume). Backend: add /api/engine/missions/summary endpoint returning counts by status (active/paused/completed/failed). Frontend: - New "Missions" tab between Jobs and Routines - Summary cards showing mission counts by status - Table with name, goal, cadence type, thread count, status, actions - Detail view with goal, cadence, current focus, success criteria, approach history, spawned thread list, and action buttons - Fire/Pause/Resume actions with toast notifications - i18n support (English + Chinese) - CSS following the existing routines/jobs patterns Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): eagerly initialize engine v2 at startup The gateway API endpoints (/api/engine/missions, etc.) call bridge query functions that return empty results when the engine state hasn't been initialized yet. Previously, initialization only happened lazily on the first chat message via handle_with_engine(). Now when ENGINE_V2=true, the engine is initialized in Agent::run() before channels start, so the self-improvement mission and other engine state is available to gateway API endpoints immediately. Also rename get_or_init_engine → init_engine and make it public so it can be called from agent_loop.rs at startup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(web): improve mission detail with markdown goal and thread table - Goal rendered as full-width markdown block instead of plain-text meta item (uses existing renderMarkdown/marked) - Current focus and success criteria also rendered as markdown - Spawned threads shown as a clickable table with goal, type, state, steps, tokens, and created date instead of a UUID list - Clicking a thread row opens an inline thread detail view showing metadata grid and full message history with markdown rendering - Back button returns to the mission detail view - Backend: mission detail now returns full thread summaries (goal, state, step_count, tokens) instead of just thread IDs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(web): close SSE connections on page unload to prevent connection starvation The browser limits concurrent HTTP/1.1 connections per origin to 6. Without cleanup, SSE connections from prior page loads linger after refresh/navigation, eating into the pool. After 2-3 refreshes, all 6 slots are consumed by stale SSE streams and new API fetch calls queue indefinitely — the UI shows "connected" (SSE works) but data never loads. Add a beforeunload handler that closes both eventSource (chat events) and logEventSource (log stream) so the browser can reuse connections immediately on page reload. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(web): support multiple gateway tabs by reducing SSE connections Each browser tab opened 2 SSE connections (chat events + log events). With the HTTP/1.1 per-origin limit of 6, the 3rd tab exhausted the pool and couldn't load any data. Three changes: 1. Lazy log SSE — only connect when the logs tab is active, disconnect when switching away. Most users rarely view logs, so this saves a connection slot per tab. 2. Visibility API — close SSE when the browser tab goes to background (user switches to another tab), reconnect when it becomes visible. Background tabs don't need real-time events. 3. Combined with the existing beforeunload cleanup, this means: - Active foreground tab: 1 connection (chat SSE only, +1 if logs tab) - Background tabs: 0 connections - Closed/refreshed tabs: 0 connections (beforeunload cleanup) This allows many gateway tabs to coexist within the 6-connection limit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(engine): route messages to correct conversation by thread scope Messages sent from a new conversation in the gateway always appeared in the default assistant conversation because handle_with_engine ignored the thread_id from the frontend. Two fixes: 1. Engine conversation scoping — when the message carries a thread_id (from the frontend's conversation picker), use it as part of the engine conversation key: "gateway:<thread_id>" instead of just "gateway". This creates a distinct engine conversation per v1 thread, so messages don't cross-contaminate. 2. V1 dual-write targeting — write user messages and assistant responses to the v1 conversation matching the thread_id (via ensure_conversation), not the hardcoded assistant conversation. Falls back to the assistant conversation when no thread_id is present (e.g., default chat). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(web): richer activity indicators for engine v2 execution The gateway UI showed only generic "Thinking..." during engine v2 execution with no visibility into CodeAct code execution, tool calls, or reflection. Now the event mapping produces detailed status updates: Step lifecycle: - "Calling LLM..." when a step starts (was "Thinki…
Summary
crates/ironclaw_commonwithAppEvent(renamed fromSseEvent) andtruncate_preview— shared types/utils that were leaking from the web gateway into 12+ modulesSseEvent→AppEventacross 23 files — the enum was the app-wide event protocol, not a web transport concernfrom_sse_event()→from_app_event()onWsServerMessageweb/types.rsandweb/util.rsre-export fromironclaw_commonfor backward compat within the gatewayMotivation
SseEventwas defined insrc/channels/web/types.rsbut imported by agent, orchestrator, worker, tools, extensions, and CLI modules. This coupling would block extracting the web gateway into its own crate. Moving the event type to a neutral shared crate decouples the event protocol from the transport.Aligned with the event bus direction on
refactor/architectural-hardeningwhereDomainEvent(≡AppEvent) is wrapped in aSystemEventenvelope with category-based sink routing.Test plan
cargo check— clean compilecargo clippy --all --benches --tests --examples --all-features— zero warningscargo fmt -- --check— cleancargo test -p ironclaw_common— 11 tests pass (truncate_preview)cargo test -p ironclaw --lib -- channels::web— 127 gateway tests passcargo test -p ironclaw --lib -- job_monitor— 9 tests passcargo testsuite — 3576+ tests pass, 0 failures🤖 Generated with Claude Code