docs(plan): update engine v2 architecture to match verified reality by ilblackdragon · Pull Request #2801 · nearai/ironclaw

ilblackdragon · 2026-04-21T17:38:39Z

Rewrites sections of docs/plans/2026-03-20-engine-v2-architecture.md that had drifted from the code. No code changes.

What this fixes

A verification pass against the actual implementation showed that several items the plan marked as "pending" or "not yet implemented" were in fact shipped. Future readers relying on the plan would redo the same verification work.

Sections rewritten

Section	Previous claim	Actual state
§4.3 Compaction	"orchestrator should own…" (aspirational)	IMPLEMENTED in `orchestrator/default.py:240-310`
§4.9 Tool reliability	"whether to inject them… remaining question"	`ReliabilityTracker` exists; integration tracked #2800 PR-B
§6.7 Routines / Jobs	"blocked in engine v2 with a helpful error"	`routine_` already aliased to mission_; `create_job` aliasing in #2800 PR-C
§6.7 Two-phase commit	"NOT YET IMPLEMENTED"	IMPLEMENTED via unified gate (`policy.rs:126-169` + `structured.rs:139-171`)
§6.7 Acceptance testing	"IN PROGRESS" with 6-item coverage list	pointer to `with_engine_v2` harness; coverage list moved to #2800 PR-D
§7 Cleanup and Migration	"Planned" monolith	split: 7a engine-side DONE, 7b host-cleanup blocked on default flip
Status header + Implementation Progress table	phase 6 "PARTIAL", 7/8 "Planned"	phase 6 DONE, 7a DONE, 7b blocked on flip, 8 in progress

Tracking

Part of umbrella issue #2800 (engine v2 default flip).

Test plan

`grep -n "NOT YET IMPLEMENTED\|Planned\|aspirational" docs/plans/2026-03-20-engine-v2-architecture.md` → clean except intentional history references
file paths and line numbers in the doc resolve to existing code

🤖 Generated with Claude Code

…ality The plan doc claimed several items as missing/pending that are already implemented. Update to match ground truth so future readers don't redo the verification pass. Changes: - Compaction (§4.3): marked DONE, pointer to orchestrator/default.py:240-310 - Tool reliability (§4.9): tracker exists; integration tracked in #2800 PR-B - Routines/Jobs (§6.7): routine_to_mission_alias already translates routine_* calls; create_job aliasing tracked in #2800 PR-C - Two-phase commit (§6.7): marked IMPLEMENTED via unified gate (policy.rs:126-169 + structured.rs:139-171); simulate/preview intentionally not added at policy layer - Acceptance testing (§6.7): pointer to with_engine_v2 harness; coverage expansion tracked in #2800 PR-D - Phase 7: split into 7a (engine-side, DONE) and 7b (host cleanup, blocked on default flip) - Status header + Implementation Progress table: updated to match current state; default-flip work consolidated under issue #2800 No code changes. Refs: #2800 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request updates the Engine v2 architecture documentation to reflect the current implementation status, specifically noting that Phase 7 engine-side cleanup is complete and Phase 8 infrastructure integration is underway. Key technical updates include documenting the Python-based compaction logic, the implementation of the ReliabilityTracker for tool performance, and the unified gate mechanism for two-phase commits. The review feedback identifies several technical inaccuracies in the documentation regarding token estimation ownership, non-existent file references, the current state of the tool denylist, and the specific provenance triggers for effect approvals.

gemini-code-assist · 2026-04-21T17:41:30Z

+- Stores a snapshot in state history for audit
+- Full prior trajectory stays searchable via workspace-backed retrieval; raw history is not replayed into the attention window
+
+Rust side provides token estimation, retrieval helpers, and final transcript commit points; the orchestrator owns the mutable working transcript it sends to the LLM.


The document states that the Rust side provides token estimation, but the implementation in orchestrator/default.py (lines 226-238) uses a Python-side heuristic (estimate_context_tokens) instead of a host function. The description should be updated to reflect that the orchestrator performs this estimation.

Suggested change

Rust side provides token estimation, retrieval helpers, and final transcript commit points; the orchestrator owns the mutable working transcript it sends to the LLM.

Rust side provides retrieval helpers and final transcript commit points; the orchestrator owns the mutable working transcript it sends to the LLM and performs token estimation via a heuristic.

Fixed in 1b9c47c — moved token estimation ownership to the Python orchestrator and left Rust with retrieval helpers + final transcript commit points only.

gemini-code-assist · 2026-04-21T17:41:30Z

+
+Rust side provides token estimation, retrieval helpers, and final transcript commit points; the orchestrator owns the mutable working transcript it sends to the LLM.
+
+Note: the crate-structure block above mentions `executor/compaction.rs` — that file was never created. Compaction lives entirely in Python; the Rust side only exposes the primitives the Python orchestrator calls.


The note mentions that the crate-structure block above includes executor/compaction.rs, but that file is not present in the structure listed in lines 97-103. This reference should be removed or updated to avoid confusion.

Suggested change

Note: the crate-structure block above mentions `executor/compaction.rs` — that file was never created. Compaction lives entirely in Python; the Rust side only exposes the primitives the Python orchestrator calls.

Note: Compaction lives entirely in Python; the Rust side only exposes the primitives the Python orchestrator calls.

Fixed in 1b9c47c — dropped the stale cross-reference. Now just notes that compaction lives entirely in Python and the Rust side exposes the primitives the orchestrator calls.

gemini-code-assist · 2026-04-21T17:41:30Z

- Routines still work via `/routine` slash commands (fall through to v1)
- Engine v2 Mission APIs exist and are wired through the bridge; remaining work is migration/UX convergence rather than greenfield implementation
+- `routine_create` / `routine_update` / `routine_list` / etc. are translated to mission_* dispatches via `routine_to_mission_alias()` in `src/bridge/effect_adapter.rs` before the v1-denylist check fires. The LLM-facing routine tools go through the mission manager in v2, not the v1 routine engine.
+- Tracked in issue #2800 (PR-C): extend the alias to cover `create_job` / `cancel_job` as well. Only `build_software` remains hard-denylisted as v1-specific infra.


The statement "Only build_software remains hard-denylisted" is slightly premature as it describes the desired state after PR-C is completed. Currently, create_job and cancel_job are also still hard-denylisted in src/bridge/effect_adapter.rs:2275-2283.

Suggested change

- Tracked in issue #2800 (PR-C): extend the alias to cover `create_job` / `cancel_job` as well. Only `build_software` remains hard-denylisted as v1-specific infra.

- Tracked in issue #2800 (PR-C): extend the alias to cover create_job / cancel_job as well, after which only build_software will remain hard-denylisted as v1-specific infra.

Fixed in 1b9c47c — reworded to "...after which only build_software will remain hard-denylisted" so it's clearly describing the post-PR-C state, not current staging.

gemini-code-assist · 2026-04-21T17:41:30Z

-1. Simulate → preview
-2. Approve → user/policy
-3. Execute → actual effect
+- `PolicyEngine::evaluate_with_provenance` injects `RequireApproval` for `WriteExternal` and `Financial` effects when triggered by `LlmGenerated` or `ToolOutput` provenance (`crates/ironclaw_engine/src/capability/policy.rs:126-169`).


The description of provenance-based approval is slightly inaccurate. According to the implementation in policy.rs:153-161, ToolOutput provenance currently only triggers RequireApproval for Financial effects, not WriteExternal effects.

Suggested change

- `PolicyEngine::evaluate_with_provenance` injects `RequireApproval` for `WriteExternal` and `Financial` effects when triggered by `LlmGenerated` or `ToolOutput` provenance (`crates/ironclaw_engine/src/capability/policy.rs:126-169`).

- PolicyEngine::evaluate_with_provenance injects RequireApproval for Financial effects (via LlmGenerated or ToolOutput provenance) and WriteExternal effects (via LlmGenerated provenance) (crates/ironclaw_engine/src/capability/policy.rs:126-169).

Fixed in 1b9c47c — rewrote to "injects RequireApproval for Financial effects (via LlmGenerated or ToolOutput provenance) and WriteExternal effects (via LlmGenerated provenance)" to match policy.rs:126-169 exactly (ToolOutput only taints Financial, not WriteExternal).

Copilot

Pull request overview

Updates the engine v2 architecture plan document to reflect the current, verified implementation state and re-align status/progress sections with shipped code (tracked under issue #2800).

Changes:

Updates status header + implementation progress table to reflect completed phases (incl. Phase 6 + 7a) and current Phase 8 focus.
Rewrites plan sections (compaction, tool reliability, routines/jobs, two-phase commit, acceptance testing, cleanup/migration) with references to concrete code locations.
Adds clarifications where prior plan text described aspirational or outdated architecture.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-22T07:51:35Z

- `SkillSelector` / `LoadedSkill` → engine `Capability` (knowledge)
- `HookPipeline` → engine `Capability` (policies)
- `ApprovalRequirement` / `ApprovalContext` → engine `CapabilityLease` + `PolicyEngine`
+The `ironclaw_engine` crate contains zero references to `JobState`, `Session`, `Routine`, or v1 delegate types. The engine was built clean from day one on the five primitives (Thread, Step, Capability, MemoryDoc, Project). No migration work is needed inside the crate.


“The ironclaw_engine crate contains zero references to ... Session, Routine ...” is not accurate as-written: the terms “Session”/“Routine” appear in crate/module docs (e.g., crates/ironclaw_engine/src/lib.rs and crates/ironclaw_engine/src/types/thread.rs). If the intent is “no runtime dependency on v1 types/modules”, consider rephrasing to that (or narrowing to “no v1 type definitions/usages outside documentation”).

Suggested change

The `ironclaw_engine` crate contains zero references to `JobState`, `Session`, `Routine`, or v1 delegate types. The engine was built clean from day one on the five primitives (Thread, Step, Capability, MemoryDoc, Project). No migration work is needed inside the crate.

The `ironclaw_engine` crate has no runtime dependency on `JobState`, `Session`, `Routine`, or v1 delegate types; any remaining mentions are limited to documentation/comments. The engine was built clean from day one on the five primitives (Thread, Step, Capability, MemoryDoc, Project). No migration work is needed inside the crate.

Fixed in 1b9c47c — rephrased to "no runtime dependency on JobState, Session, Routine, or v1 delegate types; any remaining mentions are limited to documentation/comments". Agreed the prior "zero references" phrasing was too literal given the docs/comment mentions in lib.rs and types/thread.rs.

Copilot · 2026-04-22T07:51:35Z

-Rust should provide token estimates, retrieval helpers, checkpoints, and final transcript commit points. The compaction policy, timing, and prompt should live in the Python RLM loop, and the orchestrator should own the mutable working transcript it sends to the LLM.
+Compaction is orchestrator-owned, in Python. See `crates/ironclaw_engine/orchestrator/default.py:240-310`:
+
+- Triggers when token count exceeds 85% of the model limit


Compaction trigger is described as fixed at 85% of model limit, but compact_if_needed uses compaction_threshold from config (default 0.85). Consider wording this as “defaults to 85% (configurable via compaction_threshold)” to match the implementation.

Suggested change

- Triggers when token count exceeds 85% of the model limit

- Triggers when token count exceeds the configured `compaction_threshold` of the model limit (defaults to 85%)

Fixed in 1b9c47c — rebuilt the bullet as "Triggers when token count exceeds the configured compaction_threshold of the model limit (defaults to 85%)" so it matches compact_if_needed's config read.

Copilot · 2026-04-22T07:51:36Z

 ### 4.9 Tool reliability learning
-Track per-action EMA metrics (success rate, latency, failure patterns). Current remaining question: whether to inject them into context by default or only surface them through targeted retrieval/debugging.
+
+`ReliabilityTracker` (`crates/ironclaw_engine/src/reliability.rs`) records EMA-smoothed success rate and latency per action. Tracked in issue #2800 (PR-B): writes from `EffectBridgeAdapter` after every dispatch, reads from `build_step_context` to append a "recently unreliable actions" section to the system prompt when `call_count ≥ 10` and `success_rate < 0.7` (cap 5 entries, kill switch `ENGINE_V2_RELIABILITY_HINTS`).


This paragraph includes detailed behavior (writing from EffectBridgeAdapter, reading from build_step_context, thresholds, and kill switch ENGINE_V2_RELIABILITY_HINTS) that does not appear to exist in the codebase yet (e.g., no ENGINE_V2_RELIABILITY_HINTS usage and no record_success/record_failure calls from the bridge). To keep this plan “verified reality”, either remove the unimplemented details or clearly label them as proposed work for PR-B.

Suggested change

`ReliabilityTracker` (`crates/ironclaw_engine/src/reliability.rs`) records EMA-smoothed success rate and latency per action. Tracked in issue #2800 (PR-B): writes from `EffectBridgeAdapter` after every dispatch, reads from `build_step_context` to append a "recently unreliable actions" section to the system prompt when `call_count ≥ 10` and `success_rate < 0.7` (cap 5 entries, kill switch `ENGINE_V2_RELIABILITY_HINTS`).

`ReliabilityTracker` (`crates/ironclaw_engine/src/reliability.rs`) records EMA-smoothed success rate and latency per action. Proposed follow-up work tracked in issue #2800 (PR-B): wire `EffectBridgeAdapter` to record outcomes after dispatch, have `build_step_context` optionally surface a "recently unreliable actions" prompt section, and finalize any thresholds, entry caps, and feature-flag/kill-switch behavior (including a possible `ENGINE_V2_RELIABILITY_HINTS` control) once implemented.

Fixed in 1b9c47c — relabelled the paragraph as proposed PR-B follow-up work. The threshold/cap/kill-switch details are now described as design proposals to be finalized when EffectBridgeAdapter write-backs and build_step_context reads actually land.

Apply accuracy fixes from PR #2801 review: - Compaction threshold: describe as configurable via `compaction_threshold` (defaults to 85%), matching `compact_if_needed` in the Python orchestrator rather than claiming a fixed 85%. - Token estimation: move ownership to the Python orchestrator (which runs the chars/token heuristic); Rust no longer claims to own this. - Compaction cross-reference: drop the stale "crate-structure block above includes executor/compaction.rs" note — compaction lives entirely in Python. - Reliability injection details (`ENGINE_V2_RELIABILITY_HINTS` kill switch, `EffectBridgeAdapter` write-backs, `build_step_context` reads) are labelled as proposed PR-B follow-up work rather than described as verified reality. - Denylist phrasing: make it clear that `build_software` remains the only hard-denylisted v1 tool *after* PR-C lands, not before. - Provenance rules: document accurately that `ToolOutput` provenance only injects `RequireApproval` on `Financial` effects; `WriteExternal` taint comes only from `LlmGenerated`, per policy.rs:126-169. - Engine-side cleanup: acknowledge that `Session` / `Routine` identifiers still appear in engine docs/comments; the invariant is no runtime dependency, not zero string occurrences. No code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

serrrfirat · 2026-04-22T11:47:21Z

Paranoid Architect Review — Approve with Fixes

2 Medium findings. The verification pass is mostly accurate — claims about implemented functionality check out against the codebase. Two issues remain.

Medium

Wrong line range: orchestrator/default.py:240-310 for compaction — compact_if_needed actually spans lines 206-276. Cited range starts mid-function and overflows into score_skill().
Phase 6 progress table inconsistency: Progress table marks Phase 6 as "DONE", but the section header still reads "DONE (partial)" and two subsections (Routines/Jobs, Acceptance testing) are explicitly described as not-yet-complete. Consider "DONE (core)" or keeping "PARTIAL" to match.

Fixes needed

Change orchestrator/default.py:240-310 → orchestrator/default.py:206-276
Reconcile Phase 6 status between progress table and section header/body

serrrfirat · 2026-04-22T11:51:46Z

+### 4.3 Compaction (from RLM) — IMPLEMENTED

-Rust should provide token estimates, retrieval helpers, checkpoints, and final transcript commit points. The compaction policy, timing, and prompt should live in the Python RLM loop, and the orchestrator should own the mutable working transcript it sends to the LLM.
+Compaction is orchestrator-owned, in Python. See `crates/ironclaw_engine/orchestrator/default.py:240-310`:


Medium: Wrong line range. compact_if_needed() spans lines 206-276 in orchestrator/default.py, not 240-310. The cited range starts in the middle of the function (at the history.append() block) and overflows into the unrelated score_skill() function which starts at line 282.

github-actions Bot added scope: docs Documentation size: XS < 10 changed lines (excluding docs) risk: low Changes to docs, tests, or low-risk modules contributor: core 20+ merged PRs labels Apr 21, 2026

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

ilblackdragon mentioned this pull request Apr 21, 2026

[tracker] Engine v2 default flip — umbrella tracker #2800

Open

20 tasks

ilblackdragon marked this pull request as ready for review April 22, 2026 07:47

Copilot AI review requested due to automatic review settings April 22, 2026 07:47

Copilot started reviewing on behalf of ilblackdragon April 22, 2026 07:47 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

ilblackdragon merged commit 417ee61 into staging Apr 22, 2026
17 checks passed

ilblackdragon deleted the docs/engine-v2-plan-update-2800 branch April 22, 2026 10:00

github-actions Bot mentioned this pull request Apr 22, 2026

chore: promote staging to staging-promote/65380170-24767546819 (2026-04-22 10:09 UTC) #2848

Merged

serrrfirat reviewed Apr 22, 2026

View reviewed changes

This was referenced Apr 27, 2026

chore: promote staging to main (2026-04-27 16:38 UTC) #2990

Merged

chore: promote staging to main (2026-04-27 18:18 UTC) #2995

Merged

henrypark133 mentioned this pull request Apr 29, 2026

chore: release #3059

Merged

This was referenced May 7, 2026

chore: release #3371

Closed

chore: release #3373

Closed

chore: release #3376

Merged

	Rust side provides token estimation, retrieval helpers, and final transcript commit points; the orchestrator owns the mutable working transcript it sends to the LLM.
	Rust side provides retrieval helpers and final transcript commit points; the orchestrator owns the mutable working transcript it sends to the LLM and performs token estimation via a heuristic.


		Rust side provides token estimation, retrieval helpers, and final transcript commit points; the orchestrator owns the mutable working transcript it sends to the LLM.

		Note: the crate-structure block above mentions `executor/compaction.rs` — that file was never created. Compaction lives entirely in Python; the Rust side only exposes the primitives the Python orchestrator calls.

	Note: the crate-structure block above mentions `executor/compaction.rs` — that file was never created. Compaction lives entirely in Python; the Rust side only exposes the primitives the Python orchestrator calls.
	Note: Compaction lives entirely in Python; the Rust side only exposes the primitives the Python orchestrator calls.

	- Tracked in issue #2800 (PR-C): extend the alias to cover `create_job` / `cancel_job` as well. Only `build_software` remains hard-denylisted as v1-specific infra.
	- Tracked in issue #2800 (PR-C): extend the alias to cover create_job / cancel_job as well, after which only build_software will remain hard-denylisted as v1-specific infra.

	- `PolicyEngine::evaluate_with_provenance` injects `RequireApproval` for `WriteExternal` and `Financial` effects when triggered by `LlmGenerated` or `ToolOutput` provenance (`crates/ironclaw_engine/src/capability/policy.rs:126-169`).
	- PolicyEngine::evaluate_with_provenance injects RequireApproval for Financial effects (via LlmGenerated or ToolOutput provenance) and WriteExternal effects (via LlmGenerated provenance) (crates/ironclaw_engine/src/capability/policy.rs:126-169).

	The `ironclaw_engine` crate contains zero references to `JobState`, `Session`, `Routine`, or v1 delegate types. The engine was built clean from day one on the five primitives (Thread, Step, Capability, MemoryDoc, Project). No migration work is needed inside the crate.
	The `ironclaw_engine` crate has no runtime dependency on `JobState`, `Session`, `Routine`, or v1 delegate types; any remaining mentions are limited to documentation/comments. The engine was built clean from day one on the five primitives (Thread, Step, Capability, MemoryDoc, Project). No migration work is needed inside the crate.

	- Triggers when token count exceeds 85% of the model limit
	- Triggers when token count exceeds the configured `compaction_threshold` of the model limit (defaults to 85%)

Conversation

ilblackdragon commented Apr 21, 2026

What this fixes

Sections rewritten

Tracking

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

serrrfirat commented Apr 22, 2026

Paranoid Architect Review — Approve with Fixes

Medium

Fixes needed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants