Skip to content

docs(plan): update engine v2 architecture to match verified reality#2801

Merged
ilblackdragon merged 2 commits into
stagingfrom
docs/engine-v2-plan-update-2800
Apr 22, 2026
Merged

docs(plan): update engine v2 architecture to match verified reality#2801
ilblackdragon merged 2 commits into
stagingfrom
docs/engine-v2-plan-update-2800

Conversation

@ilblackdragon
Copy link
Copy Markdown
Member

Rewrites sections of docs/plans/2026-03-20-engine-v2-architecture.md that had drifted from the code. No code changes.

What this fixes

A verification pass against the actual implementation showed that several items the plan marked as "pending" or "not yet implemented" were in fact shipped. Future readers relying on the plan would redo the same verification work.

Sections rewritten

Section Previous claim Actual state
§4.3 Compaction "orchestrator should own…" (aspirational) IMPLEMENTED in orchestrator/default.py:240-310
§4.9 Tool reliability "whether to inject them… remaining question" ReliabilityTracker exists; integration tracked #2800 PR-B
§6.7 Routines / Jobs "blocked in engine v2 with a helpful error" routine_* already aliased to mission_*; create_job aliasing in #2800 PR-C
§6.7 Two-phase commit "NOT YET IMPLEMENTED" IMPLEMENTED via unified gate (policy.rs:126-169 + structured.rs:139-171)
§6.7 Acceptance testing "IN PROGRESS" with 6-item coverage list pointer to with_engine_v2 harness; coverage list moved to #2800 PR-D
§7 Cleanup and Migration "Planned" monolith split: 7a engine-side DONE, 7b host-cleanup blocked on default flip
Status header + Implementation Progress table phase 6 "PARTIAL", 7/8 "Planned" phase 6 DONE, 7a DONE, 7b blocked on flip, 8 in progress

Tracking

Part of umbrella issue #2800 (engine v2 default flip).

Test plan

  • `grep -n "NOT YET IMPLEMENTED\|Planned\|aspirational" docs/plans/2026-03-20-engine-v2-architecture.md` → clean except intentional history references
  • file paths and line numbers in the doc resolve to existing code

🤖 Generated with Claude Code

…ality

The plan doc claimed several items as missing/pending that are already
implemented. Update to match ground truth so future readers don't redo
the verification pass.

Changes:
- Compaction (§4.3): marked DONE, pointer to orchestrator/default.py:240-310
- Tool reliability (§4.9): tracker exists; integration tracked in #2800 PR-B
- Routines/Jobs (§6.7): routine_to_mission_alias already translates routine_* calls;
  create_job aliasing tracked in #2800 PR-C
- Two-phase commit (§6.7): marked IMPLEMENTED via unified gate
  (policy.rs:126-169 + structured.rs:139-171); simulate/preview
  intentionally not added at policy layer
- Acceptance testing (§6.7): pointer to with_engine_v2 harness; coverage
  expansion tracked in #2800 PR-D
- Phase 7: split into 7a (engine-side, DONE) and 7b (host cleanup,
  blocked on default flip)
- Status header + Implementation Progress table: updated to match current
  state; default-flip work consolidated under issue #2800

No code changes.

Refs: #2800

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added scope: docs Documentation size: XS < 10 changed lines (excluding docs) risk: low Changes to docs, tests, or low-risk modules contributor: core 20+ merged PRs labels Apr 21, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Engine v2 architecture documentation to reflect the current implementation status, specifically noting that Phase 7 engine-side cleanup is complete and Phase 8 infrastructure integration is underway. Key technical updates include documenting the Python-based compaction logic, the implementation of the ReliabilityTracker for tool performance, and the unified gate mechanism for two-phase commits. The review feedback identifies several technical inaccuracies in the documentation regarding token estimation ownership, non-existent file references, the current state of the tool denylist, and the specific provenance triggers for effect approvals.

- Stores a snapshot in state history for audit
- Full prior trajectory stays searchable via workspace-backed retrieval; raw history is not replayed into the attention window

Rust side provides token estimation, retrieval helpers, and final transcript commit points; the orchestrator owns the mutable working transcript it sends to the LLM.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The document states that the Rust side provides token estimation, but the implementation in orchestrator/default.py (lines 226-238) uses a Python-side heuristic (estimate_context_tokens) instead of a host function. The description should be updated to reflect that the orchestrator performs this estimation.

Suggested change
Rust side provides token estimation, retrieval helpers, and final transcript commit points; the orchestrator owns the mutable working transcript it sends to the LLM.
Rust side provides retrieval helpers and final transcript commit points; the orchestrator owns the mutable working transcript it sends to the LLM and performs token estimation via a heuristic.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1b9c47c — moved token estimation ownership to the Python orchestrator and left Rust with retrieval helpers + final transcript commit points only.


Rust side provides token estimation, retrieval helpers, and final transcript commit points; the orchestrator owns the mutable working transcript it sends to the LLM.

Note: the crate-structure block above mentions `executor/compaction.rs` — that file was never created. Compaction lives entirely in Python; the Rust side only exposes the primitives the Python orchestrator calls.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The note mentions that the crate-structure block above includes executor/compaction.rs, but that file is not present in the structure listed in lines 97-103. This reference should be removed or updated to avoid confusion.

Suggested change
Note: the crate-structure block above mentions `executor/compaction.rs` — that file was never created. Compaction lives entirely in Python; the Rust side only exposes the primitives the Python orchestrator calls.
Note: Compaction lives entirely in Python; the Rust side only exposes the primitives the Python orchestrator calls.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1b9c47c — dropped the stale cross-reference. Now just notes that compaction lives entirely in Python and the Rust side exposes the primitives the orchestrator calls.

- Routines still work via `/routine` slash commands (fall through to v1)
- Engine v2 Mission APIs exist and are wired through the bridge; remaining work is migration/UX convergence rather than greenfield implementation
- `routine_create` / `routine_update` / `routine_list` / etc. are translated to mission_* dispatches via `routine_to_mission_alias()` in `src/bridge/effect_adapter.rs` before the v1-denylist check fires. The LLM-facing routine tools go through the mission manager in v2, not the v1 routine engine.
- Tracked in issue #2800 (PR-C): extend the alias to cover `create_job` / `cancel_job` as well. Only `build_software` remains hard-denylisted as v1-specific infra.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The statement "Only build_software remains hard-denylisted" is slightly premature as it describes the desired state after PR-C is completed. Currently, create_job and cancel_job are also still hard-denylisted in src/bridge/effect_adapter.rs:2275-2283.

Suggested change
- Tracked in issue #2800 (PR-C): extend the alias to cover `create_job` / `cancel_job` as well. Only `build_software` remains hard-denylisted as v1-specific infra.
- Tracked in issue #2800 (PR-C): extend the alias to cover create_job / cancel_job as well, after which only build_software will remain hard-denylisted as v1-specific infra.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1b9c47c — reworded to "...after which only build_software will remain hard-denylisted" so it's clearly describing the post-PR-C state, not current staging.

1. Simulate → preview
2. Approve → user/policy
3. Execute → actual effect
- `PolicyEngine::evaluate_with_provenance` injects `RequireApproval` for `WriteExternal` and `Financial` effects when triggered by `LlmGenerated` or `ToolOutput` provenance (`crates/ironclaw_engine/src/capability/policy.rs:126-169`).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The description of provenance-based approval is slightly inaccurate. According to the implementation in policy.rs:153-161, ToolOutput provenance currently only triggers RequireApproval for Financial effects, not WriteExternal effects.

Suggested change
- `PolicyEngine::evaluate_with_provenance` injects `RequireApproval` for `WriteExternal` and `Financial` effects when triggered by `LlmGenerated` or `ToolOutput` provenance (`crates/ironclaw_engine/src/capability/policy.rs:126-169`).
- PolicyEngine::evaluate_with_provenance injects RequireApproval for Financial effects (via LlmGenerated or ToolOutput provenance) and WriteExternal effects (via LlmGenerated provenance) (crates/ironclaw_engine/src/capability/policy.rs:126-169).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1b9c47c — rewrote to "injects RequireApproval for Financial effects (via LlmGenerated or ToolOutput provenance) and WriteExternal effects (via LlmGenerated provenance)" to match policy.rs:126-169 exactly (ToolOutput only taints Financial, not WriteExternal).

@ilblackdragon ilblackdragon marked this pull request as ready for review April 22, 2026 07:47
Copilot AI review requested due to automatic review settings April 22, 2026 07:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the engine v2 architecture plan document to reflect the current, verified implementation state and re-align status/progress sections with shipped code (tracked under issue #2800).

Changes:

  • Updates status header + implementation progress table to reflect completed phases (incl. Phase 6 + 7a) and current Phase 8 focus.
  • Rewrites plan sections (compaction, tool reliability, routines/jobs, two-phase commit, acceptance testing, cleanup/migration) with references to concrete code locations.
  • Adds clarifications where prior plan text described aspirational or outdated architecture.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- `SkillSelector` / `LoadedSkill` → engine `Capability` (knowledge)
- `HookPipeline` → engine `Capability` (policies)
- `ApprovalRequirement` / `ApprovalContext` → engine `CapabilityLease` + `PolicyEngine`
The `ironclaw_engine` crate contains zero references to `JobState`, `Session`, `Routine`, or v1 delegate types. The engine was built clean from day one on the five primitives (Thread, Step, Capability, MemoryDoc, Project). No migration work is needed inside the crate.
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“The ironclaw_engine crate contains zero references to ... Session, Routine ...” is not accurate as-written: the terms “Session”/“Routine” appear in crate/module docs (e.g., crates/ironclaw_engine/src/lib.rs and crates/ironclaw_engine/src/types/thread.rs). If the intent is “no runtime dependency on v1 types/modules”, consider rephrasing to that (or narrowing to “no v1 type definitions/usages outside documentation”).

Suggested change
The `ironclaw_engine` crate contains zero references to `JobState`, `Session`, `Routine`, or v1 delegate types. The engine was built clean from day one on the five primitives (Thread, Step, Capability, MemoryDoc, Project). No migration work is needed inside the crate.
The `ironclaw_engine` crate has no runtime dependency on `JobState`, `Session`, `Routine`, or v1 delegate types; any remaining mentions are limited to documentation/comments. The engine was built clean from day one on the five primitives (Thread, Step, Capability, MemoryDoc, Project). No migration work is needed inside the crate.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1b9c47c — rephrased to "no runtime dependency on JobState, Session, Routine, or v1 delegate types; any remaining mentions are limited to documentation/comments". Agreed the prior "zero references" phrasing was too literal given the docs/comment mentions in lib.rs and types/thread.rs.

Rust should provide token estimates, retrieval helpers, checkpoints, and final transcript commit points. The compaction policy, timing, and prompt should live in the Python RLM loop, and the orchestrator should own the mutable working transcript it sends to the LLM.
Compaction is orchestrator-owned, in Python. See `crates/ironclaw_engine/orchestrator/default.py:240-310`:

- Triggers when token count exceeds 85% of the model limit
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compaction trigger is described as fixed at 85% of model limit, but compact_if_needed uses compaction_threshold from config (default 0.85). Consider wording this as “defaults to 85% (configurable via compaction_threshold)” to match the implementation.

Suggested change
- Triggers when token count exceeds 85% of the model limit
- Triggers when token count exceeds the configured `compaction_threshold` of the model limit (defaults to 85%)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1b9c47c — rebuilt the bullet as "Triggers when token count exceeds the configured compaction_threshold of the model limit (defaults to 85%)" so it matches compact_if_needed's config read.

### 4.9 Tool reliability learning
Track per-action EMA metrics (success rate, latency, failure patterns). Current remaining question: whether to inject them into context by default or only surface them through targeted retrieval/debugging.

`ReliabilityTracker` (`crates/ironclaw_engine/src/reliability.rs`) records EMA-smoothed success rate and latency per action. Tracked in issue #2800 (PR-B): writes from `EffectBridgeAdapter` after every dispatch, reads from `build_step_context` to append a "recently unreliable actions" section to the system prompt when `call_count ≥ 10` and `success_rate < 0.7` (cap 5 entries, kill switch `ENGINE_V2_RELIABILITY_HINTS`).
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph includes detailed behavior (writing from EffectBridgeAdapter, reading from build_step_context, thresholds, and kill switch ENGINE_V2_RELIABILITY_HINTS) that does not appear to exist in the codebase yet (e.g., no ENGINE_V2_RELIABILITY_HINTS usage and no record_success/record_failure calls from the bridge). To keep this plan “verified reality”, either remove the unimplemented details or clearly label them as proposed work for PR-B.

Suggested change
`ReliabilityTracker` (`crates/ironclaw_engine/src/reliability.rs`) records EMA-smoothed success rate and latency per action. Tracked in issue #2800 (PR-B): writes from `EffectBridgeAdapter` after every dispatch, reads from `build_step_context` to append a "recently unreliable actions" section to the system prompt when `call_count ≥ 10` and `success_rate < 0.7` (cap 5 entries, kill switch `ENGINE_V2_RELIABILITY_HINTS`).
`ReliabilityTracker` (`crates/ironclaw_engine/src/reliability.rs`) records EMA-smoothed success rate and latency per action. Proposed follow-up work tracked in issue #2800 (PR-B): wire `EffectBridgeAdapter` to record outcomes after dispatch, have `build_step_context` optionally surface a "recently unreliable actions" prompt section, and finalize any thresholds, entry caps, and feature-flag/kill-switch behavior (including a possible `ENGINE_V2_RELIABILITY_HINTS` control) once implemented.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1b9c47c — relabelled the paragraph as proposed PR-B follow-up work. The threshold/cap/kill-switch details are now described as design proposals to be finalized when EffectBridgeAdapter write-backs and build_step_context reads actually land.

Apply accuracy fixes from PR #2801 review:

- Compaction threshold: describe as configurable via `compaction_threshold`
  (defaults to 85%), matching `compact_if_needed` in the Python
  orchestrator rather than claiming a fixed 85%.
- Token estimation: move ownership to the Python orchestrator (which
  runs the chars/token heuristic); Rust no longer claims to own this.
- Compaction cross-reference: drop the stale "crate-structure block
  above includes executor/compaction.rs" note — compaction lives
  entirely in Python.
- Reliability injection details (`ENGINE_V2_RELIABILITY_HINTS` kill
  switch, `EffectBridgeAdapter` write-backs, `build_step_context`
  reads) are labelled as proposed PR-B follow-up work rather than
  described as verified reality.
- Denylist phrasing: make it clear that `build_software` remains the
  only hard-denylisted v1 tool *after* PR-C lands, not before.
- Provenance rules: document accurately that `ToolOutput` provenance
  only injects `RequireApproval` on `Financial` effects; `WriteExternal`
  taint comes only from `LlmGenerated`, per policy.rs:126-169.
- Engine-side cleanup: acknowledge that `Session` / `Routine`
  identifiers still appear in engine docs/comments; the invariant is
  no runtime dependency, not zero string occurrences.

No code changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ilblackdragon ilblackdragon merged commit 417ee61 into staging Apr 22, 2026
17 checks passed
@ilblackdragon ilblackdragon deleted the docs/engine-v2-plan-update-2800 branch April 22, 2026 10:00
@serrrfirat
Copy link
Copy Markdown
Collaborator

Paranoid Architect Review — Approve with Fixes

2 Medium findings. The verification pass is mostly accurate — claims about implemented functionality check out against the codebase. Two issues remain.

Medium

  • Wrong line range: orchestrator/default.py:240-310 for compaction — compact_if_needed actually spans lines 206-276. Cited range starts mid-function and overflows into score_skill().
  • Phase 6 progress table inconsistency: Progress table marks Phase 6 as "DONE", but the section header still reads "DONE (partial)" and two subsections (Routines/Jobs, Acceptance testing) are explicitly described as not-yet-complete. Consider "DONE (core)" or keeping "PARTIAL" to match.

Fixes needed

  1. Change orchestrator/default.py:240-310orchestrator/default.py:206-276
  2. Reconcile Phase 6 status between progress table and section header/body

### 4.3 Compaction (from RLM) — IMPLEMENTED

Rust should provide token estimates, retrieval helpers, checkpoints, and final transcript commit points. The compaction policy, timing, and prompt should live in the Python RLM loop, and the orchestrator should own the mutable working transcript it sends to the LLM.
Compaction is orchestrator-owned, in Python. See `crates/ironclaw_engine/orchestrator/default.py:240-310`:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: Wrong line range. compact_if_needed() spans lines 206-276 in orchestrator/default.py, not 240-310. The cited range starts in the middle of the function (at the history.append() block) and overflows into the unrelated score_skill() function which starts at line 282.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: low Changes to docs, tests, or low-risk modules scope: docs Documentation size: XS < 10 changed lines (excluding docs)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants