Skip to content

fix(reborn): preserve provider tool roundtrip metadata#3722

Open
henrypark133 wants to merge 9 commits into
reborn-integrationfrom
reborn/3620-provider-tool-calls
Open

fix(reborn): preserve provider tool roundtrip metadata#3722
henrypark133 wants to merge 9 commits into
reborn-integrationfrom
reborn/3620-provider-tool-calls

Conversation

@henrypark133
Copy link
Copy Markdown
Collaborator

Summary

  • preserve provider tool-call metadata through the Reborn loop so provider-backed models can receive valid tool-result roundtrips
  • keep raw provider replay metadata out of product-visible tool-result transcript content; store it on the typed internal provider-call side channel instead
  • group consecutive tool-result refs from the same provider turn so multi-tool responses replay as one assistant tool-call message followed by matching tool results

Stacked on #3720 / reborn/3622-tool-result-evidence.

Verification

  • cargo test -p ironclaw_reborn --features root-llm-provider --test llm_gateway
  • cargo test -p ironclaw_loop_support --test thread_loop_support_contract
  • cargo test -p ironclaw_threads --test session_thread_contract
  • cargo test -p ironclaw_threads --features libsql --test db_session_thread_contract
  • cargo test -p ironclaw_agent_loop
  • cargo test -p ironclaw_turns
  • cargo test -p ironclaw_reborn --test loop_driver_host
  • git diff --check origin/reborn/3622-tool-result-evidence..HEAD

@github-actions github-actions Bot added size: XL 500+ changed lines risk: low Changes to docs, tests, or low-risk modules contributor: core 20+ merged PRs labels May 17, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive support for provider-side tool calls within the agent loop, enabling the model to emit and track capability invocations with associated metadata. Key changes include the addition of ProviderToolCall and ProviderToolDefinition structures, updates to the LoopCapabilityPort and HostManagedModelGateway traits to handle tool-aware requests, and enhancements to the thread history logic for replaying tool-call turns. Feedback highlights critical issues regarding non-deterministic snapshot selection from HashMap collections in capability_port.rs, a missing import for sha256_digest_token, and fragile manual field copying when constructing tool completion requests in model_gateway.rs.

Comment thread crates/ironclaw_loop_support/src/capability_port.rs Outdated
Comment thread crates/ironclaw_loop_support/src/capability_port.rs Outdated
Comment thread crates/ironclaw_loop_support/src/capability_port.rs Outdated
Comment thread crates/ironclaw_reborn/src/model_gateway.rs Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 127b4249b5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/ironclaw_loop_support/src/capability_port.rs Outdated
Comment thread crates/ironclaw_reborn/src/model_gateway.rs
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds provider-native tool-call metadata plumbing through the Reborn loop so tool-result transcript refs can be replayed to LLM providers as valid tool-call/result roundtrips.

Changes:

  • Adds provider tool definitions/calls/reference DTOs across turns, loop support, agent loop, and thread storage.
  • Updates model gateway conversion to call providers with tools and reconstruct provider tool roundtrips from stored result refs.
  • Extends tests and fixtures for tool-result metadata and new capability descriptor schemas.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
crates/ironclaw_turns/tests/agent_loop_host_contract.rs Updates capability descriptor fixtures with parameter schemas.
crates/ironclaw_turns/src/run_profile/mod.rs Re-exports provider tool metadata types.
crates/ironclaw_turns/src/run_profile/host.rs Adds provider tool DTOs and capability/transcript metadata fields.
crates/ironclaw_threads/tests/session_thread_contract.rs Updates tool-result append calls for new provider metadata field.
crates/ironclaw_threads/tests/db_session_thread_contract.rs Updates durable tool-result append calls for new field.
crates/ironclaw_threads/src/tool_result_reference.rs Adds provider tool-call reference envelope validation.
crates/ironclaw_threads/src/lib.rs Re-exports provider tool-call reference envelope.
crates/ironclaw_threads/src/in_memory.rs Persists provider metadata in in-memory thread records/context.
crates/ironclaw_threads/src/db.rs Persists provider metadata in durable thread records/context.
crates/ironclaw_threads/src/contract.rs Extends thread/message/context contracts with provider metadata.
crates/ironclaw_reborn/tests/loop_driver_host.rs Updates test candidates for new provider metadata fields.
crates/ironclaw_reborn/tests/llm_gateway.rs Adds provider tool-call and roundtrip reconstruction tests.
crates/ironclaw_reborn/src/model_gateway.rs Adds tool-aware provider requests and tool-result replay reconstruction.
crates/ironclaw_reborn/src/loop_exit_applier/tests/mod.rs Updates test records/requests for new metadata field.
crates/ironclaw_reborn/src/loop_driver_host.rs Passes capability ports into model gateway flow.
crates/ironclaw_loop_support/tests/thread_loop_support_contract.rs Adds metadata preservation checks for transcript/model ports.
crates/ironclaw_loop_support/src/lib.rs Carries provider metadata through transcript and model request adapters.
crates/ironclaw_loop_support/src/capability_surface_filter.rs Filters provider tool definitions/calls by capability allow-set.
crates/ironclaw_loop_support/src/capability_port.rs Builds provider tool definitions and maps provider calls to capability candidates.
crates/ironclaw_agent_loop/src/test_support/mod.rs Records provider metadata in mock host result refs.
crates/ironclaw_agent_loop/src/executor.rs Propagates provider call metadata when appending completed tool results.
Comments suppressed due to low confidence (5)

crates/ironclaw_loop_support/src/capability_port.rs:541

  • Provider tool calls cannot be used with the actual host-runtime capability port because this new path immediately calls LoopCapabilityInputResolver::register_provider_tool_call_input, but the trait only provides a default implementation that returns InvalidInvocation and there is no overriding implementation in the repository. As a result, any provider response containing tool calls will fail instead of producing capability candidates unless every host wires a custom resolver implementation.
        let input_ref = self
            .input_resolver
            .register_provider_tool_call_input(&self.run_context, &tool_call)
            .await?;

crates/ironclaw_loop_support/src/capability_port.rs:534

  • This lookup accepts any provider name recorded in the visible snapshot, including capabilities that tool_definitions() deliberately filtered out because their schemas were not usable for provider tool calling. A provider can still return one of those unadvertised tool names and this path will register it for invocation, so the registration path should enforce the same advertised-tool set as tool_definitions().
            let Some(capability_id) = snapshot.provider_names.get(&tool_call.name).cloned() else {
                return Err(AgentLoopHostError::new(
                    AgentLoopHostErrorKind::InvalidInvocation,
                    "provider tool call is outside the visible capability surface",
                ));

crates/ironclaw_loop_support/src/capability_port.rs:515

  • Provider-call metadata is not validated before it is turned into a capability input. The only new validation happens later when the tool result is appended to the thread, so an oversized or invalid provider call can execute the capability side effect and then fail transcript persistence, leaving the run unable to replay the roundtrip safely. Validate the provider id/name/arguments limits before registering the input ref.
    async fn register_provider_tool_call(
        &self,
        tool_call: ProviderToolCall,
    ) -> Result<ironclaw_turns::run_profile::CapabilityCallCandidate, AgentLoopHostError> {

crates/ironclaw_reborn/src/model_gateway.rs:855

  • Assistant-level provider reasoning is reconstructed from the first per-call reasoning value, but provider_tool_call_from_llm stores call-specific reasoning and response-level reasoning in the same field. When the first tool call has its own reasoning and the response also has shared reasoning, replay sends the call-specific text as the assistant message reasoning instead of the provider's response-level reasoning, which can break providers that require the exact reasoning artifact to be echoed.
    let reasoning = provider_results
        .iter()
        .find_map(|(provider_call, _)| provider_call.reasoning.clone());
    let assistant = ChatMessage::assistant_with_tool_calls(

crates/ironclaw_loop_support/src/capability_port.rs:769

  • Provider tool names are generated directly from the full capability id without enforcing the 256-byte limit that ProviderToolCallReferenceEnvelope::validate applies later. A long but valid capability id can therefore be advertised to the provider and executed, then fail when the result metadata is appended; the generated name should be bounded or rejected before advertising the tool.
    let base = capability_id.as_str().replace('.', "__");
    if !existing.contains_key(&base) {
        return base;
    }
    let digest = sha256_digest_token(capability_id.as_str().as_bytes());
    format!("{}__{}", base, &digest[..8])

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/ironclaw_loop_support/src/capability_port.rs Outdated
Comment thread crates/ironclaw_loop_support/src/capability_surface_filter.rs
Comment thread crates/ironclaw_threads/src/db.rs
Comment thread crates/ironclaw_reborn/src/model_gateway.rs Outdated
Comment thread crates/ironclaw_threads/src/contract.rs Outdated
Comment thread crates/ironclaw_agent_loop/src/executor.rs
Comment thread crates/ironclaw_threads/src/in_memory.rs
Comment thread crates/ironclaw_threads/src/db.rs
@henrypark133 henrypark133 force-pushed the reborn/3620-provider-tool-calls branch from 127b424 to 3f4e21c Compare May 17, 2026 02:02
@henrypark133
Copy link
Copy Markdown
Collaborator Author

Addressed the provider-tool-call review round in 3f4e21cfc88265af73dae48de99dc7b2feb6afee.

Summary:

  • replaced the expanded CapabilityCallCandidate provider fields with nested ProviderToolCallReplay metadata
  • made provider surface selection deterministic and provider registration reject invalid calls before staging input
  • preserved zero-argument capability schemas for provider tools
  • scoped synthesized provider turn ids by run/turn/model-call and fixed digest truncation
  • centralized CompletionRequest -> ToolCompletionRequest conversion
  • kept provider replay metadata internal to context loading while stripping/redacting it from product history surfaces
  • added executor, loop-support, in-memory thread, and durable libSQL contract coverage for the provider metadata path

Resolved the review threads that are covered by those changes.

Copy link
Copy Markdown
Collaborator

@zmanian zmanian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the diff carefully. The metadata side channel is structurally sound but a few correctness gaps in the replay path need attention before this lands.

Metadata isolation — looks correct

ThreadMessageRecord.tool_result_provider_call is stripped in history_messages() in both crates/ironclaw_threads/src/db.rs and crates/ironclaw_threads/src/in_memory.rs, while context_messages_with_summary_replacements retains it. So list_thread_history (product surface) sees only content, and load_context_window (model replay) sees the provider envelope. Redaction clears it. Test coverage on the in-memory and durable paths is real. Good.

One nit: nothing prevents a future caller from serializing a ThreadMessageRecord directly (it derives Serialize and the field has skip_serializing_if = Option::is_none but is otherwise public). A doc comment landed on the field, but it would be more defensible to make tool_result_provider_call pub(crate) or wrap accessors so the only emission path is the deliberately-scrubbing history fn. As-is, the isolation is enforced by convention, not by the type.

Multi-tool grouping — partial-failure replay is wrong

In crates/ironclaw_agent_loop/src/executor.rs, append_capability_result_ref only runs on CapabilityOutcome::Completed. Failures don't append a tool-result ref carrying provider_call. On the next turn, convert_messages in crates/ironclaw_reborn/src/model_gateway.rs groups consecutive ToolResult messages by provider_turn_id and reconstructs an assistant message containing only the completed tool calls.

That is structurally invalid for OpenAI / Anthropic: if the model originally emitted N tool calls in one turn, the provider expects exactly N matching tool-result messages. Replaying with K<N tool_calls and K tool_results drops the failed calls entirely from the assistant message — the provider sees a turn that never happened, and may produce different (or rejected) downstream behavior.

The two grouping tests (gateway_reconstructs_multi_tool_provider_turn_from_grouped_result_references, gateway_with_two_tool_calls_returns_two_candidates) only cover the all-success case. There is no test for partial-failure replay, no test for failure-only replay, and no test for what happens when a denied / suspended tool is in the batch.

Either:

  • emit a tool-result envelope with provider_call populated even on failure (with the safe summary describing the failure), so the assistant turn replays with its full original shape, or
  • record the assistant tool-call turn shape independently so reconstruction does not depend on which results survived.

Worth catching before this lands.

Cross-provider replay — no identity check

ProviderToolCallReferenceEnvelope carries provider_call_id, provider_tool_name, and signature verbatim into ChatMessage::assistant_with_tool_calls and ChatMessage::tool_result on replay (tool_result_replay_messageprovider_tool_call_from_reference). Nothing in the envelope, the ToolResultReferenceEnvelope, or ProviderToolCallReplay records the originating provider or model.

Real failure modes on resume against a different provider:

  • Anthropic call IDs are toolu_*, OpenAI are call_*. Replaying Anthropic IDs into an OpenAI request will be rejected by the API.
  • signature is an Anthropic reasoning-signature concept. Forwarding it into an OpenAI request is at best ignored, at worst rejected.
  • Tool-name normalization differs (demo__echo is fine for both, but the encoding rule lives in provider_tool_name() and is not provider-specific — a future provider with stricter constraints would silently mismatch).

Either bind the replay metadata to the producing provider/model and refuse cross-provider replay, or strip/translate provider-specific fields on emit. As-is, a resume that crosses providers is a latent runtime failure with no early signal.

Schema gate vs invocation surface drift

provider_schema_is_usable (crates/ironclaw_loop_support/src/capability_port.rs) silently drops capabilities whose parameters_schema isn't { "type": "object", ... } from tool_definitions(), but the same capabilities remain invokable through invoke_capability. The model sees a smaller surface than the host enforces. For capabilities authored without a schema this is a soft failure — they become permanently invisible to provider-tool-call paths until someone notices. Worth at least a debug! when a capability is filtered out, and ideally a contract that capabilities advertised as visible always carry a usable schema.

Stacking on #3720

Base is reborn/3622-tool-result-evidence (#3720's head). Does not stand alone — the ProviderToolCallReferenceEnvelope validation primitives, tool_result_provider_call field plumbing, and the tool-result-reference envelope itself are introduced on the parent branch. If #3720 changes shape during review (envelope schema, validation rules) this PR's tests will need a rebase pass. Worth confirming the merge order is locked.

Smaller items

  • provider_turn_id is a sha256-truncated digest of (run, turn, model_call_seq, tool_call ids+names). The 16-hex-char truncation gives ~64 bits of entropy. Collisions across long-running threads are vanishingly unlikely but the grouping in convert_messages relies entirely on string equality — a collision would silently merge two distinct provider turns into one assistant message. Either widen to 32 hex chars or pin grouping with a secondary check (e.g., contiguous sequence range in transcript).
  • AtomicU64::fetch_add(1, Ordering::Relaxed) for provider_turn_sequence is fine for grouping but means turn ids are not deterministic across process restarts. If a gateway is recreated mid-thread, sequence resets to 1 and the same (run, turn) produces a new provider_turn_id. Confirm this matches the intended replay semantics.
  • LlmProviderModelGateway::stream_model and RoutedLlmProviderModelGateway::stream_model no longer pass capabilities — only stream_model_with_capabilities does. The non-tool path is intentional, but means any code path that goes through plain stream_model will never get tool definitions even when capabilities are wired. Ensure no production caller path falls back to plain stream_model by accident.

Approve the metadata side-channel direction. Block on partial-failure replay correctness and cross-provider identity handling.

Zaki

@henrypark133
Copy link
Copy Markdown
Collaborator Author

Addressed Zaki’s follow-up review in c6c69252c45664374c6a3861e484350f616b1c7b.

What changed:

  • provider replay metadata is now bound to provider/model identity and replay fails closed if the selected route does not match
  • provider-originated terminal non-success outcomes now append a safe synthetic tool-result reference, so replay preserves the full assistant tool-call shape for mixed success/failure batches
  • provider turn ids now use 32 hex chars of digest entropy
  • provider tool schema drops now emit debug diagnostics
  • direct ThreadMessageRecord serialization skips the provider replay side channel, while durable DB persistence still writes/loads it explicitly

Validation run locally:

  • cargo test -p ironclaw_agent_loop
  • cargo test -p ironclaw_reborn --features root-llm-provider --test llm_gateway
  • cargo test -p ironclaw_threads --test session_thread_contract
  • cargo test -p ironclaw_threads --features libsql --test db_session_thread_contract
  • cargo test -p ironclaw_loop_support
  • cargo test -p ironclaw_turns
  • cargo test -p ironclaw_reborn --test loop_driver_host
  • cargo check -p ironclaw_reborn --features root-llm-provider

@henrypark133 henrypark133 force-pushed the reborn/3620-provider-tool-calls branch from c6c6925 to be5888a Compare May 17, 2026 05:36
@henrypark133
Copy link
Copy Markdown
Collaborator Author

Follow-up after rebasing onto reborn/3622-tool-result-evidence and running another review pass.

A review subagent checked local head be5888af0965d514080df0b7bf6027533c3af319 against base a60b062b3 for PR scope, Reborn contract boundaries, and bug risk. It found no verified bugs or contract violations. The review specifically checked provider replay identity, provider metadata storage/public serialization boundaries, synthetic result refs for provider-backed non-success paths, and the cross-crate capability/result-reference contracts.

Additional fixes included in be5888af0:

  • static LlmProviderModelGateway now carries an explicit provider id instead of deriving provider identity from LlmProvider::model_name()
  • the all-features clippy large-enum warning in agent-loop test support is fixed
  • the post-rebase clippy warning in loop-exit-applier tests is fixed

Local verification after the rebase:

  • cargo fmt --check
  • cargo test -p ironclaw_reborn --features root-llm-provider --test llm_gateway
  • cargo test -p ironclaw_agent_loop
  • cargo clippy --all --benches --tests --examples --all-features -- -D warnings

The rebased branch has been force-pushed with lease. GitHub checks have started on the new head and are still pending.

Copy link
Copy Markdown
Collaborator

@serrrfirat serrrfirat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness review focused on regressions from the broad provider-tool-call changes. I found two medium-or-higher issues and did a second pass over the touched replay, filtering, and persistence paths without finding additional medium findings.

Findings:

  1. P1 - Provider replay metadata is lost on history fallback. ThreadBackedLoopModelPort::resolve_model_messages falls back to list_thread_history when a requested message ref is outside the current context window. Both DB and in-memory list_thread_history deliberately clear tool_result_provider_call before returning history records, so history_messages_by_ref rebuilds a model message without the provider side channel. convert_messages then treats that tool result as a plain system summary instead of reconstructing the assistant tool-call + tool-result round trip. Any older tool-result ref resolved through history can therefore lose the provider call id/signature/reasoning and break the next provider request.

  2. P2 - Provider-native tool definitions ignore the executor's per-iteration capability filter. The executor computes a CapabilityFilter and applies it only to the local visible surface used for prompt/invocation checks. Later, the gateway fetches provider tool definitions from capabilities.tool_definitions(), which is built from the current snapshot without that local filter. The executor still denies filtered calls after the model asks for them, but the provider is being advertised tools that the current loop iteration intended to hide, leading to avoidable invalid tool calls and incorrect model context.

Verification run locally on the PR head:

  • cargo test -p ironclaw_threads --test session_thread_contract provider -- --nocapture
  • cargo test -p ironclaw_loop_support capability_surface_filter::tests::provider_tool_call_denied_before_inner_registration_when_not_allowed
  • cargo test -p ironclaw_reborn --features root-llm-provider --test llm_gateway

The first attempted cargo test -p ironclaw_reborn --test llm_gateway failed only because that test target requires root-llm-provider; rerunning with the feature passed.

Copy link
Copy Markdown
Collaborator

@zmanian zmanian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review of 3 commits since my prior review (6657d33d, 194ee470, be5888af) on top of the rebased base.

Prior findings — status

1. Metadata isolation enforced by convention not type — Partial. ThreadMessageRecord now has a doc note and DB persistence is the only explicit emitter; in-memory list_thread_history still scrubs at the same spot (in_memory.rs:612). The field is still public on a Serialize struct. Acceptable for now but I'd still prefer pub(crate) with explicit accessors.

2. Multi-tool partial-failure replay broken — Resolved. executor.rs now appends a tool-result ref on all terminal non-success outcomes (SkipResult, FailRun/FailTurn, PolicyDenied Retry, gate SkipAndContinue/Abort, unsupported process wait, final-fallback). append_capability_error_ref/append_capability_safe_summary_ref carry the provider_call envelope so reconstructed assistant turns match the original tool-call shape. New test gateway_reconstructs_multi_tool_provider_turn_with_failure_safe_summary covers mixed success/failure. Good fix.

3. Cross-provider replay no identity check — Resolved. ProviderToolCallReferenceEnvelope now carries provider_id + provider_model_id, validated on emit (capability_port.rs::validate_provider_identity) and again on replay (model_gateway.rs::validate_provider_replay_identity), which fails closed with PolicyDenied on mismatch. LlmProviderModelGateway::new now takes an explicit provider_id instead of deriving from model_name(). Good.

4. Schema gate vs invocation surface drift — Partial. Filter now emits tracing::debug! when a capability is dropped from tool_definitions(). The invocation surface still accepts capabilities that were silently filtered (no positive contract that "visible ⇒ provider-usable"). Acceptable for follow-up; not a blocker.

5. Stacking on #3720 — Resolved by rebase. Base is now a60b062b3 post-rebase; both PRs need to land in order.

6. 64-bit provider_turn_id entropy + non-deterministic sequence — Partially resolved. Digest widened from 16 to 32 hex chars (sha256_hex_prefix(..., 32)) — collision risk now adequate. AtomicU64 sequence reset on gateway recreation is unchanged; that means the same (run, turn) can produce different provider_turn_ids across restarts. Grouping is per-turn so a restart mid-turn is the only real failure window; flag for future hardening.

New findings (still blocking)

A. History-fallback path loses tool_result_provider_call (also raised by serrrfirat). ThreadBackedLoopModelPort::resolve_model_messages in crates/ironclaw_loop_support/src/lib.rs:862 falls back to list_thread_history when a result_ref is outside the active context window. Both in_memory.rs:611-613 and the DB equivalent in db.rs deliberately clear tool_result_provider_call before returning history records. history_messages_by_ref (line 1189) then propagates that None into the rebuilt ContextMessage. convert_messages falls back to the system-summary branch and the assistant tool-call turn cannot be reconstructed. Concretely: any tool-result older than the current window will lose its provider envelope on replay, regressing exactly the case this PR was meant to fix. Either retain the envelope in the fallback path (separate accessor that does not scrub), or refuse fallback for tool-result refs and require they live in the context window.

B. Per-iteration CapabilityFilter not applied to tool_definitions (also raised by serrrfirat). CanonicalAgentLoopExecutor::run_batch computes surface_filter = planner.capability().filter(&state) and applies it to the local VisibleCapabilitySurface via apply_capability_filter (executor.rs:232/250). The model gateway then independently calls capabilities.tool_definitions() on the unfiltered port. CapabilitySurfaceProfileFilter only enforces the static run-profile allow set, not the dynamic per-iteration filter. Net effect: provider gets advertised tools that the executor will then deny. The right fix is to either pass the per-iteration filter through to the capability port the gateway sees, or have tool_definitions() consult the same filter source the executor uses.

Validation

I did not re-run tests locally; relying on the run reported in the PR comments. The new loop_driver_host and gateway tests look targeted; the partial-failure test in particular exercises the right replay shape.

Verdict

Findings 2, 3, 5 cleanly resolved. Findings 1, 4, 6 acceptable as-is or partial. Block on A (history fallback) and B (per-iteration filter not propagated). Both are correctness regressions on the exact replay path this PR is meant to harden.

Zaki

@henrypark133 henrypark133 force-pushed the reborn/3620-provider-tool-calls branch from 464aa31 to 2eea296 Compare May 18, 2026 07:14
@henrypark133
Copy link
Copy Markdown
Collaborator Author

Addressed the provider-tool visibility concern in the latest push. The executor now passes the resolved model-visible capability IDs for the current iteration, and loop-support wraps the capability port with that resolved view before provider-native tool definitions/calls reach the model gateway. This keeps loop-specific filtering policy inside agent-loop while making provider tools respect the final visible surface.\n\nValidation run:\n- cargo test -p ironclaw_loop_support --test thread_loop_support_contract\n- cargo test -p ironclaw_agent_loop\n- cargo test -p ironclaw_turns --test agent_loop_host_contract\n- cargo test -p ironclaw_reborn --test loop_driver_host\n- cargo test -p ironclaw_reborn --test loop_milestone_event_projection\n- cargo test -p ironclaw_reborn --features root-llm-provider --test llm_gateway\n- cargo clippy -p ironclaw_turns -p ironclaw_agent_loop -p ironclaw_loop_support -p ironclaw_reborn --tests -- -D warnings

@serrrfirat
Copy link
Copy Markdown
Collaborator

Correctness/security follow-up on the latest head (2eea296), avoiding the findings already raised in prior review threads.

Findings:

  1. P2 - Synthetic provider error refs are appended to transcript but not recorded in loop state.

In crates/ironclaw_agent_loop/src/executor.rs, provider-backed non-success paths call append_capability_error_ref / append_capability_safe_summary_ref, which writes a synthetic result:provider-error-* tool-result ref to the transcript. Unlike completed capability outcomes, those synthetic refs are never pushed into state.result_refs.

That leaves TurnSummary.batch_result_refs, final LoopExit.result_refs, and checkpointed loop state missing tool results that were required to satisfy provider replay. The transcript may replay correctly, but loop-level accounting still treats some provider tool outcomes as resultless. The append helper should return the synthetic ref and provider-backed terminal non-success paths should record it in state.result_refs.

  1. P2 - Multi-tool provider registration is not atomic.

In crates/ironclaw_reborn/src/model_gateway.rs, tool_response_to_host registers provider tool calls one at a time. If the first call stages input successfully and a later call fails validation/filtering, the whole model response is rejected while the earlier staged input remains orphaned. Retrying can stage duplicates.

The provider tool-call batch should be prevalidated before staging any inputs, or registration should be moved behind a batch API that can apply rollback/idempotency across the whole provider turn.

Extra paranoid pass: I did not find additional non-duplicate medium-or-higher correctness/security issues in the latest diff.

@henrypark133
Copy link
Copy Markdown
Collaborator Author

henrypark133 commented May 18, 2026

Code Review (multi-agent)

Intent: Preserve provider tool-call metadata through the Reborn loop so provider-backed models receive valid tool-result roundtrips without exposing raw metadata in user-visible transcripts.

Stats: 42 findings (from 42 raw, 42 after dedup) across 27 files

Security (7 issues)

  1. High ContextMessage serializes raw provider metadata without skip_serializing (crates/ironclaw_threads/src/contract.rs:220-228, confidence 90)
  2. High validate_provider_replay_identity only checks provider_id/model_id, not turn/call scope (crates/ironclaw_reborn/src/model_gateway.rs:963-976, confidence 85)
  3. Medium ProviderToolCallReferenceEnvelope arguments field lacks content sanitization (crates/ironclaw_threads/src/tool_result_reference.rs:68-83, confidence 75)
  4. Medium HostManagedModelMessage serializes tool_result_provider_call when present (crates/ironclaw_loop_support/src/lib.rs:1061-1068, confidence 70)
  5. Medium signature field stored and transmitted but never cryptographically verified (crates/ironclaw_threads/src/tool_result_reference.rs:68-107, confidence 65)
  6. Low validate_optional_provider_text lacks sensitive content checks (crates/ironclaw_threads/src/tool_result_reference.rs:181-193, confidence 60)
  7. Low ThreadMessageRecord Clone derives leak provider_call in memory across boundaries (crates/ironclaw_threads/src/contract.rs:78-98, confidence 55)

Bugs (5 issues)

  1. Medium validate_provider_tool_call skips turn_id validation when None (crates/ironclaw_loop_support/src/capability_port.rs:1053, confidence 80)
  2. Medium append_capability_safe_summary_ref accepts unvalidated String for safe_summary (crates/ironclaw_agent_loop/src/executor.rs:920, confidence 75)
  3. Medium invoke_filtered_batch truncation drops denials between last returned outcome and suspension point (crates/ironclaw_loop_support/src/capability_surface_filter.rs:101, confidence 70)
  4. Medium TOCTOU race in CapabilitySurfaceVisibleFilter::register_provider_tool_call (crates/ironclaw_loop_support/src/capability_surface_filter.rs:1298, confidence 60)
  5. Low provider_tool_name collision fallback produces suboptimal names for short capability IDs (crates/ironclaw_loop_support/src/capability_port.rs:1031, confidence 65)

Performance (11 issues)

  1. Medium context_messages_with_summary_replacements is O(n^2 * s) in db.rs (crates/ironclaw_threads/src/db.rs:1026-1075, confidence 75)
  2. Medium context_messages_with_summary_replacements is O(n^2 * s) in in_memory.rs (crates/ironclaw_threads/src/in_memory.rs:548-597, confidence 75)
  3. Low context_messages_by_id does O(n*m) linear scans in db.rs (crates/ironclaw_threads/src/db.rs:1077-1097, confidence 75)
  4. Low context_messages_by_id does O(n*m) linear scans in in_memory.rs (crates/ironclaw_threads/src/in_memory.rs:599-619, confidence 75)
  5. Low to_message_json does redundant triple serialization (crates/ironclaw_threads/src/db.rs:1864-1881, confidence 50)
  6. Low history_messages clones full records then nullifies provider_call in db.rs (crates/ironclaw_threads/src/db.rs:1118-1128, confidence 50)
  7. Low history_messages clones full records then nullifies provider_call in in_memory.rs (crates/ironclaw_threads/src/in_memory.rs:640-650, confidence 50)
  8. Low current_snapshot acquires two mutexes sequentially (crates/ironclaw_loop_support/src/capability_port.rs:309-340, confidence 50)
  9. Nit register_provider_tool_call calls tool_definitions() then inner.register (crates/ironclaw_loop_support/src/capability_surface_filter.rs:56-85, confidence 50)
  10. Nit create_summary_artifact scans messages twice for range validation in db.rs (crates/ironclaw_threads/src/db.rs:892-905, confidence 50)
  11. Nit create_summary_artifact scans messages twice for range validation in in_memory.rs (crates/ironclaw_threads/src/in_memory.rs:414-427, confidence 50)

Tests (15 issues)

  1. High ProviderToolCallReferenceEnvelope.validate() has zero test coverage (crates/ironclaw_threads/src/tool_result_reference.rs:86-106, confidence 100)
  2. High convert_messages() has zero test coverage for provider tool roundtrip (crates/ironclaw_reborn/src/model_gateway.rs:911-961, confidence 100)
  3. High Consecutive tool-result ref grouping not tested with mixed turn_ids (crates/ironclaw_reborn/src/model_gateway.rs:940-954, confidence 100)
  4. High tool_result_replay_message() error path for malformed JSON not tested (crates/ironclaw_reborn/src/model_gateway.rs:983-997, confidence 100)
  5. High tool_response_to_host() finish_reason error paths not tested (crates/ironclaw_reborn/src/model_gateway.rs:767-821, confidence 75)
  6. Medium validate_provider_replay_identity() mismatch error path not tested (crates/ironclaw_reborn/src/model_gateway.rs:963-976, confidence 100)
  7. Medium CapabilitySurfaceVisibleFilter has zero test coverage (crates/ironclaw_loop_support/src/capability_surface_filter.rs:48-118, confidence 100)
  8. Medium validate_provider_tool_call() has no direct unit tests (crates/ironclaw_loop_support/src/capability_port.rs:1052-1079, confidence 100)
  9. Medium history_messages() not tested to confirm provider_call is stripped in in_memory.rs (crates/ironclaw_threads/src/in_memory.rs:640-650, confidence 100)
  10. Medium history_messages() not tested to confirm provider_call is stripped in db.rs (crates/ironclaw_threads/src/db.rs:1118-1128, confidence 100)
  11. Medium context_messages_by_id() not tested with empty ids or all-missing ids in in_memory.rs (crates/ironclaw_threads/src/in_memory.rs:599-619, confidence 75)
  12. Medium context_messages_by_id() in db.rs not tested with empty/missing/mixed cases (crates/ironclaw_threads/src/db.rs:1077-1097, confidence 75)
  13. Medium to_message_json() not tested with and without provider_call in db.rs (crates/ironclaw_threads/src/db.rs:1864-1881, confidence 75)
  14. Medium synthetic_provider_error_result_ref() error paths not tested (crates/ironclaw_agent_loop/src/executor.rs:303-323, confidence 75)
  15. Medium complete_model_request() capabilities error path not tested (crates/ironclaw_reborn/src/model_gateway.rs:714-757, confidence 50)

Conventions (4 issues)

  1. Medium Duplicated provider validation functions across crates (crates/ironclaw_loop_support/src/capability_port.rs:865-937, confidence 90) — Anchor: AGENTS.md
  2. Medium Duplicated ProviderToolCallReference struct requires manual conversion (crates/ironclaw_loop_support/src/lib.rs:1288-1303, confidence 85) — Anchor: AGENTS.md
  3. Low New public types lack documentation (crates/ironclaw_turns/src/run_profile/host.rs:783-1167, confidence 95)
  4. Low Missing safety comment on infallible unwrap_or (crates/ironclaw_loop_support/src/capability_port.rs:823-824, confidence 80) — Anchor: AGENTS.md

Base automatically changed from reborn/3622-tool-result-evidence to reborn-integration May 18, 2026 14:29
@serrrfirat serrrfirat force-pushed the reborn/3620-provider-tool-calls branch from 764f5b9 to faa93b8 Compare May 18, 2026 15:29
@henrypark133
Copy link
Copy Markdown
Collaborator Author

Review findings summary

All 19 review comments from automated reviewers (Gemini, Codex, Copilot) and human reviewers (zmanian ×2, serrrfirat) have been triaged. No new changes are needed — every finding has been addressed by prior commits on this branch:

Already resolved by commits on this branch

Finding Status Resolved by
HashMap non-determinism in snapshots.values().next() ✅ Fixed 2eea296a7 — sorted/deterministic lookups
Missing sha256_digest_token import ✅ False positive Already imported at line 9
Manual field copying CompletionRequestToolCompletionRequest ✅ Fixed From impl added in crates/ironclaw_llm/src/provider.rs
provider_schema_is_usable rejects no-arg schemas ✅ Fixed Now uses .is_none_or() on properties
provider_turn_id collision risk (9 hex chars) ✅ Fixed Widened to 32 hex chars (128-bit entropy) + includes run_id/turn_id/model_call sequence
ThreadMessageRecord exposes provider metadata to product APIs ✅ Fixed #[serde(default, skip_serializing)] prevents serialization
Redaction doesn't clear tool_result_provider_call ✅ Fixed Both in_memory.rs and db.rs clear it on redact; history_messages() scrubs it
Allow-set check after inner delegation (security) ✅ Fixed Check moved before self.inner.register_provider_tool_call()
Missing test coverage for provider metadata roundtrip ✅ Fixed Tests with provider_replay: Some(...) added
Multi-tool partial-failure replay broken ✅ Fixed append_capability_error_ref/append_capability_safe_summary_ref carry provider_call
Cross-provider replay no identity check ✅ Fixed provider_id + provider_model_id validated on emit and replay
History-fallback loses tool_result_provider_call ✅ Fixed load_context_window preserves it; load_context_messages added
Per-iteration CapabilityFilter not applied to tool_definitions ✅ Fixed apply_capability_filter called; capability_view passed to model request
debug! logging when capability filtered ✅ Fixed tracing::debug! present at capability_port.rs:548

Duplicates (same issue, already resolved)

  • Schema filtering comment on capability_port.rs:500 — duplicate of 3253841725
  • Redaction comment on in_memory.rs:302 — duplicate of db.rs:779

Suppressed (low-confidence, no action needed)

  • Copilot suppressed findings — all deferred by design (fail-closed defaults, validation exists)

Verdict: All review findings are addressed. The PR is mergeable with a clean state.

@zmanian @serrrfirat — please re-review if you have any remaining concerns.

Copy link
Copy Markdown
Collaborator Author

@henrypark133 henrypark133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All review findings have been addressed by commits on this branch. Re-reviewing to confirm.

Copy link
Copy Markdown
Collaborator

@zmanian zmanian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review against head faa93b84 (the "address provider replay review findings" commit) on top of the prior reviewer rounds.

Prior findings — status check

# Finding Status
Zaki #1 Metadata isolation enforced by convention not type Partial — tool_result_provider_call is now skip_serializing on both ThreadMessageRecord (contract.rs:94) and HostManagedModelMessage (loop_support/lib.rs:1066). Acceptable.
Zaki #2 Multi-tool partial-failure replay broken Resolved via append_capability_safe_summary_ref on all terminal non-success paths.
Zaki #3 Cross-provider replay no identity check Resolved — validate_provider_replay_identity runs on each replayed envelope.
Zaki #4 Schema gate vs invocation surface drift Partial — tracing::debug! emitted on filter, no positive contract. Acceptable.
Zaki #6 Turn id entropy + non-determinism Resolved (32 hex chars). Restart-mid-turn sequence reset still latent.
Zaki A History fallback loses tool_result_provider_call Resolvedresolve_model_messages now calls load_context_messages (loop_support/lib.rs:881-890) which preserves the side channel via context_messages_by_id in both in_memory.rs:621 and db.rs:1096. list_thread_history is only used for summary artifacts.
Zaki B Per-iteration CapabilityFilter not applied to tool_definitions Resolved — new CapabilitySurfaceVisibleFilter (capability_surface_filter.rs:21-118) wraps the inner port; tool_definitions, register_provider_tool_call, visible_capabilities, and both invoke_capability* paths all enforce the model-visible set.
serrrfirat #1 (latest) Synthetic provider error refs not recorded in loop state Still open — see below.
serrrfirat #2 (latest) Multi-tool provider registration not atomic Partially addressed — see below.

Still blocking

A. Synthetic provider error refs are appended to the transcript but never pushed into state.result_refs

faa93b84 does not touch crates/ironclaw_agent_loop/src/executor.rs, so serrrfirat's earlier finding is unchanged. Concretely:

append_capability_safe_summary_ref (executor.rs:1710-1726) writes the synthetic result:provider-error-… ref to the host transcript via host.append_capability_result_ref(...), but the only writer of state.result_refs is push_completed_result (executor.rs:1776-1785), which runs solely on the CapabilityOutcome::Completed branch. Every call site at executor.rs:791, 803, 821, 937, 989, 1003, 1031 calls append_capability_*_ref for a provider-backed non-success outcome (SkipResult, Abort, policy-denied Retry, gate SkipAndContinue / Abort, final fallback, unsupported process wait) and only updates state.recovery_state / state.gate_state — never state.result_refs.

Net effect: the transcript contains the synthetic ref (so model replay reconstructs the assistant tool-call shape via provider_tool_roundtrip_messages), but loop accounting (TurnSummary.batch_result_refs derived at executor.rs:460 from state.result_refs[result_refs_start..], final LoopExit.result_refs at line 1387, and checkpointed loop state) is silently missing those refs. Mixed success/failure provider batches will produce a LoopExit.result_refs whose length differs from the number of provider tool-call results, and downstream consumers (telemetry, batch accounting, retry/resumption logic) see fewer refs than were actually emitted.

The fix is either to have append_capability_safe_summary_ref return the LoopResultRef and push it into state.result_refs at each provider-backed non-success call site, or to wrap the append in a helper that does both atomically.

Test gap that masked this: the test at executor.rs:3387 only asserts host.appended_result_refs() (transcript view), not completed.result_refs (loop-state view) on a denied provider call.

B. Multi-tool provider registration is still not atomic for non-name failure modes

faa93b84 adds a useful upfront check at model_gateway.rs:784-799 that all tool calls in the batch are in advertised_tool_names before any registration, and the new test gateway_rejects_unknown_provider_tool_call_before_registration confirms capabilities.registered.lock().unwrap().is_empty() after a name-mismatch rejection. That closes the unknown-name path cleanly.

But after that gate passes, tool_response_to_host still calls register_provider_tool_call in a sequential loop (model_gateway.rs:800-814), and register_provider_tool_call (capability_port.rs:566-623) has several failure modes that fire after prior calls in the batch have already staged input via input_resolver.register_provider_tool_call_input(...):

  1. validate_provider_tool_call (capability_port.rs:571) — turn-id/call-id/name length, 16 KiB arguments cap, JSON nesting depth, NUL/control chars, sensitive-marker rejection, sk- token rejection. Any of these only catch the offending call.
  2. current_snapshot() returning None between iterations (StaleSurface).
  3. provider_names/capabilities map miss (capability_port.rs:584-594).
  4. provider_schema_is_usable (capability_port.rs:596).
  5. input_resolver.register_provider_tool_call_input itself returning an error on a later call.

Scenario: a model emits two correctly-named tool calls, the second of which carries a 20 KiB arguments blob. Upfront name check passes. Call 1 validates and stages an input ref via the input resolver. Call 2 fails validate_provider_arguments. The whole response is rejected, but call 1's input ref is orphaned in the resolver. A retry can stage duplicates.

Either prevalidate the whole batch (run validate_provider_tool_call plus the visibility/schema checks on each tool_call in a first pass before any side effects), or move registration behind a batch API that can roll back / dedupe across the full provider turn. The current test only exercises the upfront name-check path; a partial-failure test against arguments-size / sensitive-content / unusable-schema would have caught this and should land with the fix.

Smaller observations (non-blocking)

  • validate_provider_replay_identity (model_gateway.rs:984-1003) now runs provider_call.validate() first — good. It only matches provider_id / provider_model_id, not turn/call scope, which is the right call (turn/call IDs are provider-emitted and have no host-side expected value), but worth pinning in a doc-comment so a future reader doesn't try to "tighten" it into something that breaks replay.
  • CapabilitySurfaceVisibleFilter::register_provider_tool_call (capability_surface_filter.rs:56-85) does a TOCTOU-style check: read tool_definitions() to map name→capability, then call inner.register_provider_tool_call. The post-call permits check on candidate.capability_id (line 78) catches the visibility-change race, but the inner registration's side effects (input-resolver staging) still happen — same orphan shape as B above, just one layer up. Easier to address once B has a batch-validation primitive.
  • The previously-flagged "synthetic ref written but state.result_refs not updated" test gap could be closed with a single new test asserting LoopExit::Completed { result_refs, .. } after a mixed-success provider batch.

Verification

I read the diff and the files at PR head faa93b84; I did not re-run the test suite locally. CI is still pending on this head (status pending, no checks reported yet).

Verdict

Findings A (history fallback) and B (per-iteration filter) from the previous round are cleanly resolved. The most recent serrrfirat findings are not fully resolved by faa93b84 despite the "all findings addressed" summary comment:

  • Block on A (loop-state result_refs missing synthetic provider error refs — correctness regression for loop accounting on every provider-backed non-success outcome).
  • Block on B (atomicity gap remains for argument-size / sensitive-content / unusable-schema failures after the first registered call, even with the new upfront name check).

Both are tractable; B in particular is largely a prevalidate-then-stage refactor of the existing per-call validation in register_provider_tool_call.


Generated by Claude Code

@henrypark133
Copy link
Copy Markdown
Collaborator Author

Addressed the open review feedback from #3722 (review) in 421a939a5.

What changed:

  • synthetic provider error refs now update loop-state result_refs at the same time they append transcript refs, so TurnSummary, checkpoints, and final LoopExit accounting include provider-backed non-success outcomes
  • provider tool-call batches now prevalidate every provider call through the capability port/filter stack before any registration or input staging, closing the partial-registration orphan case
  • duplicate append_tool_result_reference retries now return existing finalized tool-result refs before validating incoming provider metadata in both in-memory and durable stores
  • provider replay conversion now keeps same-turn provider tool results grouped even when a plain tool-result reference is interleaved

Regression coverage added for:

  • denied provider call final/checkpointed result_refs
  • invalid second provider tool call with no prior registration
  • in-memory and libSQL idempotent retry with invalid duplicate provider metadata
  • interleaved plain/provider tool-result replay grouping

Local validation:

  • cargo fmt --check
  • git diff --check
  • cargo test -p ironclaw_agent_loop
  • cargo test -p ironclaw_loop_support
  • cargo test -p ironclaw_reborn --test loop_driver_host
  • cargo test -p ironclaw_threads
  • cargo test -p ironclaw_threads --features libsql --test db_session_thread_contract libsql_persists_thread_history_and_context_across_reopen
  • cargo test -p ironclaw_reborn --features root-llm-provider --test llm_gateway
  • cargo clippy -p ironclaw_turns -p ironclaw_agent_loop -p ironclaw_loop_support -p ironclaw_reborn --tests -- -D warnings
  • cargo clippy -p ironclaw_threads --features libsql --tests -- -D warnings
  • cargo clippy -p ironclaw_reborn --features root-llm-provider --test llm_gateway -- -D warnings

GitHub checks have started on the new head and are still in progress.

Copy link
Copy Markdown
Collaborator

@serrrfirat serrrfirat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-agent review of nearai/ironclaw#3722 at head 421a939a5bfd28bb21b425145fc5e2424e8b8f4a.

I found 4 medium-severity issues that should be addressed before merge:

  1. crates/ironclaw_agent_loop/src/executor.rs:273 - Filtered-out capabilities can still be materialized into the model prompt.

    The executor applies the planner capability filter to surface and passes the filtered IDs only as capability_view for provider tool definitions, but the prompt request still carries only the unchanged surface.version. Prompt construction can resolve that version through the host surface lookup and materialize the pre-strategy host/profile surface, so the model may receive names/descriptions for capabilities removed from this iteration. Bind prompt construction to the filtered surface as well, either by carrying filtered capability IDs/surface in the prompt request or by making the prompt authority resolve this request against the filtered descriptor set. Add a regression that asserts both the materialized prompt and provider tool definitions exclude a denied capability.

  2. crates/ironclaw_loop_support/src/capability_port.rs:694 - Capability surface snapshot locks are acquired in opposite orders.

    current_snapshot() locks current_surface_version and then snapshots, while visible_capabilities() locks snapshots and then current_surface_version. Concurrent model/tool surface access can deadlock the shared port. Put the current version and snapshots behind a single mutex, or enforce one lock acquisition order everywhere.

  3. crates/ironclaw_loop_support/src/capability_port.rs:863 - Provider tool names are not constrained to the LLM provider contract.

    provider_tool_name() only replaces dots and otherwise exposes the capability ID as the provider tool name when it is <= 256 bytes. The inbound validator even allows ., :, _, and -, but common provider tool/function name contracts are stricter than this. A valid IronClaw capability ID can therefore produce a tool definition that a provider rejects before the model sees the visible tool surface. Normalize provider tool names at this boundary to the strict shared provider-safe contract and use a digest suffix for truncation/collisions; validate inbound provider tool names against the same advertised-name contract.

  4. crates/ironclaw_threads/src/db.rs:769 - Idempotent tool-result replay can permanently drop provider replay metadata.

    append_tool_result_reference() returns an existing finalized record based only on turn_run_id and tool_result_ref before validating or comparing the new provider_call. If an older/fallback write created that record with tool_result_provider_call: None, a retry with provider metadata now available will return the stale record and never backfill the side channel. Backfill None -> Some(provider_call) on idempotent replay, and fail loudly if both sides are Some but differ. Mirror the same behavior in the in-memory store.

@henrypark133
Copy link
Copy Markdown
Collaborator Author

Addressed the four latest serrrfirat review findings in 3f78ca9be:

  • Prompt construction now receives the same filtered capability_view as provider tool definitions, so dynamic capability filters hide tools from both the materialized prompt and provider tool surface.
  • HostRuntimeLoopCapabilityPort::current_snapshot now uses the same lock order as visible_capabilities to avoid the snapshot/current-version deadlock.
  • Provider tool names are normalized and validated against a strict provider-safe contract ([A-Za-z0-9_-], 64 bytes max) with digest suffixes for truncation/collisions; inbound provider tool calls and stored provider refs use the same contract.
  • Idempotent tool-result append now backfills None -> Some(provider_call) on retry and rejects conflicting provider metadata in both durable and in-memory stores.

Validation run:

  • cargo fmt --check
  • git diff --check
  • cargo test -p ironclaw_turns -p ironclaw_agent_loop -p ironclaw_loop_support -p ironclaw_threads
  • cargo test -p ironclaw_threads --features libsql --test db_session_thread_contract libsql_persists_thread_history_and_context_across_reopen
  • cargo clippy -p ironclaw_turns -p ironclaw_agent_loop -p ironclaw_loop_support -p ironclaw_threads --tests -- -D warnings

@henrypark133 henrypark133 requested a review from serrrfirat May 18, 2026 20:30
Copy link
Copy Markdown
Collaborator

@serrrfirat serrrfirat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review against current head 3f78ca9b, focused on the follow-up commits after the prior faa93b84 review and the latest automated review at 421a939a.

The shape is right: the follow-up now threads the per-iteration capability_view into both prompt construction and provider tool definitions, pre-validates provider tool batches before registration, constrains provider tool names to the stricter provider-safe contract, fixes the capability-port lock ordering, and backfills/rejects idempotent provider replay metadata consistently across in-memory and DB thread stores. The targeted test additions cover the intended invariants.

One thing still blocks merge:

  1. The current head does not compile under the PR checks

Both failing CI jobs are stopped by the same Rust compile error: LoopPromptBundleRequest gained required field capability_view, but crates/ironclaw_reborn/tests/loop_driver_host.rs:2750 still initializes it without that field in the inline-message rejection test.

This is caller-visible in the quality gate rather than a runtime design issue, but it blocks all downstream verification: Clippy (all-features) and Reborn CLI smoke tests both fail with E0063: missing field capability_view in initializer of LoopPromptBundleRequest at loop_driver_host.rs:2758 in the current CI run.

Please add capability_view: None to that initializer and rerun the affected checks. A good minimal verification is the failing loop_driver_host test target plus the existing provider-tool replay tests.

Verification I checked:

  • GitHub CI logs for run 26058596870: Clippy (all-features) and Reborn CLI smoke tests both fail on the missing field above.
  • Local focused check passed: cargo test -p ironclaw_reborn --features root-llm-provider --test llm_gateway gateway_rejects_invalid_provider_tool_batch_before_any_registration -- --exact.

Net: prior provider replay correctness issues look addressed on current head, but this cannot merge until the test compile failure is fixed.

@henrypark133
Copy link
Copy Markdown
Collaborator Author

Fixed the compile blocker from review 4313712547 in 8c9511603.

Change: added the missing capability_view: None to the inline-message LoopPromptBundleRequest initializer in crates/ironclaw_reborn/tests/loop_driver_host.rs.

Validation run from an isolated worktree:

  • cargo fmt --check
  • git diff --check
  • cargo test -p ironclaw_reborn --test loop_driver_host text_only_host_prompt_rejects_inline_messages -- --exact
  • cargo test -p ironclaw_reborn --features root-llm-provider --test llm_gateway gateway_rejects_invalid_provider_tool_batch_before_any_registration -- --exact

Copy link
Copy Markdown
Collaborator

@serrrfirat serrrfirat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-agent review of nearai/ironclaw#3722 at head 3f78ca9bee4a640b25ece2babcc2e3d0ba2fe144 with --force.

Reviewers completed: security, bugs, tests, conventions. The performance/concurrency reviewer failed to inspect the checkout due to a subprocess launch error, so I excluded it rather than treating it as clean.

Findings kept after validation:

  1. High - provider-backed tool calls cannot stage their input on the live HostRuntime capability path.
  2. High - current head fails CI because one LoopPromptBundleRequest initializer still omits capability_view.
  3. Medium - adjacent provider tool results from different turns need a regression test.
  4. Medium - FinishReason::ToolUse with an empty tool-call list needs a gateway regression test.
  5. Medium - run-profile provider tool advertisement filtering needs a regression test.
  6. Medium - provider replay identifiers remain raw Strings across internal module boundaries.

I dropped one conventions finding as a false positive: durable thread persistence does retain tool_result_provider_call through to_message_json, and the DB contract tests assert reload behavior.

let prepared = self.prepare_provider_tool_call(&tool_call)?;
let input_ref = self
.input_resolver
.register_provider_tool_call_input(&self.run_context, &tool_call)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HostRuntimeLoopCapabilityPort::register_provider_tool_call now calls register_provider_tool_call_input, but LoopCapabilityInputResolver only provides a default implementation that returns InvalidInvocation, and the concrete resolver in this checkout does not override it. The gateway can advertise provider tools and accept a provider tool call, but registration fails before producing a CapabilityCallCandidate, so the live provider-tool-call path cannot execute. Please implement provider tool-call argument staging for the resolver wired into HostRuntimeLoopCapabilityPortFactory, or make this trait method required and provide the concrete store-backed implementation.

continue;
};
validate_provider_replay_identity(&next_provider_call, replay_identity)?;
if next_provider_call.provider_turn_id != provider_turn_id {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The replay converter breaks the current provider-tool group when the next tool result has a different provider_turn_id, but the added gateway tests cover same-turn grouping and plain-result interleaving only. Please add a regression such as gateway_separates_consecutive_provider_tool_results_from_different_turns with adjacent ToolResult messages for turn_1 then turn_2, so a future change cannot merge turns or skip the second replay.

HostManagedModelErrorKind::PolicyDenied,
"model response was blocked by provider policy",
)),
FinishReason::ToolUse => Err(HostManagedModelError::safe(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new error branch for FinishReason::ToolUse with no tool calls is not covered by the current gateway tests. Please add a regression such as gateway_with_tool_surface_rejects_tool_use_finish_without_tool_calls where complete_with_tools returns ToolUse and an empty tool_calls list.


#[async_trait]
impl LoopCapabilityPort for CapabilitySurfaceProfileFilter {
fn tool_definitions(&self) -> Result<Vec<ProviderToolDefinition>, AgentLoopHostError> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CapabilitySurfaceProfileFilter::tool_definitions now filters provider tool advertisements by the run-profile allow set, but the adjacent tests only verify denied registration. Please add a regression such as profile_filter_tool_definitions_excludes_denied_capabilities that asserts the provider tool definitions expose only allowlisted capabilities.

#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
pub struct ProviderToolCallReplay {
/// Provider identity selected by the host route.
pub provider_id: String,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provider replay metadata is now carried across turns, loop support, Reborn, and threads as raw String identifiers (provider_id, provider_model_id, turn/call IDs, and provider tool names). That conflicts with .claude/rules/types.md, which requires boundary strings to be converted into specialized domain types before flowing between internal modules. Please introduce validated newtypes for these provider replay identifiers in the owning contract layer and use them through the replay structs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: low Changes to docs, tests, or low-risk modules size: XL 500+ changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants