fix(reborn): preserve provider tool roundtrip metadata#3722
fix(reborn): preserve provider tool roundtrip metadata#3722henrypark133 wants to merge 9 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive support for provider-side tool calls within the agent loop, enabling the model to emit and track capability invocations with associated metadata. Key changes include the addition of ProviderToolCall and ProviderToolDefinition structures, updates to the LoopCapabilityPort and HostManagedModelGateway traits to handle tool-aware requests, and enhancements to the thread history logic for replaying tool-call turns. Feedback highlights critical issues regarding non-deterministic snapshot selection from HashMap collections in capability_port.rs, a missing import for sha256_digest_token, and fragile manual field copying when constructing tool completion requests in model_gateway.rs.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 127b4249b5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR adds provider-native tool-call metadata plumbing through the Reborn loop so tool-result transcript refs can be replayed to LLM providers as valid tool-call/result roundtrips.
Changes:
- Adds provider tool definitions/calls/reference DTOs across turns, loop support, agent loop, and thread storage.
- Updates model gateway conversion to call providers with tools and reconstruct provider tool roundtrips from stored result refs.
- Extends tests and fixtures for tool-result metadata and new capability descriptor schemas.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/ironclaw_turns/tests/agent_loop_host_contract.rs | Updates capability descriptor fixtures with parameter schemas. |
| crates/ironclaw_turns/src/run_profile/mod.rs | Re-exports provider tool metadata types. |
| crates/ironclaw_turns/src/run_profile/host.rs | Adds provider tool DTOs and capability/transcript metadata fields. |
| crates/ironclaw_threads/tests/session_thread_contract.rs | Updates tool-result append calls for new provider metadata field. |
| crates/ironclaw_threads/tests/db_session_thread_contract.rs | Updates durable tool-result append calls for new field. |
| crates/ironclaw_threads/src/tool_result_reference.rs | Adds provider tool-call reference envelope validation. |
| crates/ironclaw_threads/src/lib.rs | Re-exports provider tool-call reference envelope. |
| crates/ironclaw_threads/src/in_memory.rs | Persists provider metadata in in-memory thread records/context. |
| crates/ironclaw_threads/src/db.rs | Persists provider metadata in durable thread records/context. |
| crates/ironclaw_threads/src/contract.rs | Extends thread/message/context contracts with provider metadata. |
| crates/ironclaw_reborn/tests/loop_driver_host.rs | Updates test candidates for new provider metadata fields. |
| crates/ironclaw_reborn/tests/llm_gateway.rs | Adds provider tool-call and roundtrip reconstruction tests. |
| crates/ironclaw_reborn/src/model_gateway.rs | Adds tool-aware provider requests and tool-result replay reconstruction. |
| crates/ironclaw_reborn/src/loop_exit_applier/tests/mod.rs | Updates test records/requests for new metadata field. |
| crates/ironclaw_reborn/src/loop_driver_host.rs | Passes capability ports into model gateway flow. |
| crates/ironclaw_loop_support/tests/thread_loop_support_contract.rs | Adds metadata preservation checks for transcript/model ports. |
| crates/ironclaw_loop_support/src/lib.rs | Carries provider metadata through transcript and model request adapters. |
| crates/ironclaw_loop_support/src/capability_surface_filter.rs | Filters provider tool definitions/calls by capability allow-set. |
| crates/ironclaw_loop_support/src/capability_port.rs | Builds provider tool definitions and maps provider calls to capability candidates. |
| crates/ironclaw_agent_loop/src/test_support/mod.rs | Records provider metadata in mock host result refs. |
| crates/ironclaw_agent_loop/src/executor.rs | Propagates provider call metadata when appending completed tool results. |
Comments suppressed due to low confidence (5)
crates/ironclaw_loop_support/src/capability_port.rs:541
- Provider tool calls cannot be used with the actual host-runtime capability port because this new path immediately calls
LoopCapabilityInputResolver::register_provider_tool_call_input, but the trait only provides a default implementation that returnsInvalidInvocationand there is no overriding implementation in the repository. As a result, any provider response containing tool calls will fail instead of producing capability candidates unless every host wires a custom resolver implementation.
let input_ref = self
.input_resolver
.register_provider_tool_call_input(&self.run_context, &tool_call)
.await?;
crates/ironclaw_loop_support/src/capability_port.rs:534
- This lookup accepts any provider name recorded in the visible snapshot, including capabilities that
tool_definitions()deliberately filtered out because their schemas were not usable for provider tool calling. A provider can still return one of those unadvertised tool names and this path will register it for invocation, so the registration path should enforce the same advertised-tool set astool_definitions().
let Some(capability_id) = snapshot.provider_names.get(&tool_call.name).cloned() else {
return Err(AgentLoopHostError::new(
AgentLoopHostErrorKind::InvalidInvocation,
"provider tool call is outside the visible capability surface",
));
crates/ironclaw_loop_support/src/capability_port.rs:515
- Provider-call metadata is not validated before it is turned into a capability input. The only new validation happens later when the tool result is appended to the thread, so an oversized or invalid provider call can execute the capability side effect and then fail transcript persistence, leaving the run unable to replay the roundtrip safely. Validate the provider id/name/arguments limits before registering the input ref.
async fn register_provider_tool_call(
&self,
tool_call: ProviderToolCall,
) -> Result<ironclaw_turns::run_profile::CapabilityCallCandidate, AgentLoopHostError> {
crates/ironclaw_reborn/src/model_gateway.rs:855
- Assistant-level provider reasoning is reconstructed from the first per-call reasoning value, but
provider_tool_call_from_llmstores call-specific reasoning and response-level reasoning in the same field. When the first tool call has its own reasoning and the response also has shared reasoning, replay sends the call-specific text as the assistant message reasoning instead of the provider's response-level reasoning, which can break providers that require the exact reasoning artifact to be echoed.
let reasoning = provider_results
.iter()
.find_map(|(provider_call, _)| provider_call.reasoning.clone());
let assistant = ChatMessage::assistant_with_tool_calls(
crates/ironclaw_loop_support/src/capability_port.rs:769
- Provider tool names are generated directly from the full capability id without enforcing the 256-byte limit that
ProviderToolCallReferenceEnvelope::validateapplies later. A long but valid capability id can therefore be advertised to the provider and executed, then fail when the result metadata is appended; the generated name should be bounded or rejected before advertising the tool.
let base = capability_id.as_str().replace('.', "__");
if !existing.contains_key(&base) {
return base;
}
let digest = sha256_digest_token(capability_id.as_str().as_bytes());
format!("{}__{}", base, &digest[..8])
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
127b424 to
3f4e21c
Compare
|
Addressed the provider-tool-call review round in Summary:
Resolved the review threads that are covered by those changes. |
zmanian
left a comment
There was a problem hiding this comment.
Reviewed the diff carefully. The metadata side channel is structurally sound but a few correctness gaps in the replay path need attention before this lands.
Metadata isolation — looks correct
ThreadMessageRecord.tool_result_provider_call is stripped in history_messages() in both crates/ironclaw_threads/src/db.rs and crates/ironclaw_threads/src/in_memory.rs, while context_messages_with_summary_replacements retains it. So list_thread_history (product surface) sees only content, and load_context_window (model replay) sees the provider envelope. Redaction clears it. Test coverage on the in-memory and durable paths is real. Good.
One nit: nothing prevents a future caller from serializing a ThreadMessageRecord directly (it derives Serialize and the field has skip_serializing_if = Option::is_none but is otherwise public). A doc comment landed on the field, but it would be more defensible to make tool_result_provider_call pub(crate) or wrap accessors so the only emission path is the deliberately-scrubbing history fn. As-is, the isolation is enforced by convention, not by the type.
Multi-tool grouping — partial-failure replay is wrong
In crates/ironclaw_agent_loop/src/executor.rs, append_capability_result_ref only runs on CapabilityOutcome::Completed. Failures don't append a tool-result ref carrying provider_call. On the next turn, convert_messages in crates/ironclaw_reborn/src/model_gateway.rs groups consecutive ToolResult messages by provider_turn_id and reconstructs an assistant message containing only the completed tool calls.
That is structurally invalid for OpenAI / Anthropic: if the model originally emitted N tool calls in one turn, the provider expects exactly N matching tool-result messages. Replaying with K<N tool_calls and K tool_results drops the failed calls entirely from the assistant message — the provider sees a turn that never happened, and may produce different (or rejected) downstream behavior.
The two grouping tests (gateway_reconstructs_multi_tool_provider_turn_from_grouped_result_references, gateway_with_two_tool_calls_returns_two_candidates) only cover the all-success case. There is no test for partial-failure replay, no test for failure-only replay, and no test for what happens when a denied / suspended tool is in the batch.
Either:
- emit a tool-result envelope with
provider_callpopulated even on failure (with the safe summary describing the failure), so the assistant turn replays with its full original shape, or - record the assistant tool-call turn shape independently so reconstruction does not depend on which results survived.
Worth catching before this lands.
Cross-provider replay — no identity check
ProviderToolCallReferenceEnvelope carries provider_call_id, provider_tool_name, and signature verbatim into ChatMessage::assistant_with_tool_calls and ChatMessage::tool_result on replay (tool_result_replay_message → provider_tool_call_from_reference). Nothing in the envelope, the ToolResultReferenceEnvelope, or ProviderToolCallReplay records the originating provider or model.
Real failure modes on resume against a different provider:
- Anthropic call IDs are
toolu_*, OpenAI arecall_*. Replaying Anthropic IDs into an OpenAI request will be rejected by the API. signatureis an Anthropic reasoning-signature concept. Forwarding it into an OpenAI request is at best ignored, at worst rejected.- Tool-name normalization differs (
demo__echois fine for both, but the encoding rule lives inprovider_tool_name()and is not provider-specific — a future provider with stricter constraints would silently mismatch).
Either bind the replay metadata to the producing provider/model and refuse cross-provider replay, or strip/translate provider-specific fields on emit. As-is, a resume that crosses providers is a latent runtime failure with no early signal.
Schema gate vs invocation surface drift
provider_schema_is_usable (crates/ironclaw_loop_support/src/capability_port.rs) silently drops capabilities whose parameters_schema isn't { "type": "object", ... } from tool_definitions(), but the same capabilities remain invokable through invoke_capability. The model sees a smaller surface than the host enforces. For capabilities authored without a schema this is a soft failure — they become permanently invisible to provider-tool-call paths until someone notices. Worth at least a debug! when a capability is filtered out, and ideally a contract that capabilities advertised as visible always carry a usable schema.
Stacking on #3720
Base is reborn/3622-tool-result-evidence (#3720's head). Does not stand alone — the ProviderToolCallReferenceEnvelope validation primitives, tool_result_provider_call field plumbing, and the tool-result-reference envelope itself are introduced on the parent branch. If #3720 changes shape during review (envelope schema, validation rules) this PR's tests will need a rebase pass. Worth confirming the merge order is locked.
Smaller items
provider_turn_idis a sha256-truncated digest of(run, turn, model_call_seq, tool_call ids+names). The 16-hex-char truncation gives ~64 bits of entropy. Collisions across long-running threads are vanishingly unlikely but the grouping inconvert_messagesrelies entirely on string equality — a collision would silently merge two distinct provider turns into one assistant message. Either widen to 32 hex chars or pin grouping with a secondary check (e.g., contiguous sequence range in transcript).AtomicU64::fetch_add(1, Ordering::Relaxed)forprovider_turn_sequenceis fine for grouping but means turn ids are not deterministic across process restarts. If a gateway is recreated mid-thread, sequence resets to 1 and the same(run, turn)produces a newprovider_turn_id. Confirm this matches the intended replay semantics.LlmProviderModelGateway::stream_modelandRoutedLlmProviderModelGateway::stream_modelno longer passcapabilities— onlystream_model_with_capabilitiesdoes. The non-tool path is intentional, but means any code path that goes through plainstream_modelwill never get tool definitions even when capabilities are wired. Ensure no production caller path falls back to plainstream_modelby accident.
Approve the metadata side-channel direction. Block on partial-failure replay correctness and cross-provider identity handling.
Zaki
|
Addressed Zaki’s follow-up review in What changed:
Validation run locally:
|
c6c6925 to
be5888a
Compare
|
Follow-up after rebasing onto A review subagent checked local head Additional fixes included in
Local verification after the rebase:
The rebased branch has been force-pushed with lease. GitHub checks have started on the new head and are still pending. |
serrrfirat
left a comment
There was a problem hiding this comment.
Correctness review focused on regressions from the broad provider-tool-call changes. I found two medium-or-higher issues and did a second pass over the touched replay, filtering, and persistence paths without finding additional medium findings.
Findings:
-
P1 - Provider replay metadata is lost on history fallback.
ThreadBackedLoopModelPort::resolve_model_messagesfalls back tolist_thread_historywhen a requested message ref is outside the current context window. Both DB and in-memorylist_thread_historydeliberately cleartool_result_provider_callbefore returning history records, sohistory_messages_by_refrebuilds a model message without the provider side channel.convert_messagesthen treats that tool result as a plain system summary instead of reconstructing the assistant tool-call + tool-result round trip. Any older tool-result ref resolved through history can therefore lose the provider call id/signature/reasoning and break the next provider request. -
P2 - Provider-native tool definitions ignore the executor's per-iteration capability filter. The executor computes a
CapabilityFilterand applies it only to the local visible surface used for prompt/invocation checks. Later, the gateway fetches provider tool definitions fromcapabilities.tool_definitions(), which is built from the current snapshot without that local filter. The executor still denies filtered calls after the model asks for them, but the provider is being advertised tools that the current loop iteration intended to hide, leading to avoidable invalid tool calls and incorrect model context.
Verification run locally on the PR head:
cargo test -p ironclaw_threads --test session_thread_contract provider -- --nocapturecargo test -p ironclaw_loop_support capability_surface_filter::tests::provider_tool_call_denied_before_inner_registration_when_not_allowedcargo test -p ironclaw_reborn --features root-llm-provider --test llm_gateway
The first attempted cargo test -p ironclaw_reborn --test llm_gateway failed only because that test target requires root-llm-provider; rerunning with the feature passed.
zmanian
left a comment
There was a problem hiding this comment.
Re-review of 3 commits since my prior review (6657d33d, 194ee470, be5888af) on top of the rebased base.
Prior findings — status
1. Metadata isolation enforced by convention not type — Partial. ThreadMessageRecord now has a doc note and DB persistence is the only explicit emitter; in-memory list_thread_history still scrubs at the same spot (in_memory.rs:612). The field is still public on a Serialize struct. Acceptable for now but I'd still prefer pub(crate) with explicit accessors.
2. Multi-tool partial-failure replay broken — Resolved. executor.rs now appends a tool-result ref on all terminal non-success outcomes (SkipResult, FailRun/FailTurn, PolicyDenied Retry, gate SkipAndContinue/Abort, unsupported process wait, final-fallback). append_capability_error_ref/append_capability_safe_summary_ref carry the provider_call envelope so reconstructed assistant turns match the original tool-call shape. New test gateway_reconstructs_multi_tool_provider_turn_with_failure_safe_summary covers mixed success/failure. Good fix.
3. Cross-provider replay no identity check — Resolved. ProviderToolCallReferenceEnvelope now carries provider_id + provider_model_id, validated on emit (capability_port.rs::validate_provider_identity) and again on replay (model_gateway.rs::validate_provider_replay_identity), which fails closed with PolicyDenied on mismatch. LlmProviderModelGateway::new now takes an explicit provider_id instead of deriving from model_name(). Good.
4. Schema gate vs invocation surface drift — Partial. Filter now emits tracing::debug! when a capability is dropped from tool_definitions(). The invocation surface still accepts capabilities that were silently filtered (no positive contract that "visible ⇒ provider-usable"). Acceptable for follow-up; not a blocker.
5. Stacking on #3720 — Resolved by rebase. Base is now a60b062b3 post-rebase; both PRs need to land in order.
6. 64-bit provider_turn_id entropy + non-deterministic sequence — Partially resolved. Digest widened from 16 to 32 hex chars (sha256_hex_prefix(..., 32)) — collision risk now adequate. AtomicU64 sequence reset on gateway recreation is unchanged; that means the same (run, turn) can produce different provider_turn_ids across restarts. Grouping is per-turn so a restart mid-turn is the only real failure window; flag for future hardening.
New findings (still blocking)
A. History-fallback path loses tool_result_provider_call (also raised by serrrfirat). ThreadBackedLoopModelPort::resolve_model_messages in crates/ironclaw_loop_support/src/lib.rs:862 falls back to list_thread_history when a result_ref is outside the active context window. Both in_memory.rs:611-613 and the DB equivalent in db.rs deliberately clear tool_result_provider_call before returning history records. history_messages_by_ref (line 1189) then propagates that None into the rebuilt ContextMessage. convert_messages falls back to the system-summary branch and the assistant tool-call turn cannot be reconstructed. Concretely: any tool-result older than the current window will lose its provider envelope on replay, regressing exactly the case this PR was meant to fix. Either retain the envelope in the fallback path (separate accessor that does not scrub), or refuse fallback for tool-result refs and require they live in the context window.
B. Per-iteration CapabilityFilter not applied to tool_definitions (also raised by serrrfirat). CanonicalAgentLoopExecutor::run_batch computes surface_filter = planner.capability().filter(&state) and applies it to the local VisibleCapabilitySurface via apply_capability_filter (executor.rs:232/250). The model gateway then independently calls capabilities.tool_definitions() on the unfiltered port. CapabilitySurfaceProfileFilter only enforces the static run-profile allow set, not the dynamic per-iteration filter. Net effect: provider gets advertised tools that the executor will then deny. The right fix is to either pass the per-iteration filter through to the capability port the gateway sees, or have tool_definitions() consult the same filter source the executor uses.
Validation
I did not re-run tests locally; relying on the run reported in the PR comments. The new loop_driver_host and gateway tests look targeted; the partial-failure test in particular exercises the right replay shape.
Verdict
Findings 2, 3, 5 cleanly resolved. Findings 1, 4, 6 acceptable as-is or partial. Block on A (history fallback) and B (per-iteration filter not propagated). Both are correctness regressions on the exact replay path this PR is meant to harden.
Zaki
464aa31 to
2eea296
Compare
|
Addressed the provider-tool visibility concern in the latest push. The executor now passes the resolved model-visible capability IDs for the current iteration, and loop-support wraps the capability port with that resolved view before provider-native tool definitions/calls reach the model gateway. This keeps loop-specific filtering policy inside agent-loop while making provider tools respect the final visible surface.\n\nValidation run:\n- cargo test -p ironclaw_loop_support --test thread_loop_support_contract\n- cargo test -p ironclaw_agent_loop\n- cargo test -p ironclaw_turns --test agent_loop_host_contract\n- cargo test -p ironclaw_reborn --test loop_driver_host\n- cargo test -p ironclaw_reborn --test loop_milestone_event_projection\n- cargo test -p ironclaw_reborn --features root-llm-provider --test llm_gateway\n- cargo clippy -p ironclaw_turns -p ironclaw_agent_loop -p ironclaw_loop_support -p ironclaw_reborn --tests -- -D warnings |
|
Correctness/security follow-up on the latest head ( Findings:
In That leaves
In The provider tool-call batch should be prevalidated before staging any inputs, or registration should be moved behind a batch API that can apply rollback/idempotency across the whole provider turn. Extra paranoid pass: I did not find additional non-duplicate medium-or-higher correctness/security issues in the latest diff. |
Code Review (multi-agent)Intent: Preserve provider tool-call metadata through the Reborn loop so provider-backed models receive valid tool-result roundtrips without exposing raw metadata in user-visible transcripts. Stats: 42 findings (from 42 raw, 42 after dedup) across 27 files Security (7 issues)
Bugs (5 issues)
Performance (11 issues)
Tests (15 issues)
Conventions (4 issues)
|
764f5b9 to
faa93b8
Compare
Review findings summaryAll 19 review comments from automated reviewers (Gemini, Codex, Copilot) and human reviewers (zmanian ×2, serrrfirat) have been triaged. No new changes are needed — every finding has been addressed by prior commits on this branch: Already resolved by commits on this branch
Duplicates (same issue, already resolved)
Suppressed (low-confidence, no action needed)
Verdict: All review findings are addressed. The PR is mergeable with a clean state. @zmanian @serrrfirat — please re-review if you have any remaining concerns. |
henrypark133
left a comment
There was a problem hiding this comment.
All review findings have been addressed by commits on this branch. Re-reviewing to confirm.
zmanian
left a comment
There was a problem hiding this comment.
Re-review against head faa93b84 (the "address provider replay review findings" commit) on top of the prior reviewer rounds.
Prior findings — status check
| # | Finding | Status |
|---|---|---|
| Zaki #1 | Metadata isolation enforced by convention not type | Partial — tool_result_provider_call is now skip_serializing on both ThreadMessageRecord (contract.rs:94) and HostManagedModelMessage (loop_support/lib.rs:1066). Acceptable. |
| Zaki #2 | Multi-tool partial-failure replay broken | Resolved via append_capability_safe_summary_ref on all terminal non-success paths. |
| Zaki #3 | Cross-provider replay no identity check | Resolved — validate_provider_replay_identity runs on each replayed envelope. |
| Zaki #4 | Schema gate vs invocation surface drift | Partial — tracing::debug! emitted on filter, no positive contract. Acceptable. |
| Zaki #6 | Turn id entropy + non-determinism | Resolved (32 hex chars). Restart-mid-turn sequence reset still latent. |
| Zaki A | History fallback loses tool_result_provider_call |
Resolved — resolve_model_messages now calls load_context_messages (loop_support/lib.rs:881-890) which preserves the side channel via context_messages_by_id in both in_memory.rs:621 and db.rs:1096. list_thread_history is only used for summary artifacts. |
| Zaki B | Per-iteration CapabilityFilter not applied to tool_definitions |
Resolved — new CapabilitySurfaceVisibleFilter (capability_surface_filter.rs:21-118) wraps the inner port; tool_definitions, register_provider_tool_call, visible_capabilities, and both invoke_capability* paths all enforce the model-visible set. |
| serrrfirat #1 (latest) | Synthetic provider error refs not recorded in loop state | Still open — see below. |
| serrrfirat #2 (latest) | Multi-tool provider registration not atomic | Partially addressed — see below. |
Still blocking
A. Synthetic provider error refs are appended to the transcript but never pushed into state.result_refs
faa93b84 does not touch crates/ironclaw_agent_loop/src/executor.rs, so serrrfirat's earlier finding is unchanged. Concretely:
append_capability_safe_summary_ref (executor.rs:1710-1726) writes the synthetic result:provider-error-… ref to the host transcript via host.append_capability_result_ref(...), but the only writer of state.result_refs is push_completed_result (executor.rs:1776-1785), which runs solely on the CapabilityOutcome::Completed branch. Every call site at executor.rs:791, 803, 821, 937, 989, 1003, 1031 calls append_capability_*_ref for a provider-backed non-success outcome (SkipResult, Abort, policy-denied Retry, gate SkipAndContinue / Abort, final fallback, unsupported process wait) and only updates state.recovery_state / state.gate_state — never state.result_refs.
Net effect: the transcript contains the synthetic ref (so model replay reconstructs the assistant tool-call shape via provider_tool_roundtrip_messages), but loop accounting (TurnSummary.batch_result_refs derived at executor.rs:460 from state.result_refs[result_refs_start..], final LoopExit.result_refs at line 1387, and checkpointed loop state) is silently missing those refs. Mixed success/failure provider batches will produce a LoopExit.result_refs whose length differs from the number of provider tool-call results, and downstream consumers (telemetry, batch accounting, retry/resumption logic) see fewer refs than were actually emitted.
The fix is either to have append_capability_safe_summary_ref return the LoopResultRef and push it into state.result_refs at each provider-backed non-success call site, or to wrap the append in a helper that does both atomically.
Test gap that masked this: the test at executor.rs:3387 only asserts host.appended_result_refs() (transcript view), not completed.result_refs (loop-state view) on a denied provider call.
B. Multi-tool provider registration is still not atomic for non-name failure modes
faa93b84 adds a useful upfront check at model_gateway.rs:784-799 that all tool calls in the batch are in advertised_tool_names before any registration, and the new test gateway_rejects_unknown_provider_tool_call_before_registration confirms capabilities.registered.lock().unwrap().is_empty() after a name-mismatch rejection. That closes the unknown-name path cleanly.
But after that gate passes, tool_response_to_host still calls register_provider_tool_call in a sequential loop (model_gateway.rs:800-814), and register_provider_tool_call (capability_port.rs:566-623) has several failure modes that fire after prior calls in the batch have already staged input via input_resolver.register_provider_tool_call_input(...):
validate_provider_tool_call(capability_port.rs:571) — turn-id/call-id/name length, 16 KiB arguments cap, JSON nesting depth, NUL/control chars, sensitive-marker rejection,sk-token rejection. Any of these only catch the offending call.current_snapshot()returningNonebetween iterations (StaleSurface).provider_names/capabilitiesmap miss (capability_port.rs:584-594).provider_schema_is_usable(capability_port.rs:596).input_resolver.register_provider_tool_call_inputitself returning an error on a later call.
Scenario: a model emits two correctly-named tool calls, the second of which carries a 20 KiB arguments blob. Upfront name check passes. Call 1 validates and stages an input ref via the input resolver. Call 2 fails validate_provider_arguments. The whole response is rejected, but call 1's input ref is orphaned in the resolver. A retry can stage duplicates.
Either prevalidate the whole batch (run validate_provider_tool_call plus the visibility/schema checks on each tool_call in a first pass before any side effects), or move registration behind a batch API that can roll back / dedupe across the full provider turn. The current test only exercises the upfront name-check path; a partial-failure test against arguments-size / sensitive-content / unusable-schema would have caught this and should land with the fix.
Smaller observations (non-blocking)
validate_provider_replay_identity(model_gateway.rs:984-1003) now runsprovider_call.validate()first — good. It only matchesprovider_id/provider_model_id, not turn/call scope, which is the right call (turn/call IDs are provider-emitted and have no host-side expected value), but worth pinning in a doc-comment so a future reader doesn't try to "tighten" it into something that breaks replay.CapabilitySurfaceVisibleFilter::register_provider_tool_call(capability_surface_filter.rs:56-85) does a TOCTOU-style check: readtool_definitions()to map name→capability, then callinner.register_provider_tool_call. The post-call permits check oncandidate.capability_id(line 78) catches the visibility-change race, but the inner registration's side effects (input-resolver staging) still happen — same orphan shape as B above, just one layer up. Easier to address once B has a batch-validation primitive.- The previously-flagged "synthetic ref written but
state.result_refsnot updated" test gap could be closed with a single new test assertingLoopExit::Completed { result_refs, .. }after a mixed-success provider batch.
Verification
I read the diff and the files at PR head faa93b84; I did not re-run the test suite locally. CI is still pending on this head (status pending, no checks reported yet).
Verdict
Findings A (history fallback) and B (per-iteration filter) from the previous round are cleanly resolved. The most recent serrrfirat findings are not fully resolved by faa93b84 despite the "all findings addressed" summary comment:
- Block on A (loop-state result_refs missing synthetic provider error refs — correctness regression for loop accounting on every provider-backed non-success outcome).
- Block on B (atomicity gap remains for argument-size / sensitive-content / unusable-schema failures after the first registered call, even with the new upfront name check).
Both are tractable; B in particular is largely a prevalidate-then-stage refactor of the existing per-call validation in register_provider_tool_call.
Generated by Claude Code
|
Addressed the open review feedback from #3722 (review) in What changed:
Regression coverage added for:
Local validation:
GitHub checks have started on the new head and are still in progress. |
serrrfirat
left a comment
There was a problem hiding this comment.
Multi-agent review of nearai/ironclaw#3722 at head 421a939a5bfd28bb21b425145fc5e2424e8b8f4a.
I found 4 medium-severity issues that should be addressed before merge:
-
crates/ironclaw_agent_loop/src/executor.rs:273- Filtered-out capabilities can still be materialized into the model prompt.The executor applies the planner capability filter to
surfaceand passes the filtered IDs only ascapability_viewfor provider tool definitions, but the prompt request still carries only the unchangedsurface.version. Prompt construction can resolve that version through the host surface lookup and materialize the pre-strategy host/profile surface, so the model may receive names/descriptions for capabilities removed from this iteration. Bind prompt construction to the filtered surface as well, either by carrying filtered capability IDs/surface in the prompt request or by making the prompt authority resolve this request against the filtered descriptor set. Add a regression that asserts both the materialized prompt and provider tool definitions exclude a denied capability. -
crates/ironclaw_loop_support/src/capability_port.rs:694- Capability surface snapshot locks are acquired in opposite orders.current_snapshot()lockscurrent_surface_versionand thensnapshots, whilevisible_capabilities()lockssnapshotsand thencurrent_surface_version. Concurrent model/tool surface access can deadlock the shared port. Put the current version and snapshots behind a single mutex, or enforce one lock acquisition order everywhere. -
crates/ironclaw_loop_support/src/capability_port.rs:863- Provider tool names are not constrained to the LLM provider contract.provider_tool_name()only replaces dots and otherwise exposes the capability ID as the provider tool name when it is <= 256 bytes. The inbound validator even allows.,:,_, and-, but common provider tool/function name contracts are stricter than this. A valid IronClaw capability ID can therefore produce a tool definition that a provider rejects before the model sees the visible tool surface. Normalize provider tool names at this boundary to the strict shared provider-safe contract and use a digest suffix for truncation/collisions; validate inbound provider tool names against the same advertised-name contract. -
crates/ironclaw_threads/src/db.rs:769- Idempotent tool-result replay can permanently drop provider replay metadata.append_tool_result_reference()returns an existing finalized record based only onturn_run_idandtool_result_refbefore validating or comparing the newprovider_call. If an older/fallback write created that record withtool_result_provider_call: None, a retry with provider metadata now available will return the stale record and never backfill the side channel. BackfillNone -> Some(provider_call)on idempotent replay, and fail loudly if both sides areSomebut differ. Mirror the same behavior in the in-memory store.
|
Addressed the four latest serrrfirat review findings in
Validation run:
|
serrrfirat
left a comment
There was a problem hiding this comment.
Re-review against current head 3f78ca9b, focused on the follow-up commits after the prior faa93b84 review and the latest automated review at 421a939a.
The shape is right: the follow-up now threads the per-iteration capability_view into both prompt construction and provider tool definitions, pre-validates provider tool batches before registration, constrains provider tool names to the stricter provider-safe contract, fixes the capability-port lock ordering, and backfills/rejects idempotent provider replay metadata consistently across in-memory and DB thread stores. The targeted test additions cover the intended invariants.
One thing still blocks merge:
- The current head does not compile under the PR checks
Both failing CI jobs are stopped by the same Rust compile error: LoopPromptBundleRequest gained required field capability_view, but crates/ironclaw_reborn/tests/loop_driver_host.rs:2750 still initializes it without that field in the inline-message rejection test.
This is caller-visible in the quality gate rather than a runtime design issue, but it blocks all downstream verification: Clippy (all-features) and Reborn CLI smoke tests both fail with E0063: missing field capability_view in initializer of LoopPromptBundleRequest at loop_driver_host.rs:2758 in the current CI run.
Please add capability_view: None to that initializer and rerun the affected checks. A good minimal verification is the failing loop_driver_host test target plus the existing provider-tool replay tests.
Verification I checked:
- GitHub CI logs for run
26058596870:Clippy (all-features)andReborn CLI smoke testsboth fail on the missing field above. - Local focused check passed:
cargo test -p ironclaw_reborn --features root-llm-provider --test llm_gateway gateway_rejects_invalid_provider_tool_batch_before_any_registration -- --exact.
Net: prior provider replay correctness issues look addressed on current head, but this cannot merge until the test compile failure is fixed.
|
Fixed the compile blocker from review Change: added the missing Validation run from an isolated worktree:
|
serrrfirat
left a comment
There was a problem hiding this comment.
Multi-agent review of nearai/ironclaw#3722 at head 3f78ca9bee4a640b25ece2babcc2e3d0ba2fe144 with --force.
Reviewers completed: security, bugs, tests, conventions. The performance/concurrency reviewer failed to inspect the checkout due to a subprocess launch error, so I excluded it rather than treating it as clean.
Findings kept after validation:
- High - provider-backed tool calls cannot stage their input on the live HostRuntime capability path.
- High - current head fails CI because one
LoopPromptBundleRequestinitializer still omitscapability_view. - Medium - adjacent provider tool results from different turns need a regression test.
- Medium -
FinishReason::ToolUsewith an empty tool-call list needs a gateway regression test. - Medium - run-profile provider tool advertisement filtering needs a regression test.
- Medium - provider replay identifiers remain raw
Strings across internal module boundaries.
I dropped one conventions finding as a false positive: durable thread persistence does retain tool_result_provider_call through to_message_json, and the DB contract tests assert reload behavior.
| let prepared = self.prepare_provider_tool_call(&tool_call)?; | ||
| let input_ref = self | ||
| .input_resolver | ||
| .register_provider_tool_call_input(&self.run_context, &tool_call) |
There was a problem hiding this comment.
HostRuntimeLoopCapabilityPort::register_provider_tool_call now calls register_provider_tool_call_input, but LoopCapabilityInputResolver only provides a default implementation that returns InvalidInvocation, and the concrete resolver in this checkout does not override it. The gateway can advertise provider tools and accept a provider tool call, but registration fails before producing a CapabilityCallCandidate, so the live provider-tool-call path cannot execute. Please implement provider tool-call argument staging for the resolver wired into HostRuntimeLoopCapabilityPortFactory, or make this trait method required and provide the concrete store-backed implementation.
| continue; | ||
| }; | ||
| validate_provider_replay_identity(&next_provider_call, replay_identity)?; | ||
| if next_provider_call.provider_turn_id != provider_turn_id { |
There was a problem hiding this comment.
The replay converter breaks the current provider-tool group when the next tool result has a different provider_turn_id, but the added gateway tests cover same-turn grouping and plain-result interleaving only. Please add a regression such as gateway_separates_consecutive_provider_tool_results_from_different_turns with adjacent ToolResult messages for turn_1 then turn_2, so a future change cannot merge turns or skip the second replay.
| HostManagedModelErrorKind::PolicyDenied, | ||
| "model response was blocked by provider policy", | ||
| )), | ||
| FinishReason::ToolUse => Err(HostManagedModelError::safe( |
There was a problem hiding this comment.
This new error branch for FinishReason::ToolUse with no tool calls is not covered by the current gateway tests. Please add a regression such as gateway_with_tool_surface_rejects_tool_use_finish_without_tool_calls where complete_with_tools returns ToolUse and an empty tool_calls list.
|
|
||
| #[async_trait] | ||
| impl LoopCapabilityPort for CapabilitySurfaceProfileFilter { | ||
| fn tool_definitions(&self) -> Result<Vec<ProviderToolDefinition>, AgentLoopHostError> { |
There was a problem hiding this comment.
CapabilitySurfaceProfileFilter::tool_definitions now filters provider tool advertisements by the run-profile allow set, but the adjacent tests only verify denied registration. Please add a regression such as profile_filter_tool_definitions_excludes_denied_capabilities that asserts the provider tool definitions expose only allowlisted capabilities.
| #[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)] | ||
| pub struct ProviderToolCallReplay { | ||
| /// Provider identity selected by the host route. | ||
| pub provider_id: String, |
There was a problem hiding this comment.
Provider replay metadata is now carried across turns, loop support, Reborn, and threads as raw String identifiers (provider_id, provider_model_id, turn/call IDs, and provider tool names). That conflicts with .claude/rules/types.md, which requires boundary strings to be converted into specialized domain types before flowing between internal modules. Please introduce validated newtypes for these provider replay identifiers in the owning contract layer and use them through the replay structs.
Summary
Stacked on #3720 / reborn/3622-tool-result-evidence.
Verification