fix(agent): preserve streamed reasoning content for tool replay#5606
fix(agent): preserve streamed reasoning content for tool replay#5606tompro wants to merge 6 commits intozeroclaw-labs:masterfrom
Conversation
CI Failure Analysis — #5606Comprehension summary: This PR fixes a bug where streamed tool-calling turns rebuilt a synthetic Why CI is failingThe Security Audit check is failing due to pre-existing
The CI Required Gate in the CI workflow passes. All other checks (Lint, Test, Build on all platforms, Check 32-bit, Strict Delta Lint, Docs Quality, Verify Benchmarks Compile) pass. What to fixNothing on the contributor's side. This PR's code compiles, passes all tests, passes lint, and builds on all targets. The Security Audit failure is a repo-wide pre-existing issue that affects all open PRs equally. The repository maintainers need to either:
This PR is not blocked by anything it introduced. |
Agent Review — Ready to MergeComprehension summary: This PR fixes a bug where streamed tool-calling turns rebuilt a synthetic Thank you, @tompro. The fix is minimal, correct, and well-targeted. What was reviewed and verified:
Security/performance assessment:
CI Status: The Template completeness: Fully completed with all required sections properly filled. Exemplary PR discipline. This PR is ready for maintainer merge.
|
|
This PR has merge conflicts with master. Could you rebase to resolve them? Once conflicts are cleared, this is agent-approved and ready to merge. |
2a40818 to
4eb27a1
Compare
singlerider
left a comment
There was a problem hiding this comment.
Agent Review — Routing: Needs Maintainer Review
@WareWolf-MoonWall @JordanTheJet — this PR modifies crates/zeroclaw-runtime/src/agent/agent.rs, a high-risk path requiring maintainer sign-off.
DRY note: @theonlyhennygod already fully reviewed and approved this PR ("Ready to Merge" verdict). No new code findings. This comment exists solely to ensure maintainer eyes land on the runtime path changes before merge.
What the change does: Fixes a bug where streamed tool-calling turns rebuilt a synthetic ChatResponse without preserving the streamed reasoning content, causing reasoning to be dropped during tool replay. The fix carries reasoning_content through the synthetic response assembly. 118 lines added, 1 deleted.
CI: all checks passing.
WareWolf-MoonWall
left a comment
There was a problem hiding this comment.
PR Review — #5606 fix(agent): preserve streamed reasoning content for tool replay
I've read the full diff, the linked issue (#5600), all prior review threads, and the relevant foundations.
What this change does
During turn_streamed, the agent accumulates streaming events into a synthetic ChatResponse before handing it to the tool dispatcher. Before this fix, that synthetic response hardcoded reasoning_content: None, so when Kimi-compatible providers received the replayed assistant tool-call message in the next turn, they returned HTTP 400: "thinking is enabled but reasoning_content is missing in assistant tool call message." The fix adds a streamed_reasoning accumulator alongside the existing streamed_text one, collects reasoning deltas as they arrive, and populates reasoning_content in the synthetic response using (!streamed_reasoning.is_empty()).then_some(streamed_reasoning) — preserving None for providers that don't emit reasoning events.
The fix is minimal, precisely targeted, and does not touch non-streaming paths, tool dispatch serialization, or any provider API contract.
The merge conflict flag raised by @theonlyhennygod has been resolved — the author has kept the branch current with three master merges, most recently April 16. The branch is clean and mergeable.
✅ Commendation
Two things done well:
(!streamed_reasoning.is_empty()).then_some(streamed_reasoning) is the right idiom here. It keeps reasoning_content absent — not an empty string — when no reasoning was streamed. An empty string and None are semantically different to a provider that validates this field: None means "not a thinking-enabled turn," Some("") could cause a different class of validation failure. The choice to use then_some rather than Some(streamed_reasoning) unconditionally shows the author thought about the provider contract, not just the Rust type.
The regression test constructs a StreamingReasoningProvider mock that correctly exercises the full two-turn path: stream_chat call 0 emits reasoning + tool call → tool executes → stream_chat call 1 returns empty → falls back to chat → seen_requests captures the message history. Asserting on the replayed assistant payload (the chat call) rather than on the intermediate streamed state means the test verifies what the provider actually receives, which is exactly what was broken in #5600.
🔴 Blocking — risk label mismatch
The PR body declares risk: medium. The applied label is risk: high. As with other PRs touching crates/zeroclaw-runtime/src/**, the high label is correct per AGENTS.md — that path is explicitly listed as high-risk. The label snapshot in the PR body needs to reflect risk: high so the audit trail is consistent. This is a one-line update.
🟡 Conditional — multi-chunk reasoning accumulation edge case needs a follow-up
The streamed_reasoning accumulator concatenates all reasoning deltas with push_str. For the single-chunk case in the regression test this is correct. The question worth tracking is what happens when a provider streams reasoning across many small chunks with leading/trailing whitespace or structural delimiters — the concatenation is lossless at the byte level, but some providers structure reasoning output with newlines or markers between "thinking steps" that may be meaningful for replay fidelity.
This isn't a reason to block the PR — the behavior is strictly better than None in every case, and the regression test covers the critical path from #5600. But I'd like to see a follow-up issue filed to track whether reasoning replay fidelity across multi-chunk reasoning streams needs any normalization, particularly as more providers adopt thinking modes. The current PR description notes this in the risks section; making it a tracked issue with an owner closes the loop on the conditional acceptance.
To @tompro
The fix is correct, the test is well-structured, and the PR discipline is strong — the template is one of the most completely filled I've seen. Two items before merge: update the risk label in the PR body from medium to high, and file a follow-up issue for the multi-chunk reasoning accumulation question (brief, just needs to capture the question and assign an owner). The blocking item is a one-line change; the conditional can be done concurrently.
Thank you for keeping the branch current through the merge conflicts. Kimi tool-use being entirely blocked (S1 severity on #5600) makes this worth getting across the line cleanly.
|
@WareWolf-MoonWall added the follow up here #5840 and changed the risk label accordingly. |
WareWolf-MoonWall
left a comment
There was a problem hiding this comment.
Re-review following rebase. All 20 CI checks green, including the new Quality Gate. Previous blockers (merge conflicts, wasmtime advisory CI failures) are fully resolved.
✅ The fix
Three production lines, perfectly symmetric with the existing streamed_text accumulation pattern. streamed_reasoning follows the same lifecycle: initialized empty, accumulated from deltas, conditionally present in the synthetic response. The (!streamed_reasoning.is_empty()).then_some(streamed_reasoning) idiom is idiomatic and correct — reasoning_content stays None for providers that don't emit reasoning, preserving all existing behaviour exactly.
✅ The test
The StreamingReasoningProvider mock is well-constructed — it emits a reasoning delta followed by a tool call event, which is exactly the sequence that triggered the Kimi 400. The assertion on the serialized assistant payload directly validates what the provider receives on replay. This is the right kind of regression test: it covers the specific failure path, it will catch any future regression immediately, and it documents the contract in executable form.
✅ Follow-up discipline
Filing #5840 for multi-chunk reasoning replay normalization before this merged was the right call — it keeps this PR focused on the confirmed regression and defers the broader question to a tracked, owned issue.
Thank you @tompro. Clean fix, strong test, good follow-up hygiene.
Summary
masterfor all contributions):masterChatResponsewithreasoning_content: None, so Kimi-compatible providers rejected the replayed assistant tool-call message.400when thinking is enabled butreasoning_contentis missing on assistant tool-call history, which blocks the entire tool-use workflow in issue [Bug]: Use kimi-code provider in streaming chat call tools, provider API reports an error #5600.Agent::turn_streamedand add a regression test that verifies assistant tool-call replay includesreasoning_content.Label Snapshot (required)
risk: low|medium|high):risk: highsize: XS|S|M|L|XL, auto-managed/read-only):size: S(expected)core|agent|channel|config|cron|daemon|doctor|gateway|health|heartbeat|integration|memory|observability|onboard|provider|runtime|security|service|skillforge|skills|tool|tunnel|docs|dependencies|ci|tests|scripts|dev, comma-separated):agent<module>: <component>, for examplechannel: telegram,provider: kimi,tool: shell):N.A.trusted contributor|experienced contributor|principal contributor|distinguished contributor, auto-managed/read-only; author merged PRs >=5/10/20/50):auto-managedNoneChange Metadata
bug|feature|refactor|docs|security|chore):bugruntime|provider|channel|memory|security|ci|docs|multi):runtimeLinked Issue
Supersede Attribution (required when
Supersedes #is used)#<pr> by @<author>, one per line):N.A.N.A.Co-authored-bytrailers added for materially incorporated contributors? (Yes/No):NoNo, explain why (for example: inspiration-only, no direct code/design carry-over):No superseded PR.\n): (Pass/Fail):PassValidation Evidence (required)
Commands and result summary:
cargo fmt --all -- --check cargo clippy --all-targets -- -D warnings cargo testAll commands passed locally. Also verified targeted regression test turn_streamed_preserves_reasoning_content_for_tool_call_replay and the user confirmed the local Kimi workflow succeeds with this patch.None.Security Impact (required)
Yes/No):NoYes/No):NoYes/No):NoYes/No):NoYes, describe risk and mitigation:N.A.Privacy and Data Hygiene (required)
pass|needs-follow-up):passNo user data, secrets, or identity-like fixtures added.Confirmed.Compatibility / Migration
Yes/No):YesYes/No):NoYes/No):NoN.A.i18n Follow-Through (required when docs or user-facing wording changes)
Yes/No):NoYes, locale navigation parity updated inREADME*,docs/README*, anddocs/SUMMARY.mdfor supported locales (en,zh-CN,ja,ru,fr,vi)? (Yes/No):N.A.Yes, localized runtime-contract docs updated where equivalents exist (minimum forfr/vi:commands-reference,config-reference,troubleshooting)? (Yes/No/N.A.):N.A.Yes, Vietnamese canonical docs underdocs/i18n/vi/**synced and compatibility shims underdocs/*.vi.mdvalidated? (Yes/No/N.A.):N.A.No/N.A., link follow-up issue/PR and explain scope decision:Rust-only bugfix; no user-facing docs or strings changed.Human Verification (required)
What was personally validated beyond CI:
streamed reasoning deltas are preserved into assistant tool-call replay history via the new regression test; full local fmt/clippy/test suite passed; user validated the local Kimi tool-call workflow after applying the patch.empty streamed reasoning still omits reasoning_content, existing non-streaming and dispatcher serialization behavior remain unchanged.A live Kimi API call was not executed directly from this environment.Side Effects / Blast Radius (required)
crates/zeroclaw-runtime/src/agentstreamed tool-call loop, especially providers that require reasoning parity on replayed assistant messages.`streaming providers now retain reasoning text in the synthetic response, which could expose provider replay assumptions if a backend emits malformed reasoning fragments.regression test coverage plus CI Required Gate; failures should surface as replay/tool-call test regressions or provider 400s in runtime logs.Agent Collaboration Notes (recommended)
Read, Bash, LSP diagnostics, Oracle-style read-only review.Traced the streamed tool-call path, confirmed dispatcher serialization already handled reasoning_content, patched the streamed response assembly, rebased the fix into the workspace layout, and added a focused regression test.reasoning_content preservation for replayed assistant tool-call history plus full repo Rust validation.AGENTS.md+CONTRIBUTING.md):YesRollback Plan (required)
git revert 4eb27a12or revert this PR before release.`None.streamed tool-call turns regress, replay payload assertions fail, or provider-specific tool-call workflows start returning 400 errors again.Risks and Mitigations
Providers may emit streamed reasoning in chunk patterns that concatenate into unexpected text.The change only preserves already-emitted reasoning data, keeps the field absent when empty, is covered by a regression test that exercises the replay path implicated in #5600, and follow-up issue #5840 tracks whether multi-chunk reasoning replay needs normalization.