Skip to content

fix(agent): preserve streamed reasoning content for tool replay#5606

Closed
tompro wants to merge 6 commits intozeroclaw-labs:masterfrom
tompro:fix/kimi-streamed-reasoning-replay
Closed

fix(agent): preserve streamed reasoning content for tool replay#5606
tompro wants to merge 6 commits intozeroclaw-labs:masterfrom
tompro:fix/kimi-streamed-reasoning-replay

Conversation

@tompro
Copy link
Copy Markdown

@tompro tompro commented Apr 10, 2026

Summary

  • Base branch target (master for all contributions): master
  • Problem: streamed tool-calling turns rebuilt a synthetic ChatResponse with reasoning_content: None, so Kimi-compatible providers rejected the replayed assistant tool-call message.
  • Why it matters: Kimi returns 400 when thinking is enabled but reasoning_content is missing on assistant tool-call history, which blocks the entire tool-use workflow in issue [Bug]: Use kimi-code provider in streaming chat call tools, provider API reports an error #5600.
  • What changed: preserve streamed reasoning deltas in Agent::turn_streamed and add a regression test that verifies assistant tool-call replay includes reasoning_content.
  • What did not change (scope boundary): no provider API contracts, tool-dispatch serialization logic, or non-streaming tool-call behavior were changed.

Label Snapshot (required)

  • Risk label (risk: low|medium|high): risk: high
  • Size label (size: XS|S|M|L|XL, auto-managed/read-only): size: S (expected)
  • Scope labels (core|agent|channel|config|cron|daemon|doctor|gateway|health|heartbeat|integration|memory|observability|onboard|provider|runtime|security|service|skillforge|skills|tool|tunnel|docs|dependencies|ci|tests|scripts|dev, comma-separated): agent
  • Module labels (<module>: <component>, for example channel: telegram, provider: kimi, tool: shell): N.A.
  • Contributor tier label (trusted contributor|experienced contributor|principal contributor|distinguished contributor, auto-managed/read-only; author merged PRs >=5/10/20/50): auto-managed
  • If any auto-label is incorrect, note requested correction: None

Change Metadata

  • Change type (bug|feature|refactor|docs|security|chore): bug
  • Primary scope (runtime|provider|channel|memory|security|ci|docs|multi): runtime

Linked Issue

Supersede Attribution (required when Supersedes # is used)

  • Superseded PRs + authors (#<pr> by @<author>, one per line): N.A.
  • Integrated scope by source PR (what was materially carried forward): N.A.
  • Co-authored-by trailers added for materially incorporated contributors? (Yes/No): No
  • If No, explain why (for example: inspiration-only, no direct code/design carry-over): No superseded PR.
  • Trailer format check (separate lines, no escaped \n): (Pass/Fail): Pass

Validation Evidence (required)

Commands and result summary:

cargo fmt --all -- --check
cargo clippy --all-targets -- -D warnings
cargo test
  • Evidence provided (test/log/trace/screenshot/perf): All commands passed locally. Also verified targeted regression test turn_streamed_preserves_reasoning_content_for_tool_call_replay and the user confirmed the local Kimi workflow succeeds with this patch.
  • If any command is intentionally skipped, explain why: None.

Security Impact (required)

  • New permissions/capabilities? (Yes/No): No
  • New external network calls? (Yes/No): No
  • Secrets/tokens handling changed? (Yes/No): No
  • File system access scope changed? (Yes/No): No
  • If any Yes, describe risk and mitigation: N.A.

Privacy and Data Hygiene (required)

  • Data-hygiene status (pass|needs-follow-up): pass
  • Redaction/anonymization notes: No user data, secrets, or identity-like fixtures added.
  • Neutral wording confirmation (use ZeroClaw/project-native labels if identity-like wording is needed): Confirmed.

Compatibility / Migration

  • Backward compatible? (Yes/No): Yes
  • Config/env changes? (Yes/No): No
  • Migration needed? (Yes/No): No
  • If yes, exact upgrade steps: N.A.

i18n Follow-Through (required when docs or user-facing wording changes)

  • i18n follow-through triggered? (Yes/No): No
  • If Yes, locale navigation parity updated in README*, docs/README*, and docs/SUMMARY.md for supported locales (en, zh-CN, ja, ru, fr, vi)? (Yes/No): N.A.
  • If Yes, localized runtime-contract docs updated where equivalents exist (minimum for fr/vi: commands-reference, config-reference, troubleshooting)? (Yes/No/N.A.): N.A.
  • If Yes, Vietnamese canonical docs under docs/i18n/vi/** synced and compatibility shims under docs/*.vi.md validated? (Yes/No/N.A.): N.A.
  • If any No/N.A., link follow-up issue/PR and explain scope decision: Rust-only bugfix; no user-facing docs or strings changed.

Human Verification (required)

What was personally validated beyond CI:

  • Verified scenarios: streamed reasoning deltas are preserved into assistant tool-call replay history via the new regression test; full local fmt/clippy/test suite passed; user validated the local Kimi tool-call workflow after applying the patch.
  • Edge cases checked: empty streamed reasoning still omits reasoning_content, existing non-streaming and dispatcher serialization behavior remain unchanged.
  • What was not verified: A live Kimi API call was not executed directly from this environment.

Side Effects / Blast Radius (required)

  • Affected subsystems/workflows: crates/zeroclaw-runtime/src/agent streamed tool-call loop, especially providers that require reasoning parity on replayed assistant messages.`
  • Potential unintended effects: streaming providers now retain reasoning text in the synthetic response, which could expose provider replay assumptions if a backend emits malformed reasoning fragments.
  • Guardrails/monitoring for early detection: regression test coverage plus CI Required Gate; failures should surface as replay/tool-call test regressions or provider 400s in runtime logs.

Agent Collaboration Notes (recommended)

  • Agent tools used (if any): Read, Bash, LSP diagnostics, Oracle-style read-only review.
  • Workflow/plan summary (if any): Traced the streamed tool-call path, confirmed dispatcher serialization already handled reasoning_content, patched the streamed response assembly, rebased the fix into the workspace layout, and added a focused regression test.
  • Verification focus: reasoning_content preservation for replayed assistant tool-call history plus full repo Rust validation.
  • Confirmation: naming + architecture boundaries followed (AGENTS.md + CONTRIBUTING.md): Yes

Rollback Plan (required)

  • Fast rollback command/path: git revert 4eb27a12 or revert this PR before release.`
  • Feature flags or config toggles (if any): None.
  • Observable failure symptoms: streamed tool-call turns regress, replay payload assertions fail, or provider-specific tool-call workflows start returning 400 errors again.

Risks and Mitigations

  • Risk: Providers may emit streamed reasoning in chunk patterns that concatenate into unexpected text.
    • Mitigation: The change only preserves already-emitted reasoning data, keeps the field absent when empty, is covered by a regression test that exercises the replay path implicated in #5600, and follow-up issue #5840 tracks whether multi-chunk reasoning replay needs normalization.

@singlerider
Copy link
Copy Markdown
Collaborator

@tompro See #5559's summary for a guide on how to resolve the recent merge conflicts.

@theonlyhennygod
Copy link
Copy Markdown
Collaborator

CI Failure Analysis — #5606

Comprehension summary: This PR fixes a bug where streamed tool-calling turns rebuilt a synthetic ChatResponse with reasoning_content: None, causing Kimi-compatible providers to reject the replayed assistant tool-call message with a 400 error. The fix preserves streamed reasoning deltas in Agent::turn_streamed and adds a regression test. Blast radius is limited to src/agent/agent.rs — specifically the streamed tool-call response assembly path.

Why CI is failing

The Security Audit check is failing due to pre-existing wasmtime crate advisories on master (RUSTSEC-2026-0088, 0089, 0092, 0095, 0096). These are inherited from the Tauri desktop app dependency and are not introduced by this PR. The cascade is:

  1. Security Audit fails (wasmtime advisories)
  2. Security Required Gate fails (depends on Security Audit)
  3. CI Required Gate (Quality Gate workflow) fails (depends on Security Required Gate)

The CI Required Gate in the CI workflow passes. All other checks (Lint, Test, Build on all platforms, Check 32-bit, Strict Delta Lint, Docs Quality, Verify Benchmarks Compile) pass.

What to fix

Nothing on the contributor's side. This PR's code compiles, passes all tests, passes lint, and builds on all targets. The Security Audit failure is a repo-wide pre-existing issue that affects all open PRs equally.

The repository maintainers need to either:

  • Update or pin wasmtime to a patched version
  • Add the advisories to an audit.toml ignore list if they are acknowledged/accepted

This PR is not blocked by anything it introduced.

@theonlyhennygod theonlyhennygod self-assigned this Apr 12, 2026
@theonlyhennygod
Copy link
Copy Markdown
Collaborator

Agent Review — Ready to Merge

Comprehension summary: This PR fixes a bug where streamed tool-calling turns rebuilt a synthetic ChatResponse with reasoning_content: None, causing Kimi-compatible providers to reject the replayed assistant tool-call message with HTTP 400 (Kimi requires reasoning_content on assistant messages when thinking mode is enabled). The fix accumulates streamed reasoning deltas into a streamed_reasoning String during turn_streamed and populates reasoning_content in the synthetic response when non-empty. A focused regression test verifies the assistant tool-call replay includes reasoning_content. Blast radius: src/agent/agent.rs streamed tool-call loop only; no impact on non-streaming paths or providers that don't require reasoning replay.

Thank you, @tompro. The fix is minimal, correct, and well-targeted.

What was reviewed and verified:

  • Code correctness: The fix adds streamed_reasoning accumulation alongside the existing streamed_text pattern. The conditional (!streamed_reasoning.is_empty()).then_some(streamed_reasoning) correctly keeps reasoning_content as None when no reasoning was streamed, preserving existing behavior for non-reasoning providers.
  • Regression analysis: Non-streaming path is unchanged. Providers that don't emit reasoning events continue to get reasoning_content: None. The test explicitly verifies the replay payload structure.
  • Test coverage: New turn_streamed_preserves_reasoning_content_for_tool_call_replay test with a StreamingReasoningProvider mock that emits reasoning + tool call events. Test verifies assistant payload contains reasoning_content and tool_calls.
  • Privacy/data hygiene: Pass — no PII, test data uses system-scoped content.
  • Architecture alignment: Follows existing streaming event handling patterns.

Security/performance assessment:

  • Security: No security impact.
  • Performance: Negligible — one additional String allocation for reasoning accumulation during streamed tool-call turns.

CI Status: The CI Required Gate failure on Quality Gate is from the known wasmtime RUSTSEC-2026-04-09 Security Audit issue (not this PR). The CI workflow's CI Required Gate passes. All lint, test, and build checks pass.

Template completeness: Fully completed with all required sections properly filled. Exemplary PR discipline.

This PR is ready for maintainer merge.


Field Content
PR #5606 — fix(agent): preserve streamed reasoning content for tool replay
Author @tompro
Summary Accumulates streamed reasoning deltas for Kimi-compatible tool-call replay
Action Ready to merge
Reason Correct fix, focused scope, all tests pass, no outstanding findings
Security/performance No security impact; negligible performance cost
Changes requested None
Architectural notes Follows existing streaming accumulation pattern; clean symmetry with streamed_text
Tests Full suite passes; new regression test covers the exact failure mode
Notes CI Required Gate failure on Quality Gate is known infra issue (wasmtime RUSTSEC), not this PR

@theonlyhennygod theonlyhennygod added the agent-approved PR approved by automated review agent label Apr 12, 2026
@theonlyhennygod
Copy link
Copy Markdown
Collaborator

This PR has merge conflicts with master. Could you rebase to resolve them? Once conflicts are cleared, this is agent-approved and ready to merge.

@tompro tompro force-pushed the fix/kimi-streamed-reasoning-replay branch from 2a40818 to 4eb27a1 Compare April 13, 2026 06:56
@github-actions github-actions bot removed the agent Auto scope: src/agent/** changed. label Apr 13, 2026
@singlerider singlerider added risk: high Auto risk: security/runtime/gateway/tools/workflows. size: S Auto size: 81-250 non-doc changed lines. needs-maintainer-review labels Apr 17, 2026
Copy link
Copy Markdown
Collaborator

@singlerider singlerider left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agent Review — Routing: Needs Maintainer Review

@WareWolf-MoonWall @JordanTheJet — this PR modifies crates/zeroclaw-runtime/src/agent/agent.rs, a high-risk path requiring maintainer sign-off.

DRY note: @theonlyhennygod already fully reviewed and approved this PR ("Ready to Merge" verdict). No new code findings. This comment exists solely to ensure maintainer eyes land on the runtime path changes before merge.

What the change does: Fixes a bug where streamed tool-calling turns rebuilt a synthetic ChatResponse without preserving the streamed reasoning content, causing reasoning to be dropped during tool replay. The fix carries reasoning_content through the synthetic response assembly. 118 lines added, 1 deleted.

CI: all checks passing.

Copy link
Copy Markdown
Collaborator

@WareWolf-MoonWall WareWolf-MoonWall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review — #5606 fix(agent): preserve streamed reasoning content for tool replay

I've read the full diff, the linked issue (#5600), all prior review threads, and the relevant foundations.


What this change does

During turn_streamed, the agent accumulates streaming events into a synthetic ChatResponse before handing it to the tool dispatcher. Before this fix, that synthetic response hardcoded reasoning_content: None, so when Kimi-compatible providers received the replayed assistant tool-call message in the next turn, they returned HTTP 400: "thinking is enabled but reasoning_content is missing in assistant tool call message." The fix adds a streamed_reasoning accumulator alongside the existing streamed_text one, collects reasoning deltas as they arrive, and populates reasoning_content in the synthetic response using (!streamed_reasoning.is_empty()).then_some(streamed_reasoning) — preserving None for providers that don't emit reasoning events.

The fix is minimal, precisely targeted, and does not touch non-streaming paths, tool dispatch serialization, or any provider API contract.

The merge conflict flag raised by @theonlyhennygod has been resolved — the author has kept the branch current with three master merges, most recently April 16. The branch is clean and mergeable.


✅ Commendation

Two things done well:

(!streamed_reasoning.is_empty()).then_some(streamed_reasoning) is the right idiom here. It keeps reasoning_content absent — not an empty string — when no reasoning was streamed. An empty string and None are semantically different to a provider that validates this field: None means "not a thinking-enabled turn," Some("") could cause a different class of validation failure. The choice to use then_some rather than Some(streamed_reasoning) unconditionally shows the author thought about the provider contract, not just the Rust type.

The regression test constructs a StreamingReasoningProvider mock that correctly exercises the full two-turn path: stream_chat call 0 emits reasoning + tool call → tool executes → stream_chat call 1 returns empty → falls back to chatseen_requests captures the message history. Asserting on the replayed assistant payload (the chat call) rather than on the intermediate streamed state means the test verifies what the provider actually receives, which is exactly what was broken in #5600.


🔴 Blocking — risk label mismatch

The PR body declares risk: medium. The applied label is risk: high. As with other PRs touching crates/zeroclaw-runtime/src/**, the high label is correct per AGENTS.md — that path is explicitly listed as high-risk. The label snapshot in the PR body needs to reflect risk: high so the audit trail is consistent. This is a one-line update.


🟡 Conditional — multi-chunk reasoning accumulation edge case needs a follow-up

The streamed_reasoning accumulator concatenates all reasoning deltas with push_str. For the single-chunk case in the regression test this is correct. The question worth tracking is what happens when a provider streams reasoning across many small chunks with leading/trailing whitespace or structural delimiters — the concatenation is lossless at the byte level, but some providers structure reasoning output with newlines or markers between "thinking steps" that may be meaningful for replay fidelity.

This isn't a reason to block the PR — the behavior is strictly better than None in every case, and the regression test covers the critical path from #5600. But I'd like to see a follow-up issue filed to track whether reasoning replay fidelity across multi-chunk reasoning streams needs any normalization, particularly as more providers adopt thinking modes. The current PR description notes this in the risks section; making it a tracked issue with an owner closes the loop on the conditional acceptance.


To @tompro

The fix is correct, the test is well-structured, and the PR discipline is strong — the template is one of the most completely filled I've seen. Two items before merge: update the risk label in the PR body from medium to high, and file a follow-up issue for the multi-chunk reasoning accumulation question (brief, just needs to capture the question and assign an owner). The blocking item is a one-line change; the conditional can be done concurrently.

Thank you for keeping the branch current through the merge conflicts. Kimi tool-use being entirely blocked (S1 severity on #5600) makes this worth getting across the line cleanly.

@github-project-automation github-project-automation bot moved this from Backlog to Needs Changes in ZeroClaw Project Board Apr 17, 2026
@tompro
Copy link
Copy Markdown
Author

tompro commented Apr 17, 2026

@WareWolf-MoonWall added the follow up here #5840 and changed the risk label accordingly.

Copy link
Copy Markdown
Collaborator

@WareWolf-MoonWall WareWolf-MoonWall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review following rebase. All 20 CI checks green, including the new Quality Gate. Previous blockers (merge conflicts, wasmtime advisory CI failures) are fully resolved.

✅ The fix

Three production lines, perfectly symmetric with the existing streamed_text accumulation pattern. streamed_reasoning follows the same lifecycle: initialized empty, accumulated from deltas, conditionally present in the synthetic response. The (!streamed_reasoning.is_empty()).then_some(streamed_reasoning) idiom is idiomatic and correct — reasoning_content stays None for providers that don't emit reasoning, preserving all existing behaviour exactly.

✅ The test

The StreamingReasoningProvider mock is well-constructed — it emits a reasoning delta followed by a tool call event, which is exactly the sequence that triggered the Kimi 400. The assertion on the serialized assistant payload directly validates what the provider receives on replay. This is the right kind of regression test: it covers the specific failure path, it will catch any future regression immediately, and it documents the contract in executable form.

✅ Follow-up discipline

Filing #5840 for multi-chunk reasoning replay normalization before this merged was the right call — it keeps this PR focused on the confirmed regression and defers the broader question to a tracked, owned issue.

Thank you @tompro. Clean fix, strong test, good follow-up hygiene.

@github-project-automation github-project-automation bot moved this from Needs Changes to Ready to Merge in ZeroClaw Project Board Apr 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-approved PR approved by automated review agent needs-maintainer-review risk: high Auto risk: security/runtime/gateway/tools/workflows. size: S Auto size: 81-250 non-doc changed lines.

Projects

Status: Ready to Merge

Development

Successfully merging this pull request may close these issues.

[Bug]: Use kimi-code provider in streaming chat call tools, provider API reports an error

4 participants