fix(providers): drop leading non-user turns before provider call by dmnkhorvath · Pull Request #6303 · zeroclaw-labs/zeroclaw

dmnkhorvath · 2026-05-03T08:13:31Z

Summary

Closes #6302.

ZeroClaw constructed conversation histories that strict providers (Google Gemini) reject with HTTP 400 (Please ensure that function call turn comes immediately after a user turn or after a function response turn.). Permissive providers (Anthropic, GLM) silently tolerated the malformed shape. PR #960 had patched this for the Telegram channel runtime path; CLI / gateway / agent-loop paths were untouched.

This PR is two layered fixes, both required because investigation revealed the bug had two distinct causes:

1. `zeroclaw-providers/src/history_sanitizer.rs` — provider-edge cleanup

A new provider-agnostic enforce_leading_user_turn pass, called from multimodal::prepare_messages_for_provider. Drops the contiguous sequence of leading non-user, non-system messages immediately before the LLM call. Conservative: if no user turn exists anywhere, leaves messages alone so the upstream provider error surfaces. Emits a tracing::warn! when it acts so the symptom is observable.

2. `zeroclaw-runtime/src/agent/history_pruner.rs` — root-cause prevention

After commit 1 was deployed, captured Gemini requests showed the bug had a deeper layer: on later iterations of the agent loop, the messages list contained no user turn at all — only assistant.tool_calls and tool responses. Sanitizing that to drop everything would just produce an empty-contents error.

Root cause: protected_indices() only protected system messages and the last keep_recent items. When a multi-round tool-using conversation grew past max_tokens, phase-2 budget enforcement happily dropped the original user message because it had fallen outside the keep_recent window. remove_orphaned_tool_messages then cleaned dangling tool rows but left the assistant.tool_calls row in place (its only requirement was a preceding assistant, not a preceding user). The result was a history beginning with assistant.tool_calls — universally invalid for Gemini.

Two changes:

Prevention — protected_indices now also protects the first user turn after the leading system block, so pruning can never strip the canonical conversation prefix.
Defense in depth — remove_orphaned_tool_messages gains a third pass that drops a leading assistant/tool block lacking a preceding user. Existing persisted sessions that already contain the malformed shape get auto-healed on next load. Conservative when no user turn exists.

Five existing pruner tests codified the previous (Gemini-invalid) shape of remove_orphaned_tool_messages keeping leading orphan assistants. They were updated to inject a leading user message — preserves their original orphan-removal intent without depending on the now-fixed broken shape.

Out of scope

Tool-call ↔ tool-response pairing (orphan tool messages without preceding assistant.tool_calls, empty tool_calls: [] arrays) — tracked in #6298.

Risk

Medium. Touches the hot path of every LLM call but behavior is strictly subtractive — only drops messages that would otherwise produce a 400 on Gemini and are uninterpretable on every other provider. No new config flags. No system messages affected.

Validation

Static:

12 new unit tests across both crates (8 in history_sanitizer, 4 in history_pruner).
All 1611 zeroclaw-runtime lib tests pass.
All 791 zeroclaw-providers lib tests pass.
cargo fmt --all -- --check clean.
cargo clippy -p zeroclaw-runtime -p zeroclaw-providers --all-targets -- -D warnings clean.

End-to-end (LiteLLM-routed Gemini, on actual broken-history session that previously 400'd):

$ zeroclaw agent -m "respond with the literal text OK"
... Preemptive context trim: estimated tokens exceed budget estimated=96492 budget=32000 iteration=1
... WARN history_pruner: Removed 5 leading non-user turn(s) from history (#6302) count=5
OK   ← previously: Error: All providers/models failed. ... GeminiException BadRequestError 400

$ zeroclaw agent -m "be creative and do something based on the toolset you have, ask for permission when needed, go!"
I tried to build a custom terminal UI and write a fractal generator to the workspace, but it looks like those actions were blocked.
I've got a whole arsenal here: I can fetch the weather in Svalbard, aggregate the latest world news, check crypto balances, scrape web pages, or generate a custom PDF report.
What kind of trouble should we get into today?

Both prompts succeed; auto-heal warnings fire on the first iteration and the agent loop completes normally afterward.

Test plan

CI green
Manual: zeroclaw agent -m "..." against LiteLLM-routed Gemini with poisoned session history — 400 cleared, sanitizer + pruner warnings observable in logs.

🤖 Generated with Claude Code

Some providers (notably Google Gemini) reject conversation histories whose first non-system turn is anything other than `user`. ZeroClaw can produce such histories when context trimming, session restoration, or native-tool-call serialization leaves an `assistant` turn (often carrying `tool_calls`) at the head of the message list. Permissive providers (Anthropic, GLM) silently accept the malformed shape; strict providers return HTTP 400 — the symptom reported in zeroclaw-labs#6302. Add a provider-agnostic `enforce_leading_user_turn` pass and call it from `prepare_messages_for_provider` so every provider invocation benefits regardless of multimodal content. The pass drops only the contiguous sequence of leading non-user, non-system messages and is conservative: if no user turn exists, messages are returned unchanged so the provider's native error surfaces normally. Tool-call/tool-response pairing (zeroclaw-labs#6298) is intentionally out of scope. Closes zeroclaw-labs#6302 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…hans After PR zeroclaw-labs#6303's leading-non-user sanitizer landed, capture of post-fix Gemini requests revealed a deeper layer of the bug: the agent loop emits LLM calls whose `messages` list contains *no* `user` turn at all on later iterations — only assistant tool_calls and tool responses. Sanitizing that to drop everything would just produce empty contents, so the previous PR deliberately left it alone. Root cause: `protected_indices()` in the history pruner only protects `system` messages and the last `keep_recent` items. When a multi-round tool-using conversation grows past `max_tokens`, phase-2 budget enforcement happily drops the original `user` message because it has fallen outside the keep_recent window. `remove_orphaned_tool_messages` then cleans up dangling `tool` rows but leaves the `assistant.tool_calls` row in place, since its only requirement is a preceding assistant — not a preceding user. The result is a history that begins with `assistant.tool_calls` and is universally invalid for Gemini. Two-layer fix: 1. **Prevention** — `protected_indices` now also protects the first `user` turn following the leading system block, so pruning can never strip the canonical conversation prefix. 2. **Defense in depth** — `remove_orphaned_tool_messages` gains a third pass that drops any leading `assistant`/`tool` block lacking a preceding `user` turn. Existing persisted sessions that already contain the malformed shape get auto-healed on next load. Conservative when no `user` turn exists anywhere — leaves messages intact and lets the upstream provider error surface. Five existing pruner tests codified the previous (Gemini-invalid) behavior of `remove_orphaned_tool_messages` keeping leading orphan assistants. They were updated to inject a leading user message, which preserves their original orphan-removal intent without depending on the now-fixed broken shape. Tests: - 4 new regressions in `history_pruner::tests` covering the prevention invariant (`first_user_turn_is_protected_from_budget_pruning`) and the cleanup invariant (`remove_orphaned_drops_leading_assistant_tool_call_block`, `remove_orphaned_keeps_messages_when_no_user_exists`, `remove_orphaned_noop_when_user_already_first`). - All 1611 `zeroclaw-runtime` lib tests pass. - All 791 `zeroclaw-providers` lib tests pass. - `cargo fmt --all -- --check` and `cargo clippy -p zeroclaw-runtime` clean. Refs zeroclaw-labs#6302 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… STT provider Adds two new media providers wired into the existing tool/transcription plumbing: - gemini_image_gen tool: generates / edits images through the LiteLLM /chat/completions endpoint against Gemini 2.5/3 Pro Image (Nano Banana / Nano Banana Pro). Reuses LiteLLM creds from env or [providers.models.litellm] (decrypts enc2:). Saves PNG to workspace/images and returns a [IMAGE:...] marker for channel delivery. - ElevenLabs Scribe STT provider: implements TranscriptionProvider, registers in TranscriptionManager, and is selectable via default_provider = "elevenlabs". Reads key from [transcription.elevenlabs].api_key or ELEVENLABS_API_KEY env. Config: new [gemini_image_gen] and [transcription.elevenlabs] sections plus updated default_provider whitelist to accept "elevenlabs". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…to fix/history-sanitizer-leading-user-turn

…er-leading-user-turn

Scan <workspace>/sops/*/SOP.toml and list SOP names + descriptions in the system prompt so the model can dispatch directly via sop_execute without first calling sop_list. Updates sop_execute description to point at the new section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Voice memos are intentionally directed at the assistant; when STT yields only background noise the classifier was returning NO_REPLY[INFO] and silently 👍-ing the message. Carve voice messages (prefixed [Voice]) out of the no-reply heuristic so the user gets an acknowledgement and a chance to clarify. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dmnkhorvath requested review from JordanTheJet and theonlyhennygod as code owners May 3, 2026 08:13

dmnkhorvath mentioned this pull request May 3, 2026

[Bug]: Gemini 400 — assistant tool_call emitted as first non-system turn (history serializer invariant violation) #6302

Open

dmnkhorvath and others added 3 commits May 3, 2026 09:32

Merge branch 'master' of https://github.com/zeroclaw-labs/zeroclaw in…

c70f566

…to fix/history-sanitizer-leading-user-turn

JordanTheJet mentioned this pull request May 3, 2026

fix(agent): align history trim to user boundary for provider compatibility #5257

Closed

singlerider added bug Something isn't working risk: high Auto risk: security/runtime/gateway/tools/workflows. provider Auto scope: src/providers/** changed. runtime Auto scope: src/runtime/** changed. agent Auto scope: src/agent/** changed. labels May 4, 2026

dmnkhorvath and others added 3 commits May 5, 2026 08:33

Merge remote-tracking branch 'origin/master' into fix/history-sanitiz…

a4e78c7

…er-leading-user-turn

github-actions Bot removed agent Auto scope: src/agent/** changed. provider Auto scope: src/providers/** changed. runtime Auto scope: src/runtime/** changed. labels May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(providers): drop leading non-user turns before provider call#6303

fix(providers): drop leading non-user turns before provider call#6303
dmnkhorvath wants to merge 7 commits intozeroclaw-labs:masterfrom
dmnkhorvath:fix/history-sanitizer-leading-user-turn

dmnkhorvath commented May 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dmnkhorvath commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. zeroclaw-providers/src/history_sanitizer.rs — provider-edge cleanup

2. zeroclaw-runtime/src/agent/history_pruner.rs — root-cause prevention

Out of scope

Risk

Validation

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dmnkhorvath commented May 3, 2026 •

edited

Loading

1. `zeroclaw-providers/src/history_sanitizer.rs` — provider-edge cleanup

2. `zeroclaw-runtime/src/agent/history_pruner.rs` — root-cause prevention