Skip to content

fix(providers): drop leading non-user turns before provider call#6303

Open
dmnkhorvath wants to merge 7 commits intozeroclaw-labs:masterfrom
dmnkhorvath:fix/history-sanitizer-leading-user-turn
Open

fix(providers): drop leading non-user turns before provider call#6303
dmnkhorvath wants to merge 7 commits intozeroclaw-labs:masterfrom
dmnkhorvath:fix/history-sanitizer-leading-user-turn

Conversation

@dmnkhorvath
Copy link
Copy Markdown
Contributor

@dmnkhorvath dmnkhorvath commented May 3, 2026

Summary

Closes #6302.

ZeroClaw constructed conversation histories that strict providers (Google Gemini) reject with HTTP 400 (Please ensure that function call turn comes immediately after a user turn or after a function response turn.). Permissive providers (Anthropic, GLM) silently tolerated the malformed shape. PR #960 had patched this for the Telegram channel runtime path; CLI / gateway / agent-loop paths were untouched.

This PR is two layered fixes, both required because investigation revealed the bug had two distinct causes:

1. zeroclaw-providers/src/history_sanitizer.rs — provider-edge cleanup

A new provider-agnostic enforce_leading_user_turn pass, called from multimodal::prepare_messages_for_provider. Drops the contiguous sequence of leading non-user, non-system messages immediately before the LLM call. Conservative: if no user turn exists anywhere, leaves messages alone so the upstream provider error surfaces. Emits a tracing::warn! when it acts so the symptom is observable.

2. zeroclaw-runtime/src/agent/history_pruner.rs — root-cause prevention

After commit 1 was deployed, captured Gemini requests showed the bug had a deeper layer: on later iterations of the agent loop, the messages list contained no user turn at all — only assistant.tool_calls and tool responses. Sanitizing that to drop everything would just produce an empty-contents error.

Root cause: protected_indices() only protected system messages and the last keep_recent items. When a multi-round tool-using conversation grew past max_tokens, phase-2 budget enforcement happily dropped the original user message because it had fallen outside the keep_recent window. remove_orphaned_tool_messages then cleaned dangling tool rows but left the assistant.tool_calls row in place (its only requirement was a preceding assistant, not a preceding user). The result was a history beginning with assistant.tool_calls — universally invalid for Gemini.

Two changes:

  • Preventionprotected_indices now also protects the first user turn after the leading system block, so pruning can never strip the canonical conversation prefix.
  • Defense in depthremove_orphaned_tool_messages gains a third pass that drops a leading assistant/tool block lacking a preceding user. Existing persisted sessions that already contain the malformed shape get auto-healed on next load. Conservative when no user turn exists.

Five existing pruner tests codified the previous (Gemini-invalid) shape of remove_orphaned_tool_messages keeping leading orphan assistants. They were updated to inject a leading user message — preserves their original orphan-removal intent without depending on the now-fixed broken shape.

Out of scope

Tool-call ↔ tool-response pairing (orphan tool messages without preceding assistant.tool_calls, empty tool_calls: [] arrays) — tracked in #6298.

Risk

Medium. Touches the hot path of every LLM call but behavior is strictly subtractive — only drops messages that would otherwise produce a 400 on Gemini and are uninterpretable on every other provider. No new config flags. No system messages affected.

Validation

Static:

  • 12 new unit tests across both crates (8 in history_sanitizer, 4 in history_pruner).
  • All 1611 zeroclaw-runtime lib tests pass.
  • All 791 zeroclaw-providers lib tests pass.
  • cargo fmt --all -- --check clean.
  • cargo clippy -p zeroclaw-runtime -p zeroclaw-providers --all-targets -- -D warnings clean.

End-to-end (LiteLLM-routed Gemini, on actual broken-history session that previously 400'd):

$ zeroclaw agent -m "respond with the literal text OK"
... Preemptive context trim: estimated tokens exceed budget estimated=96492 budget=32000 iteration=1
... WARN history_pruner: Removed 5 leading non-user turn(s) from history (#6302) count=5
OK   ← previously: Error: All providers/models failed. ... GeminiException BadRequestError 400

$ zeroclaw agent -m "be creative and do something based on the toolset you have, ask for permission when needed, go!"
I tried to build a custom terminal UI and write a fractal generator to the workspace, but it looks like those actions were blocked.
I've got a whole arsenal here: I can fetch the weather in Svalbard, aggregate the latest world news, check crypto balances, scrape web pages, or generate a custom PDF report.
What kind of trouble should we get into today?

Both prompts succeed; auto-heal warnings fire on the first iteration and the agent loop completes normally afterward.

Test plan

  • CI green
  • Manual: zeroclaw agent -m "..." against LiteLLM-routed Gemini with poisoned session history — 400 cleared, sanitizer + pruner warnings observable in logs.

🤖 Generated with Claude Code

Some providers (notably Google Gemini) reject conversation histories
whose first non-system turn is anything other than `user`. ZeroClaw
can produce such histories when context trimming, session restoration,
or native-tool-call serialization leaves an `assistant` turn (often
carrying `tool_calls`) at the head of the message list. Permissive
providers (Anthropic, GLM) silently accept the malformed shape; strict
providers return HTTP 400 — the symptom reported in zeroclaw-labs#6302.

Add a provider-agnostic `enforce_leading_user_turn` pass and call it
from `prepare_messages_for_provider` so every provider invocation
benefits regardless of multimodal content. The pass drops only the
contiguous sequence of leading non-user, non-system messages and is
conservative: if no user turn exists, messages are returned unchanged
so the provider's native error surfaces normally.

Tool-call/tool-response pairing (zeroclaw-labs#6298) is intentionally out of scope.

Closes zeroclaw-labs#6302

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dmnkhorvath and others added 3 commits May 3, 2026 09:32
…hans

After PR zeroclaw-labs#6303's leading-non-user sanitizer landed, capture of post-fix
Gemini requests revealed a deeper layer of the bug: the agent loop
emits LLM calls whose `messages` list contains *no* `user` turn at all
on later iterations — only assistant tool_calls and tool responses.
Sanitizing that to drop everything would just produce empty contents,
so the previous PR deliberately left it alone.

Root cause: `protected_indices()` in the history pruner only protects
`system` messages and the last `keep_recent` items. When a multi-round
tool-using conversation grows past `max_tokens`, phase-2 budget
enforcement happily drops the original `user` message because it has
fallen outside the keep_recent window. `remove_orphaned_tool_messages`
then cleans up dangling `tool` rows but leaves the `assistant.tool_calls`
row in place, since its only requirement is a preceding assistant —
not a preceding user. The result is a history that begins with
`assistant.tool_calls` and is universally invalid for Gemini.

Two-layer fix:

1. **Prevention** — `protected_indices` now also protects the first
   `user` turn following the leading system block, so pruning can never
   strip the canonical conversation prefix.

2. **Defense in depth** — `remove_orphaned_tool_messages` gains a
   third pass that drops any leading `assistant`/`tool` block lacking
   a preceding `user` turn. Existing persisted sessions that already
   contain the malformed shape get auto-healed on next load. Conservative
   when no `user` turn exists anywhere — leaves messages intact and
   lets the upstream provider error surface.

Five existing pruner tests codified the previous (Gemini-invalid)
behavior of `remove_orphaned_tool_messages` keeping leading orphan
assistants. They were updated to inject a leading user message, which
preserves their original orphan-removal intent without depending on
the now-fixed broken shape.

Tests:
- 4 new regressions in `history_pruner::tests` covering the prevention
  invariant (`first_user_turn_is_protected_from_budget_pruning`) and
  the cleanup invariant (`remove_orphaned_drops_leading_assistant_tool_call_block`,
  `remove_orphaned_keeps_messages_when_no_user_exists`,
  `remove_orphaned_noop_when_user_already_first`).
- All 1611 `zeroclaw-runtime` lib tests pass.
- All 791 `zeroclaw-providers` lib tests pass.
- `cargo fmt --all -- --check` and `cargo clippy -p zeroclaw-runtime` clean.

Refs zeroclaw-labs#6302

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… STT provider

Adds two new media providers wired into the existing tool/transcription
plumbing:

- gemini_image_gen tool: generates / edits images through the LiteLLM
  /chat/completions endpoint against Gemini 2.5/3 Pro Image (Nano Banana
  / Nano Banana Pro). Reuses LiteLLM creds from env or
  [providers.models.litellm] (decrypts enc2:). Saves PNG to
  workspace/images and returns a [IMAGE:...] marker for channel
  delivery.
- ElevenLabs Scribe STT provider: implements TranscriptionProvider,
  registers in TranscriptionManager, and is selectable via
  default_provider = "elevenlabs". Reads key from
  [transcription.elevenlabs].api_key or ELEVENLABS_API_KEY env.

Config: new [gemini_image_gen] and [transcription.elevenlabs] sections
plus updated default_provider whitelist to accept "elevenlabs".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@singlerider singlerider added bug Something isn't working risk: high Auto risk: security/runtime/gateway/tools/workflows. provider Auto scope: src/providers/** changed. runtime Auto scope: src/runtime/** changed. agent Auto scope: src/agent/** changed. labels May 4, 2026
dmnkhorvath and others added 3 commits May 5, 2026 08:33
Scan <workspace>/sops/*/SOP.toml and list SOP names + descriptions in
the system prompt so the model can dispatch directly via sop_execute
without first calling sop_list. Updates sop_execute description to
point at the new section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Voice memos are intentionally directed at the assistant; when STT
yields only background noise the classifier was returning
NO_REPLY[INFO] and silently 👍-ing the message. Carve voice messages
(prefixed [Voice]) out of the no-reply heuristic so the user gets an
acknowledgement and a chance to clarify.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot removed agent Auto scope: src/agent/** changed. provider Auto scope: src/providers/** changed. runtime Auto scope: src/runtime/** changed. labels May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working risk: high Auto risk: security/runtime/gateway/tools/workflows.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Gemini 400 — assistant tool_call emitted as first non-system turn (history serializer invariant violation)

2 participants