Summary
The Gemini provider's Part struct is text-only, preventing image inputs from being sent via the native inlineData API — even though Gemini models natively support multimodal image inputs.
Problem statement
Other providers (Anthropic, OpenAI-compatible) already parse [IMAGE:...] markers and convert them to their native image formats. The Gemini provider ignores these markers and sends them as raw text, making multimodal workflows broken for Gemini users.
Proposed solution
Replace the flat Part struct with a serde untagged enum supporting Part::Text and Part::InlineData variants. Add a build_user_parts() helper that reuses the existing multimodal::parse_image_markers() to extract image markers and produce Gemini-native inlineData entries. Supports multiple images per message.
Non-goals / out of scope
- Files API support for large images (>20MB) — can be a follow-up
- Video/audio multimodal inputs
- Changes to
multimodal.rs or provider traits
Alternatives considered
- Adding image support via the Files API — more complex, requires upload step, not needed for inline use cases under 20MB
Acceptance criteria
- User messages with
[IMAGE:data:...] markers produce correct inlineData parts in the Gemini API request
- Multiple images per message are supported
- Text-only messages serialize identically to current behavior (backward compatible)
- Tests cover text-only, single image, multiple images, image-only, and fallback paths
Architecture impact
None. Change is scoped to src/providers/gemini.rs. No trait changes, no new dependencies, no config changes.
Risk and rollback
Low risk. Single-file, additive change. git revert <commit> for rollback.
Breaking change?
No
Data hygiene checks
- No personal or sensitive data in tests — synthetic base64 stubs used
- Neutral project-scoped wording confirmed
Summary
The Gemini provider's
Partstruct is text-only, preventing image inputs from being sent via the nativeinlineDataAPI — even though Gemini models natively support multimodal image inputs.Problem statement
Other providers (Anthropic, OpenAI-compatible) already parse
[IMAGE:...]markers and convert them to their native image formats. The Gemini provider ignores these markers and sends them as raw text, making multimodal workflows broken for Gemini users.Proposed solution
Replace the flat
Partstruct with a serde untagged enum supportingPart::TextandPart::InlineDatavariants. Add abuild_user_parts()helper that reuses the existingmultimodal::parse_image_markers()to extract image markers and produce Gemini-nativeinlineDataentries. Supports multiple images per message.Non-goals / out of scope
multimodal.rsor provider traitsAlternatives considered
Acceptance criteria
[IMAGE:data:...]markers produce correctinlineDataparts in the Gemini API requestArchitecture impact
None. Change is scoped to
src/providers/gemini.rs. No trait changes, no new dependencies, no config changes.Risk and rollback
Low risk. Single-file, additive change.
git revert <commit>for rollback.Breaking change?
No
Data hygiene checks