Skip to content

Add multimodal content support for Gemini models (PDF, video, audio) #143

@fede-kamel

Description

@fede-kamel

Feature Request

Summary

Add support for multimodal content types (PDF, video, audio) in ChatOCIGenAI to leverage Google Gemini models' full capabilities on OCI.

Motivation

Google Gemini models on OCI support multimodal inputs including:

  • PDF documents - can be processed without page breaks
  • Video files - can analyze video content directly
  • Audio files - can transcribe and analyze audio

The OCI Python SDK already includes the necessary models (DocumentContent, VideoContent, AudioContent), but langchain-oci currently only supports text and images.

Model Support Matrix

Model Images PDF Video Audio
Google Gemini
Meta Llama Vision
Cohere Vision

Tested with Real Files

Content Type File Size Result
Architecture diagram (PNG) 478KB ✓ Full component analysis
Screen recording (MOV) 4.8MB ✓ Identified app, user actions
Photo with text (JPG) 1.1MB ✓ OCR + scene description
PDF document 485KB ✓ Text extraction
Audio (WAV) 10KB ✓ Sound identification

Usage Example

from langchain_oci import ChatOCIGenAI
from langchain_core.messages import HumanMessage
import base64

llm = ChatOCIGenAI(
    model_id="google.gemini-2.0-flash",
    compartment_id="your-compartment-id",
    service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
)

# PDF Analysis
with open("document.pdf", "rb") as f:
    pdf_b64 = base64.b64encode(f.read()).decode("utf-8")

message = HumanMessage(content=[
    {"type": "text", "text": "Summarize this PDF"},
    {"type": "document_url", "document_url": {"url": f"data:application/pdf;base64,{pdf_b64}"}},
])
result = llm.invoke([message])

# Video Analysis
with open("video.mp4", "rb") as f:
    video_b64 = base64.b64encode(f.read()).decode("utf-8")

message = HumanMessage(content=[
    {"type": "text", "text": "What happens in this video?"},
    {"type": "video_url", "video_url": {"url": f"data:video/mp4;base64,{video_b64}"}},
])
result = llm.invoke([message])

Related PR

#142

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions