generated from oracle/template-repo
-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Description
Feature Request
Summary
Add support for multimodal content types (PDF, video, audio) in ChatOCIGenAI to leverage Google Gemini models' full capabilities on OCI.
Motivation
Google Gemini models on OCI support multimodal inputs including:
- PDF documents - can be processed without page breaks
- Video files - can analyze video content directly
- Audio files - can transcribe and analyze audio
The OCI Python SDK already includes the necessary models (DocumentContent, VideoContent, AudioContent), but langchain-oci currently only supports text and images.
Model Support Matrix
| Model | Images | Video | Audio | |
|---|---|---|---|---|
| Google Gemini | ✓ | ✓ | ✓ | ✓ |
| Meta Llama Vision | ✓ | ✗ | ✗ | ✗ |
| Cohere Vision | ✓ | ✗ | ✗ | ✗ |
Tested with Real Files
| Content Type | File Size | Result |
|---|---|---|
| Architecture diagram (PNG) | 478KB | ✓ Full component analysis |
| Screen recording (MOV) | 4.8MB | ✓ Identified app, user actions |
| Photo with text (JPG) | 1.1MB | ✓ OCR + scene description |
| PDF document | 485KB | ✓ Text extraction |
| Audio (WAV) | 10KB | ✓ Sound identification |
Usage Example
from langchain_oci import ChatOCIGenAI
from langchain_core.messages import HumanMessage
import base64
llm = ChatOCIGenAI(
model_id="google.gemini-2.0-flash",
compartment_id="your-compartment-id",
service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
)
# PDF Analysis
with open("document.pdf", "rb") as f:
pdf_b64 = base64.b64encode(f.read()).decode("utf-8")
message = HumanMessage(content=[
{"type": "text", "text": "Summarize this PDF"},
{"type": "document_url", "document_url": {"url": f"data:application/pdf;base64,{pdf_b64}"}},
])
result = llm.invoke([message])
# Video Analysis
with open("video.mp4", "rb") as f:
video_b64 = base64.b64encode(f.read()).decode("utf-8")
message = HumanMessage(content=[
{"type": "text", "text": "What happens in this video?"},
{"type": "video_url", "video_url": {"url": f"data:video/mp4;base64,{video_b64}"}},
])
result = llm.invoke([message])Related PR
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels