Add multimodal content support for Gemini models (PDF, video, audio)

## Feature Request

### Summary
Add support for multimodal content types (PDF, video, audio) in `ChatOCIGenAI` to leverage Google Gemini models' full capabilities on OCI.

### Motivation
Google Gemini models on OCI support multimodal inputs including:
- **PDF documents** - can be processed without page breaks
- **Video files** - can analyze video content directly
- **Audio files** - can transcribe and analyze audio

The OCI Python SDK already includes the necessary models (`DocumentContent`, `VideoContent`, `AudioContent`), but langchain-oci currently only supports text and images.

### Model Support Matrix

| Model | Images | PDF | Video | Audio |
|-------|--------|-----|-------|-------|
| **Google Gemini** | ✓ | ✓ | ✓ | ✓ |
| **Meta Llama Vision** | ✓ | ✗ | ✗ | ✗ |
| **Cohere Vision** | ✓ | ✗ | ✗ | ✗ |

### Tested with Real Files

| Content Type | File Size | Result |
|--------------|-----------|--------|
| Architecture diagram (PNG) | 478KB | ✓ Full component analysis |
| Screen recording (MOV) | 4.8MB | ✓ Identified app, user actions |
| Photo with text (JPG) | 1.1MB | ✓ OCR + scene description |
| PDF document | 485KB | ✓ Text extraction |
| Audio (WAV) | 10KB | ✓ Sound identification |

### Usage Example

```python
from langchain_oci import ChatOCIGenAI
from langchain_core.messages import HumanMessage
import base64

llm = ChatOCIGenAI(
    model_id="google.gemini-2.0-flash",
    compartment_id="your-compartment-id",
    service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
)

# PDF Analysis
with open("document.pdf", "rb") as f:
    pdf_b64 = base64.b64encode(f.read()).decode("utf-8")

message = HumanMessage(content=[
    {"type": "text", "text": "Summarize this PDF"},
    {"type": "document_url", "document_url": {"url": f"data:application/pdf;base64,{pdf_b64}"}},
])
result = llm.invoke([message])

# Video Analysis
with open("video.mp4", "rb") as f:
    video_b64 = base64.b64encode(f.read()).decode("utf-8")

message = HumanMessage(content=[
    {"type": "text", "text": "What happens in this video?"},
    {"type": "video_url", "video_url": {"url": f"data:video/mp4;base64,{video_b64}"}},
])
result = llm.invoke([message])
```

### Related PR
#142

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multimodal content support for Gemini models (PDF, video, audio) #143

Feature Request

Summary

Motivation

Model Support Matrix

Tested with Real Files

Usage Example

Related PR

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Images	PDF	Video	Audio
Google Gemini	✓	✓	✓	✓
Meta Llama Vision	✓	✗	✗	✗
Cohere Vision	✓	✗	✗	✗

Content Type	File Size	Result
Architecture diagram (PNG)	478KB	✓ Full component analysis
Screen recording (MOV)	4.8MB	✓ Identified app, user actions
Photo with text (JPG)	1.1MB	✓ OCR + scene description
PDF document	485KB	✓ Text extraction
Audio (WAV)	10KB	✓ Sound identification

Add multimodal content support for Gemini models (PDF, video, audio) #143

Description

Feature Request

Summary

Motivation

Model Support Matrix

Tested with Real Files

Usage Example

Related PR

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions