Skip to content

Conversation

@zucchini-nlp
Copy link
Member

What does this PR do?

Passing images as a flat list do not give same logits in mllama, as passing the images in nested batch format. To get the model working correctly, one should pass as many images per batch, as there are image tokens.

See below reproducer

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = ["<|image|><|begin_of_text|>What do you see here?", "<|image|><|begin_of_text|>What do you see here but longer?"]

repo_id = "mv11/11"
processor = AutoProcessor.from_pretrained(repo_id)
model = MllamaForConditionalGeneration.from_pretrained(repo_id, device_map='auto')

batch = processor(text=texts, images=[image, image], return_tensors="pt", padding=True) # .to(model.device)
with torch.no_grad():
    model_output = model(
        input_ids = batch['input_ids'],
        attention_mask = batch['attention_mask'], 
        pixel_values = batch['pixel_values'], 
        aspect_ratio_ids = batch['aspect_ratio_ids'], 
        aspect_ratio_mask = batch['aspect_ratio_mask'], 
        cross_attention_mask = batch['cross_attention_mask'],
        )


batch = processor(text=texts, images=[[image], [image]], return_tensors="pt", padding=True) # .to(model.device)
with torch.no_grad():
    model_output_2 = model(
        input_ids = batch['input_ids'],
        attention_mask = batch['attention_mask'], 
        pixel_values = batch['pixel_values'], 
        aspect_ratio_ids = batch['aspect_ratio_ids'], 
        aspect_ratio_mask = batch['aspect_ratio_mask'], 
        cross_attention_mask = batch['cross_attention_mask'],
        )

print(torch.allclose(model_output_2.logits, model_output.logits))
>>> False

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! I think the tests need an update no?
Other lgtm

@zucchini-nlp
Copy link
Member Author

@ArthurZucker yeah, updated the test and added check for this condition. We might need to update idefics models also, but I will do it later in another PR

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@zucchini-nlp zucchini-nlp merged commit 97d2f9d into huggingface:main Mar 21, 2025
11 checks passed
zucchini-nlp added a commit to zucchini-nlp/transformers that referenced this pull request May 14, 2025
* fix mllama

* update test

* fix test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants