Skip to content

PaliGemmaProcessor fails due to missing return_tensors in tokenizer call #38393

@sergiosgatidis

Description

@sergiosgatidis

System Info

  • transformers version: 4.52.3
  • Platform: Linux-6.8.0-59-generic-x86_64-with-glibc2.39
  • Python version: 3.10.16
  • Huggingface_hub version: 0.32.1
  • Safetensors version: 0.5.3
  • Accelerate version: 1.7.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (GPU?): 2.6.0+cu118 (True)
  • Tensorflow version (GPU?): 2.19.0 (True)
  • Flax version (CPU?/GPU?/TPU?): 0.10.6 (cpu)
  • Jax version: 0.6.0
  • JaxLib version: 0.6.0
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?: yes
  • GPU type: NVIDIA RTX 6000 Ada Generation

Who can help?

@ArthurZucker and @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Summary:
When using PaliGemmaProcessor for multimodal fine-tuning with a suffix, the processor crashes with:

AttributeError: ‘list’ object has no attribute ‘masked_fill’
This happens because return_tensors="pt" is not passed to the tokenizer internally. As a result, the tokenizer returns Python lists for input_ids and token_type_ids, and the processor assumes they’re tensors — leading to a crash at:

`inputs[“input_ids”].masked_fill(inputs[“token_type_ids”] == 0, -100)

example:

from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor

model_id = 'google/paligemma2-3b-pt-224'
processor = PaliGemmaProcessor.from_pretrained(model_id)

examples = [
    {
        "prefix": "caption <loc0412><loc0269><loc0644><loc0546><seg015>",
        "suffix": "RML",
        "image": PIL.Image.new("RGB", (224, 224)),
    },
    {
        "prefix": "detect Left Fourth Rib",
        "suffix": "<loc0234><loc0621><loc0495><loc0796> Left Fourth Rib",
        "image": PIL.Image.new("RGB", (224, 224)),
    }
]

texts = ["<image>" + ex["prefix"] for ex in examples]
labels = [ex["suffix"] for ex in examples]
images = [ex["image"] for ex in examples]

tokens = processor(
    text=texts,
    images=images,
    suffix=labels,
    return_tensors="pt",
    padding="longest"
)

This raises:
AttributeError: ‘list’ object has no attribute ‘masked_fill’

Proposed Fix:

In the call method of PaliGemmaProcessor, the return_tensors argument is popped from text_kwargs:

return_tensors = output_kwargs["text_kwargs"].pop("return_tensors", None)

…but it is never passed to self.tokenizer(...). Adding this line to the tokenizer call may fix the issue:

return_tensors=return_tensors,

Expected behavior

The processor should correctly pass return_tensors="pt" to the tokenizer so that all fields (e.g., input_ids, token_type_ids) are returned as PyTorch tensors, allowing downstream tensor operations like .masked_fill() to work without errors.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions