PaliGemmaProcessor fails due to missing return_tensors in tokenizer call

### System Info

- `transformers` version: 4.52.3
- Platform: Linux-6.8.0-59-generic-x86_64-with-glibc2.39
- Python version: 3.10.16
- Huggingface_hub version: 0.32.1
- Safetensors version: 0.5.3
- Accelerate version: 1.7.0
- Accelerate config:    not found
- DeepSpeed version: not installed
- PyTorch version (GPU?): 2.6.0+cu118 (True)
- Tensorflow version (GPU?): 2.19.0 (True)
- Flax version (CPU?/GPU?/TPU?): 0.10.6 (cpu)
- Jax version: 0.6.0
- JaxLib version: 0.6.0
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: yes
- GPU type: NVIDIA RTX 6000 Ada Generation

### Who can help?

@ArthurZucker and @itazap

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction


**Summary:**
When using PaliGemmaProcessor for multimodal fine-tuning with a suffix, the processor crashes with:

`AttributeError: ‘list’ object has no attribute ‘masked_fill’
`
This happens because return_tensors="pt" is not passed to the tokenizer internally. As a result, the tokenizer returns Python lists for input_ids and token_type_ids, and the processor assumes they’re tensors — leading to a crash at:

`inputs[“input_ids”].masked_fill(inputs[“token_type_ids”] == 0, -100)

example:


````
from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor

model_id = 'google/paligemma2-3b-pt-224'
processor = PaliGemmaProcessor.from_pretrained(model_id)

examples = [
    {
        "prefix": "caption <loc0412><loc0269><loc0644><loc0546><seg015>",
        "suffix": "RML",
        "image": PIL.Image.new("RGB", (224, 224)),
    },
    {
        "prefix": "detect Left Fourth Rib",
        "suffix": "<loc0234><loc0621><loc0495><loc0796> Left Fourth Rib",
        "image": PIL.Image.new("RGB", (224, 224)),
    }
]

texts = ["<image>" + ex["prefix"] for ex in examples]
labels = [ex["suffix"] for ex in examples]
images = [ex["image"] for ex in examples]

tokens = processor(
    text=texts,
    images=images,
    suffix=labels,
    return_tensors="pt",
    padding="longest"
)
````

This raises:
`AttributeError: ‘list’ object has no attribute ‘masked_fill’`

**Proposed Fix:**

In the __call__ method of PaliGemmaProcessor, the return_tensors argument is popped from text_kwargs:

`return_tensors = output_kwargs["text_kwargs"].pop("return_tensors", None)`

…but it is never passed to self.tokenizer(...). Adding this line to the tokenizer call may fix the issue:

`return_tensors=return_tensors,`


### Expected behavior

The processor should correctly pass return_tensors="pt" to the tokenizer so that all fields (e.g., input_ids, token_type_ids) are returned as PyTorch tensors, allowing downstream tensor operations like .masked_fill() to work without errors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PaliGemmaProcessor fails due to missing return_tensors in tokenizer call #38393

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PaliGemmaProcessor fails due to missing return_tensors in tokenizer call #38393

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions