[BUG] OmDet-Turbo processor produces 640px inputs but the model expects 224px

### System Info

* `transformers` version: `5.0.0.dev0`
* Platform: `Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39`
* Python version: `3.12.3`
* `huggingface_hub` version: `1.3.2`
* `safetensors` version: `0.7.0`
* `accelerate` version: `1.12.0`
* Accelerate config: `not installed`
* DeepSpeed version: `not installed`
* PyTorch version (accelerator?): `2.9.1+cu128 (CUDA)`
* GPU type: `NVIDIA L4`
* NVIDIA driver version: `550.90.07`
* CUDA version: `12.4`

### Who can help?

@zucchini-nlp (`🚨 Delete duplicate code in backbone utils`)

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

```python
from transformers import AutoProcessor, OmDetTurboForObjectDetection
from PIL import Image
import requests
import torch

model = OmDetTurboForObjectDetection.from_pretrained("omlab/omdet-turbo-swin-tiny-hf")
processor = AutoProcessor.from_pretrained("omlab/omdet-turbo-swin-tiny-hf")
image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw).convert("RGB")
encoding = processor(images=image, text=["cat", "remote"], task="Detect cat, remote.", return_tensors="pt")
try:
    with torch.no_grad():
        outputs = model(**encoding)
    print(outputs.decoder_coord_logits.shape)
except Exception as e:
    print(e)
```

**Current Repro Output:**

<img width="500" height="500" alt="Image" src="https://github.com/user-attachments/assets/6ae9830c-2968-4f16-958f-a125dbac8d57" /><br>

[OmDet-Turbo](https://huggingface.co/docs/transformers/model_doc/omdet-turbo) fails with `AssertionError` in inference. The processor produces 640×640 images but the model expects an input height of 224, and running the official loading and inference code raises `AssertionError: Input height (640) doesn't match model (224)` as shown in the screenshot; instead of the expected output tensor. Also causes the issue in the official OmDet-Turbo CI run.

### Expected behavior

→ `outputs.decoder_coord_logits.shape` should return `torch.Size([1, 900, 4])`; the model should accept 640×640 images as configured.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] OmDet-Turbo processor produces 640px inputs but the model expects 224px #44610

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] OmDet-Turbo processor produces 640px inputs but the model expects 224px #44610

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions