🔴 [VLM] Add base model without head #37033

zucchini-nlp · 2025-03-27T08:05:22Z

What does this PR do?

Stage one of vLLM with transformers backend support for vision LLMs. As discussed internally, we don't want to break existing models, so we sacrifice readability and duplicate code

The PR adds base models for all model where missing. The base model is supposed to be same as in LLMs, everything except for the head. This allows us to make modeling more aligned with vLLM and to have a standard API for multimodal generation models. Will be super helpful in the long run, for example for AutoToAny mapping

Next stages for modeling to help vLLM and TGI:

Add new attention interface where still absent
Qwen2 config workaround without breaking
Processor standardization sprint
- Add attributes for image_token_id/image_token if missing
- Return mm-token-type-ids if requested with to indicate where image/video/audio placeholder are
- Add helper get_num_of_image_tokens for all processor, which returns placeholder length given image
- Explore what else missing for processors
Finalize and merge PR on vLLM repo, check correctness for different models

Fixes #36940.

HuggingFaceDocBuilderDev · 2025-03-27T08:31:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2025-03-27T09:58:38Z

src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

@@ -1044,7 +1081,7 @@ def forward(self, hidden_states: torch.Tensor, grid_thw: torch.Tensor) -> torch.
    "The bare Qwen2VL Model outputting raw hidden-states without any specific head on top.",
    QWEN2VL_START_DOCSTRING,
 )
-class Qwen2VLModel(Qwen2VLPreTrainedModel):
+class Qwen2VLTextModel(Qwen2VLPreTrainedModel):


this might seem breaking. In terms of what we expect at input and what we return from new Qwen2VL module, nothing changes. We can accept more args now but text-only still fine and returns same hidden states

zucchini-nlp · 2025-03-27T10:27:13Z

not ready yet, marking ready to get CI run

zucchini-nlp · 2025-03-28T13:24:07Z

@ArthurZucker ready for review. Failing tests are related to hub, everything passes locally

jacob-danner · 2025-04-03T02:21:39Z

I was looking through this and I noticed a few typos in modelling_gemma3.py. I dug in and saw python utils/modular_model_converter.py --files_to_parse src/transformers/models/<your_model>/modular_<your_model>.py, so I understand the file is generated from modular_gemma3.py. However, the typos don't exist in modular_gemma3.py.

# modelling_gemma3.py line 93

@dataclass
class Gemma3CausalLMOutputWithPast(ModelOutput):
    """
    Base class for Gemma3causal language model (or autoregressive) outputs.
   
# Gemma3causal should be Gemma3 causal

# modelling_gemma3.py line 1132

@add_start_docstrings(
    """Base Gemma3 model which consists of a vision backbone and a language model withou language modeling head.""",
    
# withou should be without

These are very small typos, but if it is easy enough, it would be nice if they could be fixed. But, I couldn't figure out where the text is actually coming from. So, my comment is more of a question: where are these typos coming from? Are they easy to patch, or is there some inheritance happening that would require deeper changes?

zucchini-nlp · 2025-04-03T08:10:51Z

Thanks @jacob-danner ! These are probably copied from the very first model, from which we copy all other. In this case it might be LLaVA, I will check it

* i guessreverted all CdGen classes * style * llava onevision * fix copies * fix some tests * some more tests * dump * skip these * nevermind, i am dumb * revert fix not needed * fixup * fixup * another fixup * more fixup to make ci finally happy * fixup after rebasing * fix qwen tests * add internVL + typos here and there * image token index -> id * style * fix init weights * revert blip-2 not supported * address comments * fix copies * revert blip2 test file as well * as discussed internally, revert back CdGen models * fix some tests * fix more tests for compile * CI red * fix copies * enumerate explicitly allowed models * address comments * fix tests * fixup * style again * add tests for new model class * another fixup ( x _ x ) * [fixup] unused attributes can be removed post-deprecation

[transformers PR #37033](huggingface/transformers#37033) re-arranges the way visual language models are built by moving the LM head from the language model to the top-level VLM (among other things). This breaks the following test: ``` peft_config = PrefixTuningConfig(task_type="CAUSAL_LM", num_virtual_tokens=20) model.language_model = get_peft_model(model.language_model, peft_config) ``` Reason being that all soft-prompting methods need a task type since each task type has specific handling of the soft prompt (e.g., padding the labels accordingo to the number of virtual tokens for causal LM). We also can't simply use `task_type='FEATURE_EXTRACTION'` as this would not deal with `labels` either. Luckily the VLM is almost behaving like a LM (e.g., `get_input_embeddings` refers to the underlying LM), therefore we can target the VLM itself and need to have the soft prompt methods detect if we're fine-tuning a VLM so that we take the respective config variables from the `base_model.text_config` instead of `base_model` directly.

ManuelFay

@zucchini-nlp To detail this a bit more, if loading with adapters (assuming the class name matches the hardcoded list (quite an assumption), or that the user specified a key_mapping, the adapters will not benefit from the key_mapping that the base model uses.

Should we propagate this down ?

ManuelFay · 2025-06-05T22:44:11Z

src/transformers/modeling_utils.py

-        key_mapping = kwargs.pop("key_mapping", None)
+
+        # Load models with hardcoded key mapping on class for VLMs only,  to keep BC and standardize model
+        if any(allowed_name in cls.__name__.lower() for allowed_name in VLMS):


This is quite brittle and breaks adapters (in peft). How would you go about this?
I'm thinking we can propagate the key_mapping to the peft integration in the from pretrained function ?

Since a lot of people (including me) use adapters with VLMs, that's quite a big breaking change

For anyone looking, this was fixed!

[transformers PR #37033](huggingface/transformers#37033) re-arranges the way visual language models are built by moving the LM head from the language model to the top-level VLM (among other things). This breaks the following test: ``` peft_config = PrefixTuningConfig(task_type="CAUSAL_LM", num_virtual_tokens=20) model.language_model = get_peft_model(model.language_model, peft_config) ``` Reason being that all soft-prompting methods need a task type since each task type has specific handling of the soft prompt (e.g., padding the labels accordingo to the number of virtual tokens for causal LM). We also can't simply use `task_type='FEATURE_EXTRACTION'` as this would not deal with `labels` either. Luckily the VLM is almost behaving like a LM (e.g., `get_input_embeddings` refers to the underlying LM), therefore we can target the VLM itself and need to have the soft prompt methods detect if we're fine-tuning a VLM so that we take the respective config variables from the `base_model.text_config` instead of `base_model` directly.

zucchini-nlp added 3 commits March 26, 2025 15:23

i guessreverted all CdGen classes

3a87a41

style

8d7088a

llava onevision

95ac049

zucchini-nlp commented Mar 27, 2025

View reviewed changes

fix copies

f0e917e

zucchini-nlp marked this pull request as ready for review March 27, 2025 10:27

Merge branch 'main' into vlm-base-models

85b1e7a

zucchini-nlp mentioned this pull request Mar 27, 2025

Add Gemma 3 For Sequence Classification #36755

Closed

zucchini-nlp added 7 commits March 27, 2025 13:14

fix some tests

5e4d0e8

some more tests

02e7b6e

dump

c0e41e6

Merge branch 'main' into vlm-base-models

ef70523

skip these

06b8227

nevermind, i am dumb

5655657

revert fix not needed

083b9bc

zucchini-nlp requested a review from ArthurZucker March 28, 2025 13:24

zucchini-nlp added 5 commits March 31, 2025 14:06

Merge branch 'main' into vlm-base-models

4fe8a82

fixup

2e6caa4

Merge branch 'main' into vlm-base-models

d397075

Merge branch 'main' into vlm-base-models

0d1409f

Merge branch 'main' into vlm-base-models

a32e47e

zucchini-nlp mentioned this pull request Apr 4, 2025

[qwen-vl] Standardize config #37268

Merged

zucchini-nlp added 3 commits April 4, 2025 13:56

fixup

5c019fe

Merge remote-tracking branch 'upstream/main' into vlm-base-models

32a67b1

another fixup

a9b3816

zucchini-nlp merged commit 17742bd into huggingface:main May 7, 2025
20 checks passed

ArthurZucker mentioned this pull request May 8, 2025

Support Kosmos-2.5 #31711

Open

JJJYmmm mentioned this pull request May 12, 2025

[typo] qwen2_5_vl in modeling_utils.py #38075

Closed

jiqing-feng mentioned this pull request May 14, 2025

Emu3 precision regression #38121

Closed

4 tasks

Tcc0403 mentioned this pull request May 15, 2025

transfomers VLM base model change linkedin/Liger-Kernel#713

Closed

jiqing-feng mentioned this pull request May 20, 2025

mllama model loading failed after refactor #38220

Closed

4 tasks

Tcc0403 mentioned this pull request May 26, 2025

AttributeError: 'Qwen2VLConfig' object has no attribute 'hidden_size' #38331

Closed

4 tasks

githubnemo mentioned this pull request May 26, 2025

Address changes in transformers VLM architecture huggingface/peft#2554

Merged

hiyouga mentioned this pull request May 26, 2025

Fix convert to original state dict for VLMs #38385

Merged

5 tasks

gante mentioned this pull request May 29, 2025

[janus] Fix failing tests on mi3XX #38426

Merged

This was referenced Jun 4, 2025

[Bugfix] Update multimodel models mapping to fit new checkpoint after Transformers v4.52 vllm-project/vllm#19151

Merged

[Bug]: Error when loading model(gemma-3-4b) merged after DeepSpeed training into vLLM vllm-project/vllm#19139

Closed

ManuelFay reviewed Jun 5, 2025

View reviewed changes

ManuelFay mentioned this pull request Jun 5, 2025

bugfix: propage weight key_mapping to peft to fix 3.52 VLM renaming #38627

Merged

This was referenced Jun 9, 2025

fix: update pi0 dependency version constraint huggingface/lerobot#1247

Merged

fix: Add method to get image features in PaliGemmaForConditionalGeneration #38730

Merged

imstevenpmwork mentioned this pull request Jun 14, 2025

Fixing PI0 Policy huggingface/lerobot#1297

Merged

hiyouga mentioned this pull request Jun 16, 2025

[model] fix vlm utils hiyouga/LLaMA-Factory#8388

Merged

2 tasks

kylesayrs mentioned this pull request Jun 23, 2025

Issue with updated import and kernel compatibility for Qwen2_5_VL model vllm-project/llm-compressor#1582

Closed

Jintao-Huang mentioned this pull request Jun 26, 2025

compat transformers==4.52 (vlm) modelscope/ms-swift#4738

Merged

This was referenced Jun 30, 2025

PI0 policy suffers from weight/key mismatch on main when using transformers ≥ 4.52.0 huggingface/lerobot#1406

Open

Fix pi0 checkpoint state map huggingface/lerobot#1415

Open

hebiao064 mentioned this pull request Jul 2, 2025

apply_rotary compilation error AttributeError("'constexpr' object has no attribute 'bit_length'") Dao-AILab/flash-attention#1734

Open

ydshieh mentioned this pull request Jul 7, 2025

[vlm] fix loading of retrieval VLMs #39242

Merged

yfw mentioned this pull request Jul 16, 2025

fix: Fix gemma models broken by HF update NVIDIA-NeMo/RL#676

Merged

4 tasks

kylesayrs mentioned this pull request Jul 18, 2025

[Examples] Fix ignore layers for Qwen2.5-VL vllm-project/llm-compressor#1658

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🔴 [VLM] Add base model without head #37033

🔴 [VLM] Add base model without head #37033

Uh oh!

zucchini-nlp commented Mar 27, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Mar 27, 2025

Uh oh!

zucchini-nlp Mar 27, 2025 •

edited

Loading

Uh oh!

zucchini-nlp commented Mar 27, 2025

Uh oh!

zucchini-nlp commented Mar 28, 2025 •

edited

Loading

Uh oh!

jacob-danner commented Apr 3, 2025

Uh oh!

zucchini-nlp commented Apr 3, 2025

Uh oh!

Uh oh!

ManuelFay left a comment •

edited

Loading

Uh oh!

ManuelFay Jun 5, 2025 •

edited

Loading

Uh oh!

ManuelFay Jun 5, 2025

Uh oh!

ArthurZucker Jun 19, 2025

Uh oh!

Uh oh!

🔴 [VLM] Add base model without head #37033

🔴 [VLM] Add base model without head #37033

Uh oh!

Conversation

zucchini-nlp commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Mar 27, 2025

Uh oh!

zucchini-nlp Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented Mar 27, 2025

Uh oh!

zucchini-nlp commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacob-danner commented Apr 3, 2025

Uh oh!

zucchini-nlp commented Apr 3, 2025

Uh oh!

Uh oh!

ManuelFay left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ManuelFay Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ManuelFay Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zucchini-nlp commented Mar 27, 2025 •

edited

Loading

zucchini-nlp Mar 27, 2025 •

edited

Loading

zucchini-nlp commented Mar 28, 2025 •

edited

Loading

ManuelFay left a comment •

edited

Loading

ManuelFay Jun 5, 2025 •

edited

Loading