Skip to content

🔴 [VLM] Add base model without head #37033

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 50 commits into from
May 7, 2025

Conversation

zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Mar 27, 2025

What does this PR do?

Stage one of vLLM with transformers backend support for vision LLMs. As discussed internally, we don't want to break existing models, so we sacrifice readability and duplicate code

The PR adds base models for all model where missing. The base model is supposed to be same as in LLMs, everything except for the head. This allows us to make modeling more aligned with vLLM and to have a standard API for multimodal generation models. Will be super helpful in the long run, for example for AutoToAny mapping

Next stages for modeling to help vLLM and TGI:

  • Add new attention interface where still absent
  • Qwen2 config workaround without breaking
  • Processor standardization sprint
    • Add attributes for image_token_id/image_token if missing
    • Return mm-token-type-ids if requested with to indicate where image/video/audio placeholder are
    • Add helper get_num_of_image_tokens for all processor, which returns placeholder length given image
    • Explore what else missing for processors
  • Finalize and merge PR on vLLM repo, check correctness for different models

Fixes #36940.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@@ -1044,7 +1081,7 @@ def forward(self, hidden_states: torch.Tensor, grid_thw: torch.Tensor) -> torch.
"The bare Qwen2VL Model outputting raw hidden-states without any specific head on top.",
QWEN2VL_START_DOCSTRING,
)
class Qwen2VLModel(Qwen2VLPreTrainedModel):
class Qwen2VLTextModel(Qwen2VLPreTrainedModel):
Copy link
Member Author

@zucchini-nlp zucchini-nlp Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might seem breaking. In terms of what we expect at input and what we return from new Qwen2VL module, nothing changes. We can accept more args now but text-only still fine and returns same hidden states

@zucchini-nlp
Copy link
Member Author

not ready yet, marking ready to get CI run

@zucchini-nlp zucchini-nlp marked this pull request as ready for review March 27, 2025 10:27
@zucchini-nlp
Copy link
Member Author

zucchini-nlp commented Mar 28, 2025

@ArthurZucker ready for review. Failing tests are related to hub, everything passes locally

@jacob-danner
Copy link

I was looking through this and I noticed a few typos in modelling_gemma3.py. I dug in and saw python utils/modular_model_converter.py --files_to_parse src/transformers/models/<your_model>/modular_<your_model>.py, so I understand the file is generated from modular_gemma3.py. However, the typos don't exist in modular_gemma3.py.

# modelling_gemma3.py line 93

@dataclass
class Gemma3CausalLMOutputWithPast(ModelOutput):
    """
    Base class for Gemma3causal language model (or autoregressive) outputs.
   
# Gemma3causal should be Gemma3 causal
# modelling_gemma3.py line 1132

@add_start_docstrings(
    """Base Gemma3 model which consists of a vision backbone and a language model withou language modeling head.""",
    
# withou should be without

These are very small typos, but if it is easy enough, it would be nice if they could be fixed. But, I couldn't figure out where the text is actually coming from. So, my comment is more of a question: where are these typos coming from? Are they easy to patch, or is there some inheritance happening that would require deeper changes?

@zucchini-nlp
Copy link
Member Author

Thanks @jacob-danner ! These are probably copied from the very first model, from which we copy all other. In this case it might be LLaVA, I will check it

@zucchini-nlp zucchini-nlp merged commit 17742bd into huggingface:main May 7, 2025
20 checks passed
@jiqing-feng jiqing-feng mentioned this pull request May 14, 2025
4 tasks
zucchini-nlp added a commit to zucchini-nlp/transformers that referenced this pull request May 14, 2025
* i guessreverted all CdGen classes

* style

* llava onevision

* fix copies

* fix some tests

* some more tests

* dump

* skip these

* nevermind, i am dumb

* revert fix not needed

* fixup

* fixup

* another fixup

* more fixup to make ci finally happy

* fixup after rebasing

* fix qwen tests

* add internVL + typos here and there

* image token index -> id

* style

* fix init weights

* revert blip-2 not supported

* address comments

* fix copies

* revert blip2 test file as well

* as discussed internally, revert back CdGen models

* fix some tests

* fix more tests for compile

* CI red

* fix copies

* enumerate explicitly allowed models

* address comments

* fix tests

* fixup

* style again

* add tests for new model class

* another fixup ( x _ x )

* [fixup] unused attributes can be removed post-deprecation
githubnemo pushed a commit to githubnemo/peft that referenced this pull request May 26, 2025
[transformers PR #37033](huggingface/transformers#37033) re-arranges
the way visual language models are built by moving the LM head from the language model to
the top-level VLM (among other things).

This breaks the following test:

```
peft_config = PrefixTuningConfig(task_type="CAUSAL_LM", num_virtual_tokens=20)
model.language_model = get_peft_model(model.language_model, peft_config)
```

Reason being that all soft-prompting methods need a task type since each task type has specific
handling of the soft prompt (e.g., padding the labels accordingo to the number of virtual tokens for causal LM).
We also can't simply use `task_type='FEATURE_EXTRACTION'` as this would not deal with `labels` either.

Luckily the VLM is almost behaving like a LM (e.g., `get_input_embeddings` refers to the underlying LM),
therefore we can target the VLM itself and need to have the soft prompt methods detect if we're fine-tuning
a VLM so that we take the respective config variables from the `base_model.text_config` instead of `base_model`
directly.
githubnemo added a commit to huggingface/peft that referenced this pull request Jun 2, 2025
[transformers PR #37033](huggingface/transformers#37033) re-arranges
the way visual language models are built by moving the LM head from the language model to
the top-level VLM (among other things).

This breaks the following test:

```
peft_config = PrefixTuningConfig(task_type="CAUSAL_LM", num_virtual_tokens=20)
model.language_model = get_peft_model(model.language_model, peft_config)
```

Reason being that all soft-prompting methods need a task type since each task type has specific
handling of the soft prompt (e.g., padding the labels accordingo to the number of virtual tokens for causal LM).
We also can't simply use `task_type='FEATURE_EXTRACTION'` as this would not deal with `labels` either.

Luckily the VLM is almost behaving like a LM (e.g., `get_input_embeddings` refers to the underlying LM),
therefore we can target the VLM itself and need to have the soft prompt methods detect if we're fine-tuning
a VLM so that we take the respective config variables from the `base_model.text_config` instead of `base_model`
directly.
Copy link
Contributor

@ManuelFay ManuelFay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zucchini-nlp To detail this a bit more, if loading with adapters (assuming the class name matches the hardcoded list (quite an assumption), or that the user specified a key_mapping, the adapters will not benefit from the key_mapping that the base model uses.

Should we propagate this down ?

key_mapping = kwargs.pop("key_mapping", None)

# Load models with hardcoded key mapping on class for VLMs only, to keep BC and standardize model
if any(allowed_name in cls.__name__.lower() for allowed_name in VLMS):
Copy link
Contributor

@ManuelFay ManuelFay Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite brittle and breaks adapters (in peft). How would you go about this?
I'm thinking we can propagate the key_mapping to the peft integration in the from pretrained function ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since a lot of people (including me) use adapters with VLMs, that's quite a big breaking change

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For anyone looking, this was fixed!

efraimdahl pushed a commit to efraimdahl/peft that referenced this pull request Jul 12, 2025
[transformers PR #37033](huggingface/transformers#37033) re-arranges
the way visual language models are built by moving the LM head from the language model to
the top-level VLM (among other things).

This breaks the following test:

```
peft_config = PrefixTuningConfig(task_type="CAUSAL_LM", num_virtual_tokens=20)
model.language_model = get_peft_model(model.language_model, peft_config)
```

Reason being that all soft-prompting methods need a task type since each task type has specific
handling of the soft prompt (e.g., padding the labels accordingo to the number of virtual tokens for causal LM).
We also can't simply use `task_type='FEATURE_EXTRACTION'` as this would not deal with `labels` either.

Luckily the VLM is almost behaving like a LM (e.g., `get_input_embeddings` refers to the underlying LM),
therefore we can target the VLM itself and need to have the soft prompt methods detect if we're fine-tuning
a VLM so that we take the respective config variables from the `base_model.text_config` instead of `base_model`
directly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gemma3 not supported in main branch
7 participants