-
Notifications
You must be signed in to change notification settings - Fork 29.7k
🔴 [VLM] Add base model without head #37033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@@ -1044,7 +1081,7 @@ def forward(self, hidden_states: torch.Tensor, grid_thw: torch.Tensor) -> torch. | |||
"The bare Qwen2VL Model outputting raw hidden-states without any specific head on top.", | |||
QWEN2VL_START_DOCSTRING, | |||
) | |||
class Qwen2VLModel(Qwen2VLPreTrainedModel): | |||
class Qwen2VLTextModel(Qwen2VLPreTrainedModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might seem breaking. In terms of what we expect at input and what we return from new Qwen2VL
module, nothing changes. We can accept more args now but text-only still fine and returns same hidden states
not ready yet, marking ready to get CI run |
@ArthurZucker ready for review. Failing tests are related to hub, everything passes locally |
I was looking through this and I noticed a few typos in # modelling_gemma3.py line 93
@dataclass
class Gemma3CausalLMOutputWithPast(ModelOutput):
"""
Base class for Gemma3causal language model (or autoregressive) outputs.
# Gemma3causal should be Gemma3 causal # modelling_gemma3.py line 1132
@add_start_docstrings(
"""Base Gemma3 model which consists of a vision backbone and a language model withou language modeling head.""",
# withou should be without These are very small typos, but if it is easy enough, it would be nice if they could be fixed. But, I couldn't figure out where the text is actually coming from. So, my comment is more of a question: where are these typos coming from? Are they easy to patch, or is there some inheritance happening that would require deeper changes? |
Thanks @jacob-danner ! These are probably copied from the very first model, from which we copy all other. In this case it might be LLaVA, I will check it |
* i guessreverted all CdGen classes * style * llava onevision * fix copies * fix some tests * some more tests * dump * skip these * nevermind, i am dumb * revert fix not needed * fixup * fixup * another fixup * more fixup to make ci finally happy * fixup after rebasing * fix qwen tests * add internVL + typos here and there * image token index -> id * style * fix init weights * revert blip-2 not supported * address comments * fix copies * revert blip2 test file as well * as discussed internally, revert back CdGen models * fix some tests * fix more tests for compile * CI red * fix copies * enumerate explicitly allowed models * address comments * fix tests * fixup * style again * add tests for new model class * another fixup ( x _ x ) * [fixup] unused attributes can be removed post-deprecation
[transformers PR #37033](huggingface/transformers#37033) re-arranges the way visual language models are built by moving the LM head from the language model to the top-level VLM (among other things). This breaks the following test: ``` peft_config = PrefixTuningConfig(task_type="CAUSAL_LM", num_virtual_tokens=20) model.language_model = get_peft_model(model.language_model, peft_config) ``` Reason being that all soft-prompting methods need a task type since each task type has specific handling of the soft prompt (e.g., padding the labels accordingo to the number of virtual tokens for causal LM). We also can't simply use `task_type='FEATURE_EXTRACTION'` as this would not deal with `labels` either. Luckily the VLM is almost behaving like a LM (e.g., `get_input_embeddings` refers to the underlying LM), therefore we can target the VLM itself and need to have the soft prompt methods detect if we're fine-tuning a VLM so that we take the respective config variables from the `base_model.text_config` instead of `base_model` directly.
[transformers PR #37033](huggingface/transformers#37033) re-arranges the way visual language models are built by moving the LM head from the language model to the top-level VLM (among other things). This breaks the following test: ``` peft_config = PrefixTuningConfig(task_type="CAUSAL_LM", num_virtual_tokens=20) model.language_model = get_peft_model(model.language_model, peft_config) ``` Reason being that all soft-prompting methods need a task type since each task type has specific handling of the soft prompt (e.g., padding the labels accordingo to the number of virtual tokens for causal LM). We also can't simply use `task_type='FEATURE_EXTRACTION'` as this would not deal with `labels` either. Luckily the VLM is almost behaving like a LM (e.g., `get_input_embeddings` refers to the underlying LM), therefore we can target the VLM itself and need to have the soft prompt methods detect if we're fine-tuning a VLM so that we take the respective config variables from the `base_model.text_config` instead of `base_model` directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zucchini-nlp To detail this a bit more, if loading with adapters (assuming the class name matches the hardcoded list (quite an assumption), or that the user specified a key_mapping, the adapters will not benefit from the key_mapping that the base model uses.
Should we propagate this down ?
key_mapping = kwargs.pop("key_mapping", None) | ||
|
||
# Load models with hardcoded key mapping on class for VLMs only, to keep BC and standardize model | ||
if any(allowed_name in cls.__name__.lower() for allowed_name in VLMS): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite brittle and breaks adapters (in peft). How would you go about this?
I'm thinking we can propagate the key_mapping to the peft integration in the from pretrained function ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since a lot of people (including me) use adapters with VLMs, that's quite a big breaking change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For anyone looking, this was fixed!
[transformers PR #37033](huggingface/transformers#37033) re-arranges the way visual language models are built by moving the LM head from the language model to the top-level VLM (among other things). This breaks the following test: ``` peft_config = PrefixTuningConfig(task_type="CAUSAL_LM", num_virtual_tokens=20) model.language_model = get_peft_model(model.language_model, peft_config) ``` Reason being that all soft-prompting methods need a task type since each task type has specific handling of the soft prompt (e.g., padding the labels accordingo to the number of virtual tokens for causal LM). We also can't simply use `task_type='FEATURE_EXTRACTION'` as this would not deal with `labels` either. Luckily the VLM is almost behaving like a LM (e.g., `get_input_embeddings` refers to the underlying LM), therefore we can target the VLM itself and need to have the soft prompt methods detect if we're fine-tuning a VLM so that we take the respective config variables from the `base_model.text_config` instead of `base_model` directly.
What does this PR do?
Stage one of vLLM with transformers backend support for vision LLMs. As discussed internally, we don't want to break existing models, so we sacrifice readability and duplicate code
The PR adds base models for all model where missing. The base model is supposed to be same as in LLMs, everything except for the head. This allows us to make modeling more aligned with vLLM and to have a standard API for multimodal generation models. Will be super helpful in the long run, for example for
AutoToAny
mappingNext stages for modeling to help vLLM and TGI:
mm-token-type-ids
if requested with to indicate where image/video/audio placeholder areget_num_of_image_tokens
for all processor, which returns placeholder length given imageFixes #36940.