Skip to content

add multimodal support #81

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 50 commits into
base: main
Choose a base branch
from
Open

add multimodal support #81

wants to merge 50 commits into from

Conversation

nph4rd
Copy link

@nph4rd nph4rd commented Jun 10, 2025

Adds multimodal support. Main changes are:

  • accept a data_collator. I think this is useful because different VLMs have different processing requirements/tooling. For instance, in the case of Qwen, you can use qwen_vl_utils to handle resizing, etc.
  • prompts are formatted with format_oai_chat_msg for the rollouts. This handles encoding them as base64 imgs for the API, but leaves the image objects for further processing later on.
  • process_chat_format now has two branches: the multimodal case uses the processing_class to get the all the extra inputs necessary. These vary from model to model, so they are generically captured by remaining_inputs. I'm still unsure this is the best way to handle that, but this works.
  • updates many references to processing_class. This can now be either a multimodal processor (with the tokenizer attr) or a tokenizer in the text-only case.
  • handle conditional logits_to_keep. Not all models accept this.
  • added a generic_model_loader function to load models without handling specific classes (like Qwen2_5_VLForConditionalGeneration )
  • added logic to handle the Liger-Kernel patching of models that are not supported with AutoLigerKernelForCausalLM
  • added an example with Qwen2.5-VL, using DocVQA

Notes:

  • transformers changed the weight keys for VLMs. I added a version restriction for now, but should fix this too in the future. See this.
  • I tested gemma3 too. It works ok in the 1+1 GPU case (see this) but I've been getting some strange errors when using either DDP or Deepspeed.

@CLAassistant
Copy link

CLAassistant commented Jun 10, 2025

CLA assistant check
All committers have signed the CLA.

@nph4rd nph4rd marked this pull request as ready for review June 19, 2025 03:17
@nph4rd nph4rd changed the title wip: add multimodal support add multimodal support Jun 19, 2025
jdchawla29 added a commit to jdchawla29/verifiers that referenced this pull request Jul 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants