add multimodal support #81

nph4rd · 2025-06-10T02:59:40Z

Adds multimodal support. Main changes are:

accept a data_collator. I think this is useful because different VLMs have different processing requirements/tooling. For instance, in the case of Qwen, you can use qwen_vl_utils to handle resizing, etc.
prompts are formatted with format_oai_chat_msg for the rollouts. This handles encoding them as base64 imgs for the API, but leaves the image objects for further processing later on.
process_chat_format now has two branches: the multimodal case uses the processing_class to get the all the extra inputs necessary. These vary from model to model, so they are generically captured by remaining_inputs. I'm still unsure this is the best way to handle that, but this works.
updates many references to processing_class. This can now be either a multimodal processor (with the tokenizer attr) or a tokenizer in the text-only case.
handle conditional logits_to_keep. Not all models accept this.
added a generic_model_loader function to load models without handling specific classes (like Qwen2_5_VLForConditionalGeneration )
added logic to handle the Liger-Kernel patching of models that are not supported with AutoLigerKernelForCausalLM
added an example with Qwen2.5-VL, using DocVQA

Notes:

transformers changed the weight keys for VLMs. I added a version restriction for now, but should fix this too in the future. See this.
I tested gemma3 too. It works ok in the 1+1 GPU case (see this) but I've been getting some strange errors when using either DDP or Deepspeed.

CLAassistant · 2025-06-10T02:59:47Z

All committers have signed the CLA.

pass mm kwargs

nph4rd added 30 commits June 2, 2025 19:30

use flexible model loading

3774b56

start example

2fbd45a

use AutoProcessor class

ad19f9f

fix processor calls and pin transformers

88405ce

gather images

38062e0

format images

f91df35

fix rich log

85c3f31

update comments

d4e9818

update example

f38eee8

fix wandb logs

c40152a

resize

6eddfd6

model len

3be9f45

update example

65aeb1b

fix format and remove unused images

606027a

fix image unpacking

bd76eab

change format dataset

a89de9b

opt

6c464f9

fix format on text-only

5036f69

fix _gather_batch_data type

f3959f0

relax transformers condition

b8732f7

update comment / increase lr

e461258

liger monkey patch

58ac1c7

generic liger patch

2e25655

increase lr

0da5888

return to old naming

c9eaa02

remove todos

8aff5d1

restore padding side

9bb135c

remove padding side for completion_mask

a0323d7

fix wandb logging

6b630ba

logging format

edd9b29

nph4rd added 20 commits June 10, 2025 21:06

use data collator

f3f403e

format oai-api prompts

815e000

post-process images

6ed448d

fix text position

24ed694

process inputs in environment

ffee1d6

increase res and lr

e0026a8

fix

f51c665

Merge pull request #1 from nph4rd/mm-kwargs

9e175bc

pass mm kwargs

fix eval with data collator

4dd16b7

transform eval ds once

020d9d7

change eval steps

ac5682b

fix batch size in func call

2538d62

liger patch suffix opt

b1b90e4

load ref with generic_model_loader

5dbc658

set use_reentrant false

ee2e44c

reset format_dataset func

9ff8b79

format stuff

49c28bb

rase error

cc69cc6

Merge branch 'main' into multimodal

eed2fe2

update example

d41d00f

nph4rd marked this pull request as ready for review June 19, 2025 03:17

nph4rd changed the title ~~wip: add multimodal support~~ add multimodal support Jun 19, 2025

jdchawla29 added a commit to jdchawla29/verifiers that referenced this pull request Jul 21, 2025

squash PR willccbb#81 – adds VLM support

7620b70

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add multimodal support #81

add multimodal support #81

Uh oh!

nph4rd commented Jun 10, 2025 •

edited

Loading

Uh oh!

CLAassistant commented Jun 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

add multimodal support #81

Are you sure you want to change the base?

add multimodal support #81

Uh oh!

Conversation

nph4rd commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

nph4rd commented Jun 10, 2025 •

edited

Loading

CLAassistant commented Jun 10, 2025 •

edited

Loading