[model] Support InternVL2.5-3 Series#7258
Conversation
4c01ba8 to
20eeb05
Compare
|
I know, we need to use yonigozlan/InternVL2_5-1B-MPO-hf instead of the original OpenGVLab/InternVL2_5-1B-MPO! |
Can you use this model card for testing one more time? I can't guarantee that the latest version is available. Feel free to report bugs, this PR is not complete for now. :[ |
|
How to convert a model InternVL2_5-1B-MPO checkpoint to hf if I have a costomized-pertrained internvl model? Please help!
|
In my case, just replace these codes with tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", return_token_type_ids=False, trust_remote_code=True)It will replace the tokenizer with internLM2's. |
Thanks a lot for your timely response!! When I use Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 4/4 [00:23<00:00, 5.83s/it]
Some weights of the model checkpoint at InternVL2_5-8B-MPO-hf were not used when initializing InternVLForConditionalGeneration: ['vision_tower.encoder.layer.0.attention.attention.key.bias', 'vision_tower.encoder.layer.0.attention.attention.key.weight', 'vision_tower.encoder.layer.0.attention.attention.query.bias', 'vision_tower.encoder.layer.0.attention.attention.query.weight', 'vision_tower.encoder.layer.0.attention.attention.value.bias', 'vision_tower.encoder.layer.0.attention.attention.value.weight', 'vision_tower.encoder.layer.0.attention.output.dense.bias', 'vision_tower.encoder.layer.0.attention.output.dense.weight', 'vision_tower.encoder.layer.0.intermediate.dense.bias', 'vision_tower.encoder.layer.0.intermediate.dense.weight', 'vision_tower.encoder.layer.0.output.dense.bias', 'vision_tower.encoder.layer.0.output.dense.weight', 'vision_tower.encoder.layer.1.attention.attention.key.bias', 'vision_tower.encoder.layer.1.attention.attention.key.weight', 'vision_tower.encoder.layer.1.attention.attention.query.bias', 'vision_tower.encoder.layer.1.attention.attention.query.weight', 'vision_tower.encoder.layer.1.attention.attention.value.bias', 'vision_tower.encoder.layer.1.attention.attention.value.weight', 'vision_tower.encoder.layer.1.attention.output.dense.bias', 'vision_tower.encoder.layer.1.attention.output.dense.weight', 'vision_tower.encoder.layer.1.intermediate.dense.bias', 'vision_tower.encoder.layer.1.intermediate.dense.weight', 'vision_tower.encoder.layer.1.output.dense.bias', 'vision_tower.encoder.layer.1.output.dense.weight', 'vision_tower.encoder.layer.10.attention.attention.key.bias', 'vision_tower.encoder.layer.10.attention.attention.key.weight', 'vision_tower.encoder.layer.10.attention.attention.query.bias', 'vision_tower.encoder.layer.10.attention.attention.query.weight', 'vision_tower.encoder.layer.10.attention.attention.value.bias', 'vision_tower.encoder.layer.10.attention.attention.value.weight', 'vision_tower.encoder.layer.10.attention.output.dense.bias', 'vision_tower.encoder.layer.10.attention.output.dense.weight', 'vision_tower.encoder.layer.10.intermediate.dense.bias', 'vision_tower.encoder.layer.10.intermediate.dense.weight', 'vision_tower.encoder.layer.10.output.dense.bias', 'vision_tower.encoder.layer.10.output.dense.weight', 'vision_tower.encoder.layer.11.attention.attention.key.bias', 'vision_tower.encoder.layer.11.attention.attention.key.weight', 'vision_tower.encoder.layer.11.attention.attention.query.bias', 'vision_tower.encoder.layer.11.attention.attention.query.weight', 'vision_tower.encoder.layer.11.attention.attention.value.bias', 'vision_tower.encoder.layer.11.attention.attention.value.weight', 'vision_tower.encoder.layer.11.attention.output.dense.bias', 'vision_tower.encoder.layer.11.attention.output.dense.weight', 'vision_tower.encoder.layer.11.intermediate.dense.bias', 'vision_tower.encoder.layer.11.intermediate.dense.weight', 'vision_tower.encoder.layer.11.output.dense.bias', 'vision_tower.encoder.layer.11.output.dense.weight', 'vision_tower.encoder.layer.12.attention.attention.key.bias', 'vision_tower.encoder.layer.12.attention.attention.key.weight', 'vision_tower.encoder.layer.12.attention.attention.query.bias', 'vision_tower.encoder.layer.12.attention.attention.query.weight', 'vision_tower.encoder.layer.12.attention.attention.value.bias', 'vision_tower.encoder.layer.12.attention.attention.value.weight', 'vision_tower.encoder.layer.12.attention.output.dense.bias', 'vision_tower.encoder.layer.12.attention.output.dense.weight', 'vision_tower.encoder.layer.12.intermediate.dense.bias', 'vision_tower.encoder.layer.12.intermediate.dense.weight', 'vision_tower.encoder.layer.12.output.dense.bias', 'vision_tower.encoder.layer.12.output.dense.weight', 'vision_tower.encoder.layer.13.attention.attention.key.bias', 'vision_tower.encoder.layer.13.attention.attention.key.weight', 'vision_tower.encoder.layer.13.attention.attention.query.bias', 'vision_tower.encoder.layer.13.attention.attention.query.weight', 'vision_tower.encoder.layer.13.attention.attention.value.bias', 'vision_tower.encoder.layer.13.attention.attention.value.weight', 'vision_tower.encoder.layer.13.attention.output.dense.bias', 'vision_tower.encoder.layer.13.attention.output.dense.weight', 'vision_tower.encoder.layer.13.intermediate.dense.bias', 'vision_tower.encoder.layer.13.intermediate.dense.weight', 'vision_tower.encoder.layer.13.output.dense.bias', 'vision_tower.encoder.layer.13.output.dense.weight', 'vision_tower.encoder.layer.14.attention.attention.key.bias', 'vision_tower.encoder.layer.14.attention.attention.key.weight', 'vision_tower.encoder.layer.14.attention.attention.query.bias', 'vision_tower.encoder.layer.14.attention.attention.query.weight', 'vision_tower.encoder.layer.14.attention.attention.value.bias', 'vision_tower.encoder.layer.14.attention.attention.value.weight', 'vision_tower.encoder.layer.14.attention.output.dense.bias', 'vision_tower.encoder.layer.14.attention.output.dense.weight', 'vision_tower.encoder.layer.14.intermediate.dense.bias', 'vision_tower.encoder.layer.14.intermediate.dense.weight', 'vision_tower.encoder.layer.14.output.dense.bias', 'vision_tower.encoder.layer.14.output.dense.weight', 'vision_tower.encoder.layer.15.attention.attention.key.bias', 'vision_tower.encoder.layer.15.attention.attention.key.weight', 'vision_tower.encoder.layer.15.attention.attention.query.bias', 'vision_tower.encoder.layer.15.attention.attention.query.weight', 'vision_tower.encoder.layer.15.attention.attention.value.bias', 'vision_tower.encoder.layer.15.attention.attention.value.weight', 'vision_tower.encoder.layer.15.attention.output.dense.bias', 'vision_tower.encoder.layer.15.attention.output.dense.weight', 'vision_tower.encoder.layer.15.intermediate.dense.bias', 'vision_tower.encoder.layer.15.intermediate.dense.weight', 'vision_tower.encoder.layer.15.output.dense.bias', 'vision_tower.encoder.layer.15.output.dense.weight', 'vision_tower.encoder.layer.16.attention.attention.key.bias', 'vision_tower.encoder.layer.16.attention.attention.key.weight', 'vision_tower.encoder.layer.16.attention.attention.query.bias', 'vision_tower.encoder.layer.16.attention.attention.query.weight', 'vision_tower.encoder.layer.16.attention.attention.value.bias', 'vision_tower.encoder.layer.16.attention.attention.value.weight', 'vision_tower.encoder.layer.16.attention.output.dense.bias', 'vision_tower.encoder.layer.16.attention.output.dense.weight', 'vision_tower.encoder.layer.16.intermediate.dense.bias', 'vision_tower.encoder.layer.16.intermediate.dense.weight', 'vision_tower.encoder.layer.16.output.dense.bias', 'vision_tower.encoder.layer.16.output.dense.weight', 'vision_tower.encoder.layer.17.attention.attention.key.bias', 'vision_tower.encoder.layer.17.attention.attention.key.weight', 'vision_tower.encoder.layer.17.attention.attention.query.bias', 'vision_tower.encoder.layer.17.attention.attention.query.weight', 'vision_tower.encoder.layer.17.attention.attention.value.bias', 'vision_tower.encoder.layer.17.attention.attention.value.weight', 'vision_tower.encoder.layer.17.attention.output.dense.bias', 'vision_tower.encoder.layer.17.attention.output.dense.weight', 'vision_tower.encoder.layer.17.intermediate.dense.bias', 'vision_tower.encoder.layer.17.intermediate.dense.weight', 'vision_tower.encoder.layer.17.output.dense.bias', 'vision_tower.encoder.layer.17.output.dense.weight', 'vision_tower.encoder.layer.18.attention.attention.key.bias', 'vision_tower.encoder.layer.18.attention.attention.key.weight', 'vision_tower.encoder.layer.18.attention.attention.query.bias', 'vision_tower.encoder.layer.18.attention.attention.query.weight', 'vision_tower.encoder.layer.18.attention.attention.value.bias', 'vision_tower.encoder.layer.18.attention.attention.value.weight', 'vision_tower.encoder.layer.18.attention.output.dense.bias', 'vision_tower.encoder.layer.18.attention.output.dense.weight', 'vision_tower.encoder.layer.18.intermediate.dense.bias', 'vision_tower.encoder.layer.18.intermediate.dense.weight', 'vision_tower.encoder.layer.18.output.dense.bias', 'vision_tower.encoder.layer.18.output.dense.weight', 'vision_tower.encoder.layer.19.attention.attention.key.bias', 'vision_tower.encoder.layer.19.attention.attention.key.weight', 'vision_tower.encoder.layer.19.attention.attention.query.bias', 'vision_tower.encoder.layer.19.attention.attention.query.weight', 'vision_tower.encoder.layer.19.attention.attention.value.bias', 'vision_tower.encoder.layer.19.attention.attention.value.weight', 'vision_tower.encoder.layer.19.attention.output.dense.bias', 'vision_tower.encoder.layer.19.attention.output.dense.weight', 'vision_tower.encoder.layer.19.intermediate.dense.bias', 'vision_tower.encoder.layer.19.intermediate.dense.weight', 'vision_tower.encoder.layer.19.output.dense.bias', 'vision_tower.encoder.layer.19.output.dense.weight', 'vision_tower.encoder.layer.2.attention.attention.key.bias', 'vision_tower.encoder.layer.2.attention.attention.key.weight', 'vision_tower.encoder.layer.2.attention.attention.query.bias', 'vision_tower.encoder.layer.2.attention.attention.query.weight', 'vision_tower.encoder.layer.2.attention.attention.value.bias', 'vision_tower.encoder.layer.2.attention.attention.value.weight', 'vision_tower.encoder.layer.2.attention.output.dense.bias', 'vision_tower.encoder.layer.2.attention.output.dense.weight', 'vision_tower.encoder.layer.2.intermediate.dense.bias', 'vision_tower.encoder.layer.2.intermediate.dense.weight', 'vision_tower.encoder.layer.2.output.dense.bias', 'vision_tower.encoder.layer.2.output.dense.weight', 'vision_tower.encoder.layer.20.attention.attention.key.bias', 'vision_tower.encoder.layer.20.attention.attention.key.weight', 'vision_tower.encoder.layer.20.attention.attention.query.bias', 'vision_tower.encoder.layer.20.attention.attention.query.weight', 'vision_tower.encoder.layer.20.attention.attention.value.bias', 'vision_tower.encoder.layer.20.attention.attention.value.weight', 'vision_tower.encoder.layer.20.attention.output.dense.bias', 'vision_tower.encoder.layer.20.attention.output.dense.weight', 'vision_tower.encoder.layer.20.intermediate.dense.bias', 'vision_tower.encoder.layer.20.intermediate.dense.weight', 'vision_tower.encoder.layer.20.output.dense.bias', 'vision_tower.encoder.layer.20.output.dense.weight', 'vision_tower.encoder.layer.21.attention.attention.key.bias', 'vision_tower.encoder.layer.21.attention.attention.key.weight', 'vision_tower.encoder.layer.21.attention.attention.query.bias', 'vision_tower.encoder.layer.21.attention.attention.query.weight', 'vision_tower.encoder.layer.21.attention.attention.value.bias', 'vision_tower.encoder.layer.21.attention.attention.value.weight', 'vision_tower.encoder.layer.21.attention.output.dense.bias', 'vision_tower.encoder.layer.21.attention.output.dense.weight', 'vision_tower.encoder.layer.21.intermediate.dense.bias', 'vision_tower.encoder.layer.21.intermediate.dense.weight', 'vision_tower.encoder.layer.21.output.dense.bias', 'vision_tower.encoder.layer.21.output.dense.weight', 'vision_tower.encoder.layer.22.attention.attention.key.bias', 'vision_tower.encoder.layer.22.attention.attention.key.weight', 'vision_tower.encoder.layer.22.attention.attention.query.bias', 'vision_tower.encoder.layer.22.attention.attention.query.weight', 'vision_tower.encoder.layer.22.attention.attention.value.bias', 'vision_tower.encoder.layer.22.attention.attention.value.weight', 'vision_tower.encoder.layer.22.attention.output.dense.bias', 'vision_tower.encoder.layer.22.attention.output.dense.weight', 'vision_tower.encoder.layer.22.intermediate.dense.bias', 'vision_tower.encoder.layer.22.intermediate.dense.weight', 'vision_tower.encoder.layer.22.output.dense.bias', 'vision_tower.encoder.layer.22.output.dense.weight', 'vision_tower.encoder.layer.23.attention.attention.key.bias', 'vision_tower.encoder.layer.23.attention.attention.key.weight', 'vision_tower.encoder.layer.23.attention.attention.query.bias', 'vision_tower.encoder.layer.23.attention.attention.query.weight', 'vision_tower.encoder.layer.23.attention.attention.value.bias', 'vision_tower.encoder.layer.23.attention.attention.value.weight', 'vision_tower.encoder.layer.23.attention.output.dense.bias', 'vision_tower.encoder.layer.23.attention.output.dense.weight', 'vision_tower.encoder.layer.23.intermediate.dense.bias', 'vision_tower.encoder.layer.23.intermediate.dense.weight', 'vision_tower.encoder.layer.23.output.dense.bias', 'vision_tower.encoder.layer.23.output.dense.weight', 'vision_tower.encoder.layer.3.attention.attention.key.bias', 'vision_tower.encoder.layer.3.attention.attention.key.weight', 'vision_tower.encoder.layer.3.attention.attention.query.bias', 'vision_tower.encoder.layer.3.attention.attention.query.weight', 'vision_tower.encoder.layer.3.attention.attention.value.bias', 'vision_tower.encoder.layer.3.attention.attention.value.weight', 'vision_tower.encoder.layer.3.attention.output.dense.bias', 'vision_tower.encoder.layer.3.attention.output.dense.weight', 'vision_tower.encoder.layer.3.intermediate.dense.bias', 'vision_tower.encoder.layer.3.intermediate.dense.weight', 'vision_tower.encoder.layer.3.output.dense.bias', 'vision_tower.encoder.layer.3.output.dense.weight', 'vision_tower.encoder.layer.4.attention.attention.key.bias', 'vision_tower.encoder.layer.4.attention.attention.key.weight', 'vision_tower.encoder.layer.4.attention.attention.query.bias', 'vision_tower.encoder.layer.4.attention.attention.query.weight', 'vision_tower.encoder.layer.4.attention.attention.value.bias', 'vision_tower.encoder.layer.4.attention.attention.value.weight', 'vision_tower.encoder.layer.4.attention.output.dense.bias', 'vision_tower.encoder.layer.4.attention.output.dense.weight', 'vision_tower.encoder.layer.4.intermediate.dense.bias', 'vision_tower.encoder.layer.4.intermediate.dense.weight', 'vision_tower.encoder.layer.4.output.dense.bias', 'vision_tower.encoder.layer.4.output.dense.weight', 'vision_tower.encoder.layer.5.attention.attention.key.bias', 'vision_tower.encoder.layer.5.attention.attention.key.weight', 'vision_tower.encoder.layer.5.attention.attention.query.bias', 'vision_tower.encoder.layer.5.attention.attention.query.weight', 'vision_tower.encoder.layer.5.attention.attention.value.bias', 'vision_tower.encoder.layer.5.attention.attention.value.weight', 'vision_tower.encoder.layer.5.attention.output.dense.bias', 'vision_tower.encoder.layer.5.attention.output.dense.weight', 'vision_tower.encoder.layer.5.intermediate.dense.bias', 'vision_tower.encoder.layer.5.intermediate.dense.weight', 'vision_tower.encoder.layer.5.output.dense.bias', 'vision_tower.encoder.layer.5.output.dense.weight', 'vision_tower.encoder.layer.6.attention.attention.key.bias', 'vision_tower.encoder.layer.6.attention.attention.key.weight', 'vision_tower.encoder.layer.6.attention.attention.query.bias', 'vision_tower.encoder.layer.6.attention.attention.query.weight', 'vision_tower.encoder.layer.6.attention.attention.value.bias', 'vision_tower.encoder.layer.6.attention.attention.value.weight', 'vision_tower.encoder.layer.6.attention.output.dense.bias', 'vision_tower.encoder.layer.6.attention.output.dense.weight', 'vision_tower.encoder.layer.6.intermediate.dense.bias', 'vision_tower.encoder.layer.6.intermediate.dense.weight', 'vision_tower.encoder.layer.6.output.dense.bias', 'vision_tower.encoder.layer.6.output.dense.weight', 'vision_tower.encoder.layer.7.attention.attention.key.bias', 'vision_tower.encoder.layer.7.attention.attention.key.weight', 'vision_tower.encoder.layer.7.attention.attention.query.bias', 'vision_tower.encoder.layer.7.attention.attention.query.weight', 'vision_tower.encoder.layer.7.attention.attention.value.bias', 'vision_tower.encoder.layer.7.attention.attention.value.weight', 'vision_tower.encoder.layer.7.attention.output.dense.bias', 'vision_tower.encoder.layer.7.attention.output.dense.weight', 'vision_tower.encoder.layer.7.intermediate.dense.bias', 'vision_tower.encoder.layer.7.intermediate.dense.weight', 'vision_tower.encoder.layer.7.output.dense.bias', 'vision_tower.encoder.layer.7.output.dense.weight', 'vision_tower.encoder.layer.8.attention.attention.key.bias', 'vision_tower.encoder.layer.8.attention.attention.key.weight', 'vision_tower.encoder.layer.8.attention.attention.query.bias', 'vision_tower.encoder.layer.8.attention.attention.query.weight', 'vision_tower.encoder.layer.8.attention.attention.value.bias', 'vision_tower.encoder.layer.8.attention.attention.value.weight', 'vision_tower.encoder.layer.8.attention.output.dense.bias', 'vision_tower.encoder.layer.8.attention.output.dense.weight', 'vision_tower.encoder.layer.8.intermediate.dense.bias', 'vision_tower.encoder.layer.8.intermediate.dense.weight', 'vision_tower.encoder.layer.8.output.dense.bias', 'vision_tower.encoder.layer.8.output.dense.weight', 'vision_tower.encoder.layer.9.attention.attention.key.bias', 'vision_tower.encoder.layer.9.attention.attention.key.weight', 'vision_tower.encoder.layer.9.attention.attention.query.bias', 'vision_tower.encoder.layer.9.attention.attention.query.weight', 'vision_tower.encoder.layer.9.attention.attention.value.bias', 'vision_tower.encoder.layer.9.attention.attention.value.weight', 'vision_tower.encoder.layer.9.attention.output.dense.bias', 'vision_tower.encoder.layer.9.attention.output.dense.weight', 'vision_tower.encoder.layer.9.intermediate.dense.bias', 'vision_tower.encoder.layer.9.intermediate.dense.weight', 'vision_tower.encoder.layer.9.output.dense.bias', 'vision_tower.encoder.layer.9.output.dense.weight']
- This IS expected if you are initializing InternVLForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing InternVLForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of InternVLForConditionalGeneration were not initialized from the model checkpoint at InternVL2_5-8B-MPO-hf and are newly initialized: ['vision_tower.encoder.layer.0.attention.key.bias', 'vision_tower.encoder.layer.0.attention.key.weight', 'vision_tower.encoder.layer.0.attention.output.bias', 'vision_tower.encoder.layer.0.attention.output.weight', 'vision_tower.encoder.layer.0.attention.query.bias', 'vision_tower.encoder.layer.0.attention.query.weight', 'vision_tower.encoder.layer.0.attention.value.bias', 'vision_tower.encoder.layer.0.attention.value.weight', 'vision_tower.encoder.layer.0.mlp.down_proj.bias', 'vision_tower.encoder.layer.0.mlp.down_proj.weight', 'vision_tower.encoder.layer.0.mlp.up_proj.bias', 'vision_tower.encoder.layer.0.mlp.up_proj.weight', 'vision_tower.encoder.layer.1.attention.key.bias', 'vision_tower.encoder.layer.1.attention.key.weight', 'vision_tower.encoder.layer.1.attention.output.bias', 'vision_tower.encoder.layer.1.attention.output.weight', 'vision_tower.encoder.layer.1.attention.query.bias', 'vision_tower.encoder.layer.1.attention.query.weight', 'vision_tower.encoder.layer.1.attention.value.bias', 'vision_tower.encoder.layer.1.attention.value.weight', 'vision_tower.encoder.layer.1.mlp.down_proj.bias', 'vision_tower.encoder.layer.1.mlp.down_proj.weight', 'vision_tower.encoder.layer.1.mlp.up_proj.bias', 'vision_tower.encoder.layer.1.mlp.up_proj.weight', 'vision_tower.encoder.layer.10.attention.key.bias', 'vision_tower.encoder.layer.10.attention.key.weight', 'vision_tower.encoder.layer.10.attention.output.bias', 'vision_tower.encoder.layer.10.attention.output.weight', 'vision_tower.encoder.layer.10.attention.query.bias', 'vision_tower.encoder.layer.10.attention.query.weight', 'vision_tower.encoder.layer.10.attention.value.bias', 'vision_tower.encoder.layer.10.attention.value.weight', 'vision_tower.encoder.layer.10.mlp.down_proj.bias', 'vision_tower.encoder.layer.10.mlp.down_proj.weight', 'vision_tower.encoder.layer.10.mlp.up_proj.bias', 'vision_tower.encoder.layer.10.mlp.up_proj.weight', 'vision_tower.encoder.layer.11.attention.key.bias', 'vision_tower.encoder.layer.11.attention.key.weight', 'vision_tower.encoder.layer.11.attention.output.bias', 'vision_tower.encoder.layer.11.attention.output.weight', 'vision_tower.encoder.layer.11.attention.query.bias', 'vision_tower.encoder.layer.11.attention.query.weight', 'vision_tower.encoder.layer.11.attention.value.bias', 'vision_tower.encoder.layer.11.attention.value.weight', 'vision_tower.encoder.layer.11.mlp.down_proj.bias', 'vision_tower.encoder.layer.11.mlp.down_proj.weight', 'vision_tower.encoder.layer.11.mlp.up_proj.bias', 'vision_tower.encoder.layer.11.mlp.up_proj.weight', 'vision_tower.encoder.layer.12.attention.key.bias', 'vision_tower.encoder.layer.12.attention.key.weight', 'vision_tower.encoder.layer.12.attention.output.bias', 'vision_tower.encoder.layer.12.attention.output.weight', 'vision_tower.encoder.layer.12.attention.query.bias', 'vision_tower.encoder.layer.12.attention.query.weight', 'vision_tower.encoder.layer.12.attention.value.bias', 'vision_tower.encoder.layer.12.attention.value.weight', 'vision_tower.encoder.layer.12.mlp.down_proj.bias', 'vision_tower.encoder.layer.12.mlp.down_proj.weight', 'vision_tower.encoder.layer.12.mlp.up_proj.bias', 'vision_tower.encoder.layer.12.mlp.up_proj.weight', 'vision_tower.encoder.layer.13.attention.key.bias', 'vision_tower.encoder.layer.13.attention.key.weight', 'vision_tower.encoder.layer.13.attention.output.bias', 'vision_tower.encoder.layer.13.attention.output.weight', 'vision_tower.encoder.layer.13.attention.query.bias', 'vision_tower.encoder.layer.13.attention.query.weight', 'vision_tower.encoder.layer.13.attention.value.bias', 'vision_tower.encoder.layer.13.attention.value.weight', 'vision_tower.encoder.layer.13.mlp.down_proj.bias', 'vision_tower.encoder.layer.13.mlp.down_proj.weight', 'vision_tower.encoder.layer.13.mlp.up_proj.bias', 'vision_tower.encoder.layer.13.mlp.up_proj.weight', 'vision_tower.encoder.layer.14.attention.key.bias', 'vision_tower.encoder.layer.14.attention.key.weight', 'vision_tower.encoder.layer.14.attention.output.bias', 'vision_tower.encoder.layer.14.attention.output.weight', 'vision_tower.encoder.layer.14.attention.query.bias', 'vision_tower.encoder.layer.14.attention.query.weight', 'vision_tower.encoder.layer.14.attention.value.bias', 'vision_tower.encoder.layer.14.attention.value.weight', 'vision_tower.encoder.layer.14.mlp.down_proj.bias', 'vision_tower.encoder.layer.14.mlp.down_proj.weight', 'vision_tower.encoder.layer.14.mlp.up_proj.bias', 'vision_tower.encoder.layer.14.mlp.up_proj.weight', 'vision_tower.encoder.layer.15.attention.key.bias', 'vision_tower.encoder.layer.15.attention.key.weight', 'vision_tower.encoder.layer.15.attention.output.bias', 'vision_tower.encoder.layer.15.attention.output.weight', 'vision_tower.encoder.layer.15.attention.query.bias', 'vision_tower.encoder.layer.15.attention.query.weight', 'vision_tower.encoder.layer.15.attention.value.bias', 'vision_tower.encoder.layer.15.attention.value.weight', 'vision_tower.encoder.layer.15.mlp.down_proj.bias', 'vision_tower.encoder.layer.15.mlp.down_proj.weight', 'vision_tower.encoder.layer.15.mlp.up_proj.bias', 'vision_tower.encoder.layer.15.mlp.up_proj.weight', 'vision_tower.encoder.layer.16.attention.key.bias', 'vision_tower.encoder.layer.16.attention.key.weight', 'vision_tower.encoder.layer.16.attention.output.bias', 'vision_tower.encoder.layer.16.attention.output.weight', 'vision_tower.encoder.layer.16.attention.query.bias', 'vision_tower.encoder.layer.16.attention.query.weight', 'vision_tower.encoder.layer.16.attention.value.bias', 'vision_tower.encoder.layer.16.attention.value.weight', 'vision_tower.encoder.layer.16.mlp.down_proj.bias', 'vision_tower.encoder.layer.16.mlp.down_proj.weight', 'vision_tower.encoder.layer.16.mlp.up_proj.bias', 'vision_tower.encoder.layer.16.mlp.up_proj.weight', 'vision_tower.encoder.layer.17.attention.key.bias', 'vision_tower.encoder.layer.17.attention.key.weight', 'vision_tower.encoder.layer.17.attention.output.bias', 'vision_tower.encoder.layer.17.attention.output.weight', 'vision_tower.encoder.layer.17.attention.query.bias', 'vision_tower.encoder.layer.17.attention.query.weight', 'vision_tower.encoder.layer.17.attention.value.bias', 'vision_tower.encoder.layer.17.attention.value.weight', 'vision_tower.encoder.layer.17.mlp.down_proj.bias', 'vision_tower.encoder.layer.17.mlp.down_proj.weight', 'vision_tower.encoder.layer.17.mlp.up_proj.bias', 'vision_tower.encoder.layer.17.mlp.up_proj.weight', 'vision_tower.encoder.layer.18.attention.key.bias', 'vision_tower.encoder.layer.18.attention.key.weight', 'vision_tower.encoder.layer.18.attention.output.bias', 'vision_tower.encoder.layer.18.attention.output.weight', 'vision_tower.encoder.layer.18.attention.query.bias', 'vision_tower.encoder.layer.18.attention.query.weight', 'vision_tower.encoder.layer.18.attention.value.bias', 'vision_tower.encoder.layer.18.attention.value.weight', 'vision_tower.encoder.layer.18.mlp.down_proj.bias', 'vision_tower.encoder.layer.18.mlp.down_proj.weight', 'vision_tower.encoder.layer.18.mlp.up_proj.bias', 'vision_tower.encoder.layer.18.mlp.up_proj.weight', 'vision_tower.encoder.layer.19.attention.key.bias', 'vision_tower.encoder.layer.19.attention.key.weight', 'vision_tower.encoder.layer.19.attention.output.bias', 'vision_tower.encoder.layer.19.attention.output.weight', 'vision_tower.encoder.layer.19.attention.query.bias', 'vision_tower.encoder.layer.19.attention.query.weight', 'vision_tower.encoder.layer.19.attention.value.bias', 'vision_tower.encoder.layer.19.attention.value.weight', 'vision_tower.encoder.layer.19.mlp.down_proj.bias', 'vision_tower.encoder.layer.19.mlp.down_proj.weight', 'vision_tower.encoder.layer.19.mlp.up_proj.bias', 'vision_tower.encoder.layer.19.mlp.up_proj.weight', 'vision_tower.encoder.layer.2.attention.key.bias', 'vision_tower.encoder.layer.2.attention.key.weight', 'vision_tower.encoder.layer.2.attention.output.bias', 'vision_tower.encoder.layer.2.attention.output.weight', 'vision_tower.encoder.layer.2.attention.query.bias', 'vision_tower.encoder.layer.2.attention.query.weight', 'vision_tower.encoder.layer.2.attention.value.bias', 'vision_tower.encoder.layer.2.attention.value.weight', 'vision_tower.encoder.layer.2.mlp.down_proj.bias', 'vision_tower.encoder.layer.2.mlp.down_proj.weight', 'vision_tower.encoder.layer.2.mlp.up_proj.bias', 'vision_tower.encoder.layer.2.mlp.up_proj.weight', 'vision_tower.encoder.layer.20.attention.key.bias', 'vision_tower.encoder.layer.20.attention.key.weight', 'vision_tower.encoder.layer.20.attention.output.bias', 'vision_tower.encoder.layer.20.attention.output.weight', 'vision_tower.encoder.layer.20.attention.query.bias', 'vision_tower.encoder.layer.20.attention.query.weight', 'vision_tower.encoder.layer.20.attention.value.bias', 'vision_tower.encoder.layer.20.attention.value.weight', 'vision_tower.encoder.layer.20.mlp.down_proj.bias', 'vision_tower.encoder.layer.20.mlp.down_proj.weight', 'vision_tower.encoder.layer.20.mlp.up_proj.bias', 'vision_tower.encoder.layer.20.mlp.up_proj.weight', 'vision_tower.encoder.layer.21.attention.key.bias', 'vision_tower.encoder.layer.21.attention.key.weight', 'vision_tower.encoder.layer.21.attention.output.bias', 'vision_tower.encoder.layer.21.attention.output.weight', 'vision_tower.encoder.layer.21.attention.query.bias', 'vision_tower.encoder.layer.21.attention.query.weight', 'vision_tower.encoder.layer.21.attention.value.bias', 'vision_tower.encoder.layer.21.attention.value.weight', 'vision_tower.encoder.layer.21.mlp.down_proj.bias', 'vision_tower.encoder.layer.21.mlp.down_proj.weight', 'vision_tower.encoder.layer.21.mlp.up_proj.bias', 'vision_tower.encoder.layer.21.mlp.up_proj.weight', 'vision_tower.encoder.layer.22.attention.key.bias', 'vision_tower.encoder.layer.22.attention.key.weight', 'vision_tower.encoder.layer.22.attention.output.bias', 'vision_tower.encoder.layer.22.attention.output.weight', 'vision_tower.encoder.layer.22.attention.query.bias', 'vision_tower.encoder.layer.22.attention.query.weight', 'vision_tower.encoder.layer.22.attention.value.bias', 'vision_tower.encoder.layer.22.attention.value.weight', 'vision_tower.encoder.layer.22.mlp.down_proj.bias', 'vision_tower.encoder.layer.22.mlp.down_proj.weight', 'vision_tower.encoder.layer.22.mlp.up_proj.bias', 'vision_tower.encoder.layer.22.mlp.up_proj.weight', 'vision_tower.encoder.layer.23.attention.key.bias', 'vision_tower.encoder.layer.23.attention.key.weight', 'vision_tower.encoder.layer.23.attention.output.bias', 'vision_tower.encoder.layer.23.attention.output.weight', 'vision_tower.encoder.layer.23.attention.query.bias', 'vision_tower.encoder.layer.23.attention.query.weight', 'vision_tower.encoder.layer.23.attention.value.bias', 'vision_tower.encoder.layer.23.attention.value.weight', 'vision_tower.encoder.layer.23.mlp.down_proj.bias', 'vision_tower.encoder.layer.23.mlp.down_proj.weight', 'vision_tower.encoder.layer.23.mlp.up_proj.bias', 'vision_tower.encoder.layer.23.mlp.up_proj.weight', 'vision_tower.encoder.layer.3.attention.key.bias', 'vision_tower.encoder.layer.3.attention.key.weight', 'vision_tower.encoder.layer.3.attention.output.bias', 'vision_tower.encoder.layer.3.attention.output.weight', 'vision_tower.encoder.layer.3.attention.query.bias', 'vision_tower.encoder.layer.3.attention.query.weight', 'vision_tower.encoder.layer.3.attention.value.bias', 'vision_tower.encoder.layer.3.attention.value.weight', 'vision_tower.encoder.layer.3.mlp.down_proj.bias', 'vision_tower.encoder.layer.3.mlp.down_proj.weight', 'vision_tower.encoder.layer.3.mlp.up_proj.bias', 'vision_tower.encoder.layer.3.mlp.up_proj.weight', 'vision_tower.encoder.layer.4.attention.key.bias', 'vision_tower.encoder.layer.4.attention.key.weight', 'vision_tower.encoder.layer.4.attention.output.bias', 'vision_tower.encoder.layer.4.attention.output.weight', 'vision_tower.encoder.layer.4.attention.query.bias', 'vision_tower.encoder.layer.4.attention.query.weight', 'vision_tower.encoder.layer.4.attention.value.bias', 'vision_tower.encoder.layer.4.attention.value.weight', 'vision_tower.encoder.layer.4.mlp.down_proj.bias', 'vision_tower.encoder.layer.4.mlp.down_proj.weight', 'vision_tower.encoder.layer.4.mlp.up_proj.bias', 'vision_tower.encoder.layer.4.mlp.up_proj.weight', 'vision_tower.encoder.layer.5.attention.key.bias', 'vision_tower.encoder.layer.5.attention.key.weight', 'vision_tower.encoder.layer.5.attention.output.bias', 'vision_tower.encoder.layer.5.attention.output.weight', 'vision_tower.encoder.layer.5.attention.query.bias', 'vision_tower.encoder.layer.5.attention.query.weight', 'vision_tower.encoder.layer.5.attention.value.bias', 'vision_tower.encoder.layer.5.attention.value.weight', 'vision_tower.encoder.layer.5.mlp.down_proj.bias', 'vision_tower.encoder.layer.5.mlp.down_proj.weight', 'vision_tower.encoder.layer.5.mlp.up_proj.bias', 'vision_tower.encoder.layer.5.mlp.up_proj.weight', 'vision_tower.encoder.layer.6.attention.key.bias', 'vision_tower.encoder.layer.6.attention.key.weight', 'vision_tower.encoder.layer.6.attention.output.bias', 'vision_tower.encoder.layer.6.attention.output.weight', 'vision_tower.encoder.layer.6.attention.query.bias', 'vision_tower.encoder.layer.6.attention.query.weight', 'vision_tower.encoder.layer.6.attention.value.bias', 'vision_tower.encoder.layer.6.attention.value.weight', 'vision_tower.encoder.layer.6.mlp.down_proj.bias', 'vision_tower.encoder.layer.6.mlp.down_proj.weight', 'vision_tower.encoder.layer.6.mlp.up_proj.bias', 'vision_tower.encoder.layer.6.mlp.up_proj.weight', 'vision_tower.encoder.layer.7.attention.key.bias', 'vision_tower.encoder.layer.7.attention.key.weight', 'vision_tower.encoder.layer.7.attention.output.bias', 'vision_tower.encoder.layer.7.attention.output.weight', 'vision_tower.encoder.layer.7.attention.query.bias', 'vision_tower.encoder.layer.7.attention.query.weight', 'vision_tower.encoder.layer.7.attention.value.bias', 'vision_tower.encoder.layer.7.attention.value.weight', 'vision_tower.encoder.layer.7.mlp.down_proj.bias', 'vision_tower.encoder.layer.7.mlp.down_proj.weight', 'vision_tower.encoder.layer.7.mlp.up_proj.bias', 'vision_tower.encoder.layer.7.mlp.up_proj.weight', 'vision_tower.encoder.layer.8.attention.key.bias', 'vision_tower.encoder.layer.8.attention.key.weight', 'vision_tower.encoder.layer.8.attention.output.bias', 'vision_tower.encoder.layer.8.attention.output.weight', 'vision_tower.encoder.layer.8.attention.query.bias', 'vision_tower.encoder.layer.8.attention.query.weight', 'vision_tower.encoder.layer.8.attention.value.bias', 'vision_tower.encoder.layer.8.attention.value.weight', 'vision_tower.encoder.layer.8.mlp.down_proj.bias', 'vision_tower.encoder.layer.8.mlp.down_proj.weight', 'vision_tower.encoder.layer.8.mlp.up_proj.bias', 'vision_tower.encoder.layer.8.mlp.up_proj.weight', 'vision_tower.encoder.layer.9.attention.key.bias', 'vision_tower.encoder.layer.9.attention.key.weight', 'vision_tower.encoder.layer.9.attention.output.bias', 'vision_tower.encoder.layer.9.attention.output.weight', 'vision_tower.encoder.layer.9.attention.query.bias', 'vision_tower.encoder.layer.9.attention.query.weight', 'vision_tower.encoder.layer.9.attention.value.bias', 'vision_tower.encoder.layer.9.attention.value.weight', 'vision_tower.encoder.layer.9.mlp.down_proj.bias', 'vision_tower.encoder.layer.9.mlp.down_proj.weight', 'vision_tower.encoder.layer.9.mlp.up_proj.bias', 'vision_tower.encoder.layer.9.mlp.up_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.It seems that the vision parts are all not initialized? |
Could you catch the def write_tokenizer(save_dir: str, push_to_hub: bool = False, path: str = None, hub_dir: str = None):
if LM_TYPE_CORRESPONDENCE[path] == "qwen2":
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", return_token_type_ids=False)
tokenizer.model_max_length = CONTEXT_LENGTHFor |
Actually, I am using a customized-pretrained According to your project: LM_TYPE_CORRESPONDENCE = {
"OpenGVLab/InternVL2_5-1B-MPO": "qwen2",
"OpenGVLab/InternVL2_5-2B-MPO": "llama",
"OpenGVLab/InternVL2_5-4B-MPO": "qwen2",
"OpenGVLab/InternVL2_5-8B-MPO": "llama",
"OpenGVLab/InternVL2_5-26B-MPO": "llama",
"OpenGVLab/InternVL2_5-38B-MPO": "qwen2",
"OpenGVLab/InternVL2_5-78B-MPO": "qwen2",
}I suppose I should use the path pointing at "llama"? |
|
Yes! Wait for a moment. I am going to reproduce it. |
Additionally, this error occurs when I directly attempt to load your published model, without runnning converting weight file. I don't know what could cause the mismatch between parameter names. I used to the newest transformers version by pip install the following: |
A quick check. It seems that your
To fix: Update the transformers to align with the commit. # test with converting to InternVL3-8B-hf
from transformers import InternVLForConditionalGeneration
model = InternVLForConditionalGeneration.from_pretrained("./InternVL3-8B-hf", device_map="auto") |
|
@Kuangdd01 we should add |
yes, my bad. Wait for a second |
|
An error is reported after the model is fine-tuned。【Unrecognized configuration class <class 'transformers.models.internvl.configuration_internvl.InternVLConfig'> for this kind of AutoModel: AutoModel.】 |
There was a mismatch between transformers and model ckpt. |
我照着这个步骤做了,但是报错 |
参考版本和模型对应关系 #7258 (comment) 更新llamafactory代码重新试一次 |
[assets] update wechat (hiyouga#7288) [dataset] fix ultrachat_200k dataset (hiyouga#7259) The `HuggingFaceH4/ultrachat_200k` dataset doesn't contain the default "train" split. The correct split is "train_sft". [data] gemma3 plugin pan and scan (hiyouga#7294) * gemma3 pan and scan * add test case * fix test [inference] support sglang backend (hiyouga#7278) * Mimic SGLang offline Engine * Add more tests and args * Pass all current tests * Clean Code * fix sample_params * clean code * Fix Stream Chat * change sglang from engine mode to server mode * fix * Fix Review Issues * Use SGLang Built-In Utilities * Fix test SGLang * Some Doc Issue * fix sglang engine * add readme --------- Co-authored-by: Jin Pan <jpan236@wisc.edu> Co-authored-by: hiyouga <hiyouga@buaa.edu.cn> [model] support hunyuan 7b (hiyouga#7317) * [Model]supported tencent-hunyuan model * [Model]supported tencent-hunyuan model(fix) * [Model]supported tencent-hunyuan model(fix) [assets] update videos (hiyouga#7340) * Update README.md * Update README_zh.md [data] fix template (hiyouga#7349) [misc] set dev version (hiyouga#7351) [assets] update wechat (hiyouga#7361) [version] fix minicpmo (hiyouga#7378) [3rdparty] fix redundant process group destroy for ray (hiyouga#7395) * fix redundant process group destroy for ray * Update tuner.py --------- Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn> [misc] fix sglang deps (hiyouga#7432) * feat: Add transformer version requirement for sglang * feat: add srt to sglang which is required for running sglang Other options are srt_hip, srt_xpu, srt_npu, srt_hpu, srt_cpu, for different computation architectures. [deps] upgrade vllm to 0.8 (hiyouga#7436) [deps] upgrade transformers to 4.50.0 (hiyouga#7437) * upgrade transformers * fix hf cache * fix dpo trainer [scripts] support compute score on vllm's predictions (hiyouga#7419) * enable manual bleu&rouge eval by adding `scripts/eval_bleu_rouge.py` * added libraries check * update: 使用datasets库的多进程加速处理 * update: - 使用 fire.Fire - 修改代码格式 * Update eval_bleu_rouge.py: correctly uses fire Deleted the code of using sys.argv * Update eval_bleu_rouge.py --------- Co-authored-by: SnowFox4004 <manba@out> Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn> [misc] fix license (hiyouga#7440) [misc] fix ci (hiyouga#7441) * fix ci * improve ci [docker] upgrade to torch 2.6 (hiyouga#7442) [trainer] fix vlm loss for transformers 4.49 (hiyouga#7448) [assets] fix gemma3 readme (hiyouga#7449) [assets] update wechat (hiyouga#7455) [misc] enable liger kernel for gemma3 (hiyouga#7462) [misc] enable liger kernel for gemma3 text and paligemma (hiyouga#7466) * add gemma3 text * add paligemma (1,2 and 2 mix) [misc] update liger-kernel's monkey patch (hiyouga#7453) * Update liger_kernel.py * Update setup.py [model] fix lora on quant models (hiyouga#7456) Co-authored-by: root <root@ai> [model] add qwen2vl 32b & upgrade peft (hiyouga#7469) * add qwen2vl 32b * fix ci * upgrade peft to 0.15 * fix ci * fix ci [trainer] fix wsd scheduler (hiyouga#7304) * [trainer] Warmup_stable_decay supports setting the number of stable and decay steps according to the warmup_ratio ratio * Update trainer_utils.py --------- Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn> [3rdparty] support swanlab lark notification (hiyouga#7481) [data] fix pixtral plugin (hiyouga#7505) * preserve `image_sizes` * add comments [assets] update wechat (hiyouga#7523) [deps] pin pydantic to 2.10.6 (hiyouga#7546) [model] add Qwen2.5-Omni model (hiyouga#7537) * preserve image_sizes * preserve image_sizes * init plugin * support audio-text2text lora * nit * support image/video-text2text, audio-text2text * remove args * remove lines * add docs && nit * remove some comments * fix && add merge part script * add license [data] fix qwen2.5 omni collator (hiyouga#7553) [trainer] new kto mismatch pair creation strategy (hiyouga#7509) [data] shard the dataset to allow multiprocessing when streaming is enabled (hiyouga#7530) * Shard the dataset when streaming to allow multiprocessing * Allow user to not set dataset_shards to ensure backward compatibility [webui] fix launch with proxy (hiyouga#7332) [data] specify position_ids in PackedSupervisedDatasetProcessor for neat_packing (hiyouga#7318) * use position_ids for neat_packing with fa2 * revert fa2 changes [model] fix use_cache patching for gemma3 multimodal (hiyouga#7500) [model] fix kv cache (hiyouga#7564) [infer] vllm video/audio inference (hiyouga#7566) [trainer] fix batch processing in PPO trainer (hiyouga#7576) [data] fix qwen2.5 omni plugin (hiyouga#7573) * align key with qwen2vl * nit && change scripts [data] fix qwen2.5 omni plugin (hiyouga#7578) * specific entry * Update mm_plugin.py * fix fps cal --------- Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn> [assets] update wechat (hiyouga#7594) [model] add llama4 (hiyouga#7611) [assets] update readme (hiyouga#7612) [misc] fix packing and eval plot (hiyouga#7623) [sglang] support transformers 4.51.0 (hiyouga#7639) [trainer] fix key error (hiyouga#7635) [data] Fix bugs of `use_audio_in_video` in Qwen2.5 Omni (hiyouga#7638) * cache _mm_inputs * nit * support for use_audio_in_video * remove cache * fix data * Update mllm_video_audio_demo.json [assets] update readme (hiyouga#7644) [assets] update readme (hiyouga#7654) [data] add coig-p dataset (hiyouga#7657) [misc] fix cuda warn on intel GPU (hiyouga#7655) [bugfix] enable_gemma_liger_kernel (hiyouga#7660) - The `enable_liger_kernel` function for the Gemma model series was not executed due to the existing `if` statement in the code. - Changed the line to an `elif` statement so that the `apply_liger_kernel` function is executed properly. resolved: hiyouga#7628 [ray] allow for specifying ray.init kwargs (i.e. runtime_env) (hiyouga#7647) * ray init kwargs * Update trainer_utils.py * fix ray args --------- Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn> [data] support for specifying a dataset in cloud storage (hiyouga#7567) * add support for loading datasets from s3/gcs * add comments to readme * run linter and address comments * add option to pass in kwargs to ray init (i.e. runtime env) * address comment * revert mixed up changes [assets] update wechat (hiyouga#7674) [deps] fix uv conflicts (hiyouga#7686) * fix hiyouga#7678 * Update setup.py * Update tests.yml * Update publish.yml * Update Makefile [model] add GLM-4-0414 (hiyouga#7695) * Update README_zh.md * update [deps] upgrade transformers (hiyouga#7704) [misc] upgrade cli (hiyouga#7714) [misc] fix env vars (hiyouga#7715) [model] Support Kimi_VL thinking/instruct (hiyouga#7719) * add kimi_vl * patch config * check version * Update mm_plugin.py * Update mm_plugin.py --------- Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn> [assets] update model readme (hiyouga#7724) [docker] patch docker-rocm (hiyouga#7725) * Update Dockerfile * Fix typo * Fix syntax for /bin/sh conditional * Add build args to docker-compose * Change shell to /bin/bash This is required for "==" syntax in conditional string comparison [deps] upgrade vllm (hiyouga#7728) [api] fix chat messages (hiyouga#7732) [assets] wechat (hiyouga#7740) [infer] support vllm-ascend (hiyouga#7739) [misc] improve entrypoint (hiyouga#7345) * 纯粹优化下入口代码,因为看到if else太多了 * Update cli.py --------- Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn> [model] support intern-VL 2.5-3 series (hiyouga#7258) * add internvl and rebase * fix for internvl2&3 * remove lines * fix video_inputs & lint * nit * add constants * remove lines * fix * fix error * pass ci * pass ci * skip internvl & nit [infer] set env for vllm ascend (hiyouga#7745) [breaking] bump transformers to 4.45.0 & improve ci (hiyouga#7746) * update ci * fix * fix * fix * fix * fix [trainer] fix pt loss (hiyouga#7748) * fix pt loss * robust * fix * test [assets] update wechat (hiyouga#7792) [misc] fix bug in constant (hiyouga#7765) Co-authored-by: Sachin Beldona <sbeldona@cs.cmu.edu> [model] fix gemma3 export (hiyouga#7786) Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn> [misc] fix new tokens adding (hiyouga#7253) Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn> [data] Fix wrong position ids with packed attention masks (hiyouga#7754) Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn> [parser] support omegaconf (hiyouga#7793) [trainer] Add Muon Optimizer (hiyouga#7749) Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn> [example] add bash usage (hiyouga#7794) [data] improve mmplugin (hiyouga#7795) [trainer] support early stop (hiyouga#7797) [misc] update internvl constants (hiyouga#7801) [model] add arch check for InternVL (hiyouga#7803) [assets] update model readme (hiyouga#7804) [data] fix internvl plugin (hiyouga#7817) [model] fix moe zero3 (hiyouga#7826) Merge commit from fork [model] fix vit gradient checkpointing (hiyouga#7830) [assets] update wechat (hiyouga#7840) [ray] add storage filesystem to ray config (hiyouga#7854) fix attn patch for kimivl (hiyouga#7867) [data] fix minicpmo vllm infer (hiyouga#7870) [trainer] make projector trainable in freeze training (hiyouga#7872) Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn> [data] fix qwen2 omni plugin (hiyouga#7875) [model] fix dsv3 leaf node (hiyouga#7879) [data] fix qwen2.5 omni template (hiyouga#7883) [model] add qwen3 (hiyouga#7885) support lora sft dsv3 update code update eval yaml rebase sync w/ major branch update baseline
* add internvl and rebase * fix for internvl2&3 * remove lines * fix video_inputs & lint * nit * add constants * remove lines * fix * fix error * pass ci * pass ci * skip internvl & nit
|
@
hi, thanks for your contribution! may i ask if you have an example script of fine-tuning on text-image-to-text dataset? |
|
@Elenore1997 use |
and the remaining hyper-parameters is the same as mentioned above? |
depend on your fine-tuning method and your dataset size |
我拉了最新的训练代码,模型仓库用的是8B-hf,transformers:pip install git+https://github.com/huggingface/transformers.git@main |
|
无关,字面意思,你pixel values是float32,vit是bf16的,类型不匹配 batch_size, num_channels, height, width = pixel_values.shape
if num_channels != self.num_channels:
raise ValueError(
"Make sure that the channel dimension of the pixel values match with the one set in the configuration."
)
target_dtype = self.projection.weight.dtype
if pixel_values.dtype != target_dtype:
pixel_values = pixel_values.to(dtype=target_dtype)
embeddings = self.projection(pixel_values) |
谢谢大佬的及时回复,目前已能够成功训练~ |
你好我想问一下,我使用的是InternVL3-8B进行LoRA SFT,使用的模型是OpenGVLab/InternVL3-8B用 当MODEL_PATH设置为我OpenGVLab/InternVL3-8B转化成hf格式的模型和LoRA SFT训练完导出的模型都会出现这个问题, 请问能帮忙解答一下吗? 非常感谢 @Kuangdd01 |
目前vllm支持的internvl模型格式是internvlChat而不是internvl-hf, 如果想用vllm启动需要把-hf再转换回原来的模型命名格式,并且用原来的配置文件,可以先用hf engine当推理引擎,速度上会慢很多 |
你好,请问一下这个怎么转化回去吗?我在transformers上也没找到相应的文件,能请教一下这个怎么操作吗?我自己尝试过直接修改model_type但是并不能成功,非常感谢 @Kuangdd01 |
我还没尝试过这么做,但是你可以根据transformers/src/transformers/models/internvl /convert_internvl_weights_to_hf.py来做这个逆向过程,如果你用的internvl的llm backbone是qwen的话只需要更改visual部分的命名即可, 对当前hf版本的name_params做反向命名。 |
* add internvl and rebase * fix for internvl2&3 * remove lines * fix video_inputs & lint * nit * add constants * remove lines * fix * fix error * pass ci * pass ci * skip internvl & nit
* add internvl and rebase * fix for internvl2&3 * remove lines * fix video_inputs & lint * nit * add constants * remove lines * fix * fix error * pass ci * pass ci * skip internvl & nit
|
pip install git+https://github.com/huggingface/transformers.git@main 现在对应的版本是 transformers-4.55.0.dev0 |
Run with the following env var |
在训练internVL3.5系列时我也遇到了
|










What does this PR do?
Reopened PR #7077
may Fix #6322 #6432 #6236 #3802
Before submitting
former version
## some demo experiment on `InternVL2_5-1B-MPO-hf` 1. video lora sft ``` yaml ### model model_name_or_path: kingsley01/InternVL3-1B-hf trust_remote_code: truemethod
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
dataset
dataset: mllm_video_demo
dataset: mllm_demo # text: identity,alpaca_en_demo # video: mllm_video_demo
template: intern_vl
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 4
output
output_dir: saves/internvl-1b/lora/sft-test-demo-video
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true
train
per_device_train_batch_size: 2
gradient_accumulation_steps: 1
learning_rate: 1.0e-4
num_train_epochs: 30.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500
Align with the latest Transformers Now! 😄
NOW we support InternVL2.5-InternVL3 series post-training!
Important
We should use the latest Huggingface code instead of it release and OpenGVLab/InternVL3-xB-hf checkpoint.
For the processor issues, please check your transformers versions and model.processor.config.
For now, please install
a specific versionof the latesttransformers.pip install git+https://github.com/Kuangdd01/transformers.git@hf-internvlWe support direct use of several small-sized checkpoints:[InternVL2.5-1/2/4/8B, InternVL3-1/2/8B]. Download the InternVL models from Huggingface or Modelscope.