Skip to content

Commit ca2bb89

Browse files
HimariOngxson
andauthored
clip : Add Qwen2.5VL support (#12402)
* implment vision model architecture, gguf convertor * handle window attention inputs * add debug utils * fix few incorrect tensor memory layout * move position id remap out of ggml to avoid int32 cuda operations * cleaning up * ignore transformers Qwen2_5_xxx type check * remove not so often use `qwen2vl-cli` debug functions * remove commented-out code blocks * fix attn weight scaling after rebase * add `PROJECTOR_TYPE_QWEN2_5_VL` * remove `KEY_USE_GLU_MLP`, `KEY_USE_RMS_NORM` * replace `KEY_FULLATTN_BLK_IDX` with `KEY_WIN_ATTN_PATTERN` * remove `attn_window_size` from gguf * fix model conversion * clean up * fix merging problem * add test --------- Co-authored-by: Xuan Son Nguyen <[email protected]>
1 parent 2d451c8 commit ca2bb89

File tree

6 files changed

+597
-105
lines changed

6 files changed

+597
-105
lines changed

convert_hf_to_gguf.py

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2554,11 +2554,12 @@ def set_vocab(self):
25542554
except FileNotFoundError:
25552555
self._set_vocab_gpt2()
25562556

2557-
def get_tensors(self) -> Iterator[tuple[str, Tensor]]:
2558-
for name, data in super().get_tensors():
2559-
if name.startswith("visual."):
2560-
continue
2561-
yield name, data
2557+
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
2558+
del bid # unused
2559+
if name.startswith("visual."):
2560+
# skip visual tensors
2561+
return []
2562+
return [(self.map_tensor_name(name), data_torch)]
25622563

25632564

25642565
@ModelBase.register("WavTokenizerDec")

examples/llava/clip-impl.h

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,9 +34,14 @@
3434
#define KEY_PROJ_SCALE_FACTOR "clip.vision.projector.scale_factor"
3535
#define KEY_PROJ_TYPE "clip.projector_type"
3636

37+
#define KEY_USE_GLU_MLP "clip.use_glu_mlp" // for qwen2.5vl
38+
#define KEY_USE_RMS_NORM "clip.use_rms_norm" // for qwen2.5vl
39+
3740
#define KEY_MM_PATCH_MERGE_TYPE "clip.vision.mm_patch_merge_type"
3841
#define KEY_IMAGE_GRID_PINPOINTS "clip.vision.image_grid_pinpoints"
3942
#define KEY_IMAGE_CROP_RESOLUTION "clip.vision.image_crop_resolution"
43+
#define KEY_WIN_ATTN_PATTERN "clip.vision.n_wa_pattern"
44+
#define KEY_ATTN_WINDOW_SIZE "clip.vision.window_size"
4045

4146

4247
//
@@ -55,6 +60,7 @@
5560
#define TN_FFN_DOWN "%s.blk.%d.ffn_down.%s"
5661
#define TN_FFN_GATE "%s.blk.%d.ffn_gate.%s"
5762
#define TN_FFN_UP "%s.blk.%d.ffn_up.%s"
63+
#define TN_FFN_GATE "%s.blk.%d.ffn_gate.%s"
5864
#define TN_LN_1 "%s.blk.%d.ln1.%s"
5965
#define TN_LN_2 "%s.blk.%d.ln2.%s"
6066
#define TN_LN_PRE "%s.pre_ln.%s"
@@ -95,6 +101,7 @@ enum projector_type {
95101
PROJECTOR_TYPE_GEMMA3,
96102
PROJECTOR_TYPE_IDEFICS3,
97103
PROJECTOR_TYPE_PIXTRAL,
104+
PROJECTOR_TYPE_QWEN25VL,
98105
PROJECTOR_TYPE_UNKNOWN,
99106
};
100107

@@ -105,6 +112,7 @@ static std::map<projector_type, std::string> PROJECTOR_TYPE_NAMES = {
105112
{ PROJECTOR_TYPE_MINICPMV, "resampler"},
106113
{ PROJECTOR_TYPE_GLM_EDGE, "adapter"},
107114
{ PROJECTOR_TYPE_QWEN2VL, "qwen2vl_merger"},
115+
{ PROJECTOR_TYPE_QWEN25VL, "qwen2.5vl_merger"},
108116
{ PROJECTOR_TYPE_GEMMA3, "gemma3"},
109117
{ PROJECTOR_TYPE_IDEFICS3, "idefics3"},
110118
{ PROJECTOR_TYPE_PIXTRAL, "pixtral"},

0 commit comments

Comments
 (0)