clip : refactor graph builder #13321

ngxson · 2025-05-05T19:52:42Z

Motivation:

Pre-cursor to flash_attn support, as suggested in mtmd : add **vision** support for Mistral Small 3.1 #13231 (comment)
Prepare for debugging functions (for ex. "hook" function to show tensor data)

In this PR:

Added struct clip_graph which contains various graph builder, for example clip_graph::build_llava()
Added clip_graph::build_attn and reuse it in the arch-specific graphs --> we can experiment with flash attn support a next PR
Before this PR, qwen2vl use build_llama while qwen2.5vl use another graph --> now, both qwen2vl and qwen2.5vl are now using clip_graph::build_qwen2vl()
Dedicated build function for minicpm-v

TODO:

move graph builder to a struct
implement commonly used components like FFN, attn, norm
make dedicated builder for minicpmv
add hook and debugging functions (only add the placeholder for now)
~~remove the load_image_size, this is a non-thread-safe hack introduced by minicpmv~~ --> not doing it here, but will be a dedicated PR

Test:

OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
OK:   llama-mtmd-cli guinmoon/MobileVLM-3B-GGUF:Q4_K_M
OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K
OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/pixtral-12b-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
OK:   llama-mtmd-cli ggml-org/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2-VL-7B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q4_K_M

ngxson · 2025-05-06T13:52:22Z

Thanks @mattjcly from LM Studio for testing this PR, we confirmed that this produce exactly the same results as master version.

I'm merging this PR once the CI is green.

lcarrere · 2025-05-06T16:48:29Z

I benchmarked a few VL models, and they all perform well except for Qwen-VL v2 (2B and 8B). Currently, all accuracy scores are zero for both classification and extraction tasks.

I tested with llama-mtmd-cli on this branch and can reproduce the problem on Windows using the CUDA backend. It works fine on CPU.

To reproduce, simply use the attached image with the prompt “Give me the number.”

Every run with the new CLIP yields an incorrect result (for example, “The number is 3.”). With the previous version, it worked correctly 100% of the time.

List of tested models:

gemma-3-4b
gemma-3-12b
minicpm-o-2.6
minicpm-v-2.6
qwen2-vl-2b
qwen2.5-vl-3b
qwen2-vl-8.3b
qwen2.5-vl-7b
pixtral-12b

Let me know if you need any further information or testing from my end.

ngxson · 2025-05-06T16:57:41Z

@lcarrere ok thanks for testing. So if I understand correctly, you're getting this problem only with qwen2 VL, while the other models are running correctly, right?

ngxson · 2025-05-06T17:09:16Z

I tested on my side but it always give me the correct answer:

llama-mtmd-cli -m ../models/Qwen2-VL-7B-Instruct/model.gguf --mmproj ../models/Qwen2-VL-7B-Instruct/mmproj-model.gguf --image ../models/150.png -p "Give me the number."

The number is 150.

Running on Metal backend, mac M3 Max

(Same result with --no-mmproj-offload)

Which commit or build number on master branch you're using?

ngxson · 2025-05-06T17:14:03Z

Just to be sure, I updated this PR to latest master, feel free to give it a try

lcarrere · 2025-05-06T17:19:49Z

I tested with the master to which I added your new clip.cpp.
The problem did occur only on qwen 2 VL, all sizes. qwen 2.5 VL is OK.
Only occurring on CUDA, CPU (AVX2) was OK.
I just left the office and will be able to do some more tests in 30 min using different Nvidia cards and Metal backend on M3.

ngxson · 2025-05-06T17:30:04Z

Ok hopefully 56b41af resolves the problem. I forgot that only qwen 2.5 uses RMS norm, the older version use the "normal" norm

lcarrere · 2025-05-06T18:19:09Z

I believe you got it! llama.cpp is OK and all my unit test are now green with all backends. Thanks a lot @ngxson
I'm now running a benchmark on a larger dataset. I'll let you know how it goes in about 10 minutes.

lcarrere · 2025-05-06T18:41:02Z

OK, so everything I'm testing directly with llama.cpp is working on CUDA and other backends.
My integration tests are still returning 0% of success for these qwen v2 while 2.5 are perfectly working, but that's a fairly different code base now. I would say that llama.cpp is safe. I will continue to dig in my codebase to understand what is going wrong and in case I find a problem in llama.cpp I will ping you.

lcarrere · 2025-05-06T19:28:06Z

ok me again. @ngxson please bear with me. I though the master was merged on this PR but it is actually not, it is only on the fork.
So basically, unless I am missing something, the branch of this PR is OK. But if we merge it on the master we do have the CUDA inference problem. I'm now trying to identify which commit, on the master, is conflicting with this clip version. This may take some time as compilation is taking a looooong time.

In other words:
git fetch origin pull/13321/head:pr-13321
git checkout pr-13321

-> OK

git checkout master
copy the new clip.cpp to tools/mtmd/clip.cpp

-> KO

ngxson · 2025-05-06T19:46:33Z

Ok thanks for testing, no problem, I'll investigate this more.

This is kinda expected because before this PR, qwen2 use the same graph builder with llava, while in this PR I merge qwen 2 + 2.5 vl into once since they all use M-RoPE. Probably just missing some nodes

ngxson · 2025-05-06T20:08:06Z

Ok seems like I miss one spot, hopefully this fixes the problem: 37e24b1

lcarrere · 2025-05-06T20:15:50Z

Thanks, I just hope I'm not wasting your time....

I dig a little further. I've updated everything from the master branch (#13344). And I set the temperature on 0 to get greedy sampling as I've observed a fairly higher perplexity with the new clip and my code base, which was obviously more noticeable with smaller model and leading to random conclusions...

See the screenshot:

on the left: master + new clip.
on the right: master.

tested model: qwen 2VL 2B.
tested app: llama-mtmd-cli, with temperature set to 0.

I hope this helps!

lcarrere · 2025-05-06T20:16:28Z

Ok seems like I miss one spot, hopefully this fixes the problem: 37e24b1

on it...

lcarrere · 2025-05-06T20:22:22Z

okay this seems to work!!
..at least with this latest test. I'm doing some other ones, I'll keep you posted in probably 20 min.

Thanks again Xuan

lcarrere · 2025-05-06T20:36:07Z

OK, all regressive tests are OK. Same accuracy / same speed. Thanks again for your hard work!!

ngxson · 2025-05-06T20:37:19Z

Perfect, thanks again for testing!

ngxson · 2025-05-06T20:57:49Z

Btw @lcarrere if you are interested, the integration of vision api in llama-server is pretty much functional right now. Feel free to give it a try if you have time!

#12898

lcarrere · 2025-05-06T21:12:17Z

Okay, on my Todo list.

ngxson added 2 commits May 5, 2025 21:36

mtmd : refactor graph builder

bfd5794

fix qwen2vl

ba3f546

github-actions bot added the examples label May 5, 2025

ngxson added 6 commits May 6, 2025 01:06

clean up siglip cgraph

2c6f672

pixtral migrated

52f6345

move minicpmv to a dedicated build function

47517a1

move max_feature_layer to build_llava

e16e397

use build_attn for minicpm resampler

a155166

fix windows build

b429d0f

ngxson changed the title ~~mtmd : refactor graph builder~~ clip : refactor graph builder May 6, 2025

ngxson marked this pull request as ready for review May 6, 2025 12:28

ngxson requested review from ggerganov and slaren May 6, 2025 12:28

add comment for batch_size

b69401e

slaren approved these changes May 6, 2025

View reviewed changes

also support tinygemma3 test model

9eb496b

github-actions bot added the python python script changes label May 6, 2025

Merge branch 'master' into xsn/mtmd_graph_builder_refactor

4947b81

qwen2vl does not use RMS norm

56b41af

fix qwen2vl norm (2)

37e24b1

ngxson merged commit 32916a4 into ggml-org:master May 6, 2025
48 checks passed

clip : refactor graph builder #13321

clip : refactor graph builder #13321

Uh oh!

Conversation

ngxson commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented May 6, 2025

Uh oh!

lcarrere commented May 6, 2025

Uh oh!

ngxson commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented May 6, 2025

Uh oh!

ngxson commented May 6, 2025

Uh oh!

lcarrere commented May 6, 2025

Uh oh!

ngxson commented May 6, 2025

Uh oh!

lcarrere commented May 6, 2025

Uh oh!

lcarrere commented May 6, 2025

Uh oh!

lcarrere commented May 6, 2025

Uh oh!

ngxson commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lcarrere commented May 6, 2025

Uh oh!

lcarrere commented May 6, 2025

Uh oh!

lcarrere commented May 6, 2025

Uh oh!

lcarrere commented May 6, 2025

Uh oh!

ngxson commented May 6, 2025

Uh oh!

Uh oh!

ngxson commented May 6, 2025

Uh oh!

lcarrere commented May 6, 2025

Uh oh!

Uh oh!

ngxson commented May 5, 2025 •

edited

Loading

ngxson commented May 6, 2025 •

edited

Loading

ngxson commented May 6, 2025 •

edited

Loading

ngxson commented May 6, 2025 •

edited

Loading