Skip to content

clip : refactor graph builder #13321

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
May 6, 2025

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented May 5, 2025

Motivation:

In this PR:

  • Added struct clip_graph which contains various graph builder, for example clip_graph::build_llava()
  • Added clip_graph::build_attn and reuse it in the arch-specific graphs --> we can experiment with flash attn support a next PR
  • Before this PR, qwen2vl use build_llama while qwen2.5vl use another graph --> now, both qwen2vl and qwen2.5vl are now using clip_graph::build_qwen2vl()
  • Dedicated build function for minicpm-v

TODO:

  • move graph builder to a struct
  • implement commonly used components like FFN, attn, norm
  • make dedicated builder for minicpmv
  • add hook and debugging functions (only add the placeholder for now)
  • remove the load_image_size, this is a non-thread-safe hack introduced by minicpmv --> not doing it here, but will be a dedicated PR

Test:

OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
OK:   llama-mtmd-cli guinmoon/MobileVLM-3B-GGUF:Q4_K_M
OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K
OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/pixtral-12b-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
OK:   llama-mtmd-cli ggml-org/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2-VL-7B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q4_K_M

@ngxson ngxson changed the title mtmd : refactor graph builder clip : refactor graph builder May 6, 2025
@ngxson ngxson marked this pull request as ready for review May 6, 2025 12:28
@ngxson ngxson requested review from ggerganov and slaren May 6, 2025 12:28
@ngxson
Copy link
Collaborator Author

ngxson commented May 6, 2025

Thanks @mattjcly from LM Studio for testing this PR, we confirmed that this produce exactly the same results as master version.

I'm merging this PR once the CI is green.

@github-actions github-actions bot added the python python script changes label May 6, 2025
@lcarrere
Copy link
Contributor

lcarrere commented May 6, 2025

I benchmarked a few VL models, and they all perform well except for Qwen-VL v2 (2B and 8B). Currently, all accuracy scores are zero for both classification and extraction tasks.

I tested with llama-mtmd-cli on this branch and can reproduce the problem on Windows using the CUDA backend. It works fine on CPU.

To reproduce, simply use the attached image with the prompt “Give me the number.”

Every run with the new CLIP yields an incorrect result (for example, “The number is 3.”). With the previous version, it worked correctly 100% of the time.

List of tested models:

  • gemma-3-4b
  • gemma-3-12b
  • minicpm-o-2.6
  • minicpm-v-2.6
  • qwen2-vl-2b
  • qwen2.5-vl-3b
  • qwen2-vl-8.3b
  • qwen2.5-vl-7b
  • pixtral-12b

Let me know if you need any further information or testing from my end.

150

@ngxson
Copy link
Collaborator Author

ngxson commented May 6, 2025

@lcarrere ok thanks for testing. So if I understand correctly, you're getting this problem only with qwen2 VL, while the other models are running correctly, right?

@ngxson
Copy link
Collaborator Author

ngxson commented May 6, 2025

I tested on my side but it always give me the correct answer:

llama-mtmd-cli -m ../models/Qwen2-VL-7B-Instruct/model.gguf --mmproj ../models/Qwen2-VL-7B-Instruct/mmproj-model.gguf --image ../models/150.png -p "Give me the number."

The number is 150.

Running on Metal backend, mac M3 Max

(Same result with --no-mmproj-offload)

Which commit or build number on master branch you're using?

@ngxson
Copy link
Collaborator Author

ngxson commented May 6, 2025

Just to be sure, I updated this PR to latest master, feel free to give it a try

@lcarrere
Copy link
Contributor

lcarrere commented May 6, 2025

I tested with the master to which I added your new clip.cpp.
The problem did occur only on qwen 2 VL, all sizes. qwen 2.5 VL is OK.
Only occurring on CUDA, CPU (AVX2) was OK.
I just left the office and will be able to do some more tests in 30 min using different Nvidia cards and Metal backend on M3.

@ngxson
Copy link
Collaborator Author

ngxson commented May 6, 2025

Ok hopefully 56b41af resolves the problem. I forgot that only qwen 2.5 uses RMS norm, the older version use the "normal" norm

@lcarrere
Copy link
Contributor

lcarrere commented May 6, 2025

I believe you got it! llama.cpp is OK and all my unit test are now green with all backends. Thanks a lot @ngxson
I'm now running a benchmark on a larger dataset. I'll let you know how it goes in about 10 minutes.

@lcarrere
Copy link
Contributor

lcarrere commented May 6, 2025

OK, so everything I'm testing directly with llama.cpp is working on CUDA and other backends.
My integration tests are still returning 0% of success for these qwen v2 while 2.5 are perfectly working, but that's a fairly different code base now. I would say that llama.cpp is safe. I will continue to dig in my codebase to understand what is going wrong and in case I find a problem in llama.cpp I will ping you.

@lcarrere
Copy link
Contributor

lcarrere commented May 6, 2025

ok me again. @ngxson please bear with me. I though the master was merged on this PR but it is actually not, it is only on the fork.
So basically, unless I am missing something, the branch of this PR is OK. But if we merge it on the master we do have the CUDA inference problem. I'm now trying to identify which commit, on the master, is conflicting with this clip version. This may take some time as compilation is taking a looooong time.

In other words:
git fetch origin pull/13321/head:pr-13321
git checkout pr-13321

-> OK

git checkout master
copy the new clip.cpp to tools/mtmd/clip.cpp

-> KO

@ngxson
Copy link
Collaborator Author

ngxson commented May 6, 2025

Ok thanks for testing, no problem, I'll investigate this more.

This is kinda expected because before this PR, qwen2 use the same graph builder with llava, while in this PR I merge qwen 2 + 2.5 vl into once since they all use M-RoPE. Probably just missing some nodes

@ngxson
Copy link
Collaborator Author

ngxson commented May 6, 2025

Ok seems like I miss one spot, hopefully this fixes the problem: 37e24b1

@lcarrere
Copy link
Contributor

lcarrere commented May 6, 2025

Thanks, I just hope I'm not wasting your time....

I dig a little further. I've updated everything from the master branch (#13344). And I set the temperature on 0 to get greedy sampling as I've observed a fairly higher perplexity with the new clip and my code base, which was obviously more noticeable with smaller model and leading to random conclusions...

See the screenshot:

  • on the left: master + new clip.
  • on the right: master.

tested model: qwen 2VL 2B.
tested app: llama-mtmd-cli, with temperature set to 0.

image

I hope this helps!

@lcarrere
Copy link
Contributor

lcarrere commented May 6, 2025

Ok seems like I miss one spot, hopefully this fixes the problem: 37e24b1

on it...

@lcarrere
Copy link
Contributor

lcarrere commented May 6, 2025

okay this seems to work!!
..at least with this latest test. I'm doing some other ones, I'll keep you posted in probably 20 min.

Thanks again Xuan

@lcarrere
Copy link
Contributor

lcarrere commented May 6, 2025

OK, all regressive tests are OK. Same accuracy / same speed. Thanks again for your hard work!!

@ngxson
Copy link
Collaborator Author

ngxson commented May 6, 2025

Perfect, thanks again for testing!

@ngxson ngxson merged commit 32916a4 into ggml-org:master May 6, 2025
48 checks passed
@ngxson
Copy link
Collaborator Author

ngxson commented May 6, 2025

Btw @lcarrere if you are interested, the integration of vision api in llama-server is pretty much functional right now. Feel free to give it a try if you have time!

#12898

@lcarrere
Copy link
Contributor

lcarrere commented May 6, 2025

Okay, on my Todo list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants