-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Eval bug: mtmd Qwen2.5VL 7B not seeing an image as expected #13394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for testing this! There is one thing that I'm currently having in my mind. As discussed (via DM), Qwen2.5VL consider the whole image as one single token in temporal direction. However, the original I actually had no good source of doc to verify (rather then their paper, which may not be exactly what's going on in the code). So just to be sure, could you re-run the test with this version of llama_pos mtmd_image_tokens_get_n_pos(const mtmd_image_tokens * image_tokens) {
if (image_tokens->use_mrope_pos) {
return std::max(image_tokens->nx, image_tokens->ny); // max(nx, ny) instead of 1
}
return image_tokens->n_tokens();
} |
Thanks for pointing this out. Not seeing this make a difference when I test (I get the same result). I'm noticing that the first element of pos is 0 in the batch embd view when the image eval decode occurs here:
so I'm thinking there may be an issue with the |
Qwen2.5VL 7B will crash on Android phone, at the same time, the llava inference with Gemma3-4B(https://huggingface.co/ggml-org/gemma-3-4b-it-GGUF/tree/main) works perfectly on Android phone equipped with Qualcomm Snapdragon 8Elite. btw, thanks so much to ngxson for that you brings the amazing feature of llava inference and an uniform llava inference framework in llama.cpp. |
@mattjcly ha ok thanks for the clue, turn out, in I'm cleaning up the whole thing now, will push a fix very soon! |
unbelievable quick fix! |
Name and Version
llama-mtmd-cli --version
register_backend: registered backend Metal (1 devices)
register_device: registered device Metal (Apple M3 Pro)
register_backend: registered backend BLAS (1 devices)
register_device: registered device BLAS (Accelerate)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Apple M3 Pro)
version: 5317 (f05a6d7)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.3.0
Operating systems
Mac
GGML backends
Metal
Hardware
MacBook M3 Pro 36GB
Models
Qwen2.5-VL-7B-Instruct (Q4_K_M)
Problem description & steps to reproduce
I noticed some situations where Qwen 2.5VL appears to "miss" or partially "miss" the image in certain prompts.
To create this situation in
mtmd-cli.cpp
without any fuss, I hardcoded the following for the formatted prompt:and then run the following command:
And get response where the model claims that no image has been provided:
Gemma 3 4B (https://huggingface.co/ggml-org/gemma-3-4b-it-GGUF) does not have this issue, and responds with:
Image:

First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: