Skip to content

Feature Request: Qwen 2.5 VL #11483

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks done
bold84 opened this issue Jan 29, 2025 · 72 comments
Open
4 tasks done

Feature Request: Qwen 2.5 VL #11483

bold84 opened this issue Jan 29, 2025 · 72 comments
Labels
enhancement New feature or request

Comments

@bold84
Copy link

bold84 commented Jan 29, 2025

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Is anybody implementing this?

If not, I may give it a go. But it will take some time as I am new to the source side of llama.cpp/ggml.

Motivation

Well, it's not currently working. :-)

Possible Implementation

Based on the existing Qwen 2 VL implementation.

@bold84 bold84 added the enhancement New feature or request label Jan 29, 2025
@HimariO
Copy link
Contributor

HimariO commented Jan 29, 2025

I'm currently looking into Transformers' Qwen2.5VL implementation and waiting for the paper to drop so I can better assess the differences between Qwen2VL and Qwen2.5VL. 👀

@3unnycheung
Copy link

cool

@samkoesnadi
Copy link
Contributor

I support this!

@Shyryp
Copy link

Shyryp commented Feb 2, 2025

Our world definitely needs this!

@peter-ch
Copy link

Any progress on this? Who added support for Qwen 2 VL?

@pszemraj
Copy link

pszemraj commented Feb 20, 2025

qwen2.5-vl report is up! https://huggingface.co/papers/2502.13923

edit: official codebase here: https://github.com/QwenLM/Qwen2.5-VL

@vladislavdonchev
Copy link

I can start working on this if no one else is already.

@vladislavdonchev
Copy link

vladislavdonchev commented Feb 22, 2025

OK then!

First order of business would be to build the GGUF file(s). Seems there is an issue with that and the latest official Transformers:

python convert_hf_to_gguf.py .\build\bin\Release\Qwen2.5-VL-7B-Instruct\
INFO:hf-to-gguf:Loading model: Qwen2.5-VL-7B-Instruct
ERROR:hf-to-gguf:Model Qwen2_5_VLForConditionalGeneration is not supported

This is pretty hot:
huggingface/transformers#36292
QwenLM/Qwen2.5-VL#723

Appears a temporary workaround would be to use the old Qwen2 templates. People are reporting this works, so I'll post an update in a bit.

@vladislavdonchev
Copy link

vladislavdonchev commented Feb 22, 2025

Right, so this one is a bit of a rabbit hole...

I. Reverting the Qwen2.5 config files to:

"processor_class": "Qwen2VLProcessor"

and

  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]

Produces a (seemingly) working model! We've started testing and quantizing it here:
https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main

Image

II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit:
#10896

For more information refer to:
#11322

The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place:
#11902

So, it is now up to us to prove that everything is working properly.

I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.

@vladislavdonchev
Copy link

vladislavdonchev commented Feb 23, 2025

UPDATE: A few 4-bit quants have been uploaded, including two that support online auto-repacking.

The latest main looks stable with Vulkan CLIP and any model thrown at it so far. Some preliminary insights:

  • 1200x1200 is the maximum you can encode with 16GB of VRAM. clip.cpp does not seem to support multi-GPU Vulkan yet.
  • A 4060Ti-class GPU delivers 20-30 t/s with the Q8_0 and double that on Q4 @ 16-32K context.
  • Batching (multiple images) in a single cli call seems to be working fine:
    llama-qwen2vl-cli--ctx-size 16000 -n 16000 -m ~/gguf/Qwen2.5-VL-7B-Instruct-Q4_0.gguf --mmproj ~/gguf/mmproj-Qwen2.5-VL-7B-Instruct-f32.gguf --n_gpu_layers 9999 -p "Describe the image in detail. Extract all textual information from it. Output as detailed JSON." -p "Analyze the image." --image ~/Pictures/test_small.png --image ~/Pictures/test_small.png

Output quality looks very promising! We'll release all of the benchmark code when ready, so the process can be streamlined for other models.

@hvico
Copy link

hvico commented Feb 24, 2025

Hi! Excelent news, thank you very much for this!

I was able to run the model by using code from git main on a 4 x Radeon 7900 XTX 24 GB workstation, but using Clip on CPU. I tried to enable Vulkan acceleration for Clip by uncommenting the lines on clip.cpp under examples, but in that case I get OOM. I tried this with models FP16, Q4K_M and IQ4_XS. Specifying the cli to just use one Vulkan device does not help on the OOM / Clip GPU issue either.

@vladislavdonchev
Copy link

vladislavdonchev commented Feb 24, 2025

Hi! Excelent news, thank you very much for this!

I was able to run the model by using code from git main on a 4 x Radeon 7900 XTX 24 GB workstation, but using Clip on CPU. I tried to enable Vulkan acceleration for Clip by uncommenting the lines on clip.cpp under examples, but in that case I get OOM. I tried this with models FP16, Q4K_M and IQ4_XS. Specifying the cli to just use one Vulkan device does not help on the OOM / Clip GPU issue either.

Hi, could you please confirm what the resolution of your input images is?

EDIT: As per Qwen2.5 docs:
min_pixels = 256x28x28
max_pixels = 1280x28x28

A RTFM moment for me...

@hvico
Copy link

hvico commented Feb 24, 2025

Hi! Excelent news, thank you very much for this!
I was able to run the model by using code from git main on a 4 x Radeon 7900 XTX 24 GB workstation, but using Clip on CPU. I tried to enable Vulkan acceleration for Clip by uncommenting the lines on clip.cpp under examples, but in that case I get OOM. I tried this with models FP16, Q4K_M and IQ4_XS. Specifying the cli to just use one Vulkan device does not help on the OOM / Clip GPU issue either.

Hi, could you please confirm what the resolution of your input images is? With 24G VRAM, you can expect an OOM with images >1400x1400 pixels, so you need to make sure the files are pre-processed correctly.

Thanks.

My image was 1475x1062. I was able to run inference successfuly using a 1077x671 sample, without OOM. Would it be possible to run Clip and VL on separate GPUs? Thanks again.

@zrrraa
Copy link

zrrraa commented Feb 25, 2025

Right, so this one is a bit of a rabbit hole...

I. Reverting the Qwen2.5 config files to:

"processor_class": "Qwen2VLProcessor"

and

  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]

Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main

Image

II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896

For more information refer to: #11322

The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902

So, it is now up to us to prove that everything is working properly.

I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.

Thank you very much for your research and sharing! I would like to ask how to get mmproj from Qwen2.5-VL model? The original qwen2_vl_surgery.py used for Qwen2-VL doesn't seem to work, could you share your method? Thank you very much!

@vladislavdonchev
Copy link

Right, so this one is a bit of a rabbit hole...
I. Reverting the Qwen2.5 config files to:
"processor_class": "Qwen2VLProcessor"
and

  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]

Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main
Image
II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896
For more information refer to: #11322
The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902
So, it is now up to us to prove that everything is working properly.
I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.

Thank you very much for your research and sharing! I would like to ask how to get mmproj from Qwen2.5-VL model? The original qwen2_vl_surgery.py used for Qwen2-VL doesn't seem to work, could you share your method? Thank you very much!

Get it from our HF:
https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF

@ChmHsm
Copy link

ChmHsm commented Feb 27, 2025

Thank you for the effort, a lot of people really need this.

Any updates on the progress? Will this still take a few days? or is it more like a few weeks or months?

Thanks a lot again, we appreciate you guys a lot!.

@samkoesnadi
Copy link
Contributor

@vladislavdonchev Great work! Have you done the 3B version? I can also do it myself if you provide the conversion script :)

@vladislavdonchev
Copy link

@vladislavdonchev Great work! Have you done the 3B version? I can also do it myself if you provide the conversion script :)

Working on it as we speak, along with a quantization tool:

Image

https://github.com/Independent-AI-Labs/local-super-agents/tree/feat/additional-output-formats/quantbench

@vladislavdonchev
Copy link

UPDATE:

Opened a draft PR here: #12119

Long story short, I'll need some help debugging the vision models and llama-qwen2vl-cli as we're unable to produce anything reliably.

In addition, this still isn't resolved:
#11322

I've also asked the Qwen folks for help:
QwenLM/Qwen2.5-VL#869

@ChmHsm
Copy link

ChmHsm commented Feb 28, 2025

Thanks @vladislavdonchev for the effort and the update.

I took a look at the issue you opened with the qwen team, is it only affecting the 3B model? Can we expect at least progress to continue with 7b?

Thank you!

@vladislavdonchev
Copy link

vladislavdonchev commented Feb 28, 2025

Thanks @vladislavdonchev for the effort and the update.

I took a look at the issue you opened with the qwen team, is it only affecting the 3B model? Can we expect at least progress to continue with 7b?

Thank you!

Unfortunately, we're unable to reliably produce a working vision model from either 7B or 3B. I am not sure how the one in the repo was exported, but it seems to be working, so it's either some weird coincidence or a mistake. I've verified the LM part, including in quants and it also appears to match what you'd expect from Qwen2.5 (parameters in .gguf seem correct, responses are OK).

@David33706
Copy link

Right, so this one is a bit of a rabbit hole...

I. Reverting the Qwen2.5 config files to:

"processor_class": "Qwen2VLProcessor"

and

  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]

Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main

Image

II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896

For more information refer to: #11322

The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902

So, it is now up to us to prove that everything is working properly.

I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.

Right, so this one is a bit of a rabbit hole...

I. Reverting the Qwen2.5 config files to:

"processor_class": "Qwen2VLProcessor"

and

  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]

Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main

Image

II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896

For more information refer to: #11322

The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902

So, it is now up to us to prove that everything is working properly.

I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.

I am getting the following error while trying to use Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf on Apple Silicon:

./llama-qwen2vl-cli -m "Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" --mmproj "Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" --n_gpu_layers 0 --image "wilma-7_oval.jpg" --image "wilma-7_oval.jpg" -p "Describe the image."

key general.description not found in file
libc++abi: terminating due to uncaught exception of type std::runtime_error: Missing required key: general.description
zsh: abort      ./llama-qwen2vl-cli -m  --mmproj  --n_gpu_layers 0 --image  --image  -p

Could somebody please help out?

@tomjpalamattam
Copy link

Right, so this one is a bit of a rabbit hole...
I. Reverting the Qwen2.5 config files to:
"processor_class": "Qwen2VLProcessor"
and

  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]

Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main
Image
II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896
For more information refer to: #11322
The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902
So, it is now up to us to prove that everything is working properly.
I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.

Right, so this one is a bit of a rabbit hole...
I. Reverting the Qwen2.5 config files to:
"processor_class": "Qwen2VLProcessor"
and

  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]

Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main
Image
II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896
For more information refer to: #11322
The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902
So, it is now up to us to prove that everything is working properly.
I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.

I am getting the following error while trying to use Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf on Apple Silicon:

./llama-qwen2vl-cli -m "Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" --mmproj "Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" --n_gpu_layers 0 --image "wilma-7_oval.jpg" --image "wilma-7_oval.jpg" -p "Describe the image."

key general.description not found in file
libc++abi: terminating due to uncaught exception of type std::runtime_error: Missing required key: general.description
zsh: abort      ./llama-qwen2vl-cli -m  --mmproj  --n_gpu_layers 0 --image  --image  -p

Could somebody please help out?

Did you figure this out?

@David33706
Copy link

Right, so this one is a bit of a rabbit hole...
I. Reverting the Qwen2.5 config files to:
"processor_class": "Qwen2VLProcessor"
and

  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]

Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main
Image
II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896
For more information refer to: #11322
The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902
So, it is now up to us to prove that everything is working properly.
I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.

Right, so this one is a bit of a rabbit hole...
I. Reverting the Qwen2.5 config files to:
"processor_class": "Qwen2VLProcessor"
and

  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ]

Produces a (seemingly) working model! We've started testing and quantizing it here: https://huggingface.co/IAILabs/Qwen2.5-VL-7b-Instruct-GGUF/tree/main
Image
II. In order to get a usable experience, you need to make sure CLIP is running with hardware acceleration. This currently requires you to revert this commit: #10896
For more information refer to: #11322
The following PR seems to correct (at least) some of the issues that led to disabling hardware acceleration in the first place: #11902
So, it is now up to us to prove that everything is working properly.
I'll start a stress / perf eval test alongside the quantization process, so we have a better idea about what's going on.

I am getting the following error while trying to use Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf on Apple Silicon:
./llama-qwen2vl-cli -m "Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" --mmproj "Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" --n_gpu_layers 0 --image "wilma-7_oval.jpg" --image "wilma-7_oval.jpg" -p "Describe the image."

key general.description not found in file
libc++abi: terminating due to uncaught exception of type std::runtime_error: Missing required key: general.description
zsh: abort      ./llama-qwen2vl-cli -m  --mmproj  --n_gpu_layers 0 --image  --image  -p

Could somebody please help out?

Did you figure this out?

Nope

@vladislavdonchev
Copy link

vladislavdonchev commented Mar 3, 2025

Please stop spamming this thread. Qwen2.5 is still a WIP!

Regarding the issue above:
./llama-qwen2vl-cli -m "Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" --mmproj "Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" --n_gpu_layers 0 --image "wilma-7_oval.jpg" --image "wilma-7_oval.jpg" -p "Describe the image."
You cannot use the Language Model as a Vision Model (mmproj - in your command you are specifying the same thing twice).

Please wait until the implementation has been finalized.

@CKwasd
Copy link

CKwasd commented Mar 25, 2025

work great with Green-s Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf and HimariO 's llama.cpp. qwen25-vl branches

@iqddd
Copy link

iqddd commented Mar 25, 2025

They just dropped 32B VL version.

@HimariO
Copy link
Contributor

HimariO commented Mar 25, 2025

I just ran a simple test using the 32B variant of the model with a smaller sample image (500x300 pixels, to be specific). It still took around 20 minutes to generate a single caption on my setup with CPU backend, but the result looked pretty decent.

I've uploaded the GGUF files to the Hugging Face Hub so that others with better hardware can give it a try.

Image Output
Image The image shows a serene and heartwrenching scene of a young woman sitting on a sandy beach, enjoying a moment of connection with her dog. Here are the details:
1. Location: The setting is a sandy beach with gentle, rolling surf in the background. The beach is calm, and the sand appears smooth, indicating a tranquil environment.
2. Time of Day: The warm, golden light suggests that it is either sunrise or, more likely, late afternoon, as the warm hues are consistent with the golden hour before the sun sets.
3. The Person: A young woman is sitting on the sand with her back slightly angled toward the camera. She has dark, long hair that is pulled back, and she is wearing a plaid flannel top and dark-colored bottoms. She has a calm, content expression on her face, looking at her dog.
4. The Animal: A large, light-colored Labrador Retriever is sitting close to the woman. The dog has a blue collar and is looking attentively at the woman. The dog’s body language appears calm and friendly, and it seems to be enjoying the moment.
5. Interactions: The woman is playfully reaching out toward the dog, as if offering or receiving something
llama-qwen2vl-cli -m qwen25-vl-32b-instruct-vision-00001-of-00002.gguf --mmproj qwen-qwen2.5-vl-32b-instruct-vision.gguf -p "Describe this image." --image demo_small.jpg --threads 24

@green-s
Copy link

green-s commented Mar 25, 2025

@HimariO The llama-llava-clip-quantize-cli command doesn't seem to be working with the vision ggufs (I get no output and it just immediately exits) and that prevents the 32B at 4bit from being able to easily fit on one 24GB GPU. Any chance you could fix that?

@panda44312
Copy link

Also, consider supporting Qwen2.5-Omni?

@jfernandrezj
Copy link

What is the right way of running a btach of 4 images in batch? When I include several --image arguments it just seems to run it sequentially.

@jfernandrezj
Copy link

UPDATE: A few 4-bit quants have been uploaded, including two that support online auto-repacking.

The latest main looks stable with Vulkan CLIP and any model thrown at it so far. Some preliminary insights:

  • 1200x1200 is the maximum you can encode with 16GB of VRAM. clip.cpp does not seem to support multi-GPU Vulkan yet.
  • A 4060Ti-class GPU delivers 20-30 t/s with the Q8_0 and double that on Q4 @ 16-32K context.
  • Batching (multiple images) in a single cli call seems to be working fine:
    llama-qwen2vl-cli--ctx-size 16000 -n 16000 -m ~/gguf/Qwen2.5-VL-7B-Instruct-Q4_0.gguf --mmproj ~/gguf/mmproj-Qwen2.5-VL-7B-Instruct-f32.gguf --n_gpu_layers 9999 -p "Describe the image in detail. Extract all textual information from it. Output as detailed JSON." -p "Analyze the image." --image ~/Pictures/test_small.png --image ~/Pictures/test_small.png

Output quality looks very promising! We'll release all of the benchmark code when ready, so the process can be streamlined for other models.

@vladislavdonchev when you say batching, it does not really batch right? It seems to load the model and run inference for each image sequentially, right? Am I missing something?

@HimariO
Copy link
Contributor

HimariO commented Mar 30, 2025

@green-s according to my testing, 7B and 32B models' vision encoder gguf are working fine with clip-quantize-cli tool when quantized to Q4, only the 3B variant will fail the conversation due to the hidden state size(channel) that is not divisible by block size.
Could you provide the 32B vision gguf file causing the problem?

@green-s
Copy link

green-s commented Mar 30, 2025

@green-s according to my testing, 7B and 32B models' vision encoder gguf are working fine with clip-quantize-cli tool when quantized to Q4, only the 3B variant will fail the conversation due to the hidden state size(channel) that is not divisible by block size. Could you provide the 32B vision gguf file causing the problem?

Ah sorry I only tried running it on Windows. Tried it on Linux and it worked fine.

@Kreijstal
Copy link

please add Qwen 2.5 omni support.

@ER-EPR
Copy link

ER-EPR commented Apr 1, 2025

when will llama.cpp officially support Qwen 2.5 VL

@HimariO
Copy link
Contributor

HimariO commented Apr 4, 2025

@ER-EPR, I've just wrapped up the PR(#12402) for Qwen 2.5 VL support today. It should be integrated once the review process is complete.

@green-s
Copy link

green-s commented Apr 4, 2025

@HimariO Will existing conversions/quants work after those changes or do they have to be redone?

@soldivelot
Copy link

i got error with this build: https://github.com/HimariO/llama.cpp.qwen2.5vl/releases/tag/b5043
and this model: https://huggingface.com/samgreen/Qwen2.5-VL-32B-Instruct-GGUF

command:

.\bin-qvl\llama-qwen2vl-cli `
-m .\models\samgreen\Qwen25-VL-32B-Instruct-Q4_K_M.gguf --mmproj .\models\samgreen\qwen2.5-vl-32b-instruct-vision-f16.gguf --image .\image.png `
-p "describe this image" `
-t 16 -ngl 32

outout:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
build: 5043 (c262bedd) with MSVC 19.29.30158.0 for
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4070) - 11094 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 771 tensors from .\models\samgreen\Qwen25-VL-32B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2vl
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 VL 32B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5-VL
llama_model_loader: - kv   5:                         general.size_label str              = 32B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                               general.tags arr[str,2]       = ["multimodal", "image-text-to-text"]
llama_model_loader: - kv   8:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv   9:                        qwen2vl.block_count u32              = 64
llama_model_loader: - kv  10:                     qwen2vl.context_length u32              = 128000
llama_model_loader: - kv  11:                   qwen2vl.embedding_length u32              = 5120
llama_model_loader: - kv  12:                qwen2vl.feed_forward_length u32              = 27648
llama_model_loader: - kv  13:               qwen2vl.attention.head_count u32              = 40
llama_model_loader: - kv  14:            qwen2vl.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                     qwen2vl.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  16:   qwen2vl.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  17:            qwen2vl.rope.dimension_sections arr[i32,4]       = [16, 24, 24, 0]
llama_model_loader: - kv  18:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  19:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  20:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  21:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  22:                      tokenizer.ggml.merges arr[str,151387]  = ["臓 臓", "臓臓 臓臓", "i n", "臓 t",...
llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  24:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  25:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - kv  28:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  321 tensors
llama_model_loader: - type q4_K:  385 tensors
llama_model_loader: - type q6_K:   65 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 18.48 GiB (4.85 BPW)
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2vl
print_info: vocab_only       = 0
print_info: n_ctx_train      = 128000
print_info: n_embd           = 5120
print_info: n_layer          = 64
print_info: n_head           = 40
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 5
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 27648
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 8
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 128000
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 32B
print_info: model params     = 32.76 B
print_info: general.name     = Qwen2.5 VL 32B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 11 ','
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 '膴'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloaded 32/65 layers to GPU
load_tensors:        CUDA0 model buffer size =  8949.62 MiB
load_tensors:   CPU_Mapped model buffer size =  9976.38 MiB
.................................................................................................
clip_init: model name:   Qwen2.5-VL-32B-Instruct
clip_init: description:  image encoder for Qwen2VL
clip_init: GGUF version: 3
clip_init: alignment:    32
clip_init: n_tensors:    520
clip_init: n_kv:         24
clip_init: ftype:        f16

clip_init: loaded meta data with 24 key-value pairs and 520 tensors from .\models\samgreen\qwen2.5-vl-32b-instruct-vision-f16.gguf
clip_init: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_init: - kv   0:                       general.architecture str              = clip
clip_init: - kv   1:                        general.description str              = image encoder for Qwen2VL
clip_init: - kv   2:                          general.file_type u32              = 1
clip_init: - kv   3:                      clip.has_text_encoder bool             = false
clip_init: - kv   4:                    clip.has_vision_encoder bool             = true
clip_init: - kv   5:                    clip.has_qwen2vl_merger bool             = true
clip_init: - kv   6:                        clip.projector_type str              = qwen2vl_merger
clip_init: - kv   7:                              clip.use_silu bool             = true
clip_init: - kv   8:                              clip.use_gelu bool             = false
clip_init: - kv   9:                           clip.use_glu_mlp bool             = true
clip_init: - kv  10:                          clip.use_rms_norm bool             = true
clip_init: - kv  11:          clip.vision.fullatt_block_indexes arr[i32,4]       = [7, 15, 23, 31]
clip_init: - kv  12:                    clip.vision.window_size u32              = 112
clip_init: - kv  13:               clip.vision.embedding_length u32              = 1280
clip_init: - kv  14:                 clip.vision.projection_dim u32              = 5120
clip_init: - kv  15:                     clip.vision.patch_size u32              = 14
clip_init: - kv  16:                     clip.vision.image_size u32              = 560
clip_init: - kv  17:           clip.vision.attention.head_count u32              = 16
clip_init: - kv  18:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_init: - kv  19:                    clip.vision.block_count u32              = 32
clip_init: - kv  20:            clip.vision.feed_forward_length u32              = 0
clip_init: - kv  21:                               general.name str              = Qwen2.5-VL-32B-Instruct
clip_init: - kv  22:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_init: - kv  23:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_init: - type  f32:  292 tensors
clip_init: - type  f16:  228 tensors
clip_ctx: CLIP using CUDA0 backend
clip_init: text_encoder:   0
clip_init: vision_encoder: 1
clip_init: llava_projector:  0
clip_init: minicpmv_projector:  0
clip_init: minicpmv_version:  2
clip_init: glm_projector:  0
clip_init: model size:     1314.85 MB
clip_init: metadata size:  0.18 MB
clip_init: params backend buffer size =  1314.85 MB (520 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.feature_layer not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file

@HimariO
Copy link
Contributor

HimariO commented Apr 7, 2025

@green-s Previously converted gguf should work fine with recent changes; nothing has changed in the vision encoder gguf format.

@soldivelot Those "errors" are normal, since they are raised by non-essential parameters that Qwen2.5VL doesn't use.

@ColumbusAI
Copy link

How are you guys using Qwen VL models to have a conversation about images?

I am only finding the llama.cpp binaries to provide zero-shot prompting and not a true conversation or OAI endpoint that I can use with Open-WebUI to incorporate images in my text conversations.

Appreciate the insight!

@zrrraa
Copy link

zrrraa commented Apr 9, 2025

@green-s Previously converted gguf should work fine with recent changes; nothing has changed in the vision encoder gguf format.

@soldivelot Those "errors" are normal, since they are raised by non-essential parameters that Qwen2.5VL doesn't use.

@HimariO Then how to quantize the vision encoder of 3B variant? I also failed to quantize the vision encoder fo 7B and 32B. May I know which version of the code you are using?

@Melon-Bread
Copy link

Melon-Bread commented Apr 11, 2025

How are you guys using Qwen VL models to have a conversation about images?

I am only finding the llama.cpp binaries to provide zero-shot prompting and not a true conversation or OAI endpoint that I can use with Open-WebUI to incorporate images in my text conversations.

Appreciate the insight!

Koboldcpp is the only way I know of at this moment if you want to do it locally.

@kkptm
Copy link

kkptm commented Apr 14, 2025

Is there any plan to merge code from whria78/llama-qwen-vl

@gitlawr
Copy link

gitlawr commented Apr 30, 2025

@ColumbusAI You can take a look at llama-box if you need pure API server or GPUStack if you need UI, clustering, distributed inference and more.

@ServeurpersoCom
Copy link

You can try https://github.com/HimariO/llama.cpp.qwen2.5vl/tree/qwen25-vl-20250404

./llama-qwen2vl-cli
-m Qwen2.5-VL-7B-Instruct-q6_k_l.gguf
--mmproj Qwen2.5-VL-7B-Instruct-mmproj-f16.gguf
--image image.png
--temp 0.1
-p "Décrit l'image en détail"

https://huggingface.co/Mungert/Qwen2.5-VL-7B-Instruct-GGUF
https://huggingface.co/Mungert/Qwen2.5-VL-3B-Instruct-GGUF

Old deprecated binary. Developers must be streamlining all this into llama-mtmd-cli...

@Kreijstal
Copy link

You can try https://github.com/HimariO/llama.cpp.qwen2.5vl/tree/qwen25-vl-20250404

./llama-qwen2vl-cli -m Qwen2.5-VL-7B-Instruct-q6_k_l.gguf --mmproj Qwen2.5-VL-7B-Instruct-mmproj-f16.gguf --image image.png --temp 0.1 -p "Décrit l'image en détail"

https://huggingface.co/Mungert/Qwen2.5-VL-7B-Instruct-GGUF https://huggingface.co/Mungert/Qwen2.5-VL-3B-Instruct-GGUF

Old deprecated binary. Developers must be streamlining all this into llama-mtmd-cli...

no pull request?

@Cerno-b
Copy link

Cerno-b commented May 8, 2025

I just cloned the master of this repo and it looks to me like Qwen 2.5 VL is working fine out of the box. I just ran it via llama-mtmd-cli, and it worked perfectly. It seems this is a fairly recent change (committed 1 week ago: 074e42a)

All I had to do is run ./llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-7B-Instruct-GGUF

It flawlessly downloaded the q4 and f16 version from the ggml-org section off of huggingface.
https://huggingface.co/ggml-org/Qwen2.5-VL-7B-Instruct-GGUF/resolve/main/Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf.

It's also all documented here: https://github.com/ggml-org/llama.cpp/tree/master/tools/mtmd

I can load and ask questions about images no problem.

Not sure this is what OP wanted but it looks to me like this can be closed.

@Kreijstal
Copy link

I just cloned the master of this repo and it looks to me like Qwen 2.5 VL is working fine out of the box. I just ran it via llama-mtmd-cli, and it worked perfectly. It seems this is a fairly recent change (committed 1 week ago: 074e42a)

All I had to do is run ./llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-7B-Instruct-GGUF

It flawlessly downloaded the q4 and f16 version from the ggml-org section off of huggingface. https://huggingface.co/ggml-org/Qwen2.5-VL-7B-Instruct-GGUF/resolve/main/Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf.

It's also all documented here: https://github.com/ggml-org/llama.cpp/tree/master/tools/mtmd

I can load and ask questions about images no problem.

Not sure this is what OP wanted but it looks to me like this can be closed.

can you check Qwen 2.5 omni

@ServeurpersoCom
Copy link

ServeurpersoCom commented May 8, 2025

I can confirm, now the main here work !!! the fork I found is obsolete.
I pull every some days to test it and today Qwen2.5 VL can run without the segfault I get few days ago :) Devs work hard !!!! it's awesome. this can be closed

#!/bin/bash

./llama-mtmd-cli \
 -m Qwen2.5-VL-3B-Instruct-q6_k_l.gguf \
 --mmproj Qwen2.5-VL-3B-Instruct-mmproj-f16.gguf \
 --image image.jpg \
 -p "Décrit l'image en détail"

build: 5308 (8733e0cf) with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
llama_model_loader: loaded meta data with 29 key-value pairs and 434 tensors from Qwen2.5-VL-3B-Instruct-q6_k_l.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2vl
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5-VL-3B-Instruct
llama_model_loader: - kv   3:                         general.size_label str              = 3.1B
llama_model_loader: - kv   4:                        qwen2vl.block_count u32              = 36
llama_model_loader: - kv   5:                     qwen2vl.context_length u32              = 128000
llama_model_loader: - kv   6:                   qwen2vl.embedding_length u32              = 2048
llama_model_loader: - kv   7:                qwen2vl.feed_forward_length u32              = 11008
llama_model_loader: - kv   8:               qwen2vl.attention.head_count u32              = 16
llama_model_loader: - kv   9:            qwen2vl.attention.head_count_kv u32              = 2
llama_model_loader: - kv  10:                     qwen2vl.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:   qwen2vl.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  12:            qwen2vl.rope.dimension_sections arr[i32,4]       = [16, 24, 24, 0]
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  20:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - kv  24:                          general.file_type u32              = 18
llama_model_loader: - kv  25:                      quantize.imatrix.file str              = /home/mahadeva/code/models/Qwen2.5-VL...
llama_model_loader: - kv  26:                   quantize.imatrix.dataset str              = imatrix-train-set
llama_model_loader: - kv  27:             quantize.imatrix.entries_count i32              = 252
llama_model_loader: - kv  28:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  181 tensors
llama_model_loader: - type q8_0:    1 tensors
llama_model_loader: - type q6_K:  252 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q6_K
print_info: file size   = 2.43 GiB (6.76 BPW)
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2vl
print_info: vocab_only       = 0
print_info: n_ctx_train      = 128000
print_info: n_embd           = 2048
print_info: n_layer          = 36
print_info: n_head           = 16
print_info: n_head_kv        = 2
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 11008
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 8
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 128000
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 3B
print_info: model params     = 3.09 B
print_info: general.name     = Qwen2.5-VL-3B-Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_Mapped model buffer size =  2486.77 MiB
.........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (128000) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.58 MiB
llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 36, can_shift = 1, padding = 32
llama_kv_cache_unified:        CPU KV buffer size =   144.00 MiB
llama_kv_cache_unified: KV self size  =  144.00 MiB, K (f16):   72.00 MiB, V (f16):   72.00 MiB
llama_context:        CPU compute buffer size =   300.75 MiB
llama_context: graph nodes  = 1338
llama_context: graph splits = 578 (with bs=512), 1 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
clip_ctx: CLIP using CPU backend
mtmd_cli_context: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant

clip_model_loader: model name:   qwen7
clip_model_loader: description:  image encoder for Qwen2VL
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    520
clip_model_loader: n_kv:         24

load_hparams: projector:          qwen2vl_merger
load_hparams: n_embd:             1280
load_hparams: n_head:             16
load_hparams: n_ff:               0
load_hparams: n_layer:            32
load_hparams: projection_dim:     2048
load_hparams: image_size:         560
load_hparams: patch_size:         14

load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: proj_scale_factor:  0
load_hparams: n_wa_pattern:       0
load_hparams: ffn_op:             silu
load_hparams: model size:         1276.39 MiB
load_hparams: metadata size:      0.18 MiB
alloc_compute_meta:        CPU compute buffer size =   198.93 MiB
main: loading model: Qwen2.5-VL-3B-Instruct-q6_k_l.gguf
encoding image or slice...
image/slice encoded in 22250 ms
decoding image batch 1/1, n_tokens_batch = 726
image decoded (batch 1/1) in 13880 ms

L'affiche de la série télévisée "L'Évaluation" présente une composition graphique originale. Le titre "L'ÉVALUATION" est écrit en blanc, en majuscules, et est placé au centre de l'image. Il est entouré de motifs géométriques colorés, principalement en tons de bleu et de jaune, qui semblent être des bandes horizontales et verticales. Ces motifs créent un effet de perspective et de réfraction, donnant à l'image une apparence dynamique et captivante.

Au-dessus du titre, on peut voir une silhouette de deux personnages féminins. Elles sont représentées de manière stylisée, avec des traits simplifiés et des couleurs vives. L'une d'elles porte un haut bleu, tandis que l'autre porte un haut blanc. Leur visage est légèrement déformé, ce qui ajoute une touche de mystère à leur apparence.

Au-dessous du titre, on peut voir une silhouette d'un personnage masculin. Il porte un haut noir et a un visage plus réaliste, avec des traits plus détaillés. Son regard semble être dirigé vers l'objectif, ce qui donne à l'image une impression de perspective et d'attente.

En arrière-plan, le fond est un nuage de couleur sombre, ce qui ajoute une note de mystère et de tension à l'ensemble de l'image. Le contraste entre le fond sombre et les motifs colorés crée une atmosphère intense et captivante, qui reflète probablement le ton de la série.


llama_perf_context_print:        load time =     376.85 ms
llama_perf_context_print: prompt eval time =   36538.40 ms /   744 tokens (   49.11 ms per token,    20.36 tokens per second)
llama_perf_context_print:        eval time =   35867.17 ms /   353 runs   (  101.61 ms per token,     9.84 tokens per second)
llama_perf_context_print:       total time =   72829.70 ms /  1097 tokens

@Cerno-b
Copy link

Cerno-b commented May 8, 2025

I just cloned the master of this repo and it looks to me like Qwen 2.5 VL is working fine out of the box. I just ran it via llama-mtmd-cli, and it worked perfectly. It seems this is a fairly recent change (committed 1 week ago: 074e42a)
All I had to do is run ./llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-7B-Instruct-GGUF
It flawlessly downloaded the q4 and f16 version from the ggml-org section off of huggingface. https://huggingface.co/ggml-org/Qwen2.5-VL-7B-Instruct-GGUF/resolve/main/Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf.
It's also all documented here: https://github.com/ggml-org/llama.cpp/tree/master/tools/mtmd
I can load and ask questions about images no problem.
Not sure this is what OP wanted but it looks to me like this can be closed.

can you check Qwen 2.5 omni

I don't think it will work. There are no gguf files for omni yet in the ggml.org space at huggingface (see https://huggingface.co/models?sort=trending&search=ggml-org+qwen2.5) and it's also not in the llama.cpp docs, so I would assume that it isn't supported yet.

@ServeurpersoCom
Copy link

ServeurpersoCom commented May 8, 2025

It's still a work in progress because for now qwen 2.5 VL (i tryed 3B and 7B) is not as good as qwen2 VL or gemma 3

@danielhanchen
Copy link
Contributor

I'm not sure if any of you all found extremely high perplexity values for Qwen 2.5 VL 72B Instruct? I'm getting 20 to 70 weirdly after BF16 conversion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests