Skip to content

Eval bug: mtmd Qwen2.5VL 7B not seeing an image as expected #13394

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mattjcly opened this issue May 8, 2025 · 5 comments · Fixed by #13397
Closed

Eval bug: mtmd Qwen2.5VL 7B not seeing an image as expected #13394

mattjcly opened this issue May 8, 2025 · 5 comments · Fixed by #13397

Comments

@mattjcly
Copy link
Contributor

mattjcly commented May 8, 2025

Name and Version

llama-mtmd-cli --version
register_backend: registered backend Metal (1 devices)
register_device: registered device Metal (Apple M3 Pro)
register_backend: registered backend BLAS (1 devices)
register_device: registered device BLAS (Accelerate)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Apple M3 Pro)
version: 5317 (f05a6d7)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.3.0

Operating systems

Mac

GGML backends

Metal

Hardware

MacBook M3 Pro 36GB

Models

Qwen2.5-VL-7B-Instruct (Q4_K_M)

Problem description & steps to reproduce

I noticed some situations where Qwen 2.5VL appears to "miss" or partially "miss" the image in certain prompts.

To create this situation in mtmd-cli.cpp without any fuss, I hardcoded the following for the formatted prompt:

-    LOG_DBG("formatted_chat.prompt: %s\n", formatted_chat.prompt.c_str());
-
-    // text.text          = formatted_chat.prompt.c_str();
+    text.text = R"(<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
I'm going to tell you a story:

STORY START
The rain hammered against the windows of the antique shop, mirroring the drumming in Leo’s chest. He'd been searching for weeks, driven by a faded photograph – his grandfather, Silas, holding a peculiar silver compass. Silas had vanished without a trace fifty years ago, leaving only this enigmatic object and a whispered legend about hidden treasures.

Old Mr. Finch, the shop owner, a man who smelled perpetually of dust and beeswax, finally pointed to a small, locked box tucked away in the darkest corner. “Silas brought this in,” he rasped, his voice like rustling parchment. "Said it held the key."

Leo bought the box, his hands trembling as he wrestled with the stubborn lock. Finally, it sprung open, revealing not gold or jewels, but the silver compass. As Leo picked it up, a tiny inscription on its base caught his eye: “Follow your heart.”

Suddenly, a faint scent of pine needles and saltwater filled the air, and Leo knew, instinctively, that Silas hadn’t vanished – he'd simply been waiting for someone to understand.
STORY START

Now, ignore that story and tell me what this image is of<__image__><|im_end|>
<|im_start|>assistant
)";
+    LOG_ERR("formatted_chat.prompt: %s\n", text.text);

and then run the following command:

llama-mtmd-cli -m /Users/matt/.cache/lm-studio/models/ggml-org/Qwen2.5-VL-7B-Instruct-GGUF/Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf --mmproj /Users/matt/.cache/lm-studio/models/ggml-org/Qwen2.5-VL-7B-Instruct-GGUF/mmproj-Qwen2.5-VL-7B-Instruct-f16.gguf --temp 0 --image /Users/matt/Workspace/pikachu.png --prompt "unused dummy prompt"

And get response where the model claims that no image has been provided:

encoding image or slice...
image/slice encoded in 3214 ms
decoding image batch 1/1, n_tokens_batch = 289
image decoded (batch 1/1) in 1157 ms

I'm sorry, but you haven't provided an image for me to describe. If you have an image you'd like me to describe or analyze, please upload it or describe it, and I'll be happy to help!

Gemma 3 4B (https://huggingface.co/ggml-org/gemma-3-4b-it-GGUF) does not have this issue, and responds with:

encoding image or slice...
image/slice encoded in 20257 ms
decoding image batch 1/1, n_tokens_batch = 256
image decoded (batch 1/1) in 472 ms

That image is of Pikachu, a popular Pokémon character from the Pokémon franchise! He's known for his yellow fur, red cheeks, and electric powers.

Image:
Image

First Bad Commit

No response

Relevant log output

/Users/matt/Workspace/llama.cpp/cmake-build-debug/bin/llama-mtmd-cli -m /Users/matt/.cache/lm-studio/models/ggml-org/Qwen2.5-VL-7B-Instruct-GGUF/Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf --mmproj /Users/matt/.cache/lm-studio/models/ggml-org/Qwen2.5-VL-7B-Instruct-GGUF/mmproj-Qwen2.5-VL-7B-Instruct-f16.gguf --temp 0 --image /Users/matt/Workspace/pikachu.png --prompt dummy
register_backend: registered backend Metal (1 devices)
register_device: registered device Metal (Apple M3 Pro)
register_backend: registered backend BLAS (1 devices)
register_device: registered device BLAS (Accelerate)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Apple M3 Pro)
build: 5317 (f05a6d71) with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.3.0 (debug)
llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 339 tensors from /Users/matt/.cache/lm-studio/models/ggml-org/Qwen2.5-VL-7B-Instruct-GGUF/Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2vl
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 VL 7B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5-VL
llama_model_loader: - kv   5:                         general.size_label str              = 7B
llama_model_loader: - kv   6:                        qwen2vl.block_count u32              = 28
llama_model_loader: - kv   7:                     qwen2vl.context_length u32              = 128000
llama_model_loader: - kv   8:                   qwen2vl.embedding_length u32              = 3584
llama_model_loader: - kv   9:                qwen2vl.feed_forward_length u32              = 18944
llama_model_loader: - kv  10:               qwen2vl.attention.head_count u32              = 28
llama_model_loader: - kv  11:            qwen2vl.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                     qwen2vl.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:   qwen2vl.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:            qwen2vl.rope.dimension_sections arr[i32,4]       = [16, 24, 24, 0]
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% set image_count = namespace(value=...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 4.36 GiB (4.91 BPW) 
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2vl
print_info: vocab_only       = 0
print_info: n_ctx_train      = 128000
print_info: n_embd           = 3584
print_info: n_layer          = 28
print_info: n_head           = 28
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 7
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 18944
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 8
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 128000
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 7B
print_info: model params     = 7.62 B
print_info: general.name     = Qwen2.5 VL 7B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors: Metal_Mapped model buffer size =  4460.45 MiB
load_tensors:   CPU_Mapped model buffer size =   292.36 MiB
..................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (128000) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Pro
ggml_metal_init: picking default device: Apple M3 Pro
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name:   Apple M3 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = false
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 28991.03 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_context:        CPU  output buffer size =     0.58 MiB
llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1, padding = 32
llama_kv_cache_unified:      Metal KV buffer size =   224.00 MiB
llama_kv_cache_unified: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_context:      Metal compute buffer size =   304.00 MiB
llama_context:        CPU compute buffer size =    16.01 MiB
llama_context: graph nodes  = 1042
llama_context: graph splits = 114
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
mtmd_cli_context: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant

ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Pro
ggml_metal_init: picking default device: Apple M3 Pro
ggml_metal_init: GPU name:   Apple M3 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = false
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 28991.03 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
clip_ctx: CLIP using Metal backend
clip_model_loader: model name:   Qwen2.5 VL 7B Instruct
clip_model_loader: description:  
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    519
clip_model_loader: n_kv:         22

load_hparams: projector:          qwen2.5vl_merger
load_hparams: n_embd:             1280
load_hparams: n_head:             16
load_hparams: n_ff:               3420
load_hparams: n_layer:            32
load_hparams: projection_dim:     3584
load_hparams: image_size:         560
load_hparams: patch_size:         14

load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: proj_scale_factor:  0
load_hparams: n_wa_pattern:       8
load_hparams: ffn_op:             silu
load_hparams: model size:         1291.40 MiB
load_hparams: metadata size:      0.18 MiB
alloc_compute_meta:      Metal compute buffer size =   200.86 MiB
alloc_compute_meta:        CPU compute buffer size =    29.01 MiB
main: loading model: /Users/matt/.cache/lm-studio/models/ggml-org/Qwen2.5-VL-7B-Instruct-GGUF/Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf
formatted_chat.prompt: <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
I'm going to tell you a story:

STORY START
The rain hammered against the windows of the antique shop, mirroring the drumming in Leo’s chest. He'd been searching for weeks, driven by a faded photograph – his grandfather, Silas, holding a peculiar silver compass. Silas had vanished without a trace fifty years ago, leaving only this enigmatic object and a whispered legend about hidden treasures.

Old Mr. Finch, the shop owner, a man who smelled perpetually of dust and beeswax, finally pointed to a small, locked box tucked away in the darkest corner. “Silas brought this in,” he rasped, his voice like rustling parchment. "Said it held the key."

Leo bought the box, his hands trembling as he wrestled with the stubborn lock. Finally, it sprung open, revealing not gold or jewels, but the silver compass. As Leo picked it up, a tiny inscription on its base caught his eye: “Follow your heart.”

Suddenly, a faint scent of pine needles and saltwater filled the air, and Leo knew, instinctively, that Silas hadn’t vanished – he'd simply been waiting for someone to understand.
STORY START

Now, ignore that story and tell me what this image is of<__image__><|im_end|>
<|im_start|>assistant

encoding image or slice...
image/slice encoded in 3122 ms
decoding image batch 1/1, n_tokens_batch = 289
image decoded (batch 1/1) in 1056 ms

I'm sorry, but you haven't provided an image for me to describe. If you have an image you'd like me to describe or analyze, please upload it or describe it, and I'll be happy to help!


llama_perf_context_print:        load time =   19621.77 ms
llama_perf_context_print: prompt eval time =    5298.57 ms /   568 tokens (    9.33 ms per token,   107.20 tokens per second)
llama_perf_context_print:        eval time =    2360.08 ms /    45 runs   (   52.45 ms per token,    19.07 tokens per second)
llama_perf_context_print:       total time =   11014.97 ms /   613 tokens
ggml_metal_free: deallocating
ggml_metal_free: deallocating

Process finished with exit code 0
@ngxson
Copy link
Collaborator

ngxson commented May 8, 2025

Thanks for testing this! There is one thing that I'm currently having in my mind. As discussed (via DM), Qwen2.5VL consider the whole image as one single token in temporal direction. However, the original qwen2vl-cli consider the image to be max(nx, ny) positions instead.

I actually had no good source of doc to verify (rather then their paper, which may not be exactly what's going on in the code). So just to be sure, could you re-run the test with this version of mtmd_image_tokens_get_n_pos ?

llama_pos mtmd_image_tokens_get_n_pos(const mtmd_image_tokens * image_tokens) {
    if (image_tokens->use_mrope_pos) {
        return std::max(image_tokens->nx, image_tokens->ny); // max(nx, ny) instead of 1
    }
    return image_tokens->n_tokens();
}

@mattjcly
Copy link
Contributor Author

mattjcly commented May 9, 2025

Thanks for testing this! There is one thing that I'm currently having in my mind. As discussed (via DM), Qwen2.5VL consider the whole image as one single token in temporal direction. However, the original qwen2vl-cli consider the image to be max(nx, ny) positions instead.

I actually had no good source of doc to verify (rather then their paper, which may not be exactly what's going on in the code). So just to be sure, could you re-run the test with this version of mtmd_image_tokens_get_n_pos ?

llama_pos mtmd_image_tokens_get_n_pos(const mtmd_image_tokens * image_tokens) {
if (image_tokens->use_mrope_pos) {
return std::max(image_tokens->nx, image_tokens->ny); // max(nx, ny) instead of 1
}
return image_tokens->n_tokens();
}

Thanks for pointing this out. Not seeing this make a difference when I test (I get the same result). I'm noticing that the first element of pos is 0 in the batch embd view when the image eval decode occurs here:

        llama_batch batch_embd_view = batch_embd.get_view(pos_offset, n_tokens_batch);

        LOG_INF("decoding image batch %d/%d, n_tokens_batch = %d\n", i_batch+1, n_img_batches, n_tokens_batch);

        int64_t t1 = ggml_time_ms();
        int32_t ret = llama_decode(lctx, batch_embd_view);

so I'm thinking there may be an issue with the batch_embd.get_view or something else in that area? Still investigating, but that seemed suspicious to me. It is clearly a non-zero number for other model types when given the same prompt/types of prompts

@zhouwg
Copy link
Contributor

zhouwg commented May 9, 2025

Qwen2.5VL 7B will crash on Android phone, at the same time, the llava inference with Gemma3-4B(https://huggingface.co/ggml-org/gemma-3-4b-it-GGUF/tree/main) works perfectly on Android phone equipped with Qualcomm Snapdragon 8Elite.

btw, thanks so much to ngxson for that you brings the amazing feature of llava inference and an uniform llava inference framework in llama.cpp.

@ngxson
Copy link
Collaborator

ngxson commented May 9, 2025

@mattjcly ha ok thanks for the clue, turn out, in get_view() I suppose to use pos_view.reserve() and not pos_view.resize()

I'm cleaning up the whole thing now, will push a fix very soon!

@zhouwg
Copy link
Contributor

zhouwg commented May 9, 2025

unbelievable quick fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants