mtmd: Fix the calculation of n_tokens for smolvlm #13381

awkrail · 2025-05-08T09:32:10Z

@ngxson Thank you for supporting smolVLM family!

Summary

The recently-added SmolVLM model failed to generate image descriptions correctly due to a miscalculation in n_patches.
This PR fixes the issue by adjusting the computation to match the original SmolVLM implementation.

Reproduction

./build/bin/llama-mtmd-cli \
  -m models/custom/SmolVLM2-256M-Video-Instruct-f16.gguf \
  --mmproj models/custom/mmproj-SmolVLM2-256M-Video-Instruct-f16.gguf \
  --image tools/mtmd/test-1.jpeg \
  -p "describe this image"

The output log is (the model does not generate anything in this example):

clip_ctx: CLIP using CPU backend
mtmd_cli_context: chat template example:
<|im_start|>You are a helpful assistant

User: Hello<end_of_utterance>
Assistant: Hi there<end_of_utterance>
User: How are you?<end_of_utterance>
Assistant:
clip_model_loader: model name:   SmolVLM2 256M Video Instruct
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    198
clip_model_loader: n_kv:         66

load_hparams: projector:          idefics3
load_hparams: n_embd:             768
load_hparams: n_head:             12
load_hparams: n_ff:               3072
load_hparams: n_layer:            12
load_hparams: projection_dim:     576
load_hparams: image_size:         512
load_hparams: patch_size:         16

load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: proj_scale_factor:  4
load_hparams: n_wa_pattern:       0
load_hparams: ffn_op:             gelu
load_hparams: model size:         181.22 MiB
load_hparams: metadata size:      0.07 MiB
alloc_compute_meta:        CPU compute buffer size =    60.00 MiB
main: loading model: models/custom/SmolVLM2-256M-Video-Instruct-f16.gguf
encoding image or slice...
image/slice encoded in 34985 ms
decoding image batch 1/1, n_tokens_batch = 256
image decoded (batch 1/1) in 13578 ms




llama_perf_context_print:        load time =    1483.94 ms
llama_perf_context_print: prompt eval time =   49688.54 ms /   271 tokens (  183.35 ms per token,     5.45 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   52743.32 ms /   272 tokens

Cause

This issue stems from the calculation of n_patches (=256).
The current implementation divides by params.proj_scale_factor, but the correct logic (as in the original Hugging Face SmolVLM implementation) should divide by params.proj_scale_factor ** 2.

Fix & Result

After fixing this, I found that the model can generate the description.

Output:

clip_ctx: CLIP using CPU backend
mtmd_cli_context: chat template example:
<|im_start|>You are a helpful assistant

User: Hello<end_of_utterance>
Assistant: Hi there<end_of_utterance>
User: How are you?<end_of_utterance>
Assistant:
clip_model_loader: model name:   SmolVLM2 256M Video Instruct
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    198
clip_model_loader: n_kv:         66

load_hparams: projector:          idefics3
load_hparams: n_embd:             768
load_hparams: n_head:             12
load_hparams: n_ff:               3072
load_hparams: n_layer:            12
load_hparams: projection_dim:     576
load_hparams: image_size:         512
load_hparams: patch_size:         16

load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: proj_scale_factor:  4
load_hparams: n_wa_pattern:       0
load_hparams: ffn_op:             gelu
load_hparams: model size:         181.22 MiB
load_hparams: metadata size:      0.07 MiB
alloc_compute_meta:        CPU compute buffer size =    60.00 MiB
main: loading model: models/custom/SmolVLM2-256M-Video-Instruct-f16.gguf
encoding image or slice...
image/slice encoded in 26163 ms
decoding image batch 1/1, n_tokens_batch = 64
image decoded (batch 1/1) in 1767 ms

 A newspaper article from The New York Times dated March 28, 1955. The headline reads "MEN WALK ON MOON: ASTRONAUTS LAND ON PLAN; COLLECT ROCKS, PLANT FLAG; A POWERFUL SURFACE IS CLOSELY EXPLORED." The article is dated March 28, 1955, and is printed in a newspaper called The New York Times.


llama_perf_context_print:        load time =     377.15 ms
llama_perf_context_print: prompt eval time =   28288.96 ms /    79 tokens (  358.09 ms per token,     2.79 tokens per second)
llama_perf_context_print:        eval time =    4462.37 ms /   102 runs   (   43.75 ms per token,    22.86 tokens per second)
llama_perf_context_print:       total time =   33139.66 ms /   181 tokens

ngxson · 2025-05-08T11:13:18Z

Thanks for fixing, yes because both the H and W directions are reduced by a factor of proj_scale_factor, the number of tokens should be reduced by H*proj_scale_factor * W*proj_scale_factor == H*W * proj_scale_factor**2

awkrail · 2025-05-08T13:27:43Z

Thank you for merging!

* origin/master: (39 commits) server : vision support via libmtmd (ggml-org#12898) sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (ggml-org#12858) metal : optimize MoE for large batches (ggml-org#13388) CUDA: FA support for Deepseek (Ampere or newer) (ggml-org#13306) llama : do not crash if there is no CPU backend (ggml-org#13395) CUDA: fix crash on large batch size for MoE models (ggml-org#13384) imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation (ggml-org#13389) llama-run: add support for downloading models from ModelScope (ggml-org#13370) mtmd : fix batch_view for m-rope (ggml-org#13397) llama : one-off chat template fix for Mistral-Small-2503 (ggml-org#13398) rpc : add rpc_msg_set_tensor_hash_req (ggml-org#13353) vulkan: Allow up to 4096 elements for mul_mat_id row_ids (ggml-org#13326) server : (webui) rename has_multimodal --> modalities (ggml-org#13393) ci : limit write permission to only the release step + fixes (ggml-org#13392) mtmd : Expose helper_decode_image_chunk (ggml-org#13366) server : (webui) fix a very small misalignment (ggml-org#13387) server : (webui) revamp the input area, plus many small UI improvements (ggml-org#13365) convert : support rope_scaling type and rope_type (ggml-org#13349) mtmd : fix the calculation of n_tokens for smolvlm (ggml-org#13381) context : allow cache-less context for embeddings (ggml-org#13108) ...

fix the calculation of n_tokens for smolvlm

7439127

github-actions bot added the examples label May 8, 2025

ngxson approved these changes May 8, 2025

View reviewed changes

ngxson merged commit 0ccc121 into ggml-org:master May 8, 2025
46 checks passed

ngxson mentioned this pull request May 10, 2025

mtmd : support InternVL 2.5 and 3 #13422

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mtmd: Fix the calculation of n_tokens for smolvlm #13381

mtmd: Fix the calculation of n_tokens for smolvlm #13381

Uh oh!

awkrail commented May 8, 2025

Uh oh!

ngxson commented May 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

awkrail commented May 8, 2025

Uh oh!

Uh oh!

mtmd: Fix the calculation of n_tokens for smolvlm #13381

mtmd: Fix the calculation of n_tokens for smolvlm #13381

Uh oh!

Conversation

awkrail commented May 8, 2025

Summary

Reproduction

Cause

Fix & Result

Uh oh!

ngxson commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

awkrail commented May 8, 2025

Uh oh!

Uh oh!

ngxson commented May 8, 2025 •

edited

Loading