Skip to content

mtmd: Fix the calculation of n_tokens for smolvlm #13381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 8, 2025

Conversation

awkrail
Copy link
Contributor

@awkrail awkrail commented May 8, 2025

@ngxson Thank you for supporting smolVLM family!

Summary

The recently-added SmolVLM model failed to generate image descriptions correctly due to a miscalculation in n_patches.
This PR fixes the issue by adjusting the computation to match the original SmolVLM implementation.

Reproduction

./build/bin/llama-mtmd-cli \
  -m models/custom/SmolVLM2-256M-Video-Instruct-f16.gguf \
  --mmproj models/custom/mmproj-SmolVLM2-256M-Video-Instruct-f16.gguf \
  --image tools/mtmd/test-1.jpeg \
  -p "describe this image"

The output log is (the model does not generate anything in this example):

clip_ctx: CLIP using CPU backend
mtmd_cli_context: chat template example:
<|im_start|>You are a helpful assistant

User: Hello<end_of_utterance>
Assistant: Hi there<end_of_utterance>
User: How are you?<end_of_utterance>
Assistant:
clip_model_loader: model name:   SmolVLM2 256M Video Instruct
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    198
clip_model_loader: n_kv:         66

load_hparams: projector:          idefics3
load_hparams: n_embd:             768
load_hparams: n_head:             12
load_hparams: n_ff:               3072
load_hparams: n_layer:            12
load_hparams: projection_dim:     576
load_hparams: image_size:         512
load_hparams: patch_size:         16

load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: proj_scale_factor:  4
load_hparams: n_wa_pattern:       0
load_hparams: ffn_op:             gelu
load_hparams: model size:         181.22 MiB
load_hparams: metadata size:      0.07 MiB
alloc_compute_meta:        CPU compute buffer size =    60.00 MiB
main: loading model: models/custom/SmolVLM2-256M-Video-Instruct-f16.gguf
encoding image or slice...
image/slice encoded in 34985 ms
decoding image batch 1/1, n_tokens_batch = 256
image decoded (batch 1/1) in 13578 ms




llama_perf_context_print:        load time =    1483.94 ms
llama_perf_context_print: prompt eval time =   49688.54 ms /   271 tokens (  183.35 ms per token,     5.45 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   52743.32 ms /   272 tokens

Cause

This issue stems from the calculation of n_patches (=256).
The current implementation divides by params.proj_scale_factor, but the correct logic (as in the original Hugging Face SmolVLM implementation) should divide by params.proj_scale_factor ** 2.

Fix & Result

After fixing this, I found that the model can generate the description.

Output:

clip_ctx: CLIP using CPU backend
mtmd_cli_context: chat template example:
<|im_start|>You are a helpful assistant

User: Hello<end_of_utterance>
Assistant: Hi there<end_of_utterance>
User: How are you?<end_of_utterance>
Assistant:
clip_model_loader: model name:   SmolVLM2 256M Video Instruct
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    198
clip_model_loader: n_kv:         66

load_hparams: projector:          idefics3
load_hparams: n_embd:             768
load_hparams: n_head:             12
load_hparams: n_ff:               3072
load_hparams: n_layer:            12
load_hparams: projection_dim:     576
load_hparams: image_size:         512
load_hparams: patch_size:         16

load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: proj_scale_factor:  4
load_hparams: n_wa_pattern:       0
load_hparams: ffn_op:             gelu
load_hparams: model size:         181.22 MiB
load_hparams: metadata size:      0.07 MiB
alloc_compute_meta:        CPU compute buffer size =    60.00 MiB
main: loading model: models/custom/SmolVLM2-256M-Video-Instruct-f16.gguf
encoding image or slice...
image/slice encoded in 26163 ms
decoding image batch 1/1, n_tokens_batch = 64
image decoded (batch 1/1) in 1767 ms

 A newspaper article from The New York Times dated March 28, 1955. The headline reads "MEN WALK ON MOON: ASTRONAUTS LAND ON PLAN; COLLECT ROCKS, PLANT FLAG; A POWERFUL SURFACE IS CLOSELY EXPLORED." The article is dated March 28, 1955, and is printed in a newspaper called The New York Times.


llama_perf_context_print:        load time =     377.15 ms
llama_perf_context_print: prompt eval time =   28288.96 ms /    79 tokens (  358.09 ms per token,     2.79 tokens per second)
llama_perf_context_print:        eval time =    4462.37 ms /   102 runs   (   43.75 ms per token,    22.86 tokens per second)
llama_perf_context_print:       total time =   33139.66 ms /   181 tokens

@ngxson
Copy link
Collaborator

ngxson commented May 8, 2025

Thanks for fixing, yes because both the H and W directions are reduced by a factor of proj_scale_factor, the number of tokens should be reduced by H*proj_scale_factor * W*proj_scale_factor == H*W * proj_scale_factor**2

@ngxson ngxson merged commit 0ccc121 into ggml-org:master May 8, 2025
46 checks passed
@awkrail
Copy link
Contributor Author

awkrail commented May 8, 2025

Thank you for merging!

gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request May 9, 2025
* origin/master: (39 commits)
server : vision support via libmtmd (ggml-org#12898)
sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (ggml-org#12858)
metal : optimize MoE for large batches (ggml-org#13388)
CUDA: FA support for Deepseek (Ampere or newer) (ggml-org#13306)
llama : do not crash if there is no CPU backend (ggml-org#13395)
CUDA: fix crash on large batch size for MoE models (ggml-org#13384)
imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation (ggml-org#13389)
llama-run: add support for downloading models from ModelScope (ggml-org#13370)
mtmd : fix batch_view for m-rope (ggml-org#13397)
llama : one-off chat template fix for Mistral-Small-2503 (ggml-org#13398)
rpc : add rpc_msg_set_tensor_hash_req (ggml-org#13353)
vulkan: Allow up to 4096 elements for mul_mat_id row_ids (ggml-org#13326)
server : (webui) rename has_multimodal --> modalities (ggml-org#13393)
ci : limit write permission to only the release step + fixes (ggml-org#13392)
mtmd : Expose helper_decode_image_chunk (ggml-org#13366)
server : (webui) fix a very small misalignment (ggml-org#13387)
server : (webui) revamp the input area, plus many small UI improvements (ggml-org#13365)
convert : support rope_scaling type and rope_type (ggml-org#13349)
mtmd : fix the calculation of n_tokens for smolvlm (ggml-org#13381)
context : allow cache-less context for embeddings (ggml-org#13108)
...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants