mtmd: Fix the calculation of n_tokens for smolvlm #13381
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@ngxson Thank you for supporting smolVLM family!
Summary
The recently-added SmolVLM model failed to generate image descriptions correctly due to a miscalculation in
n_patches
.This PR fixes the issue by adjusting the computation to match the original SmolVLM implementation.
Reproduction
./build/bin/llama-mtmd-cli \ -m models/custom/SmolVLM2-256M-Video-Instruct-f16.gguf \ --mmproj models/custom/mmproj-SmolVLM2-256M-Video-Instruct-f16.gguf \ --image tools/mtmd/test-1.jpeg \ -p "describe this image"
The output log is (the model does not generate anything in this example):
Cause
This issue stems from the calculation of
n_patches (=256)
.The current implementation divides by params.proj_scale_factor, but the correct logic (as in the original Hugging Face SmolVLM implementation) should divide by
params.proj_scale_factor ** 2
.Fix & Result
After fixing this, I found that the model can generate the description.
Output: