Prerequisites
Feature Description
I have a working in-progress integration of nvidia/LocateAnything-3B into the mtmd / clip.cpp framework. The model loads, warms up, and runs inference to completion — but spatial localization is completely broken: it always outputs <0><0><1000><1000> regardless of the prompt or image. There is also a token count discrepancy (252 produced vs 256 expected).
Architecture: MoonViT-SO-400M (vision encoder, head_dim=72, 2D RoPE, 27-layer ViT) + Eagle MLP projector (2-layer MLP with intermediate LayerNorm) + Qwen2.5-3B Instruct (LLM backbone), with a 2×2 spatial patch merge between encoder and projector.
Environment
| Item |
Value |
| llama.cpp commit |
e4406fed5 (build 8540) |
| GPU |
NVIDIA GeForce GTX 1050 Ti (4030 MiB, compute 6.1) |
| OS / Arch |
Linux x86_64, GNU 11.4.0 |
| LLM weights |
Q4_K_M, 1.96 GiB |
| mmproj weights |
BF16 GGUF |
Reproduction
./build/bin/llama-mtmd-cli \
-m ./LocateAnything-3B/LocateAnything-3B-LLM-Q4_K_M.gguf \
--mmproj ./LocateAnything-3B/mmproj-LocateAnything-3B-BF16.gguf \
--image ./Sample_dataset_312.jpg \
-p "Locate all the instances that matches the following description: person." \
--temp 0.1 -n 512 -c 2048 -ngl 10 --image-max-tokens 256
Actual output:
person<0><0><1000><1000>
Expected output:
person<120><45><480><790> (tight box around person)
Loaded hparams from log
projector: moonvit
image_size: 896
patch_size: 14
n_merge: 2
n_embd: 1152
n_head: 16
n_layer: 27
image_max_pixels: 200704
warmup image: 448 × 448
Bug 1 — Token count: 252 produced, 256 expected
Expected: (448/14)² / (2×2) = 1024 / 4 = 256 visual tokens
Actual log: decoding image batch 1/1, n_tokens_batch = 252
Suspected cause: letterboxing rounding the image to a non-multiple of patch_size × merge_k = 28 (e.g. 444 instead of 448), producing ~250 tokens instead of 256.
Bug 2 — Spatial output collapsed to full image
The <0><0><1000><1000> output is the classic symptom of lost spatial encoding. Three ranked suspects:
-
pos_row / pos_col not filled before graph execution — tensors are declared as ggml_set_input and named "pos_row" / "pos_col", but I am not certain clip.cpp is writing the actual row/column I32 indices into them before ggml_backend_graph_compute. If zero-initialized, every patch looks like position (0,0).
-
build_patch_merge_permute() permutation order not matching LocateAnything-3B's training (PyTorch pixel-unshuffle ordering).
-
v.position_embd.weight (shape [hidden_size, 4096]) always sliced from offset 0 regardless of actual image resolution.
Bug 3 — head_dim=72 flash attention assert (worked around)
MoonViT: hidden_size=1152, n_heads=16 → head_dim=72.
ggml_flash_attn_ext asserts head_dim ∈ {64,80,96,112,128,256} and crashes.
Worked around with manual scaled dot-product attention in moonvit_encoder_layer(). Warmup passes cleanly. Flagging in case there is a preferred framework-level way to handle non-standard head_dim.
Questions for maintainers
-
pos_row / pos_col fill path — where in clip.cpp is the correct place to write I32 row/column indices into named input tensors before compute? Is there a pattern in qwen2vl.cpp or pixtral.cpp to follow?
-
clip_n_output_tokens — should this be set to the post-merge count (256) or pre-merge (1024)?
-
Preprocessing rounding — should image dimensions be constrained to a multiple of patch_size × merge_k = 28 to guarantee an integer token count after merging?
Implementation gists
Related: #23784 (plain feature request for LocateAnything)
HuggingFace model: https://huggingface.co/nvidia/LocateAnything-3B
Motivation
LocateAnything-3B is the strongest text and character grounding model I have tested at this scale — outperforming Qwen2.5-VL 7B and Gemma3 12B on tight bounding-box localization tasks in my personal benchmarks. At Q4_K_M quantization it fits entirely within 4 GB VRAM, making it usable on consumer hardware (GTX 1050 Ti class and above). No llama.cpp-compatible integration currently exists. A working mtmd integration would make this model accessible for local inference without requiring PyTorch or the full HuggingFace stack.
Possible Implementation
A full WIP implementation already exists across the gists linked in the Feature Description above. The core additions required are:
-
moonvit.cpp — ggml computation graph for MoonViT-SO-400M:
- Patch embedding via ggml_conv_2d with [KW, KH, IC, OC] kernel layout
- Learnable positional embedding sliced from v.position_embd.weight [hidden_size, 4096]
- 27 encoder layers with LayerNorm → QKV projection → 2D RoPE → manual scaled dot-product attention (head_dim=72, bypassing ggml_flash_attn_ext) → FFN (GELU)
- 2×2 patch merge via build_patch_merge_permute()
- Eagle MLP projector: LayerNorm → Linear → GELU → Linear
-
models.h — add clip_graph_moonvit struct declaration alongside existing encoders
-
clip-impl.h — register PROJECTOR_TYPE_MOONVIT in the projector dispatch table
-
clip.cpp — handle pos_row / pos_col I32 fill for 2D RoPE (the part currently unclear and blocking correct spatial output)
-
convert_moonvit_to_gguf.py — Python conversion script to export MoonViT + Eagle MLP weights to BF16 GGUF
The remaining blockers are the three bugs described above, particularly the pos_row/pos_col fill path (Bug 2 Cause A) which is the most likely root cause of the spatial collapse. Guidance from maintainers familiar with qwen2vl.cpp or pixtral.cpp's 2D position ID fill path would unblock this immediately.
Prerequisites
Feature Description
I have a working in-progress integration of nvidia/LocateAnything-3B into the mtmd / clip.cpp framework. The model loads, warms up, and runs inference to completion — but spatial localization is completely broken: it always outputs <0><0><1000><1000> regardless of the prompt or image. There is also a token count discrepancy (252 produced vs 256 expected).
Architecture: MoonViT-SO-400M (vision encoder, head_dim=72, 2D RoPE, 27-layer ViT) + Eagle MLP projector (2-layer MLP with intermediate LayerNorm) + Qwen2.5-3B Instruct (LLM backbone), with a 2×2 spatial patch merge between encoder and projector.
Environment
Reproduction
./build/bin/llama-mtmd-cli \ -m ./LocateAnything-3B/LocateAnything-3B-LLM-Q4_K_M.gguf \ --mmproj ./LocateAnything-3B/mmproj-LocateAnything-3B-BF16.gguf \ --image ./Sample_dataset_312.jpg \ -p "Locate all the instances that matches the following description: person." \ --temp 0.1 -n 512 -c 2048 -ngl 10 --image-max-tokens 256Actual output:
person<0><0><1000><1000>
Expected output:
person<120><45><480><790> (tight box around person)
Loaded hparams from log
projector: moonvit
image_size: 896
patch_size: 14
n_merge: 2
n_embd: 1152
n_head: 16
n_layer: 27
image_max_pixels: 200704
warmup image: 448 × 448
Bug 1 — Token count: 252 produced, 256 expected
Expected: (448/14)² / (2×2) = 1024 / 4 = 256 visual tokens
Actual log: decoding image batch 1/1, n_tokens_batch = 252
Suspected cause: letterboxing rounding the image to a non-multiple of patch_size × merge_k = 28 (e.g. 444 instead of 448), producing ~250 tokens instead of 256.
Bug 2 — Spatial output collapsed to full image
The <0><0><1000><1000> output is the classic symptom of lost spatial encoding. Three ranked suspects:
pos_row / pos_col not filled before graph execution — tensors are declared as ggml_set_input and named "pos_row" / "pos_col", but I am not certain clip.cpp is writing the actual row/column I32 indices into them before ggml_backend_graph_compute. If zero-initialized, every patch looks like position (0,0).
build_patch_merge_permute() permutation order not matching LocateAnything-3B's training (PyTorch pixel-unshuffle ordering).
v.position_embd.weight (shape [hidden_size, 4096]) always sliced from offset 0 regardless of actual image resolution.
Bug 3 — head_dim=72 flash attention assert (worked around)
MoonViT: hidden_size=1152, n_heads=16 → head_dim=72.
ggml_flash_attn_ext asserts head_dim ∈ {64,80,96,112,128,256} and crashes.
Worked around with manual scaled dot-product attention in moonvit_encoder_layer(). Warmup passes cleanly. Flagging in case there is a preferred framework-level way to handle non-standard head_dim.
Questions for maintainers
pos_row / pos_col fill path — where in clip.cpp is the correct place to write I32 row/column indices into named input tensors before compute? Is there a pattern in qwen2vl.cpp or pixtral.cpp to follow?
clip_n_output_tokens — should this be set to the post-merge count (256) or pre-merge (1024)?
Preprocessing rounding — should image dimensions be constrained to a multiple of patch_size × merge_k = 28 to guarantee an integer token count after merging?
Implementation gists
Related: #23784 (plain feature request for LocateAnything)
HuggingFace model: https://huggingface.co/nvidia/LocateAnything-3B
Motivation
LocateAnything-3B is the strongest text and character grounding model I have tested at this scale — outperforming Qwen2.5-VL 7B and Gemma3 12B on tight bounding-box localization tasks in my personal benchmarks. At Q4_K_M quantization it fits entirely within 4 GB VRAM, making it usable on consumer hardware (GTX 1050 Ti class and above). No llama.cpp-compatible integration currently exists. A working mtmd integration would make this model accessible for local inference without requiring PyTorch or the full HuggingFace stack.
Possible Implementation
A full WIP implementation already exists across the gists linked in the Feature Description above. The core additions required are:
moonvit.cpp — ggml computation graph for MoonViT-SO-400M:
models.h — add clip_graph_moonvit struct declaration alongside existing encoders
clip-impl.h — register PROJECTOR_TYPE_MOONVIT in the projector dispatch table
clip.cpp — handle pos_row / pos_col I32 fill for 2D RoPE (the part currently unclear and blocking correct spatial output)
convert_moonvit_to_gguf.py — Python conversion script to export MoonViT + Eagle MLP weights to BF16 GGUF
The remaining blockers are the three bugs described above, particularly the pos_row/pos_col fill path (Bug 2 Cause A) which is the most likely root cause of the spatial collapse. Guidance from maintainers familiar with qwen2vl.cpp or pixtral.cpp's 2D position ID fill path would unblock this immediately.