[WIP Integration] LocateAnything-3B (MoonViT-SO-400M + Eagle MLP + Qwen2.5-3B): spatial localization failure and token count mismatch in clip.cpp / mtmd

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

I have a working in-progress integration of nvidia/LocateAnything-3B into the mtmd / clip.cpp framework. The model loads, warms up, and runs inference to completion — but spatial localization is completely broken: it always outputs <box><0><0><1000><1000></box> regardless of the prompt or image. There is also a token count discrepancy (252 produced vs 256 expected).

Architecture: MoonViT-SO-400M (vision encoder, head_dim=72, 2D RoPE, 27-layer ViT) + Eagle MLP projector (2-layer MLP with intermediate LayerNorm) + Qwen2.5-3B Instruct (LLM backbone), with a 2×2 spatial patch merge between encoder and projector.

**Environment**

| Item | Value |
|---|---|
| llama.cpp commit | e4406fed5 (build 8540) |
| GPU | NVIDIA GeForce GTX 1050 Ti (4030 MiB, compute 6.1) |
| OS / Arch | Linux x86_64, GNU 11.4.0 |
| LLM weights | Q4_K_M, 1.96 GiB |
| mmproj weights | BF16 GGUF |

**Reproduction**

```bash
./build/bin/llama-mtmd-cli \
  -m  ./LocateAnything-3B/LocateAnything-3B-LLM-Q4_K_M.gguf \
  --mmproj ./LocateAnything-3B/mmproj-LocateAnything-3B-BF16.gguf \
  --image ./Sample_dataset_312.jpg \
  -p "Locate all the instances that matches the following description: person." \
  --temp 0.1 -n 512 -c 2048 -ngl 10 --image-max-tokens 256
```

Actual output:
  <ref>person</ref><box><0><0><1000><1000></box>

Expected output:
  <ref>person</ref><box><120><45><480><790></box>  (tight box around person)

**Loaded hparams from log**

  projector:        moonvit
  image_size:       896
  patch_size:       14
  n_merge:          2
  n_embd:           1152
  n_head:           16
  n_layer:          27
  image_max_pixels: 200704
  warmup image:     448 × 448

**Bug 1 — Token count: 252 produced, 256 expected**

Expected: (448/14)² / (2×2) = 1024 / 4 = 256 visual tokens
Actual log: decoding image batch 1/1, n_tokens_batch = 252

Suspected cause: letterboxing rounding the image to a non-multiple of patch_size × merge_k = 28 (e.g. 444 instead of 448), producing ~250 tokens instead of 256.

**Bug 2 — Spatial output collapsed to full image**

The <box><0><0><1000><1000></box> output is the classic symptom of lost spatial encoding. Three ranked suspects:

1. pos_row / pos_col not filled before graph execution — tensors are declared as ggml_set_input and named "pos_row" / "pos_col", but I am not certain clip.cpp is writing the actual row/column I32 indices into them before ggml_backend_graph_compute. If zero-initialized, every patch looks like position (0,0).

2. build_patch_merge_permute() permutation order not matching LocateAnything-3B's training (PyTorch pixel-unshuffle ordering).

3. v.position_embd.weight (shape [hidden_size, 4096]) always sliced from offset 0 regardless of actual image resolution.

**Bug 3 — head_dim=72 flash attention assert (worked around)**

MoonViT: hidden_size=1152, n_heads=16 → head_dim=72.
ggml_flash_attn_ext asserts head_dim ∈ {64,80,96,112,128,256} and crashes.
Worked around with manual scaled dot-product attention in moonvit_encoder_layer(). Warmup passes cleanly. Flagging in case there is a preferred framework-level way to handle non-standard head_dim.

**Questions for maintainers**

1. pos_row / pos_col fill path — where in clip.cpp is the correct place to write I32 row/column indices into named input tensors before compute? Is there a pattern in qwen2vl.cpp or pixtral.cpp to follow?

2. clip_n_output_tokens — should this be set to the post-merge count (256) or pre-merge (1024)?

3. Preprocessing rounding — should image dimensions be constrained to a multiple of patch_size × merge_k = 28 to guarantee an integer token count after merging?

**Implementation gists**

- moonvit.cpp (full ggml graph): https://gist.github.com/chandra-ps612/70b698866c7ecb6e60aeecb21114a1be
- models.h (struct declaration): https://gist.github.com/chandra-ps612/fee5400a3f08d68f8a64561a8f8b88f0
- clip-impl.h (projector hooks): https://gist.github.com/chandra-ps612/fee5400a3f08d68f8a64561a8f8b88f0
- clip.cpp (dispatch changes): https://gist.github.com/chandra-ps612/1542d764772ad3517d1c71a0d6bcf906
- convert_moonvit_to_gguf.py: https://gist.github.com/chandra-ps612/convert_moonvit_to_gguf
- extract_and_convert_llm.py: https://gist.github.com/chandra-ps612/90a478a584bfeafa7594c268ceaef8da

Related: #23784 (plain feature request for LocateAnything)
HuggingFace model: https://huggingface.co/nvidia/LocateAnything-3B

### Motivation

LocateAnything-3B is the strongest text and character grounding model I have tested at this scale — outperforming Qwen2.5-VL 7B and Gemma3 12B on tight bounding-box localization tasks in my personal benchmarks. At Q4_K_M quantization it fits entirely within 4 GB VRAM, making it usable on consumer hardware (GTX 1050 Ti class and above). No llama.cpp-compatible integration currently exists. A working mtmd integration would make this model accessible for local inference without requiring PyTorch or the full HuggingFace stack.

### Possible Implementation

A full WIP implementation already exists across the gists linked in the Feature Description above. The core additions required are:

1. moonvit.cpp — ggml computation graph for MoonViT-SO-400M:
   - Patch embedding via ggml_conv_2d with [KW, KH, IC, OC] kernel layout
   - Learnable positional embedding sliced from v.position_embd.weight [hidden_size, 4096]
   - 27 encoder layers with LayerNorm → QKV projection → 2D RoPE → manual scaled dot-product attention (head_dim=72, bypassing ggml_flash_attn_ext) → FFN (GELU)
   - 2×2 patch merge via build_patch_merge_permute()
   - Eagle MLP projector: LayerNorm → Linear → GELU → Linear

2. models.h — add clip_graph_moonvit struct declaration alongside existing encoders

3. clip-impl.h — register PROJECTOR_TYPE_MOONVIT in the projector dispatch table

4. clip.cpp — handle pos_row / pos_col I32 fill for 2D RoPE (the part currently unclear and blocking correct spatial output)

5. convert_moonvit_to_gguf.py — Python conversion script to export MoonViT + Eagle MLP weights to BF16 GGUF

The remaining blockers are the three bugs described above, particularly the pos_row/pos_col fill path (Bug 2 Cause A) which is the most likely root cause of the spatial collapse. Guidance from maintainers familiar with qwen2vl.cpp or pixtral.cpp's 2D position ID fill path would unblock this immediately.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP Integration] LocateAnything-3B (MoonViT-SO-400M + Eagle MLP + Qwen2.5-3B): spatial localization failure and token count mismatch in clip.cpp / mtmd #24020

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Item	Value
llama.cpp commit	e4406fed5 (build 8540)
GPU	NVIDIA GeForce GTX 1050 Ti (4030 MiB, compute 6.1)
OS / Arch	Linux x86_64, GNU 11.4.0
LLM weights	Q4_K_M, 1.96 GiB
mmproj weights	BF16 GGUF

Uh oh!

[WIP Integration] LocateAnything-3B (MoonViT-SO-400M + Eagle MLP + Qwen2.5-3B): spatial localization failure and token count mismatch in clip.cpp / mtmd #24020

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions