Eval bug: -fit on ignoring the VRAM required by the built-in MTP leads to OOM crash

### Name and Version

./bin/llama-cli --version
version: 9274 (52fb93a2b)
built with GNU 12.2.0 for Linux x86_64

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

12th Gen Intel(R) Core(TM) i5-12600K
NVIDIA GeForce RTX 3090

### Models

[Qwen3.6-27B-MTP-GGUF](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/) Q4_K_M

### Problem description & steps to reproduce

-fit on only optimizing memory for the primary model context, while completely ignoring the VRAM required by the built-in MTP (Multi-Token Prediction) draft model, leading to server crash.

**compile flags**
cmake -B . --fresh -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86" -DGGML_CUDA_FA_ALL_QUANTS=ON

**-fit off launch command** works 100%
`./bin/llama-server -m /whatever/Qwen3.6-27B-MTP-Q4_K_M.gguf --parallel 1 --n-gpu-layers all --host 127.0.0.1 --port 5000 --flash-attn on --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-type-k q8_0 --spec-draft-type-v q8_0 --cache-type-k q8_0 --cache-type-v q8_0 --kv-unified --sleep-idle-seconds 10 -fit off --ctx-size 131072 --verbose --log-verbosity 4`

**-fit on launch command** crashes at inference time
`./bin/llama-server -m /whatever/Qwen3.6-27B-MTP-Q4_K_M.gguf --parallel 1 --n-gpu-layers all --host 127.0.0.1 --port 5000 --flash-attn on --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-type-k q8_0 --spec-draft-type-v q8_0 --cache-type-k q8_0 --cache-type-v q8_0 --kv-unified --sleep-idle-seconds 10 --fit on --fit-ctx 131072 --verbose --log-verbosity 4`

Immediately after the main model loads, llama.cpp creates a second context for MTP speculative decoding. It defaults to the model's native n_ctx = 262144 (should be 131072) and tries to allocate 990.95 GB for compute buffers. Since VRAM is already exhausted by the main model, cudaMalloc fails and the server exits.


### First Bad Commit

git rev-parse --short HEAD
52fb93a2b

### Relevant log output

--fit successfully shrinks the main context:
<details>
<summary>Logs</summary>

```console
0.25.620.084 I common_params_fit_impl: context size reduced from 262144 to 170240 -> need 3283 MiB less memory in total
0.25.620.088 I common_params_fit_impl: entire model can be fit by reducing context
0.25.620.090 I common_fit_params: successfully fit params to free device memory
```
</details>

MTP draft context tries to initialize with the original context length (ctx should be 131072 not 262144):
<details>
<summary>Logs</summary>

```console
srv    load_model: creating MTP draft context against the target model...
llama_context: n_ctx         = 262144
0.27.740.184 E ggml_backend_cuda_buffer_type_alloc_buffer: allocating 990.95 MiB on device 0: cudaMalloc failed: out of memory
```
</details>

Then fails at inference time when sched_reserve tries to allocate the actual compute buffers for both the main and draft models, it exceeds the remaining VRAM, causing an OOM crash:
<details>
<summary>Logs</summary>

```console
0.27.639.655 I srv  params_from_: Chat format: peg-native
0.27.639.826 I slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
0.27.639.828 I srv  get_availabl: updating prompt cache
0.27.639.830 I srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
0.27.639.832 I srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 170240 tokens, 8589934592 est)
0.27.639.834 I srv  get_availabl: prompt cache update took 0.01 ms
0.27.639.871 I slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist 
0.27.639.878 I slot launch_slot_: id  0 | task -1 | sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 171520
	top_k = 20, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 1.000
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000, adaptive_target = -1.000, adaptive_decay = 0.900
0.27.639.879 I slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
0.27.639.885 I slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 170240, n_keep = 0, task.n_tokens = 16
0.27.639.887 I slot update_slots: id  0 | task 0 | cached n_tokens = 0, memory_seq_rm [0, end)
0.27.711.086 I sched_reserve: reserving ...
0.27.740.184 E ggml_backend_cuda_buffer_type_alloc_buffer: allocating 990.95 MiB on device 0: cudaMalloc failed: out of memory
0.27.740.189 E ggml_gallocr_reserve_n_impl: failed to allocate CUDA0 buffer of size 1039083520
0.27.740.189 E graph_reserve: failed to allocate compute buffers
```
</details>

**Long logs:**

[log-fit-off.1.log](https://github.com/user-attachments/files/28097414/log-fit-off.1.log)

[log-fit-on.1.log](https://github.com/user-attachments/files/28097476/log-fit-on.1.log)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: -fit on ignoring the VRAM required by the built-in MTP leads to OOM crash #23472

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Eval bug: -fit on ignoring the VRAM required by the built-in MTP leads to OOM crash #23472

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions