Skip to content

Eval bug: -fit on ignoring the VRAM required by the built-in MTP leads to OOM crash #23472

@ali0une

Description

@ali0une

Name and Version

./bin/llama-cli --version
version: 9274 (52fb93a)
built with GNU 12.2.0 for Linux x86_64

Operating systems

Linux

GGML backends

CUDA

Hardware

12th Gen Intel(R) Core(TM) i5-12600K
NVIDIA GeForce RTX 3090

Models

Qwen3.6-27B-MTP-GGUF Q4_K_M

Problem description & steps to reproduce

-fit on only optimizing memory for the primary model context, while completely ignoring the VRAM required by the built-in MTP (Multi-Token Prediction) draft model, leading to server crash.

compile flags
cmake -B . --fresh -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86" -DGGML_CUDA_FA_ALL_QUANTS=ON

-fit off launch command works 100%
./bin/llama-server -m /whatever/Qwen3.6-27B-MTP-Q4_K_M.gguf --parallel 1 --n-gpu-layers all --host 127.0.0.1 --port 5000 --flash-attn on --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-type-k q8_0 --spec-draft-type-v q8_0 --cache-type-k q8_0 --cache-type-v q8_0 --kv-unified --sleep-idle-seconds 10 -fit off --ctx-size 131072 --verbose --log-verbosity 4

-fit on launch command crashes at inference time
./bin/llama-server -m /whatever/Qwen3.6-27B-MTP-Q4_K_M.gguf --parallel 1 --n-gpu-layers all --host 127.0.0.1 --port 5000 --flash-attn on --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-type-k q8_0 --spec-draft-type-v q8_0 --cache-type-k q8_0 --cache-type-v q8_0 --kv-unified --sleep-idle-seconds 10 --fit on --fit-ctx 131072 --verbose --log-verbosity 4

Immediately after the main model loads, llama.cpp creates a second context for MTP speculative decoding. It defaults to the model's native n_ctx = 262144 (should be 131072) and tries to allocate 990.95 GB for compute buffers. Since VRAM is already exhausted by the main model, cudaMalloc fails and the server exits.

First Bad Commit

git rev-parse --short HEAD
52fb93a

Relevant log output

--fit successfully shrinks the main context:

Logs
0.25.620.084 I common_params_fit_impl: context size reduced from 262144 to 170240 -> need 3283 MiB less memory in total
0.25.620.088 I common_params_fit_impl: entire model can be fit by reducing context
0.25.620.090 I common_fit_params: successfully fit params to free device memory

MTP draft context tries to initialize with the original context length (ctx should be 131072 not 262144):

Logs
srv    load_model: creating MTP draft context against the target model...
llama_context: n_ctx         = 262144
0.27.740.184 E ggml_backend_cuda_buffer_type_alloc_buffer: allocating 990.95 MiB on device 0: cudaMalloc failed: out of memory

Then fails at inference time when sched_reserve tries to allocate the actual compute buffers for both the main and draft models, it exceeds the remaining VRAM, causing an OOM crash:

Logs
0.27.639.655 I srv  params_from_: Chat format: peg-native
0.27.639.826 I slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
0.27.639.828 I srv  get_availabl: updating prompt cache
0.27.639.830 I srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
0.27.639.832 I srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 170240 tokens, 8589934592 est)
0.27.639.834 I srv  get_availabl: prompt cache update took 0.01 ms
0.27.639.871 I slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist 
0.27.639.878 I slot launch_slot_: id  0 | task -1 | sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 171520
	top_k = 20, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 1.000
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000, adaptive_target = -1.000, adaptive_decay = 0.900
0.27.639.879 I slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
0.27.639.885 I slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 170240, n_keep = 0, task.n_tokens = 16
0.27.639.887 I slot update_slots: id  0 | task 0 | cached n_tokens = 0, memory_seq_rm [0, end)
0.27.711.086 I sched_reserve: reserving ...
0.27.740.184 E ggml_backend_cuda_buffer_type_alloc_buffer: allocating 990.95 MiB on device 0: cudaMalloc failed: out of memory
0.27.740.189 E ggml_gallocr_reserve_n_impl: failed to allocate CUDA0 buffer of size 1039083520
0.27.740.189 E graph_reserve: failed to allocate compute buffers

Long logs:

log-fit-off.1.log

log-fit-on.1.log

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions