Name and Version
./bin/llama-cli --version
version: 9274 (52fb93a)
built with GNU 12.2.0 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
12th Gen Intel(R) Core(TM) i5-12600K
NVIDIA GeForce RTX 3090
Models
Qwen3.6-27B-MTP-GGUF Q4_K_M
Problem description & steps to reproduce
-fit on only optimizing memory for the primary model context, while completely ignoring the VRAM required by the built-in MTP (Multi-Token Prediction) draft model, leading to server crash.
compile flags
cmake -B . --fresh -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86" -DGGML_CUDA_FA_ALL_QUANTS=ON
-fit off launch command works 100%
./bin/llama-server -m /whatever/Qwen3.6-27B-MTP-Q4_K_M.gguf --parallel 1 --n-gpu-layers all --host 127.0.0.1 --port 5000 --flash-attn on --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-type-k q8_0 --spec-draft-type-v q8_0 --cache-type-k q8_0 --cache-type-v q8_0 --kv-unified --sleep-idle-seconds 10 -fit off --ctx-size 131072 --verbose --log-verbosity 4
-fit on launch command crashes at inference time
./bin/llama-server -m /whatever/Qwen3.6-27B-MTP-Q4_K_M.gguf --parallel 1 --n-gpu-layers all --host 127.0.0.1 --port 5000 --flash-attn on --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-type-k q8_0 --spec-draft-type-v q8_0 --cache-type-k q8_0 --cache-type-v q8_0 --kv-unified --sleep-idle-seconds 10 --fit on --fit-ctx 131072 --verbose --log-verbosity 4
Immediately after the main model loads, llama.cpp creates a second context for MTP speculative decoding. It defaults to the model's native n_ctx = 262144 (should be 131072) and tries to allocate 990.95 GB for compute buffers. Since VRAM is already exhausted by the main model, cudaMalloc fails and the server exits.
First Bad Commit
git rev-parse --short HEAD
52fb93a
Relevant log output
--fit successfully shrinks the main context:
Logs
0.25.620.084 I common_params_fit_impl: context size reduced from 262144 to 170240 -> need 3283 MiB less memory in total
0.25.620.088 I common_params_fit_impl: entire model can be fit by reducing context
0.25.620.090 I common_fit_params: successfully fit params to free device memory
MTP draft context tries to initialize with the original context length (ctx should be 131072 not 262144):
Logs
srv load_model: creating MTP draft context against the target model...
llama_context: n_ctx = 262144
0.27.740.184 E ggml_backend_cuda_buffer_type_alloc_buffer: allocating 990.95 MiB on device 0: cudaMalloc failed: out of memory
Then fails at inference time when sched_reserve tries to allocate the actual compute buffers for both the main and draft models, it exceeds the remaining VRAM, causing an OOM crash:
Logs
0.27.639.655 I srv params_from_: Chat format: peg-native
0.27.639.826 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
0.27.639.828 I srv get_availabl: updating prompt cache
0.27.639.830 I srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
0.27.639.832 I srv update: - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 170240 tokens, 8589934592 est)
0.27.639.834 I srv get_availabl: prompt cache update took 0.01 ms
0.27.639.871 I slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
0.27.639.878 I slot launch_slot_: id 0 | task -1 | sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 171520
top_k = 20, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 1.000
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000, adaptive_target = -1.000, adaptive_decay = 0.900
0.27.639.879 I slot launch_slot_: id 0 | task 0 | processing task, is_child = 0
0.27.639.885 I slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 170240, n_keep = 0, task.n_tokens = 16
0.27.639.887 I slot update_slots: id 0 | task 0 | cached n_tokens = 0, memory_seq_rm [0, end)
0.27.711.086 I sched_reserve: reserving ...
0.27.740.184 E ggml_backend_cuda_buffer_type_alloc_buffer: allocating 990.95 MiB on device 0: cudaMalloc failed: out of memory
0.27.740.189 E ggml_gallocr_reserve_n_impl: failed to allocate CUDA0 buffer of size 1039083520
0.27.740.189 E graph_reserve: failed to allocate compute buffers
Long logs:
log-fit-off.1.log
log-fit-on.1.log
Name and Version
./bin/llama-cli --version
version: 9274 (52fb93a)
built with GNU 12.2.0 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
12th Gen Intel(R) Core(TM) i5-12600K
NVIDIA GeForce RTX 3090
Models
Qwen3.6-27B-MTP-GGUF Q4_K_M
Problem description & steps to reproduce
-fit on only optimizing memory for the primary model context, while completely ignoring the VRAM required by the built-in MTP (Multi-Token Prediction) draft model, leading to server crash.
compile flags
cmake -B . --fresh -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86" -DGGML_CUDA_FA_ALL_QUANTS=ON
-fit off launch command works 100%
./bin/llama-server -m /whatever/Qwen3.6-27B-MTP-Q4_K_M.gguf --parallel 1 --n-gpu-layers all --host 127.0.0.1 --port 5000 --flash-attn on --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-type-k q8_0 --spec-draft-type-v q8_0 --cache-type-k q8_0 --cache-type-v q8_0 --kv-unified --sleep-idle-seconds 10 -fit off --ctx-size 131072 --verbose --log-verbosity 4-fit on launch command crashes at inference time
./bin/llama-server -m /whatever/Qwen3.6-27B-MTP-Q4_K_M.gguf --parallel 1 --n-gpu-layers all --host 127.0.0.1 --port 5000 --flash-attn on --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-type-k q8_0 --spec-draft-type-v q8_0 --cache-type-k q8_0 --cache-type-v q8_0 --kv-unified --sleep-idle-seconds 10 --fit on --fit-ctx 131072 --verbose --log-verbosity 4Immediately after the main model loads, llama.cpp creates a second context for MTP speculative decoding. It defaults to the model's native n_ctx = 262144 (should be 131072) and tries to allocate 990.95 GB for compute buffers. Since VRAM is already exhausted by the main model, cudaMalloc fails and the server exits.
First Bad Commit
git rev-parse --short HEAD
52fb93a
Relevant log output
--fit successfully shrinks the main context:
Logs
MTP draft context tries to initialize with the original context length (ctx should be 131072 not 262144):
Logs
Then fails at inference time when sched_reserve tries to allocate the actual compute buffers for both the main and draft models, it exceeds the remaining VRAM, causing an OOM crash:
Logs
Long logs:
log-fit-off.1.log
log-fit-on.1.log