Skip to content

Eval bug: Regression: CUDA invalid argument after idle model reload (startup inference works normally) #24694

Description

@theLittleStone

Name and Version

version: 9668 (32120c1)
built with MSVC 19.44.35225.0 for x64

Operating systems

Windows

GGML backends

CUDA

Hardware

Ryzen 7950X & NVIDIA RTX 3090

Models

huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-IQ4_NL.gguf

Problem description & steps to reproduce

Problem description

The model loads and runs normally after server startup. However, if the model remains idle long enough to be unloaded from VRAM (or otherwise released by the server's idle memory management), subsequent requests cause a CUDA failure during the first inference after the model is reloaded.

The error occurs consistently after the idle → unload (sleeping mode) → reload cycle and does not occur immediately after startup.

The failure happens during the forward pass with the following error:

ggml_cuda_compute_forward: MUL_MAT failed
CUDA error: invalid argument

The model can be loaded successfully, all initialization steps complete normally, and the crash only appears when processing the first request after the model has been unloaded and reloaded.

Steps to reproduce

The full settings will be listed here:

Settings
[*]
models-max = 1
sleep-idle-seconds = 120

ctx-size = 72000

ngl = -1

threads = 12
threads-batch = 12

mmap = true
flash-attn = auto

cache-type-k = q8_0
cache-type-v = q8_0

[Qwen27B/Qwen3.6-27B-MTP]
m = ./...
mmproj = ./...

spec-type = ngram-mod,draft-mtp

spec-ngram-mod-n-match = 24
spec-ngram-mod-n-min = 16
spec-ngram-mod-n-max = 48

spec-draft-n-max = 3

temperature=1.0
top-p=0.95
top-k=20
min-p=0.0
presence-penalty=0.0
repeat-penalty=1.0

Then here is the steps:

  • Set sleep-idle-seconds for models, where the model I used is Qwen3.6-27B-MTP-IQ4_NL.
  • Load the model successfully and verify that inference works normally. (Triggering this issue may require a longer context).
  • Leave the server idle long enough for the model to be unloaded from VRAM (or enter sleeping state).
  • Send a new inference request.
  • Observe that the model reloads successfully.
  • During the first generation request after reload, CUDA will fail.

Additional problem in older version

In versions earlier than b9557 ( the version I used is b9555 ), the issue does not result in an immediate error. Instead, after exiting sleep mode and reloading the model, inference performance drops significantly, suggesting that the model may not be fully loaded onto the GPU or that some computations are no longer being executed on the GPU as expected. The corresponding logs are also included in the Log Output section below.

First Bad Commit

b9557

Relevant log output

Logs
�[34m4.14.029.720�[0m �[32mI �[0msrv  proxy_reques: proxying request to model Qwen27B/Qwen3.6-27B-MTP on port 55363
[55363] 3.57.931.987 I que    start_loop: exiting sleeping state
[55363] cmd_child_to_router:ready
[55363] 3.57.932.161 I srv  handle_sleep: server is exiting sleeping state
[55363] 3.57.932.163 I srv    load_model: loading model 'D:/LLMs/Qwen3.6-27B-MTP-IQ4_NL.gguf'
[55363] 3.59.078.664 I srv    load_model: [mtmd] estimated worst-case memory usage of mmproj is 1161.02 MiB
[55363] 3.59.847.894 I srv    load_model: [spec] estimated memory usage of MTP context is 420.52 MiB
[55363] 3.59.847.907 I common_init_result: fitting params to device memory ...
[55363] 3.59.847.908 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
[55363] 4.05.938.434 W llama_context: n_ctx_seq (72192) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
[55363] 4.06.055.947 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[55363] 4.06.106.352 I srv    load_model: creating MTP draft context against the target model 'D:/LLMs/Qwen3.6-27B-MTP-IQ4_NL.gguf'
[55363] 4.06.106.391 W llama_context: n_ctx_seq (72192) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
[55363] 4.06.133.406 W load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks
[55363] 4.06.133.409 W load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024
[55363] 4.06.133.410 W load_hparams: more info: https://github.com/ggml-org/llama.cpp/issues/16842
[55363]
[55363] 4.06.915.276 I srv    load_model: loaded multimodal model, 'D:/LLMs/Qwen3.6-27B-MTP-IQ4_NL-mmproj.gguf'
[55363] 4.06.915.283 I srv    load_model: initializing slots, n_slots = 1
[55363] 4.06.938.038 I common_context_can_seq_rm: the context supports bounded partial sequence removal
[55363] 4.06.970.181 I common_speculative_impl_ngram_mod: adding speculative implementation 'ngram-mod'
[55363] 4.06.970.185 I common_speculative_impl_ngram_mod: - n_match=24, n_max=48, n_min=16
[55363] 4.06.970.189 I common_speculative_impl_ngram_mod: - mod size=4194304 (16.000 MB)
[55363] 4.06.970.194 I common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp'
[55363] 4.06.970.196 I common_speculative_impl_draft_mtp: - n_max=3, n_min=0, p_min=0.00, n_embd=5120, backend_sampling=1
[55363] 4.06.970.198 I common_speculative_impl_draft_mtp: - gpu_layers=-1, cache_k=f16, cache_v=f16, ctx_tgt=yes, ctx_dft=yes, devices=[default]
[55363] 4.07.000.134 I srv    load_model: speculative decoding context initialized
[55363] 4.07.000.138 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 72192
[55363] 4.07.000.213 I srv    load_model: prompt cache is enabled, size limit: 8192 MiB
[55363] 4.07.000.216 I srv    load_model: use `--cache-ram 0` to disable the prompt cache
[55363] 4.07.000.218 I srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
[55363] 4.07.000.219 I srv    load_model: context checkpoints enabled, max = 32, min spacing = 256
[55363] 4.07.000.227 I srv  update_slots: all slots are idle
[55363] 4.07.028.119 I srv  params_from_: Chat format: peg-native
[55363] 4.07.028.243 I slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
[55363] 4.07.028.246 I srv  get_availabl: updating prompt cache
[55363] 4.07.028.248 I srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
[55363] 4.07.028.250 I srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 72192 tokens, 8589934592 est)
[55363] 4.07.028.251 I srv  get_availabl: prompt cache update took 0.00 ms
[55363] 4.07.028.280 I slot launch_slot_: id  0 | task 1680 | processing task, is_child = 0
[55363] 4.07.040.967 C:\Users\theLittleStone\Projects\AI\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:103: CUDA error
[55363] E ggml_cuda_compute_forward: MUL_MAT failed
[55363] 4.07.040.975 E CUDA error: invalid argument
[55363] 4.07.040.977 E   current device: 0, in function ggml_cuda_compute_forward at C:\Users\theLittleStone\Projects\AI\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:3163
[55363] 4.07.040.977 E   err
�[34m4.26.218.224�[0m �[31mE srv   operator (): http client error: Failed to read connection
Logs ( old version ) ```

[56280] 4.11.090.386 I srv handle_sleep: server is exiting sleeping state
[56280] 4.11.090.387 I srv load_model: loading model 'D:/LLMs/Qwen3.6-27B-MTP-IQ4_NL.gguf'
[56280] 4.12.240.981 I srv load_model: [mtmd] estimated worst-case memory usage of mmproj is 1161.02 MiB
[56280] 4.12.591.172 I srv load_model: [spec] estimated memory usage of MTP context is 420.52 MiB
[56280] 4.12.591.189 I common_init_result: fitting params to device memory ...
[56280] 4.12.591.190 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
[56280] 4.19.470.426 W llama_context: n_ctx_seq (72192) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
[56280] 4.19.627.377 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[56280] 4.19.839.289 I srv load_model: creating MTP draft context against the target model 'D:/LLMs/Qwen3.6-27B-MTP-IQ4_NL.gguf'
[56280] 4.19.839.333 W llama_context: n_ctx_seq (72192) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
[56280] 4.19.867.800 W load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks
[56280] 4.19.867.803 W load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024
[56280] 4.19.867.803 W load_hparams: more info: #16842
[56280]
[56280] 4.20.655.546 I srv load_model: loaded multimodal model, 'D:/LLMs/Qwen3.6-27B-MTP-IQ4_NL-mmproj.gguf'
[56280] 4.20.655.552 I srv load_model: initializing slots, n_slots = 1
[56280] 4.20.733.251 I common_context_can_seq_rm: the context supports bounded partial sequence removal
[56280] 4.20.765.380 I common_speculative_impl_ngram_mod: adding speculative implementation 'ngram-mod'
[56280] 4.20.765.383 I common_speculative_impl_ngram_mod: - n_match=24, n_max=48, n_min=16
[56280] 4.20.765.386 I common_speculative_impl_ngram_mod: - mod size=4194304 (16.000 MB)
[56280] 4.20.765.389 I common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp'
[56280] 4.20.765.390 I common_speculative_impl_draft_mtp: - n_max=3, n_min=0, p_min=0.00, n_embd=5120, backend_sampling=1
[56280] 4.20.765.391 I common_speculative_impl_draft_mtp: - gpu_layers=-1, cache_k=f16, cache_v=f16, ctx_tgt=yes, ctx_dft=yes, devices=[default]
[56280] 4.20.794.929 I srv load_model: speculative decoding context initialized
[56280] 4.20.794.933 I slot load_model: id 0 | task -1 | new slot, n_ctx = 72192
[56280] 4.20.794.998 I srv load_model: prompt cache is enabled, size limit: 8192 MiB
[56280] 4.20.795.001 I srv load_model: use --cache-ram 0 to disable the prompt cache
[56280] 4.20.802.534 I srv load_model: for more info see #16391
[56280] 4.20.802.537 I srv load_model: context checkpoints enabled, max = 32, min spacing = 256
[56280] 4.20.802.547 I srv update_slots: all slots are idle
[56280] 4.20.838.673 I srv params_from_: Chat format: peg-native
[56280] 4.20.838.856 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
[56280] 4.20.838.859 I srv get_availabl: updating prompt cache
[56280] 4.20.838.863 I srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
[56280] 4.20.838.865 I srv update: - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 72192 tokens, 8589934592 est)
[56280] 4.20.838.866 I srv get_availabl: prompt cache update took 0.01 ms
[56280] 4.20.838.913 I slot launch_slot_: id 0 | task 1973 | processing task, is_child = 0
[56280] 4.25.371.386 I slot print_timing: id 0 | task 1973 | prompt processing, n_tokens = 4096, progress = 0.49, t = 4.53 s / 903.70 tokens per second
[56280] 4.27.696.266 I slot print_timing: id 0 | task 1973 | prompt processing, n_tokens = 6144, progress = 0.74, t = 6.86 s / 895.97 tokens per second
[56280] 4.29.735.945 I slot print_timing: id 0 | task 1973 | prompt processing, n_tokens = 7791, progress = 0.94, t = 8.90 s / 875.69 tokens per second
[56280] 4.29.763.765 I slot create_check: id 0 | task 1973 | created context checkpoint 1 of 32 (pos_min = 7790, pos_max = 7790, n_tokens = 7791, size = 180.208 MiB)
[56280] 4.30.355.618 I slot print_timing: id 0 | task 1973 | prompt processing, n_tokens = 8303, progress = 1.00, t = 9.52 s / 872.47 tokens per second
[56280] 4.30.383.988 I slot create_check: id 0 | task 1973 | created context checkpoint 2 of 32 (pos_min = 8302, pos_max = 8302, n_tokens = 8303, size = 182.218 MiB)
[56280] 4.30.485.285 I begin: ngram_mod occupancy = 5955/4194304 (0.00)
[56280] 4.34.885.250 I slot print_timing: id 0 | task 1973 | n_decoded = 100, tg = 22.73 t/s
[56280] 4.38.010.444 I slot print_timing: id 0 | task 1973 | n_decoded = 189, tg = 25.12 t/s
...
[56280] 6.03.464.444 I slot print_timing: id 0 | task 1973 | n_decoded = 3309, tg = 35.59 t/s
[56280] 6.05.774.810 I slot print_timing: id 0 | task 1973 | prompt eval time = 9646.85 ms / 8307 tokens ( 1.16 ms per token, 861.11 tokens per second)
[56280] 6.05.774.816 I slot print_timing: id 0 | task 1973 | eval time = 95288.92 ms / 3382 tokens ( 28.18 ms per token, 35.49 tokens per second)
[56280] 6.05.774.817 I slot print_timing: id 0 | task 1973 | total time = 104935.77 ms / 11689 tokens
[56280] 6.05.774.817 I slot print_timing: id 0 | task 1973 | graphs reused = 740
[56280] 6.05.774.818 I slot print_timing: id 0 | task 1973 | draft acceptance = 0.69849 ( 2590 accepted / 3708 generated)
[56280] 6.05.774.836 I statistics ngram-mod: #calls(b,g,a) = 1 791 31, #gen drafts = 31, #acc drafts = 31, #gen tokens = 1428, #acc tokens = 873, dur(b,g,a) = 0.370, 1.646, 0.015 ms
[56280] 6.05.774.841 I statistics draft-mtp: #calls(b,g,a) = 1 760 760, #gen drafts = 760, #acc drafts = 669, #gen tokens = 2280, #acc tokens = 1717, dur(b,g,a) = 0.001, 6319.715, 0.773 ms
[56280] 6.05.775.201 I slot release: id 0 | task 1973 | stop processing: n_tokens = 11688, truncated = 0
[56280] 6.05.775.225 I srv update_slots: all slots are idle �[34m6.18.294.761�[0m �[32mI �[0msrv proxy_reques: proxying request to model Qwen27B/Qwen3.6-27B-MTP on port 56280
[56280] 6.05.817.120 I srv params_from_: Chat format: peg-native
[56280] 6.05.819.299 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 111031212
[56280] 6.05.819.304 I srv get_availabl: updating prompt cache
[56280] 6.05.820.428 W srv prompt_save: - saving prompt with length 11688, total state size = 583.807 MiB (draft: 45.879 MiB)
[56280] 6.05.938.037 I srv load: - looking for better prompt, base f_keep = 0.000, sim = 0.001
[56280] 6.05.938.044 I srv update: - cache state: 1 prompts, 946.233 MiB (limits: 8192.000 MiB, 72192 tokens, 101188 est)
[56280] 6.05.938.045 I srv update: - prompt 00000163110B7CD0: 11688 tokens, checkpoints: 2, 946.233 MiB
[56280] 6.05.938.048 I srv get_availabl: prompt cache update took 118.74 ms
[56280] 6.05.938.177 I slot launch_slot_: id 0 | task 2790 | processing task, is_child = 0
[56280] 6.05.938.192 I slot update_slots: id 0 | task 2790 | Checking checkpoint with [8302, 8302] against 3...
[56280] 6.05.938.193 I slot update_slots: id 0 | task 2790 | Checking checkpoint with [7790, 7790] against 3...
[56280] 6.05.938.194 W slot update_slots: id 0 | task 2790 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com//pull/13194#issuecomment-2868343055)
[56280] 6.05.938.197 W slot update_slots: id 0 | task 2790 | erased invalidated context checkpoint (pos_min = 7790, pos_max = 7790, n_tokens = 7791, n_swa = 0, pos_next = 0, size = 180.208 MiB)
[56280] 6.05.942.834 W slot update_slots: id 0 | task 2790 | erased invalidated context checkpoint (pos_min = 8302, pos_max = 8302, n_tokens = 8303, n_swa = 0, pos_next = 0, size = 182.218 MiB)
[56280] 6.07.443.957 I slot create_check: id 0 | task 2790 | created context checkpoint 1 of 32 (pos_min = 1223, pos_max = 1223, n_tokens = 1224, size = 154.431 MiB)
[56280] 6.09.726.661 I slot print_timing: id 0 | task 2790 | prompt processing, n_tokens = 3272, progress = 0.79, t = 3.79 s / 863.68 tokens per second
[56280] 6.10.200.108 I slot print_timing: id 0 | task 2790 | prompt processing, n_tokens = 3629, progress = 0.88, t = 4.26 s / 851.50 tokens per second
[56280] 6.10.225.165 I slot create_check: id 0 | task 2790 | created context checkpoint 2 of 32 (pos_min = 3628, pos_max = 3628, n_tokens = 3629, size = 163.871 MiB)
[56280] 6.10.806.087 I slot print_timing: id 0 | task 2790 | prompt processing, n_tokens = 4141, progress = 1.00, t = 4.87 s / 850.68 tokens per second
[56280] 6.10.833.328 I slot create_check: id 0 | task 2790 | created context checkpoint 3 of 32 (pos_min = 4140, pos_max = 4140, n_tokens = 4141, size = 165.881 MiB)
[56280] 6.10.926.252 I begin: ngram_mod occupancy = 8862/4194304 (0.00)
[56280] 6.14.281.611 I slot print_timing: id 0 | task 2790 | n_decoded = 102, tg = 30.41 t/s
[56280] 6.17.340.648 I slot print_timing: id 0 | task 2790 | n_decoded = 180, tg = 28.07 t/s
...
[56280] 8.29.197.005 I slot print_timing: id 0 | task 2790 | n_decoded = 3707, tg = 26.81 t/s
[56280] 8.30.117.777 I slot print_timing: id 0 | task 2790 | prompt eval time = 4988.76 ms / 4145 tokens ( 1.20 ms per token, 830.87 tokens per second)
[56280] 8.30.117.781 I slot print_timing: id 0 | task 2790 | eval time = 139190.75 ms / 3737 tokens ( 37.25 ms per token, 26.85 tokens per second)
[56280] 8.30.117.782 I slot print_timing: id 0 | task 2790 | total time = 144179.51 ms / 7882 tokens
[56280] 8.30.117.782 I slot print_timing: id 0 | task 2790 | graphs reused = 1958
[56280] 8.30.117.783 I slot print_timing: id 0 | task 2790 | draft acceptance = 0.56290 ( 2479 accepted / 4404 generated)
[56280] 8.30.117.796 I statistics ngram-mod: #calls(b,g,a) = 2 2049 45, #gen drafts = 45, #acc drafts = 45, #gen tokens = 2100, #acc tokens = 1006, dur(b,g,a) = 0.631, 4.065, 1.182 ms
[56280] 8.30.117.800 I statistics draft-mtp: #calls(b,g,a) = 2 2004 2004, #gen drafts = 2004, #acc drafts = 1646, #gen tokens = 6012, #acc tokens = 4063, dur(b,g,a) = 0.002, 16293.805, 2.227 ms
[56280] 8.30.117.954 I slot release: id 0 | task 2790 | stop processing: n_tokens = 7882, truncated = 0
[56280] 8.30.117.980 I srv update_slots: all slots are idle
[56280] 8.50.327.489 I que start_loop: entering sleeping state
[56280] cmd_child_to_router:sleep
[56280] 8.50.327.750 I srv handle_sleep: server is entering sleeping state

</details>



Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions