Name and Version
llama-server -version
version: 9837 (b3fed31)
built with GNU 16.1.1 for Linux x86_64
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
build/bin/llama-server -hf bartowski/Qwen_Qwen3.5-0.8B-GGUF:Q4_K_M -dev none --port 8899
Problem description & steps to reproduce
If you build llama-server with CUDA support enabled and have a CUDA device present that has its memory fully, or almost fully utilized, then attempting to load a model on the CPU will cause it to crash.
Expectation is loading a model on CPU with '-dev none' should not be impacted by CUDA0 device's memory being fully utilized.
precondition:
create a build of llama-server with DGGML_CUDA=on set, and use this for all commands below.
to repro crash, CUDA0 should have its memory fully or almost fully utilized. to get CUDA0 in this state i run the following command and increment the arg for -ngl until it crashes, then back off by 1 leaving CUDA 0 memory near full.
command i used to get cuda0 near full memory:
/build/bin/llama-server -hf bartowski/Qwen_Qwen3.6-27B-GGUF:Q4_K_L -dev CUDA0 -c 4096 -ngl 29 --port 8888 --verbose --parallel 1
then use nvidia-smi to check memory is near full:
nvidia-smi
Mon Jun 29 12:14:18 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.80 Driver Version: 595.80 CUDA Version: 13.2 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3080 Off | 00000000:0A:00.0 Off | N/A |
| 30% 39C P2 96W / 320W | 9666MiB / 10240MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 5153 C+G /usr/bin/kwin_wayland 12MiB |
| 0 N/A N/A 73862 C ./build/bin/llama-server 9610MiB |
+-----------------------------------------------------------------------------------------+
reproduce crash:
once you verified the CUDA device is near full memory, run the following:
build/bin/llama-server -hf bartowski/Qwen_Qwen3.5-0.8B-GGUF:Q4_K_M -dev none --port 8899
this should crash with a cuda out of memory error despide the cuda device not being selected.
First Bad Commit
No response
Relevant log output
output of build/bin/llama-server -hf bartowski/Qwen_Qwen3.5-0.8B-GGUF:Q4_K_M -dev none --port 8899
0.00.328.112 I cmn common_param: common_params_print_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
/home/chris/git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:104: CUDA error
0.00.452.557 E CUDA error: out of memory
0.00.452.560 E current device: 0, in function ggml_backend_cuda_device_get_memory at /home/chris/git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:4991
0.00.452.561 E cudaMemGetInfo(free, total)
[New LWP 73980]
[New LWP 73979]
[New LWP 73978]
[New LWP 73977]
[New LWP 73976]
[New LWP 73975]
This GDB supports auto-downloading debuginfo from the following URLs:
<ima:enforcing>
<https://debuginfod.fedoraproject.org/>
<ima:ignore>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f5696c82412 in __syscall_cancel_arch () from /lib64/libc.so.6
#0 0x00007f5696c82412 in __syscall_cancel_arch () from /lib64/libc.so.6
#1 0x00007f5696c7662c in __internal_syscall_cancel () from /lib64/libc.so.6
#2 0x00007f5696c76674 in __syscall_cancel () from /lib64/libc.so.6
#3 0x00007f5696ce624f in wait4 () from /lib64/libc.so.6
#4 0x00007f56a2b3401b in ggml_print_backtrace () from /home/chris/git/llama.cpp/build/bin/libggml-base.so.0
#5 0x00007f56a2b3418d in ggml_abort () from /home/chris/git/llama.cpp/build/bin/libggml-base.so.0
#6 0x00007f56a04107e3 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/chris/git/llama.cpp/build/bin/libggml-cuda.so.0
#7 0x00007f56a0411618 in ggml_backend_cuda_device_get_memory(ggml_backend_device*, unsigned long*, unsigned long*) () from /home/chris/git/llama.cpp/build/bin/libggml-cuda.so.0
#8 0x00007f56a2d5786f in common_params_print_info(common_params const&, bool) () from /home/chris/git/llama.cpp/build/bin/libllama-common.so.0
#9 0x00007f56a3256771 in llama_server(int, char**) () from /home/chris/git/llama.cpp/build/bin/libllama-server-impl.so
#10 0x00007f5696c0a681 in __libc_start_call_main () from /lib64/libc.so.6
#11 0x00007f5696c0a798 in __libc_start_main_impl () from /lib64/libc.so.6
#12 0x00000000004003b5 in _start ()
[Inferior 1 (process 73973) detached]
Aborted (core dumped) /home/chris/git/llama.cpp/build/bin/llama-server -hf bartowski/Qwen_Qwen3.5-0.8B-GGUF:Q4_K_M -dev none --port 8899
Name and Version
llama-server -version
version: 9837 (b3fed31)
built with GNU 16.1.1 for Linux x86_64
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
If you build llama-server with CUDA support enabled and have a CUDA device present that has its memory fully, or almost fully utilized, then attempting to load a model on the CPU will cause it to crash.
Expectation is loading a model on CPU with '-dev none' should not be impacted by CUDA0 device's memory being fully utilized.
precondition:
create a build of llama-server with DGGML_CUDA=on set, and use this for all commands below.
to repro crash, CUDA0 should have its memory fully or almost fully utilized. to get CUDA0 in this state i run the following command and increment the arg for -ngl until it crashes, then back off by 1 leaving CUDA 0 memory near full.
command i used to get cuda0 near full memory:
then use
nvidia-smito check memory is near full:reproduce crash:
once you verified the CUDA device is near full memory, run the following:
this should crash with a cuda out of memory error despide the cuda device not being selected.
First Bad Commit
No response
Relevant log output
output of
build/bin/llama-server -hf bartowski/Qwen_Qwen3.5-0.8B-GGUF:Q4_K_M -dev none --port 8899