-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Description
Name and Version
230d116 (HEAD -> master, tag: b6962, origin/master, origin/HEAD) improve CUDA cpy memory bandwidth when copying transposed tensor (#16841)
Select-String -Path CMakeCache.txt -Pattern "GGML_VULKAN"
CMakeCache.txt:804:GGML_VULKAN:BOOL=OFF
CMakeCache.txt:807:GGML_VULKAN_CHECK_RESULTS:BOOL=OFF
CMakeCache.txt:810:GGML_VULKAN_DEBUG:BOOL=OFF
CMakeCache.txt:813:GGML_VULKAN_MEMORY_DEBUG:BOOL=OFF
CMakeCache.txt:816:GGML_VULKAN_RUN_TESTS:BOOL=OFF
CMakeCache.txt:819:GGML_VULKAN_SHADERS_GEN_TOOLCHAIN:FILEPATH=
CMakeCache.txt:822:GGML_VULKAN_SHADER_DEBUG_INFO:BOOL=OFF
CMakeCache.txt:825:GGML_VULKAN_VALIDATE:BOOL=OFF
Operating systems
Windows
GGML backends
Vulkan
Hardware
System Model HP ZBook Ultra G1a 14 inch Mobile Workstation PC
Processor AMD RYZEN AI MAX+ PRO 395 w/ Radeon 8060S, 3000 Mhz, 16 Core(s), 32 Logical Processor(s)
BIOS Version/Date HP X89 Ver. 01.03.11, 28/8/2025
Models
lmstudio-community/gpt-oss-20b-GGUF/gpt-oss-20b-MXFP4.gguf
Problem description & steps to reproduce
when running ..\llama.cpp\build\bin\Release\llama-server.exe -m C:/Users/franc/.lmstudio/models/lmstudio-community/gpt-oss-20b-GGUF/gpt-oss-20b-MXFP4.gguf -c 16384 --temp 0.2 --port 8033 --host 127.0.0.1 -ngl -1
the model load and just exit without warning
see atatched file
First Bad Commit
No response
Relevant log output
llama_context: constructing llama_context
llama_context: n_seq_max = 4
llama_context: n_ctx = 16384
llama_context: n_ctx_seq = 16384
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 150000.0
llama_context: freq_scale = 0.03125
llama_context: n_ctx_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 3.07 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 16384 cells
llama_kv_cache: Vulkan0 KV buffer size = 384.00 MiB
llama_kv_cache: size = 384.00 MiB ( 16384 cells, 12 layers, 4/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_kv_cache_iswa: creating SWA KV cache, size = 1024 cells
llama_kv_cache: Vulkan0 KV buffer size = 24.00 MiB
llama_kv_cache: size = 24.00 MiB ( 1024 cells, 12 layers, 4/1 seqs), K (f16): 12.00 MiB, V (f16): 12.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context: Vulkan0 compute buffer size = 398.38 MiB
llama_context: Vulkan_Host compute buffer size = 39.65 MiB
llama_context: graph nodes = 1352
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|return|> logit bias = -inf
common_init_from_params: added <|call|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16384
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)