Name and Version
llama-server.exe --model "h:\DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf"
Version 6f4f53f
Windows 10. Cuda + Vulkan
Bad performance.
6.09.148.279 I sched_reserve: CUDA0 compute buffer size = 16699.85 MiB
6.09.148.286 I sched_reserve: Vulkan1 compute buffer size = 786.86 MiB
6.09.148.288 I sched_reserve: Vulkan2 compute buffer size = 781.35 MiB
6.09.148.289 I sched_reserve: CUDA_Host compute buffer size = 273.84 MiB
6.09.148.290 I sched_reserve: graph nodes = 33605
6.09.148.291 I sched_reserve: graph splits = 56 (with bs=512), 14 (with bs=1)
6.09.148.293 I sched_reserve: reserve took 519.97 ms, sched copies = 1
6.18.524.862 I cmn common_conte: the context does not support partial sequence removal
6.18.596.611 I srv load_model: speculative decoding will use checkpoints
6.18.596.619 I srv load_model: initializing, n_slots = 1, n_ctx_slot = 258560, kv_unified = 'false'
6.18.598.013 I spec common_specu: no implementations specified for speculative decoding
6.18.599.410 I slot load_model: id 0 | task -1 | new slot, n_ctx = 258560
6.18.599.500 I srv load_model: prompt cache is enabled, size limit: 8192 MiB
6.18.599.500 I srv load_model: use --cache-ram 0 to disable the prompt cache
6.18.599.502 I srv load_model: for more info see #16391
6.18.599.503 I srv load_model: context checkpoints enabled, max = 32, min spacing = 8192
6.18.599.527 I srv init: idle slots will be saved to prompt cache upon starting a new task
6.18.650.147 I srv init: init: chat template, example_format: '<|begin▁of▁sentence|>You are a helpful assistant<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'
6.18.673.746 I srv init: init: chat template, thinking = 1
6.18.674.367 I srv llama_server: model loaded
6.18.674.376 I srv llama_server: listening on http://127.0.0.1:8080
6.18.674.900 I srv update_slots: all slots are idle
6.23.557.777 I srv operator (): chat format: peg-native
6.23.558.485 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
6.23.558.490 I srv get_availabl: updating prompt cache
6.23.558.497 I srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
6.23.558.503 I srv update: - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 258560 tokens, 8589934592 est)
6.23.558.505 I srv get_availabl: prompt cache update took 0.01 ms
6.23.558.790 I cmn common_reaso: activated, budget=512 tokens
6.23.559.008 I slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
6.23.559.023 I slot launch_slot_: id 0 | task -1 | sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 258560
top_k = 20, top_p = 0.950, min_p = 0.000, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 1.000
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000, adaptive_target = -1.000, adaptive_decay = 0.900
6.23.559.026 I slot launch_slot_: id 0 | task 0 | processing task, is_child = 0
6.23.559.241 I slot operator (): id 0 | task 0 | new prompt, n_ctx_slot = 258560, n_keep = 0, task.n_tokens = 6
6.23.559.256 I slot operator (): id 0 | task 0 | cached n_tokens = 0, memory_seq_rm [0, end)
6.23.561.307 I srv stream_sessi: conv_id=de6e77d1-76eb-4974-b749-eac9c4e27342 (empty=0)
6.25.263.009 I slot operator (): id 0 | task 0 | cached n_tokens = 1, memory_seq_rm [1, end)
6.25.279.560 I slot create_check: id 0 | task 0 | created context checkpoint 1 of 32 (pos_min = 0, pos_max = 0, n_tokens = 1, size = 11.687 MiB)
6.25.900.482 I slot operator (): id 0 | task 0 | cached n_tokens = 2, memory_seq_rm [2, end)
6.25.900.513 I slot init_sampler: id 0 | task 0 | init sampler, took 0.00 ms, tokens: text = 6, total = 6
7.19.001.577 I cmn common_reaso: forced into forcing state (manual transition)
7.19.565.944 I cmn common_reaso: forced sequence complete, done
7.27.492.457 I slot print_timing: id 0 | task 0 | n_decoded = 100, tg = 1.77 t/s, tg_3s = 1.77 t/s
7.30.928.038 I slot print_timing: id 0 | task 0 | n_decoded = 106, tg = 1.77 t/s, tg_3s = 1.75 t/s
7.34.397.876 I slot print_timing: id 0 | task 0 | n_decoded = 112, tg = 1.77 t/s, tg_3s = 1.73 t/s
7.37.779.867 I slot print_timing: id 0 | task 0 | n_decoded = 118, tg = 1.77 t/s, tg_3s = 1.77 t/s
7.41.477.249 I slot print_timing: id 0 | task 0 | n_decoded = 124, tg = 1.76 t/s, tg_3s = 1.62 t/s
7.42.109.973 I slot print_timing: id 0 | task 0 | prompt eval time = 7447.31 ms / 6 tokens ( 1241.22 ms per token, 0.81 tokens per second)
7.42.109.981 I slot print_timing: id 0 | task 0 | eval time = 71103.40 ms / 125 tokens ( 568.83 ms per token, 1.76 tokens per second)
7.42.109.982 I slot print_timing: id 0 | task 0 | total time = 78550.71 ms / 131 tokens
7.42.109.983 I slot print_timing: id 0 | task 0 | graphs reused = 122
7.42.109.999 I slot release: id 0 | task 0 | stop processing: n_tokens = 130, truncated = 0
7.42.110.010 I srv update_slots: all slots are idle
7.42.120.361 I srv close: stream_pipe close: skip drain (done=1 cancelled=0) conv=de6e77d1-76eb-4974-b749-eac9c4e27342
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
start C:\llm\llamads4\build\bin\Release\llama-server.exe --model "h:\DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf" --temp 1 --top-p 0.95 --ctx-size 258320 --top-k 20 --min-p 0.00 --no-warmup --no-mmap --fit on --parallel 1 --cont-batching --reasoning on --n-cpu-moe 0 -lv 4
Problem description & steps to reproduce
only 1,7 tokens/sec.
I imagine only CUDA is supported for now, but it isn't specified.
First Bad Commit
No response
Relevant log output
Logs
Name and Version
llama-server.exe --model "h:\DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf"
Version 6f4f53f
Windows 10. Cuda + Vulkan
Bad performance.
6.09.148.279 I sched_reserve: CUDA0 compute buffer size = 16699.85 MiB
6.09.148.286 I sched_reserve: Vulkan1 compute buffer size = 786.86 MiB
6.09.148.288 I sched_reserve: Vulkan2 compute buffer size = 781.35 MiB
6.09.148.289 I sched_reserve: CUDA_Host compute buffer size = 273.84 MiB
6.09.148.290 I sched_reserve: graph nodes = 33605
6.09.148.291 I sched_reserve: graph splits = 56 (with bs=512), 14 (with bs=1)
6.09.148.293 I sched_reserve: reserve took 519.97 ms, sched copies = 1
6.18.524.862 I cmn common_conte: the context does not support partial sequence removal
6.18.596.611 I srv load_model: speculative decoding will use checkpoints
6.18.596.619 I srv load_model: initializing, n_slots = 1, n_ctx_slot = 258560, kv_unified = 'false'
6.18.598.013 I spec common_specu: no implementations specified for speculative decoding
6.18.599.410 I slot load_model: id 0 | task -1 | new slot, n_ctx = 258560
6.18.599.500 I srv load_model: prompt cache is enabled, size limit: 8192 MiB
6.18.599.500 I srv load_model: use
--cache-ram 0to disable the prompt cache6.18.599.502 I srv load_model: for more info see #16391
6.18.599.503 I srv load_model: context checkpoints enabled, max = 32, min spacing = 8192
6.18.599.527 I srv init: idle slots will be saved to prompt cache upon starting a new task
6.18.650.147 I srv init: init: chat template, example_format: '<|begin▁of▁sentence|>You are a helpful assistant<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'
6.18.673.746 I srv init: init: chat template, thinking = 1
6.18.674.367 I srv llama_server: model loaded
6.18.674.376 I srv llama_server: listening on http://127.0.0.1:8080
6.18.674.900 I srv update_slots: all slots are idle
6.23.557.777 I srv operator (): chat format: peg-native
6.23.558.485 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
6.23.558.490 I srv get_availabl: updating prompt cache
6.23.558.497 I srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
6.23.558.503 I srv update: - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 258560 tokens, 8589934592 est)
6.23.558.505 I srv get_availabl: prompt cache update took 0.01 ms
6.23.558.790 I cmn common_reaso: activated, budget=512 tokens
6.23.559.008 I slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
6.23.559.023 I slot launch_slot_: id 0 | task -1 | sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 258560
top_k = 20, top_p = 0.950, min_p = 0.000, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 1.000
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000, adaptive_target = -1.000, adaptive_decay = 0.900
6.23.559.026 I slot launch_slot_: id 0 | task 0 | processing task, is_child = 0
6.23.559.241 I slot operator (): id 0 | task 0 | new prompt, n_ctx_slot = 258560, n_keep = 0, task.n_tokens = 6
6.23.559.256 I slot operator (): id 0 | task 0 | cached n_tokens = 0, memory_seq_rm [0, end)
6.23.561.307 I srv stream_sessi: conv_id=de6e77d1-76eb-4974-b749-eac9c4e27342 (empty=0)
6.25.263.009 I slot operator (): id 0 | task 0 | cached n_tokens = 1, memory_seq_rm [1, end)
6.25.279.560 I slot create_check: id 0 | task 0 | created context checkpoint 1 of 32 (pos_min = 0, pos_max = 0, n_tokens = 1, size = 11.687 MiB)
6.25.900.482 I slot operator (): id 0 | task 0 | cached n_tokens = 2, memory_seq_rm [2, end)
6.25.900.513 I slot init_sampler: id 0 | task 0 | init sampler, took 0.00 ms, tokens: text = 6, total = 6
7.19.001.577 I cmn common_reaso: forced into forcing state (manual transition)
7.19.565.944 I cmn common_reaso: forced sequence complete, done
7.27.492.457 I slot print_timing: id 0 | task 0 | n_decoded = 100, tg = 1.77 t/s, tg_3s = 1.77 t/s
7.30.928.038 I slot print_timing: id 0 | task 0 | n_decoded = 106, tg = 1.77 t/s, tg_3s = 1.75 t/s
7.34.397.876 I slot print_timing: id 0 | task 0 | n_decoded = 112, tg = 1.77 t/s, tg_3s = 1.73 t/s
7.37.779.867 I slot print_timing: id 0 | task 0 | n_decoded = 118, tg = 1.77 t/s, tg_3s = 1.77 t/s
7.41.477.249 I slot print_timing: id 0 | task 0 | n_decoded = 124, tg = 1.76 t/s, tg_3s = 1.62 t/s
7.42.109.973 I slot print_timing: id 0 | task 0 | prompt eval time = 7447.31 ms / 6 tokens ( 1241.22 ms per token, 0.81 tokens per second)
7.42.109.981 I slot print_timing: id 0 | task 0 | eval time = 71103.40 ms / 125 tokens ( 568.83 ms per token, 1.76 tokens per second)
7.42.109.982 I slot print_timing: id 0 | task 0 | total time = 78550.71 ms / 131 tokens
7.42.109.983 I slot print_timing: id 0 | task 0 | graphs reused = 122
7.42.109.999 I slot release: id 0 | task 0 | stop processing: n_tokens = 130, truncated = 0
7.42.110.010 I srv update_slots: all slots are idle
7.42.120.361 I srv close: stream_pipe close: skip drain (done=1 cancelled=0) conv=de6e77d1-76eb-4974-b749-eac9c4e27342
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
only 1,7 tokens/sec.
I imagine only CUDA is supported for now, but it isn't specified.
First Bad Commit
No response
Relevant log output
Logs