Skip to content

Misc. bug: DS4 performance with cuda+vulkan #25146

Description

@Fringe210

Name and Version

llama-server.exe --model "h:\DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf"

Version 6f4f53f

Windows 10. Cuda + Vulkan

Bad performance.

6.09.148.279 I sched_reserve: CUDA0 compute buffer size = 16699.85 MiB
6.09.148.286 I sched_reserve: Vulkan1 compute buffer size = 786.86 MiB
6.09.148.288 I sched_reserve: Vulkan2 compute buffer size = 781.35 MiB
6.09.148.289 I sched_reserve: CUDA_Host compute buffer size = 273.84 MiB
6.09.148.290 I sched_reserve: graph nodes = 33605
6.09.148.291 I sched_reserve: graph splits = 56 (with bs=512), 14 (with bs=1)
6.09.148.293 I sched_reserve: reserve took 519.97 ms, sched copies = 1
6.18.524.862 I cmn common_conte: the context does not support partial sequence removal
6.18.596.611 I srv load_model: speculative decoding will use checkpoints
6.18.596.619 I srv load_model: initializing, n_slots = 1, n_ctx_slot = 258560, kv_unified = 'false'
6.18.598.013 I spec common_specu: no implementations specified for speculative decoding
6.18.599.410 I slot load_model: id 0 | task -1 | new slot, n_ctx = 258560
6.18.599.500 I srv load_model: prompt cache is enabled, size limit: 8192 MiB
6.18.599.500 I srv load_model: use --cache-ram 0 to disable the prompt cache
6.18.599.502 I srv load_model: for more info see #16391
6.18.599.503 I srv load_model: context checkpoints enabled, max = 32, min spacing = 8192
6.18.599.527 I srv init: idle slots will be saved to prompt cache upon starting a new task
6.18.650.147 I srv init: init: chat template, example_format: '<|begin▁of▁sentence|>You are a helpful assistant<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'
6.18.673.746 I srv init: init: chat template, thinking = 1
6.18.674.367 I srv llama_server: model loaded
6.18.674.376 I srv llama_server: listening on http://127.0.0.1:8080
6.18.674.900 I srv update_slots: all slots are idle
6.23.557.777 I srv operator (): chat format: peg-native
6.23.558.485 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
6.23.558.490 I srv get_availabl: updating prompt cache
6.23.558.497 I srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
6.23.558.503 I srv update: - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 258560 tokens, 8589934592 est)
6.23.558.505 I srv get_availabl: prompt cache update took 0.01 ms
6.23.558.790 I cmn common_reaso: activated, budget=512 tokens
6.23.559.008 I slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
6.23.559.023 I slot launch_slot_: id 0 | task -1 | sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 258560
top_k = 20, top_p = 0.950, min_p = 0.000, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 1.000
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000, adaptive_target = -1.000, adaptive_decay = 0.900
6.23.559.026 I slot launch_slot_: id 0 | task 0 | processing task, is_child = 0
6.23.559.241 I slot operator (): id 0 | task 0 | new prompt, n_ctx_slot = 258560, n_keep = 0, task.n_tokens = 6
6.23.559.256 I slot operator (): id 0 | task 0 | cached n_tokens = 0, memory_seq_rm [0, end)
6.23.561.307 I srv stream_sessi: conv_id=de6e77d1-76eb-4974-b749-eac9c4e27342 (empty=0)
6.25.263.009 I slot operator (): id 0 | task 0 | cached n_tokens = 1, memory_seq_rm [1, end)
6.25.279.560 I slot create_check: id 0 | task 0 | created context checkpoint 1 of 32 (pos_min = 0, pos_max = 0, n_tokens = 1, size = 11.687 MiB)
6.25.900.482 I slot operator (): id 0 | task 0 | cached n_tokens = 2, memory_seq_rm [2, end)
6.25.900.513 I slot init_sampler: id 0 | task 0 | init sampler, took 0.00 ms, tokens: text = 6, total = 6
7.19.001.577 I cmn common_reaso: forced into forcing state (manual transition)
7.19.565.944 I cmn common_reaso: forced sequence complete, done
7.27.492.457 I slot print_timing: id 0 | task 0 | n_decoded = 100, tg = 1.77 t/s, tg_3s = 1.77 t/s
7.30.928.038 I slot print_timing: id 0 | task 0 | n_decoded = 106, tg = 1.77 t/s, tg_3s = 1.75 t/s
7.34.397.876 I slot print_timing: id 0 | task 0 | n_decoded = 112, tg = 1.77 t/s, tg_3s = 1.73 t/s
7.37.779.867 I slot print_timing: id 0 | task 0 | n_decoded = 118, tg = 1.77 t/s, tg_3s = 1.77 t/s
7.41.477.249 I slot print_timing: id 0 | task 0 | n_decoded = 124, tg = 1.76 t/s, tg_3s = 1.62 t/s
7.42.109.973 I slot print_timing: id 0 | task 0 | prompt eval time = 7447.31 ms / 6 tokens ( 1241.22 ms per token, 0.81 tokens per second)
7.42.109.981 I slot print_timing: id 0 | task 0 | eval time = 71103.40 ms / 125 tokens ( 568.83 ms per token, 1.76 tokens per second)
7.42.109.982 I slot print_timing: id 0 | task 0 | total time = 78550.71 ms / 131 tokens
7.42.109.983 I slot print_timing: id 0 | task 0 | graphs reused = 122
7.42.109.999 I slot release: id 0 | task 0 | stop processing: n_tokens = 130, truncated = 0
7.42.110.010 I srv update_slots: all slots are idle
7.42.120.361 I srv close: stream_pipe close: skip drain (done=1 cancelled=0) conv=de6e77d1-76eb-4974-b749-eac9c4e27342

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-server

Command line

start C:\llm\llamads4\build\bin\Release\llama-server.exe  --model "h:\DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf" --temp 1 --top-p 0.95 --ctx-size 258320 --top-k 20  --min-p 0.00 --no-warmup --no-mmap --fit on --parallel 1 --cont-batching --reasoning on --n-cpu-moe 0  -lv 4

Problem description & steps to reproduce

only 1,7 tokens/sec.

I imagine only CUDA is supported for now, but it isn't specified.

First Bad Commit

No response

Relevant log output

Logs

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions