Misc. bug: DS4 performance with cuda+vulkan

### Name and Version

llama-server.exe  --model "h:\DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf"

Version 6f4f53f2b7da54fcdbbecaaa734337c337ad6176

Windows 10. Cuda + Vulkan

Bad performance.

6.09.148.279 I sched_reserve:      CUDA0 compute buffer size = 16699.85 MiB
6.09.148.286 I sched_reserve:    Vulkan1 compute buffer size =   786.86 MiB
6.09.148.288 I sched_reserve:    Vulkan2 compute buffer size =   781.35 MiB
6.09.148.289 I sched_reserve:  CUDA_Host compute buffer size =   273.84 MiB
6.09.148.290 I sched_reserve: graph nodes  = 33605
6.09.148.291 I sched_reserve: graph splits = 56 (with bs=512), 14 (with bs=1)
6.09.148.293 I sched_reserve: reserve took 519.97 ms, sched copies = 1
6.18.524.862 I cmn  common_conte: the context does not support partial sequence removal
6.18.596.611 I srv    load_model: speculative decoding will use checkpoints
6.18.596.619 I srv    load_model: initializing, n_slots = 1, n_ctx_slot = 258560, kv_unified = 'false'
6.18.598.013 I spec common_specu: no implementations specified for speculative decoding
6.18.599.410 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 258560
6.18.599.500 I srv    load_model: prompt cache is enabled, size limit: 8192 MiB
6.18.599.500 I srv    load_model: use `--cache-ram 0` to disable the prompt cache
6.18.599.502 I srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
6.18.599.503 I srv    load_model: context checkpoints enabled, max = 32, min spacing = 8192
6.18.599.527 I srv          init: idle slots will be saved to prompt cache upon starting a new task
6.18.650.147 I srv          init: init: chat template, example_format: '<｜begin▁of▁sentence｜>You are a helpful assistant<｜User｜>Hello<｜Assistant｜></think>Hi there<｜end▁of▁sentence｜><｜User｜>How are you?<｜Assistant｜><think>'
6.18.673.746 I srv          init: init: chat template, thinking = 1
6.18.674.367 I srv  llama_server: model loaded
6.18.674.376 I srv  llama_server: listening on http://127.0.0.1:8080
6.18.674.900 I srv  update_slots: all slots are idle
6.23.557.777 I srv   operator (): chat format: peg-native
6.23.558.485 I slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
6.23.558.490 I srv  get_availabl: updating prompt cache
6.23.558.497 I srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
6.23.558.503 I srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 258560 tokens, 8589934592 est)
6.23.558.505 I srv  get_availabl: prompt cache update took 0.01 ms
6.23.558.790 I cmn  common_reaso: activated, budget=512 tokens
6.23.559.008 I slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
6.23.559.023 I slot launch_slot_: id  0 | task -1 | sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 258560
        top_k = 20, top_p = 0.950, min_p = 0.000, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 1.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000, adaptive_target = -1.000, adaptive_decay = 0.900
6.23.559.026 I slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
6.23.559.241 I slot  operator (): id  0 | task 0 | new prompt, n_ctx_slot = 258560, n_keep = 0, task.n_tokens = 6
6.23.559.256 I slot  operator (): id  0 | task 0 | cached n_tokens = 0, memory_seq_rm [0, end)
6.23.561.307 I srv  stream_sessi: conv_id=de6e77d1-76eb-4974-b749-eac9c4e27342 (empty=0)
6.25.263.009 I slot  operator (): id  0 | task 0 | cached n_tokens = 1, memory_seq_rm [1, end)
6.25.279.560 I slot create_check: id  0 | task 0 | created context checkpoint 1 of 32 (pos_min = 0, pos_max = 0, n_tokens = 1, size = 11.687 MiB)
6.25.900.482 I slot  operator (): id  0 | task 0 | cached n_tokens = 2, memory_seq_rm [2, end)
6.25.900.513 I slot init_sampler: id  0 | task 0 | init sampler, took 0.00 ms, tokens: text = 6, total = 6
7.19.001.577 I cmn  common_reaso: forced into forcing state (manual transition)
7.19.565.944 I cmn  common_reaso: forced sequence complete, done
7.27.492.457 I slot print_timing: id  0 | task 0 | n_decoded =    100, tg =   1.77 t/s, tg_3s =   1.77 t/s
7.30.928.038 I slot print_timing: id  0 | task 0 | n_decoded =    106, tg =   1.77 t/s, tg_3s =   1.75 t/s
7.34.397.876 I slot print_timing: id  0 | task 0 | n_decoded =    112, tg =   1.77 t/s, tg_3s =   1.73 t/s
7.37.779.867 I slot print_timing: id  0 | task 0 | n_decoded =    118, tg =   1.77 t/s, tg_3s =   1.77 t/s
7.41.477.249 I slot print_timing: id  0 | task 0 | n_decoded =    124, tg =   1.76 t/s, tg_3s =   1.62 t/s
7.42.109.973 I slot print_timing: id  0 | task 0 | prompt eval time =    7447.31 ms /     6 tokens ( 1241.22 ms per token,     0.81 tokens per second)
7.42.109.981 I slot print_timing: id  0 | task 0 |        eval time =   71103.40 ms /   125 tokens (  568.83 ms per token,     1.76 tokens per second)
7.42.109.982 I slot print_timing: id  0 | task 0 |       total time =   78550.71 ms /   131 tokens
7.42.109.983 I slot print_timing: id  0 | task 0 |    graphs reused =        122
7.42.109.999 I slot      release: id  0 | task 0 | stop processing: n_tokens = 130, truncated = 0
7.42.110.010 I srv  update_slots: all slots are idle
7.42.120.361 I srv         close: stream_pipe close: skip drain (done=1 cancelled=0) conv=de6e77d1-76eb-4974-b749-eac9c4e27342



### Operating systems

Windows

### Which llama.cpp modules do you know to be affected?

llama-server

### Command line

```shell
start C:\llm\llamads4\build\bin\Release\llama-server.exe  --model "h:\DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf" --temp 1 --top-p 0.95 --ctx-size 258320 --top-k 20  --min-p 0.00 --no-warmup --no-mmap --fit on --parallel 1 --cont-batching --reasoning on --n-cpu-moe 0  -lv 4
```

### Problem description & steps to reproduce

only 1,7 tokens/sec.

I imagine only CUDA is supported for now, but it isn't specified.

### First Bad Commit

_No response_

### Relevant log output

<details>
<summary>Logs</summary>


```console

```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: DS4 performance with cuda+vulkan #25146

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Misc. bug: DS4 performance with cuda+vulkan #25146

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions