Compile bug: 4060ti 16g got poor performance using default compile options( GGML_CUDA_GRAPHS = on )

### Git commit

master commit: 328874d054e0eb44591202a23c209cf02c18e3cb

### Operating systems

Linux

### GGML backends

CUDA

### Problem description & steps to reproduce

qwen2.5-1.5b-instruct-q4_k_m.gguf speed from 37 t/s -> 194 t/s

use
```bash
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.4/bin/nvcc \
  -DCUDAToolkit_ROOT=/usr/local/cuda-12.4
```
to get bad performance ： ` GPU 210MHz` 


1.09.031.870 I srv  update_slots: all slots are idle
2.12.670.942 I srv  params_from_: Chat format: peg-native
2.12.671.015 I slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.481 (> 0.100 thold), f_keep = 0.238
2.12.671.017 I srv  get_availabl: updating prompt cache
2.12.671.033 W srv   prompt_save:  - saving prompt with length 105, total state size = 2.873 MiB (draft: 0.000 MiB)
2.12.678.125 I srv          load:  - looking for better prompt, base f_keep = 0.238, sim = 0.481
2.12.678.129 I srv        update:  - cache state: 2 prompts, 6.211 MiB (limits: 8192.000 MiB, 4096 tokens, 299403 est)
2.12.678.130 I srv        update:    - prompt 0x5bf2c6927e50:     122 tokens, checkpoints:  0,     3.338 MiB
2.12.678.131 I srv        update:    - prompt 0x5bf2c6edfb70:     105 tokens, checkpoints:  0,     2.873 MiB
2.12.678.131 I srv  get_availabl: prompt cache update took 7.11 ms
2.12.678.172 I slot launch_slot_: id  3 | task 153 | processing task, is_child = 0
2.15.212.751 I slot print_timing: id  3 | task 153 | n_decoded =    100, tg =  39.60 t/s
2.18.226.607 I slot print_timing: id  3 | task 153 | n_decoded =    212, tg =  38.28 t/s
2.21.230.563 I slot print_timing: id  3 | task 153 | n_decoded =    322, tg =  37.69 t/s
2.24.234.525 I slot print_timing: id  3 | task 153 | n_decoded =    432, tg =  37.41 t/s
2.27.256.171 I slot print_timing: id  3 | task 153 | n_decoded =    541, tg =  37.14 t/s
2.29.456.489 I slot print_timing: id  3 | task 153 | prompt eval time =       9.62 ms /    27 tokens (    0.36 ms per token,  2806.94 tokens per second)
2.29.456.494 I slot print_timing: id  3 | task 153 |        eval time =   16768.69 ms /   620 tokens (   27.05 ms per token,    36.97 tokens per second)
2.29.456.495 I slot print_timing: id  3 | task 153 |       total time =   16778.31 ms /   647 tokens
2.29.456.496 I slot print_timing: id  3 | task 153 |    graphs reused =        766
2.29.456.526 I slot      release: id  3 | task 153 | stop processing: n_tokens = 671, truncated = 0
2.29.456.534 I srv  update_slots: all slots are idle

use

```bash
cmake --build build --config Release -j$(nproc)

cmake -B build \
  -DGGML_CUDA=ON \
  -DGGML_CUDA_GRAPHS=OFF \
  -DCMAKE_CUDA_ARCHITECTURES="89" \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.4/bin/nvcc \
  -DCUDAToolkit_ROOT=/usr/local/cuda-12.4

cmake --build build --config Release -j$(nproc)
```

to fix.

(base) ➜  mtp.llama.cpp git:(master) # 用新编译的版本启动
./build/bin/llama-server \
  --model /home/albin/models/qwen2.5-1.5b-instruct-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8081 \
  -ngl 999 \
  --ctx-size 4096 \
  -n 512 \
  -t 4 \
  -ub 512 \
  --api-key sk-local-qwen

0.00.013.517 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.013.519 I device_info:
0.00.068.836 I   - CUDA0   : NVIDIA GeForce RTX 4060 Ti (16193 MiB, 16051 MiB free)
0.00.068.846 I   - CPU     : Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz (23965 MiB, 23965 MiB free)
0.00.068.915 I system_info: n_threads = 4 (n_threads_batch = 4) / 6 | CUDA : ARCHS = 890 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.068.921 I srv  llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
0.00.068.974 I srv          init: api_keys: ****qwen
0.00.068.984 I srv          init: using 8 threads for HTTP server
0.00.069.089 I srv         start: binding port with default address family
0.00.070.293 I srv  llama_server: loading model
0.00.070.297 I srv    load_model: loading model '/home/albin/models/qwen2.5-1.5b-instruct-q4_k_m.gguf'
0.00.070.348 I common_init_result: fitting params to device memory ...
0.00.070.348 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.495.469 W load: control-looking token: 128247 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.00.810.506 W llama_context: n_ctx_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
0.00.820.770 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.00.842.588 I srv    load_model: initializing slots, n_slots = 4
0.00.848.995 W common_speculative_init: no implementations specified for speculative decoding
0.00.848.998 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 4096
0.00.849.002 I slot   load_model: id  1 | task -1 | new slot, n_ctx = 4096
0.00.849.002 I slot   load_model: id  2 | task -1 | new slot, n_ctx = 4096
0.00.849.002 I slot   load_model: id  3 | task -1 | new slot, n_ctx = 4096
0.00.849.106 I srv    load_model: prompt cache is enabled, size limit: 8192 MiB
0.00.849.107 I srv    load_model: use `--cache-ram 0` to disable the prompt cache
0.00.849.108 I srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.00.849.132 I srv          init: idle slots will be saved to prompt cache and cleared upon starting a new task
0.00.854.147 I init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
0.00.857.202 I srv          init: init: chat template, thinking = 0
0.00.857.230 I srv  llama_server: model loaded
0.00.857.233 I srv  llama_server: server is listening on http://0.0.0.0:8081
0.00.857.238 I srv  update_slots: all slots are idle
0.20.465.242 I srv  params_from_: Chat format: peg-native
0.20.465.349 I slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = -1
0.20.465.352 I srv  get_availabl: updating prompt cache
0.20.465.355 I srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
0.20.465.361 I srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 4096 tokens, 8589934592 est)
0.20.465.362 I srv  get_availabl: prompt cache update took 0.01 ms
0.20.465.411 I slot launch_slot_: id  3 | task 0 | processing task, is_child = 0
0.20.913.347 I slot print_timing: id  3 | task 0 | prompt eval time =      27.47 ms /    39 tokens (    0.70 ms per token,  1419.94 tokens per second)
0.20.913.352 I slot print_timing: id  3 | task 0 |        eval time =     420.45 ms /    81 tokens (    5.19 ms per token,   192.65 tokens per second)
0.20.913.352 I slot print_timing: id  3 | task 0 |       total time =     447.92 ms /   120 tokens
0.20.913.370 I slot print_timing: id  3 | task 0 |    graphs reused =         80
0.20.913.395 I slot      release: id  3 | task 0 | stop processing: n_tokens = 119, truncated = 0
0.20.913.400 I srv  update_slots: all slots are idle
0.32.947.917 I srv  params_from_: Chat format: peg-native
0.32.948.051 I slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.328
0.32.948.053 I srv  get_availabl: updating prompt cache
0.32.948.075 W srv   prompt_save:  - saving prompt with length 119, total state size = 3.256 MiB (draft: 0.000 MiB)
0.32.950.844 I srv          load:  - looking for better prompt, base f_keep = 0.328, sim = 1.000
0.32.950.848 I srv        update:  - cache state: 1 prompts, 3.256 MiB (limits: 8192.000 MiB, 4096 tokens, 299406 est)
0.32.950.849 I srv        update:    - prompt 0x63ba952db360:     119 tokens, checkpoints:  0,     3.256 MiB
0.32.950.850 I srv  get_availabl: prompt cache update took 2.80 ms
0.32.950.896 I slot launch_slot_: id  3 | task 82 | processing task, is_child = 0
0.32.950.900 W slot update_slots: id  3 | task 82 | need to evaluate at least 1 token for each active slot (n_past = 39, task.n_tokens() = 39)
0.32.950.901 W slot update_slots: id  3 | task 82 | n_past was set to 38
0.33.470.903 I slot print_timing: id  3 | task 82 | n_decoded =    100, tg = 194.37 t/s
0.33.599.415 I slot print_timing: id  3 | task 82 | prompt eval time =       5.52 ms /     1 tokens (    5.52 ms per token,   181.06 tokens per second)
0.33.599.419 I slot print_timing: id  3 | task 82 |        eval time =     642.99 ms /   125 tokens (    5.14 ms per token,   194.41 tokens per second)
0.33.599.419 I slot print_timing: id  3 | task 82 |       total time =     648.51 ms /   126 tokens
0.33.599.420 I slot print_timing: id  3 | task 82 |    graphs reused =        205
0.33.599.450 I slot      release: id  3 | task 82 | stop processing: n_tokens = 163, truncated = 0
0.33.599.455 I srv  update_slots: all slots are idle

### First Bad Commit

unknown

### Compile command

```shell
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.4/bin/nvcc \
  -DCUDAToolkit_ROOT=/usr/local/cuda-12.4

cmake --build build --config Release -j$(nproc)
```

### Relevant log output

```shell
qwen2.5-1.5b-instruct-q4_k_m.gguf speed from 30+ t/s -> 190+ t/s

use

cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.4/bin/nvcc \
  -DCUDAToolkit_ROOT=/usr/local/cuda-12.4

to get bad performance ： ` GPU 210MHz` 

 Device 0 [NVIDIA GeForce RTX 4060 Ti] PCIe GEN 3@ 8x RX: 1.191 MiB/s TX: 32.19 MiB/s
 GPU 210MHz  MEM 8751MHz TEMP  37°C FAN  30% POW 150 / 165 W
 GPU[|||||||||||||||||||||||||||||||||99%] MEM[||||                 1.663Gi/15.996Gi]
   ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
100│GPU0 %                                                                                                                                                                                                                          ┌───┐                                                                                                                       ┌─────────│
   │GPU0 mem%                                                                                                                                                                                                                       │   │                                                                                                                       │         │
   │                                                                                                                                                                                                                                │   │                                                                                                                       │         │
   │                                                                                                                                                                                                                                │   │                                                                                                                       │         │
   │                                                                                                                                                                                                                                │   │                                                                                                                       │         │
 75│                                                                                                                                                                                                                                │   │                                                                                                                       │         │
   │                                                                                                                                                                                                                                │   │                                                                                                                       │         │
   │                                                                                                                                                                                                                                │   │                                                                                                                       │         │
   │                                                                                                                                                                                                                                │   │                                                                                                                       │         │
   │                                                                                                                                                                                                                                │   │                                                                                                                       │         │
 50│                                                                                                                                                                                                                                │   │                                                                                                                       │         │
   │                                                                                                                                                                                                                                │   │                                                                                                                       │         │
   │                                                                                                                                                                                                                                │   │                                                                                                                       │         │
   │                                                                                                                                                                                                                                │   │                                                                                                                       │         │
   │                                                                                                                                                                                                                                │   │                                                                                                                       │         │
 25│                                                                                                                                                                                                                                │   │                                                                                                                       │         │
   │                                                                                                                                                                                                                                │   │                                                                                                                       │         │
   │                                                                                                                                                                                                                                │   │                                                                                                                       │         │
   │                                                                                                                                                                                                                       ┌────────┼───┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────│
   │                                                                                                                                                                                                                       │        │   │                                                                                                                       │         │
  0│───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────┘   └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘         │
   └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
    PID  USER DEV    TYPE  GPU        GPU MEM    CPU  HOST MEM Command
   3698 albin   0 Compute  98%   1494MiB   9%   101%    477MiB ./ai_models/mtp.llama.cpp/build/bin/llama-server --model /home/albin/models/qwen2.5-1.5b-instruct-q4_k_m.gguf --host 0.0.0.0 --port 8081 -ngl 999 --ctx-size 4096 -n 512 -t 8 -ub 512 --api-key sk-local-qwen


1.09.031.870 I srv  update_slots: all slots are idle
2.12.670.942 I srv  params_from_: Chat format: peg-native
2.12.671.015 I slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.481 (> 0.100 thold), f_keep = 0.238
2.12.671.017 I srv  get_availabl: updating prompt cache
2.12.671.033 W srv   prompt_save:  - saving prompt with length 105, total state size = 2.873 MiB (draft: 0.000 MiB)
2.12.678.125 I srv          load:  - looking for better prompt, base f_keep = 0.238, sim = 0.481
2.12.678.129 I srv        update:  - cache state: 2 prompts, 6.211 MiB (limits: 8192.000 MiB, 4096 tokens, 299403 est)
2.12.678.130 I srv        update:    - prompt 0x5bf2c6927e50:     122 tokens, checkpoints:  0,     3.338 MiB
2.12.678.131 I srv        update:    - prompt 0x5bf2c6edfb70:     105 tokens, checkpoints:  0,     2.873 MiB
2.12.678.131 I srv  get_availabl: prompt cache update took 7.11 ms
2.12.678.172 I slot launch_slot_: id  3 | task 153 | processing task, is_child = 0
2.15.212.751 I slot print_timing: id  3 | task 153 | n_decoded =    100, tg =  39.60 t/s
2.18.226.607 I slot print_timing: id  3 | task 153 | n_decoded =    212, tg =  38.28 t/s
2.21.230.563 I slot print_timing: id  3 | task 153 | n_decoded =    322, tg =  37.69 t/s
2.24.234.525 I slot print_timing: id  3 | task 153 | n_decoded =    432, tg =  37.41 t/s
2.27.256.171 I slot print_timing: id  3 | task 153 | n_decoded =    541, tg =  37.14 t/s
2.29.456.489 I slot print_timing: id  3 | task 153 | prompt eval time =       9.62 ms /    27 tokens (    0.36 ms per token,  2806.94 tokens per second)
2.29.456.494 I slot print_timing: id  3 | task 153 |        eval time =   16768.69 ms /   620 tokens (   27.05 ms per token,    36.97 tokens per second)
2.29.456.495 I slot print_timing: id  3 | task 153 |       total time =   16778.31 ms /   647 tokens
2.29.456.496 I slot print_timing: id  3 | task 153 |    graphs reused =        766
2.29.456.526 I slot      release: id  3 | task 153 | stop processing: n_tokens = 671, truncated = 0
2.29.456.534 I srv  update_slots: all slots are idle

use


cmake --build build --config Release -j$(nproc)

cmake -B build \
  -DGGML_CUDA=ON \
  -DGGML_CUDA_GRAPHS=OFF \
  -DCMAKE_CUDA_ARCHITECTURES="89" \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.4/bin/nvcc \
  -DCUDAToolkit_ROOT=/usr/local/cuda-12.4

cmake --build build --config Release -j$(nproc)


to fix.

(base) ➜  mtp.llama.cpp git:(master) # 用新编译的版本启动
./build/bin/llama-server \
  --model /home/albin/models/qwen2.5-1.5b-instruct-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8081 \
  -ngl 999 \
  --ctx-size 4096 \
  -n 512 \
  -t 4 \
  -ub 512 \
  --api-key sk-local-qwen

0.00.013.517 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.013.519 I device_info:
0.00.068.836 I   - CUDA0   : NVIDIA GeForce RTX 4060 Ti (16193 MiB, 16051 MiB free)
0.00.068.846 I   - CPU     : Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz (23965 MiB, 23965 MiB free)
0.00.068.915 I system_info: n_threads = 4 (n_threads_batch = 4) / 6 | CUDA : ARCHS = 890 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.068.921 I srv  llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
0.00.068.974 I srv          init: api_keys: ****qwen
0.00.068.984 I srv          init: using 8 threads for HTTP server
0.00.069.089 I srv         start: binding port with default address family
0.00.070.293 I srv  llama_server: loading model
0.00.070.297 I srv    load_model: loading model '/home/albin/models/qwen2.5-1.5b-instruct-q4_k_m.gguf'
0.00.070.348 I common_init_result: fitting params to device memory ...
0.00.070.348 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.495.469 W load: control-looking token: 128247 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.00.810.506 W llama_context: n_ctx_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
0.00.820.770 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.00.842.588 I srv    load_model: initializing slots, n_slots = 4
0.00.848.995 W common_speculative_init: no implementations specified for speculative decoding
0.00.848.998 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 4096
0.00.849.002 I slot   load_model: id  1 | task -1 | new slot, n_ctx = 4096
0.00.849.002 I slot   load_model: id  2 | task -1 | new slot, n_ctx = 4096
0.00.849.002 I slot   load_model: id  3 | task -1 | new slot, n_ctx = 4096
0.00.849.106 I srv    load_model: prompt cache is enabled, size limit: 8192 MiB
0.00.849.107 I srv    load_model: use `--cache-ram 0` to disable the prompt cache
0.00.849.108 I srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.00.849.132 I srv          init: idle slots will be saved to prompt cache and cleared upon starting a new task
0.00.854.147 I init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
0.00.857.202 I srv          init: init: chat template, thinking = 0
0.00.857.230 I srv  llama_server: model loaded
0.00.857.233 I srv  llama_server: server is listening on http://0.0.0.0:8081
0.00.857.238 I srv  update_slots: all slots are idle
0.20.465.242 I srv  params_from_: Chat format: peg-native
0.20.465.349 I slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = -1
0.20.465.352 I srv  get_availabl: updating prompt cache
0.20.465.355 I srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
0.20.465.361 I srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 4096 tokens, 8589934592 est)
0.20.465.362 I srv  get_availabl: prompt cache update took 0.01 ms
0.20.465.411 I slot launch_slot_: id  3 | task 0 | processing task, is_child = 0
0.20.913.347 I slot print_timing: id  3 | task 0 | prompt eval time =      27.47 ms /    39 tokens (    0.70 ms per token,  1419.94 tokens per second)
0.20.913.352 I slot print_timing: id  3 | task 0 |        eval time =     420.45 ms /    81 tokens (    5.19 ms per token,   192.65 tokens per second)
0.20.913.352 I slot print_timing: id  3 | task 0 |       total time =     447.92 ms /   120 tokens
0.20.913.370 I slot print_timing: id  3 | task 0 |    graphs reused =         80
0.20.913.395 I slot      release: id  3 | task 0 | stop processing: n_tokens = 119, truncated = 0
0.20.913.400 I srv  update_slots: all slots are idle
0.32.947.917 I srv  params_from_: Chat format: peg-native
0.32.948.051 I slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.328
0.32.948.053 I srv  get_availabl: updating prompt cache
0.32.948.075 W srv   prompt_save:  - saving prompt with length 119, total state size = 3.256 MiB (draft: 0.000 MiB)
0.32.950.844 I srv          load:  - looking for better prompt, base f_keep = 0.328, sim = 1.000
0.32.950.848 I srv        update:  - cache state: 1 prompts, 3.256 MiB (limits: 8192.000 MiB, 4096 tokens, 299406 est)
0.32.950.849 I srv        update:    - prompt 0x63ba952db360:     119 tokens, checkpoints:  0,     3.256 MiB
0.32.950.850 I srv  get_availabl: prompt cache update took 2.80 ms
0.32.950.896 I slot launch_slot_: id  3 | task 82 | processing task, is_child = 0
0.32.950.900 W slot update_slots: id  3 | task 82 | need to evaluate at least 1 token for each active slot (n_past = 39, task.n_tokens() = 39)
0.32.950.901 W slot update_slots: id  3 | task 82 | n_past was set to 38
0.33.470.903 I slot print_timing: id  3 | task 82 | n_decoded =    100, tg = 194.37 t/s
0.33.599.415 I slot print_timing: id  3 | task 82 | prompt eval time =       5.52 ms /     1 tokens (    5.52 ms per token,   181.06 tokens per second)
0.33.599.419 I slot print_timing: id  3 | task 82 |        eval time =     642.99 ms /   125 tokens (    5.14 ms per token,   194.41 tokens per second)
0.33.599.419 I slot print_timing: id  3 | task 82 |       total time =     648.51 ms /   126 tokens
0.33.599.420 I slot print_timing: id  3 | task 82 |    graphs reused =        205
0.33.599.450 I slot      release: id  3 | task 82 | stop processing: n_tokens = 163, truncated = 0
0.33.599.455 I srv  update_slots: all slots are idle
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compile bug: 4060ti 16g got poor performance using default compile options( GGML_CUDA_GRAPHS = on ) #23957

Git commit

Operating systems

GGML backends

Problem description & steps to reproduce

First Bad Commit

Compile command

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Compile bug: 4060ti 16g got poor performance using default compile options( GGML_CUDA_GRAPHS = on ) #23957

Description

Git commit

Operating systems

GGML backends

Problem description & steps to reproduce

First Bad Commit

Compile command

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions