Git commit
master commit: 328874d
Operating systems
Linux
GGML backends
CUDA
Problem description & steps to reproduce
qwen2.5-1.5b-instruct-q4_k_m.gguf speed from 37 t/s -> 194 t/s
use
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.4/bin/nvcc \
-DCUDAToolkit_ROOT=/usr/local/cuda-12.4
to get bad performance : GPU 210MHz
1.09.031.870 I srv update_slots: all slots are idle
2.12.670.942 I srv params_from_: Chat format: peg-native
2.12.671.015 I slot get_availabl: id 3 | task -1 | selected slot by LCP similarity, sim_best = 0.481 (> 0.100 thold), f_keep = 0.238
2.12.671.017 I srv get_availabl: updating prompt cache
2.12.671.033 W srv prompt_save: - saving prompt with length 105, total state size = 2.873 MiB (draft: 0.000 MiB)
2.12.678.125 I srv load: - looking for better prompt, base f_keep = 0.238, sim = 0.481
2.12.678.129 I srv update: - cache state: 2 prompts, 6.211 MiB (limits: 8192.000 MiB, 4096 tokens, 299403 est)
2.12.678.130 I srv update: - prompt 0x5bf2c6927e50: 122 tokens, checkpoints: 0, 3.338 MiB
2.12.678.131 I srv update: - prompt 0x5bf2c6edfb70: 105 tokens, checkpoints: 0, 2.873 MiB
2.12.678.131 I srv get_availabl: prompt cache update took 7.11 ms
2.12.678.172 I slot launch_slot_: id 3 | task 153 | processing task, is_child = 0
2.15.212.751 I slot print_timing: id 3 | task 153 | n_decoded = 100, tg = 39.60 t/s
2.18.226.607 I slot print_timing: id 3 | task 153 | n_decoded = 212, tg = 38.28 t/s
2.21.230.563 I slot print_timing: id 3 | task 153 | n_decoded = 322, tg = 37.69 t/s
2.24.234.525 I slot print_timing: id 3 | task 153 | n_decoded = 432, tg = 37.41 t/s
2.27.256.171 I slot print_timing: id 3 | task 153 | n_decoded = 541, tg = 37.14 t/s
2.29.456.489 I slot print_timing: id 3 | task 153 | prompt eval time = 9.62 ms / 27 tokens ( 0.36 ms per token, 2806.94 tokens per second)
2.29.456.494 I slot print_timing: id 3 | task 153 | eval time = 16768.69 ms / 620 tokens ( 27.05 ms per token, 36.97 tokens per second)
2.29.456.495 I slot print_timing: id 3 | task 153 | total time = 16778.31 ms / 647 tokens
2.29.456.496 I slot print_timing: id 3 | task 153 | graphs reused = 766
2.29.456.526 I slot release: id 3 | task 153 | stop processing: n_tokens = 671, truncated = 0
2.29.456.534 I srv update_slots: all slots are idle
use
cmake --build build --config Release -j$(nproc)
cmake -B build \
-DGGML_CUDA=ON \
-DGGML_CUDA_GRAPHS=OFF \
-DCMAKE_CUDA_ARCHITECTURES="89" \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.4/bin/nvcc \
-DCUDAToolkit_ROOT=/usr/local/cuda-12.4
cmake --build build --config Release -j$(nproc)
to fix.
(base) ➜ mtp.llama.cpp git:(master) # 用新编译的版本启动
./build/bin/llama-server
--model /home/albin/models/qwen2.5-1.5b-instruct-q4_k_m.gguf
--host 0.0.0.0
--port 8081
-ngl 999
--ctx-size 4096
-n 512
-t 4
-ub 512
--api-key sk-local-qwen
0.00.013.517 I log_info: verbosity = 3 (adjust with the -lv N CLI arg)
0.00.013.519 I device_info:
0.00.068.836 I - CUDA0 : NVIDIA GeForce RTX 4060 Ti (16193 MiB, 16051 MiB free)
0.00.068.846 I - CPU : Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz (23965 MiB, 23965 MiB free)
0.00.068.915 I system_info: n_threads = 4 (n_threads_batch = 4) / 6 | CUDA : ARCHS = 890 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.068.921 I srv llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
0.00.068.974 I srv init: api_keys: ****qwen
0.00.068.984 I srv init: using 8 threads for HTTP server
0.00.069.089 I srv start: binding port with default address family
0.00.070.293 I srv llama_server: loading model
0.00.070.297 I srv load_model: loading model '/home/albin/models/qwen2.5-1.5b-instruct-q4_k_m.gguf'
0.00.070.348 I common_init_result: fitting params to device memory ...
0.00.070.348 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.495.469 W load: control-looking token: 128247 '' was not control-type; this is probably a bug in the model. its type will be overridden
0.00.810.506 W llama_context: n_ctx_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
0.00.820.770 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.00.842.588 I srv load_model: initializing slots, n_slots = 4
0.00.848.995 W common_speculative_init: no implementations specified for speculative decoding
0.00.848.998 I slot load_model: id 0 | task -1 | new slot, n_ctx = 4096
0.00.849.002 I slot load_model: id 1 | task -1 | new slot, n_ctx = 4096
0.00.849.002 I slot load_model: id 2 | task -1 | new slot, n_ctx = 4096
0.00.849.002 I slot load_model: id 3 | task -1 | new slot, n_ctx = 4096
0.00.849.106 I srv load_model: prompt cache is enabled, size limit: 8192 MiB
0.00.849.107 I srv load_model: use --cache-ram 0 to disable the prompt cache
0.00.849.108 I srv load_model: for more info see #16391
0.00.849.132 I srv init: idle slots will be saved to prompt cache and cleared upon starting a new task
0.00.854.147 I init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
0.00.857.202 I srv init: init: chat template, thinking = 0
0.00.857.230 I srv llama_server: model loaded
0.00.857.233 I srv llama_server: server is listening on http://0.0.0.0:8081
0.00.857.238 I srv update_slots: all slots are idle
0.20.465.242 I srv params_from_: Chat format: peg-native
0.20.465.349 I slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1
0.20.465.352 I srv get_availabl: updating prompt cache
0.20.465.355 I srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
0.20.465.361 I srv update: - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 4096 tokens, 8589934592 est)
0.20.465.362 I srv get_availabl: prompt cache update took 0.01 ms
0.20.465.411 I slot launch_slot_: id 3 | task 0 | processing task, is_child = 0
0.20.913.347 I slot print_timing: id 3 | task 0 | prompt eval time = 27.47 ms / 39 tokens ( 0.70 ms per token, 1419.94 tokens per second)
0.20.913.352 I slot print_timing: id 3 | task 0 | eval time = 420.45 ms / 81 tokens ( 5.19 ms per token, 192.65 tokens per second)
0.20.913.352 I slot print_timing: id 3 | task 0 | total time = 447.92 ms / 120 tokens
0.20.913.370 I slot print_timing: id 3 | task 0 | graphs reused = 80
0.20.913.395 I slot release: id 3 | task 0 | stop processing: n_tokens = 119, truncated = 0
0.20.913.400 I srv update_slots: all slots are idle
0.32.947.917 I srv params_from_: Chat format: peg-native
0.32.948.051 I slot get_availabl: id 3 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.328
0.32.948.053 I srv get_availabl: updating prompt cache
0.32.948.075 W srv prompt_save: - saving prompt with length 119, total state size = 3.256 MiB (draft: 0.000 MiB)
0.32.950.844 I srv load: - looking for better prompt, base f_keep = 0.328, sim = 1.000
0.32.950.848 I srv update: - cache state: 1 prompts, 3.256 MiB (limits: 8192.000 MiB, 4096 tokens, 299406 est)
0.32.950.849 I srv update: - prompt 0x63ba952db360: 119 tokens, checkpoints: 0, 3.256 MiB
0.32.950.850 I srv get_availabl: prompt cache update took 2.80 ms
0.32.950.896 I slot launch_slot_: id 3 | task 82 | processing task, is_child = 0
0.32.950.900 W slot update_slots: id 3 | task 82 | need to evaluate at least 1 token for each active slot (n_past = 39, task.n_tokens() = 39)
0.32.950.901 W slot update_slots: id 3 | task 82 | n_past was set to 38
0.33.470.903 I slot print_timing: id 3 | task 82 | n_decoded = 100, tg = 194.37 t/s
0.33.599.415 I slot print_timing: id 3 | task 82 | prompt eval time = 5.52 ms / 1 tokens ( 5.52 ms per token, 181.06 tokens per second)
0.33.599.419 I slot print_timing: id 3 | task 82 | eval time = 642.99 ms / 125 tokens ( 5.14 ms per token, 194.41 tokens per second)
0.33.599.419 I slot print_timing: id 3 | task 82 | total time = 648.51 ms / 126 tokens
0.33.599.420 I slot print_timing: id 3 | task 82 | graphs reused = 205
0.33.599.450 I slot release: id 3 | task 82 | stop processing: n_tokens = 163, truncated = 0
0.33.599.455 I srv update_slots: all slots are idle
First Bad Commit
unknown
Compile command
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.4/bin/nvcc \
-DCUDAToolkit_ROOT=/usr/local/cuda-12.4
cmake --build build --config Release -j$(nproc)
Relevant log output
qwen2.5-1.5b-instruct-q4_k_m.gguf speed from 30+ t/s -> 190+ t/s
use
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.4/bin/nvcc \
-DCUDAToolkit_ROOT=/usr/local/cuda-12.4
to get bad performance : ` GPU 210MHz`
Device 0 [NVIDIA GeForce RTX 4060 Ti] PCIe GEN 3@ 8x RX: 1.191 MiB/s TX: 32.19 MiB/s
GPU 210MHz MEM 8751MHz TEMP 37°C FAN 30% POW 150 / 165 W
GPU[|||||||||||||||||||||||||||||||||99%] MEM[|||| 1.663Gi/15.996Gi]
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
100│GPU0 % ┌───┐ ┌─────────│
│GPU0 mem% │ │ │ │
│ │ │ │ │
│ │ │ │ │
│ │ │ │ │
75│ │ │ │ │
│ │ │ │ │
│ │ │ │ │
│ │ │ │ │
│ │ │ │ │
50│ │ │ │ │
│ │ │ │ │
│ │ │ │ │
│ │ │ │ │
│ │ │ │ │
25│ │ │ │ │
│ │ │ │ │
│ │ │ │ │
│ ┌────────┼───┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────│
│ │ │ │ │ │
0│───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────┘ └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
PID USER DEV TYPE GPU GPU MEM CPU HOST MEM Command
3698 albin 0 Compute 98% 1494MiB 9% 101% 477MiB ./ai_models/mtp.llama.cpp/build/bin/llama-server --model /home/albin/models/qwen2.5-1.5b-instruct-q4_k_m.gguf --host 0.0.0.0 --port 8081 -ngl 999 --ctx-size 4096 -n 512 -t 8 -ub 512 --api-key sk-local-qwen
1.09.031.870 I srv update_slots: all slots are idle
2.12.670.942 I srv params_from_: Chat format: peg-native
2.12.671.015 I slot get_availabl: id 3 | task -1 | selected slot by LCP similarity, sim_best = 0.481 (> 0.100 thold), f_keep = 0.238
2.12.671.017 I srv get_availabl: updating prompt cache
2.12.671.033 W srv prompt_save: - saving prompt with length 105, total state size = 2.873 MiB (draft: 0.000 MiB)
2.12.678.125 I srv load: - looking for better prompt, base f_keep = 0.238, sim = 0.481
2.12.678.129 I srv update: - cache state: 2 prompts, 6.211 MiB (limits: 8192.000 MiB, 4096 tokens, 299403 est)
2.12.678.130 I srv update: - prompt 0x5bf2c6927e50: 122 tokens, checkpoints: 0, 3.338 MiB
2.12.678.131 I srv update: - prompt 0x5bf2c6edfb70: 105 tokens, checkpoints: 0, 2.873 MiB
2.12.678.131 I srv get_availabl: prompt cache update took 7.11 ms
2.12.678.172 I slot launch_slot_: id 3 | task 153 | processing task, is_child = 0
2.15.212.751 I slot print_timing: id 3 | task 153 | n_decoded = 100, tg = 39.60 t/s
2.18.226.607 I slot print_timing: id 3 | task 153 | n_decoded = 212, tg = 38.28 t/s
2.21.230.563 I slot print_timing: id 3 | task 153 | n_decoded = 322, tg = 37.69 t/s
2.24.234.525 I slot print_timing: id 3 | task 153 | n_decoded = 432, tg = 37.41 t/s
2.27.256.171 I slot print_timing: id 3 | task 153 | n_decoded = 541, tg = 37.14 t/s
2.29.456.489 I slot print_timing: id 3 | task 153 | prompt eval time = 9.62 ms / 27 tokens ( 0.36 ms per token, 2806.94 tokens per second)
2.29.456.494 I slot print_timing: id 3 | task 153 | eval time = 16768.69 ms / 620 tokens ( 27.05 ms per token, 36.97 tokens per second)
2.29.456.495 I slot print_timing: id 3 | task 153 | total time = 16778.31 ms / 647 tokens
2.29.456.496 I slot print_timing: id 3 | task 153 | graphs reused = 766
2.29.456.526 I slot release: id 3 | task 153 | stop processing: n_tokens = 671, truncated = 0
2.29.456.534 I srv update_slots: all slots are idle
use
cmake --build build --config Release -j$(nproc)
cmake -B build \
-DGGML_CUDA=ON \
-DGGML_CUDA_GRAPHS=OFF \
-DCMAKE_CUDA_ARCHITECTURES="89" \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.4/bin/nvcc \
-DCUDAToolkit_ROOT=/usr/local/cuda-12.4
cmake --build build --config Release -j$(nproc)
to fix.
(base) ➜ mtp.llama.cpp git:(master) # 用新编译的版本启动
./build/bin/llama-server \
--model /home/albin/models/qwen2.5-1.5b-instruct-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8081 \
-ngl 999 \
--ctx-size 4096 \
-n 512 \
-t 4 \
-ub 512 \
--api-key sk-local-qwen
0.00.013.517 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.013.519 I device_info:
0.00.068.836 I - CUDA0 : NVIDIA GeForce RTX 4060 Ti (16193 MiB, 16051 MiB free)
0.00.068.846 I - CPU : Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz (23965 MiB, 23965 MiB free)
0.00.068.915 I system_info: n_threads = 4 (n_threads_batch = 4) / 6 | CUDA : ARCHS = 890 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.068.921 I srv llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
0.00.068.974 I srv init: api_keys: ****qwen
0.00.068.984 I srv init: using 8 threads for HTTP server
0.00.069.089 I srv start: binding port with default address family
0.00.070.293 I srv llama_server: loading model
0.00.070.297 I srv load_model: loading model '/home/albin/models/qwen2.5-1.5b-instruct-q4_k_m.gguf'
0.00.070.348 I common_init_result: fitting params to device memory ...
0.00.070.348 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.495.469 W load: control-looking token: 128247 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.00.810.506 W llama_context: n_ctx_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
0.00.820.770 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.00.842.588 I srv load_model: initializing slots, n_slots = 4
0.00.848.995 W common_speculative_init: no implementations specified for speculative decoding
0.00.848.998 I slot load_model: id 0 | task -1 | new slot, n_ctx = 4096
0.00.849.002 I slot load_model: id 1 | task -1 | new slot, n_ctx = 4096
0.00.849.002 I slot load_model: id 2 | task -1 | new slot, n_ctx = 4096
0.00.849.002 I slot load_model: id 3 | task -1 | new slot, n_ctx = 4096
0.00.849.106 I srv load_model: prompt cache is enabled, size limit: 8192 MiB
0.00.849.107 I srv load_model: use `--cache-ram 0` to disable the prompt cache
0.00.849.108 I srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.00.849.132 I srv init: idle slots will be saved to prompt cache and cleared upon starting a new task
0.00.854.147 I init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
0.00.857.202 I srv init: init: chat template, thinking = 0
0.00.857.230 I srv llama_server: model loaded
0.00.857.233 I srv llama_server: server is listening on http://0.0.0.0:8081
0.00.857.238 I srv update_slots: all slots are idle
0.20.465.242 I srv params_from_: Chat format: peg-native
0.20.465.349 I slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1
0.20.465.352 I srv get_availabl: updating prompt cache
0.20.465.355 I srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
0.20.465.361 I srv update: - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 4096 tokens, 8589934592 est)
0.20.465.362 I srv get_availabl: prompt cache update took 0.01 ms
0.20.465.411 I slot launch_slot_: id 3 | task 0 | processing task, is_child = 0
0.20.913.347 I slot print_timing: id 3 | task 0 | prompt eval time = 27.47 ms / 39 tokens ( 0.70 ms per token, 1419.94 tokens per second)
0.20.913.352 I slot print_timing: id 3 | task 0 | eval time = 420.45 ms / 81 tokens ( 5.19 ms per token, 192.65 tokens per second)
0.20.913.352 I slot print_timing: id 3 | task 0 | total time = 447.92 ms / 120 tokens
0.20.913.370 I slot print_timing: id 3 | task 0 | graphs reused = 80
0.20.913.395 I slot release: id 3 | task 0 | stop processing: n_tokens = 119, truncated = 0
0.20.913.400 I srv update_slots: all slots are idle
0.32.947.917 I srv params_from_: Chat format: peg-native
0.32.948.051 I slot get_availabl: id 3 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.328
0.32.948.053 I srv get_availabl: updating prompt cache
0.32.948.075 W srv prompt_save: - saving prompt with length 119, total state size = 3.256 MiB (draft: 0.000 MiB)
0.32.950.844 I srv load: - looking for better prompt, base f_keep = 0.328, sim = 1.000
0.32.950.848 I srv update: - cache state: 1 prompts, 3.256 MiB (limits: 8192.000 MiB, 4096 tokens, 299406 est)
0.32.950.849 I srv update: - prompt 0x63ba952db360: 119 tokens, checkpoints: 0, 3.256 MiB
0.32.950.850 I srv get_availabl: prompt cache update took 2.80 ms
0.32.950.896 I slot launch_slot_: id 3 | task 82 | processing task, is_child = 0
0.32.950.900 W slot update_slots: id 3 | task 82 | need to evaluate at least 1 token for each active slot (n_past = 39, task.n_tokens() = 39)
0.32.950.901 W slot update_slots: id 3 | task 82 | n_past was set to 38
0.33.470.903 I slot print_timing: id 3 | task 82 | n_decoded = 100, tg = 194.37 t/s
0.33.599.415 I slot print_timing: id 3 | task 82 | prompt eval time = 5.52 ms / 1 tokens ( 5.52 ms per token, 181.06 tokens per second)
0.33.599.419 I slot print_timing: id 3 | task 82 | eval time = 642.99 ms / 125 tokens ( 5.14 ms per token, 194.41 tokens per second)
0.33.599.419 I slot print_timing: id 3 | task 82 | total time = 648.51 ms / 126 tokens
0.33.599.420 I slot print_timing: id 3 | task 82 | graphs reused = 205
0.33.599.450 I slot release: id 3 | task 82 | stop processing: n_tokens = 163, truncated = 0
0.33.599.455 I srv update_slots: all slots are idle
Git commit
master commit: 328874d
Operating systems
Linux
GGML backends
CUDA
Problem description & steps to reproduce
qwen2.5-1.5b-instruct-q4_k_m.gguf speed from 37 t/s -> 194 t/s
use
to get bad performance :
GPU 210MHz1.09.031.870 I srv update_slots: all slots are idle
2.12.670.942 I srv params_from_: Chat format: peg-native
2.12.671.015 I slot get_availabl: id 3 | task -1 | selected slot by LCP similarity, sim_best = 0.481 (> 0.100 thold), f_keep = 0.238
2.12.671.017 I srv get_availabl: updating prompt cache
2.12.671.033 W srv prompt_save: - saving prompt with length 105, total state size = 2.873 MiB (draft: 0.000 MiB)
2.12.678.125 I srv load: - looking for better prompt, base f_keep = 0.238, sim = 0.481
2.12.678.129 I srv update: - cache state: 2 prompts, 6.211 MiB (limits: 8192.000 MiB, 4096 tokens, 299403 est)
2.12.678.130 I srv update: - prompt 0x5bf2c6927e50: 122 tokens, checkpoints: 0, 3.338 MiB
2.12.678.131 I srv update: - prompt 0x5bf2c6edfb70: 105 tokens, checkpoints: 0, 2.873 MiB
2.12.678.131 I srv get_availabl: prompt cache update took 7.11 ms
2.12.678.172 I slot launch_slot_: id 3 | task 153 | processing task, is_child = 0
2.15.212.751 I slot print_timing: id 3 | task 153 | n_decoded = 100, tg = 39.60 t/s
2.18.226.607 I slot print_timing: id 3 | task 153 | n_decoded = 212, tg = 38.28 t/s
2.21.230.563 I slot print_timing: id 3 | task 153 | n_decoded = 322, tg = 37.69 t/s
2.24.234.525 I slot print_timing: id 3 | task 153 | n_decoded = 432, tg = 37.41 t/s
2.27.256.171 I slot print_timing: id 3 | task 153 | n_decoded = 541, tg = 37.14 t/s
2.29.456.489 I slot print_timing: id 3 | task 153 | prompt eval time = 9.62 ms / 27 tokens ( 0.36 ms per token, 2806.94 tokens per second)
2.29.456.494 I slot print_timing: id 3 | task 153 | eval time = 16768.69 ms / 620 tokens ( 27.05 ms per token, 36.97 tokens per second)
2.29.456.495 I slot print_timing: id 3 | task 153 | total time = 16778.31 ms / 647 tokens
2.29.456.496 I slot print_timing: id 3 | task 153 | graphs reused = 766
2.29.456.526 I slot release: id 3 | task 153 | stop processing: n_tokens = 671, truncated = 0
2.29.456.534 I srv update_slots: all slots are idle
use
to fix.
(base) ➜ mtp.llama.cpp git:(master) # 用新编译的版本启动
./build/bin/llama-server
--model /home/albin/models/qwen2.5-1.5b-instruct-q4_k_m.gguf
--host 0.0.0.0
--port 8081
-ngl 999
--ctx-size 4096
-n 512
-t 4
-ub 512
--api-key sk-local-qwen
0.00.013.517 I log_info: verbosity = 3 (adjust with the
-lv NCLI arg)0.00.013.519 I device_info:
0.00.068.836 I - CUDA0 : NVIDIA GeForce RTX 4060 Ti (16193 MiB, 16051 MiB free)
0.00.068.846 I - CPU : Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz (23965 MiB, 23965 MiB free)
0.00.068.915 I system_info: n_threads = 4 (n_threads_batch = 4) / 6 | CUDA : ARCHS = 890 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.068.921 I srv llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
0.00.068.974 I srv init: api_keys: ****qwen
0.00.068.984 I srv init: using 8 threads for HTTP server
0.00.069.089 I srv start: binding port with default address family
0.00.070.293 I srv llama_server: loading model
0.00.070.297 I srv load_model: loading model '/home/albin/models/qwen2.5-1.5b-instruct-q4_k_m.gguf'
0.00.070.348 I common_init_result: fitting params to device memory ...
0.00.070.348 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.495.469 W load: control-looking token: 128247 '' was not control-type; this is probably a bug in the model. its type will be overridden
0.00.810.506 W llama_context: n_ctx_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
0.00.820.770 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.00.842.588 I srv load_model: initializing slots, n_slots = 4
0.00.848.995 W common_speculative_init: no implementations specified for speculative decoding
0.00.848.998 I slot load_model: id 0 | task -1 | new slot, n_ctx = 4096
0.00.849.002 I slot load_model: id 1 | task -1 | new slot, n_ctx = 4096
0.00.849.002 I slot load_model: id 2 | task -1 | new slot, n_ctx = 4096
0.00.849.002 I slot load_model: id 3 | task -1 | new slot, n_ctx = 4096
0.00.849.106 I srv load_model: prompt cache is enabled, size limit: 8192 MiB
0.00.849.107 I srv load_model: use
--cache-ram 0to disable the prompt cache0.00.849.108 I srv load_model: for more info see #16391
0.00.849.132 I srv init: idle slots will be saved to prompt cache and cleared upon starting a new task
0.00.854.147 I init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
0.00.857.202 I srv init: init: chat template, thinking = 0
0.00.857.230 I srv llama_server: model loaded
0.00.857.233 I srv llama_server: server is listening on http://0.0.0.0:8081
0.00.857.238 I srv update_slots: all slots are idle
0.20.465.242 I srv params_from_: Chat format: peg-native
0.20.465.349 I slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1
0.20.465.352 I srv get_availabl: updating prompt cache
0.20.465.355 I srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
0.20.465.361 I srv update: - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 4096 tokens, 8589934592 est)
0.20.465.362 I srv get_availabl: prompt cache update took 0.01 ms
0.20.465.411 I slot launch_slot_: id 3 | task 0 | processing task, is_child = 0
0.20.913.347 I slot print_timing: id 3 | task 0 | prompt eval time = 27.47 ms / 39 tokens ( 0.70 ms per token, 1419.94 tokens per second)
0.20.913.352 I slot print_timing: id 3 | task 0 | eval time = 420.45 ms / 81 tokens ( 5.19 ms per token, 192.65 tokens per second)
0.20.913.352 I slot print_timing: id 3 | task 0 | total time = 447.92 ms / 120 tokens
0.20.913.370 I slot print_timing: id 3 | task 0 | graphs reused = 80
0.20.913.395 I slot release: id 3 | task 0 | stop processing: n_tokens = 119, truncated = 0
0.20.913.400 I srv update_slots: all slots are idle
0.32.947.917 I srv params_from_: Chat format: peg-native
0.32.948.051 I slot get_availabl: id 3 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.328
0.32.948.053 I srv get_availabl: updating prompt cache
0.32.948.075 W srv prompt_save: - saving prompt with length 119, total state size = 3.256 MiB (draft: 0.000 MiB)
0.32.950.844 I srv load: - looking for better prompt, base f_keep = 0.328, sim = 1.000
0.32.950.848 I srv update: - cache state: 1 prompts, 3.256 MiB (limits: 8192.000 MiB, 4096 tokens, 299406 est)
0.32.950.849 I srv update: - prompt 0x63ba952db360: 119 tokens, checkpoints: 0, 3.256 MiB
0.32.950.850 I srv get_availabl: prompt cache update took 2.80 ms
0.32.950.896 I slot launch_slot_: id 3 | task 82 | processing task, is_child = 0
0.32.950.900 W slot update_slots: id 3 | task 82 | need to evaluate at least 1 token for each active slot (n_past = 39, task.n_tokens() = 39)
0.32.950.901 W slot update_slots: id 3 | task 82 | n_past was set to 38
0.33.470.903 I slot print_timing: id 3 | task 82 | n_decoded = 100, tg = 194.37 t/s
0.33.599.415 I slot print_timing: id 3 | task 82 | prompt eval time = 5.52 ms / 1 tokens ( 5.52 ms per token, 181.06 tokens per second)
0.33.599.419 I slot print_timing: id 3 | task 82 | eval time = 642.99 ms / 125 tokens ( 5.14 ms per token, 194.41 tokens per second)
0.33.599.419 I slot print_timing: id 3 | task 82 | total time = 648.51 ms / 126 tokens
0.33.599.420 I slot print_timing: id 3 | task 82 | graphs reused = 205
0.33.599.450 I slot release: id 3 | task 82 | stop processing: n_tokens = 163, truncated = 0
0.33.599.455 I srv update_slots: all slots are idle
First Bad Commit
unknown
Compile command
cmake -B build \ -DGGML_CUDA=ON \ -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.4/bin/nvcc \ -DCUDAToolkit_ROOT=/usr/local/cuda-12.4 cmake --build build --config Release -j$(nproc)Relevant log output