-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Misc. bug: llama-server token per second slow down sigificant after release b5450 (#13642) #13735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
#13642 Additionally, the CPU backend and llama.cpp now are only built once when creating a release. My llama-server.exe is downloaded from github, would it be related to clang compile exe using cpu not efficiently? |
Likely duplicate of #13664. Try using fewer threads. |
Thank you for your suggestion.
|
For your reference, in case of b5449: seems build with MSVC is faster than clang |
I've experienced the same issue. Starting from b5450 to latest version, token generation rate for model Qwen3-30B-A3B is reduced to ~5 tok/s. While from b5449 or earlier version,the token generation rate is about 22 tok/s. I'm using Windows Vulkan x64 llama.cpp binaries downloaded from github, My notebook PC platform: Lenovo ThinkBook 14 G7+ IAH, Intel Core Ultra 7 255H CPU,Intel ARC 140T integrated GPU,32GB RAM,Windows 11 24H2. |
Uh oh!
There was an error while loading. Please reload this page.
Name and Version
llama-server.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from ggml-cuda.dll
load_backend: loaded RPC backend from ggml-rpc.dll
load_backend: loaded CPU backend from ggml-cpu-haswell.dll
version: 5450 (d643bb2)
built with clang version 18.1.8 for x86_64-pc-windows-msvc
Operating systems
Windows 10
CPU: AMD Ryzen 9 3950X
Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
For release b5449, I could archive around 26 token per second
After release b5450, I could only archive around 16 token per second
Anyone experience the same issue?
First Bad Commit
release b5450, commit d643bb2.
After b5450 all release seems suffer the same speed slow down
Relevant log output
The text was updated successfully, but these errors were encountered: