Skip to content

Misc. bug: llama-server token per second slow down sigificant after release b5450 (#13642) #13735

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
BVEsun opened this issue May 24, 2025 · 5 comments

Comments

@BVEsun
Copy link

BVEsun commented May 24, 2025

Name and Version

llama-server.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from ggml-cuda.dll
load_backend: loaded RPC backend from ggml-rpc.dll
load_backend: loaded CPU backend from ggml-cpu-haswell.dll
version: 5450 (d643bb2)
built with clang version 18.1.8 for x86_64-pc-windows-msvc

Operating systems

Windows 10
CPU: AMD Ryzen 9 3950X

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-server.exe -m "..\models\Qwen3-30B-A3B-UD-IQ3_XXS.gguf" -ngl 99 -fa -c 40960 --cache-type-k q8_0 --cache-type-v q8_0 --override-tensor "([0-9]|[13579]).ffn_.*_exps.*=CPU"

Problem description & steps to reproduce

For release b5449, I could archive around 26 token per second
After release b5450, I could only archive around 16 token per second

Anyone experience the same issue?

First Bad Commit

release b5450, commit d643bb2.
After b5450 all release seems suffer the same speed slow down

Relevant log output

@BVEsun
Copy link
Author

BVEsun commented May 24, 2025

#13642
By building the CPU backend separately it is possible to use clang to compile it even with backends that require a different C/C++ compiler, such as CUDA, SYCL and Vulkan. Using clang to build the CPU backend is preferred since it tends to result in better performance than MSVC.

Additionally, the CPU backend and llama.cpp now are only built once when creating a release.


My llama-server.exe is downloaded from github, would it be related to clang compile exe using cpu not efficiently?
My CPU loading in b5449 was around 67% when running,
while CPU loading in b5450 was around 33% when running,
both are using 16 threads

@slaren
Copy link
Member

slaren commented May 24, 2025

Likely duplicate of #13664. Try using fewer threads.

@BVEsun
Copy link
Author

BVEsun commented May 24, 2025

Thank you for your suggestion.
I have try reduce thread, test from 2-16, it seems improve a bit but still no match for b5449 speed.

Likely duplicate of #13664. Try using fewer threads.

@BVEsun
Copy link
Author

BVEsun commented May 24, 2025

For your reference, in case of b5449:
llama-server.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from ggml-cuda.dll
load_backend: loaded RPC backend from ggml-rpc.dll
load_backend: loaded CPU backend from ggml-cpu-haswell.dll
version: 5449 (8e186ef)
built with MSVC 19.29.30159.0 for Windows AMD64

seems build with MSVC is faster than clang

@xmgsincere
Copy link

I've experienced the same issue. Starting from b5450 to latest version, token generation rate for model Qwen3-30B-A3B is reduced to ~5 tok/s. While from b5449 or earlier version,the token generation rate is about 22 tok/s. I'm using Windows Vulkan x64 llama.cpp binaries downloaded from github, My notebook PC platform: Lenovo ThinkBook 14 G7+ IAH, Intel Core Ultra 7 255H CPU,Intel ARC 140T integrated GPU,32GB RAM,Windows 11 24H2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants