Misc. bug: llama-server token per second slow down sigificant after release b5450 (#13642) #13735

BVEsun · 2025-05-24T00:17:13Z

Name and Version

llama-server.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from ggml-cuda.dll
load_backend: loaded RPC backend from ggml-rpc.dll
load_backend: loaded CPU backend from ggml-cpu-haswell.dll
version: 5450 (d643bb2)
built with clang version 18.1.8 for x86_64-pc-windows-msvc

Operating systems

Windows 10
CPU: AMD Ryzen 9 3950X

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-server.exe -m "..\models\Qwen3-30B-A3B-UD-IQ3_XXS.gguf" -ngl 99 -fa -c 40960 --cache-type-k q8_0 --cache-type-v q8_0 --override-tensor "([0-9]|[13579]).ffn_.*_exps.*=CPU"

Problem description & steps to reproduce

For release b5449, I could archive around 26 token per second
After release b5450, I could only archive around 16 token per second

Anyone experience the same issue?

First Bad Commit

release b5450, commit d643bb2.
After b5450 all release seems suffer the same speed slow down

Relevant log output

The text was updated successfully, but these errors were encountered:

BVEsun · 2025-05-24T00:23:02Z

#13642
By building the CPU backend separately it is possible to use clang to compile it even with backends that require a different C/C++ compiler, such as CUDA, SYCL and Vulkan. Using clang to build the CPU backend is preferred since it tends to result in better performance than MSVC.

Additionally, the CPU backend and llama.cpp now are only built once when creating a release.

My llama-server.exe is downloaded from github, would it be related to clang compile exe using cpu not efficiently?
My CPU loading in b5449 was around 67% when running,
while CPU loading in b5450 was around 33% when running,
both are using 16 threads

slaren · 2025-05-24T00:28:35Z

Likely duplicate of #13664. Try using fewer threads.

BVEsun · 2025-05-24T00:38:03Z

Thank you for your suggestion.
I have try reduce thread, test from 2-16, it seems improve a bit but still no match for b5449 speed.

Likely duplicate of #13664. Try using fewer threads.

BVEsun · 2025-05-24T01:32:55Z

For your reference, in case of b5449:
llama-server.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from ggml-cuda.dll
load_backend: loaded RPC backend from ggml-rpc.dll
load_backend: loaded CPU backend from ggml-cpu-haswell.dll
version: 5449 (8e186ef)
built with MSVC 19.29.30159.0 for Windows AMD64

seems build with MSVC is faster than clang

xmgsincere · 2025-05-24T06:50:52Z

I've experienced the same issue. Starting from b5450 to latest version, token generation rate for model Qwen3-30B-A3B is reduced to ~5 tok/s. While from b5449 or earlier version,the token generation rate is about 22 tok/s. I'm using Windows Vulkan x64 llama.cpp binaries downloaded from github, My notebook PC platform: Lenovo ThinkBook 14 G7+ IAH, Intel Core Ultra 7 255H CPU,Intel ARC 140T integrated GPU,32GB RAM,Windows 11 24H2.

BVEsun added the bug-unconfirmed label May 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: llama-server token per second slow down sigificant after release b5450 (#13642) #13735

Misc. bug: llama-server token per second slow down sigificant after release b5450 (#13642) #13735

BVEsun commented May 24, 2025 •

edited

Loading

BVEsun commented May 24, 2025

Uh oh!

slaren commented May 24, 2025

Uh oh!

BVEsun commented May 24, 2025

Uh oh!

BVEsun commented May 24, 2025 •

edited

Loading

Uh oh!

xmgsincere commented May 24, 2025

Uh oh!

Misc. bug: llama-server token per second slow down sigificant after release b5450 (#13642) #13735

Misc. bug: llama-server token per second slow down sigificant after release b5450 (#13642) #13735

Comments

BVEsun commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

BVEsun commented May 24, 2025

Uh oh!

slaren commented May 24, 2025

Uh oh!

BVEsun commented May 24, 2025

Uh oh!

BVEsun commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xmgsincere commented May 24, 2025

Uh oh!

BVEsun commented May 24, 2025 •

edited

Loading

BVEsun commented May 24, 2025 •

edited

Loading