Skip to content

Misc. bug: Starting from b5450 to latest version, token generation rate for model Qwen3-30B-A3B is reduced to ~5 tok/s. #13738

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
xmgsincere opened this issue May 24, 2025 · 2 comments

Comments

@xmgsincere
Copy link

Name and Version

from b5450 to latest version

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-server -m H:\models\Sowkwndms\Qwen3-30B-A3B-abliterated-Q4_K_M-GGUF\qwen3-30b-a3b-abliterated-q4_k_m.gguf  --port 1234 -c 4096 -ngl 46 -t 16 --no-warmup

Problem description & steps to reproduce

Starting from b5450 to latest version, token generation rate for model Qwen3-30B-A3B is reduced to ~5 tok/s. While from b5449 or earlier version,the token generation rate is about 22 tok/s. I'm using Windows Vulkan x64 binary, My notebook PC platform: Lenovo ThinkBook 14 G7+ IAH, Intel Core Ultra 7 255H CPU,Intel ARC 140T iGPU,32GB RAM,Windows 11 24H2.

First Bad Commit

No response

Relevant log output

@JohannesGaessler JohannesGaessler changed the title Misc. bug: Misc. bug: Starting from b5450 to latest version, token generation rate for model Qwen3-30B-A3B is reduced to ~5 tok/s. May 24, 2025
@2dameneko
Copy link

2dameneko commented May 24, 2025

I confirm the situation with Qwen3-235B-UD2 with partial offloading to CPU (-ot ".[8-9].ffn_.*_exps.=CPU"). Generation dropped from ~10 to 3t/s.

Win11, 14700/128/5090+4090+3090. Tested on CUDA12.4/CPU builds - affected both.

Cmd
llama-server --model "Qwen3-235B-A22B-UD-Q2_K_XL-00001-of-00002.gguf" -fa --mlock -ctk q8_0 -ctv q8_0 -c 32768 --batch-size 512 -ts 24,32,21 -ot ".[8-9].ffn_.*_exps.=CPU" -ngl 99 --threads 27 --host 0.0.0.0 --port 5000

@BVEsun
Copy link

BVEsun commented May 24, 2025

I also confirm this situation, perform significant worst for partial offloading to CPU through -ot

@slaren slaren closed this as completed May 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants