Misc. bug: Starting from b5450 to latest version, token generation rate for model Qwen3-30B-A3B is reduced to ~5 tok/s. #13738

xmgsincere · 2025-05-24T06:14:43Z

Name and Version

from b5450 to latest version

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-server -m H:\models\Sowkwndms\Qwen3-30B-A3B-abliterated-Q4_K_M-GGUF\qwen3-30b-a3b-abliterated-q4_k_m.gguf  --port 1234 -c 4096 -ngl 46 -t 16 --no-warmup

Problem description & steps to reproduce

Starting from b5450 to latest version, token generation rate for model Qwen3-30B-A3B is reduced to ~5 tok/s. While from b5449 or earlier version,the token generation rate is about 22 tok/s. I'm using Windows Vulkan x64 binary, My notebook PC platform: Lenovo ThinkBook 14 G7+ IAH, Intel Core Ultra 7 255H CPU,Intel ARC 140T iGPU,32GB RAM,Windows 11 24H2.

First Bad Commit

No response

Relevant log output

2dameneko · 2025-05-24T11:12:35Z

I confirm the situation with Qwen3-235B-UD2 with partial offloading to CPU (-ot ".[8-9].ffn_.*_exps.=CPU"). Generation dropped from ~10 to 3t/s.

Win11, 14700/128/5090+4090+3090. Tested on CUDA12.4/CPU builds - affected both.

Cmd
llama-server --model "Qwen3-235B-A22B-UD-Q2_K_XL-00001-of-00002.gguf" -fa --mlock -ctk q8_0 -ctv q8_0 -c 32768 --batch-size 512 -ts 24,32,21 -ot ".[8-9].ffn_.*_exps.=CPU" -ngl 99 --threads 27 --host 0.0.0.0 --port 5000

BVEsun · 2025-05-24T11:32:47Z

I also confirm this situation, perform significant worst for partial offloading to CPU through -ot

xmgsincere added the bug-unconfirmed label May 24, 2025

JohannesGaessler changed the title ~~Misc. bug:~~ Misc. bug: Starting from b5450 to latest version, token generation rate for model Qwen3-30B-A3B is reduced to ~5 tok/s. May 24, 2025

slaren closed this as completed May 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: Starting from b5450 to latest version, token generation rate for model Qwen3-30B-A3B is reduced to ~5 tok/s. #13738

Misc. bug: Starting from b5450 to latest version, token generation rate for model Qwen3-30B-A3B is reduced to ~5 tok/s. #13738

xmgsincere commented May 24, 2025

2dameneko commented May 24, 2025 •

edited

Loading

Uh oh!

BVEsun commented May 24, 2025

Uh oh!

Misc. bug: Starting from b5450 to latest version, token generation rate for model Qwen3-30B-A3B is reduced to ~5 tok/s. #13738

Misc. bug: Starting from b5450 to latest version, token generation rate for model Qwen3-30B-A3B is reduced to ~5 tok/s. #13738

Comments

xmgsincere commented May 24, 2025

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

2dameneko commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BVEsun commented May 24, 2025

Uh oh!

2dameneko commented May 24, 2025 •

edited

Loading