-
Notifications
You must be signed in to change notification settings - Fork 13k
Description
This is a weird issue that I'm currently looking into, and I'll keep this open to share whatever I can find and also to see if anyone else is seeing this!
I'm running llama.cpp Vulkan on two GPUs, a FirePro W8100 and a RX 470 using Ubuntu 24 and RADV. On certain models prompt processing speed drops significantly once the prompt size reaches a certain amount. Text generation is perfectly fine and this only affects prompt processing.
For example here's Mistral Small 22B quantized with Q4_K_M and Q4_K_S, fully offloaded with -sm layer
. Once the prompt hits 385 tokens for Q4_K_M or 202 tokens for Q4_K_S it slows down by more than half.
model | size | params | backend | ngl | threads | main_gpu | test | t/s |
---|---|---|---|---|---|---|---|---|
llama ?B Q4_K - Medium | 12.42 GiB | 22.25 B | Vulkan | 100 | 8 | 1 | pp384 | 40.37 ± 0.00 |
llama ?B Q4_K - Medium | 12.42 GiB | 22.25 B | Vulkan | 100 | 8 | 1 | pp385 | 16.00 ± 0.00 |
llama ?B Q4_K - Small | 11.79 GiB | 22.25 B | Vulkan | 100 | 8 | 1 | pp201 | 32.42 ± 0.00 |
llama ?B Q4_K - Small | 11.79 GiB | 22.25 B | Vulkan | 100 | 8 | 1 | pp202 | 9.17 ± 0.00 |
My GPUs aren't running out of VRAM and they're not swapping to GTT. Also the GPUs normally run right up to their power and tdc limits during prompt processing, but in this case as soon as the prompt size hits that amount the entire prompt processing segment runs at much lower power. For example my 470 runs at 120W during pp384 but it only hits around 80W or so during pp385. The memory clock also runs at a lower speed during pp385 but locking it at its maximum value didn't do a thing. It's almost as if the GPU is somehow running a different shader there.
Now if I decrease -ub
then magically the prompt processing speeds return to normal. However that's not a real fix since if I increase the prompt size some more it breaks again at pp577.
model | size | params | backend | ngl | threads | n_ubatch | main_gpu | test | t/s |
---|---|---|---|---|---|---|---|---|---|
llama ?B Q4_K - Medium | 12.42 GiB | 22.25 B | Vulkan | 100 | 8 | 256 | 1 | pp385 | 35.52 ± 0.00 |
llama ?B Q4_K - Medium | 12.42 GiB | 22.25 B | Vulkan | 100 | 8 | 256 | 1 | pp576 | 39.88 ± 0.00 |
llama ?B Q4_K - Medium | 12.42 GiB | 22.25 B | Vulkan | 100 | 8 | 256 | 1 | pp577 | 21.13 ± 0.00 |
This also happens regardless if I set my main gpu to something else or if I have flash attention on or off.