Misc. bug: vulkan prompt processing suddenly slows down once I reach a certain prompt size

This is a weird issue that I'm currently looking into, and I'll keep this open to share whatever I can find and also to see if anyone else is seeing this!

I'm running llama.cpp Vulkan on two GPUs, a FirePro W8100 and a RX 470 using Ubuntu 24 and RADV. On certain models prompt processing speed drops significantly once the prompt size reaches a certain amount. Text generation is perfectly fine and this only affects prompt processing.

For example here's Mistral Small 22B quantized with Q4_K_M and Q4_K_S, fully offloaded with `-sm layer`. Once the prompt hits 385 tokens for Q4_K_M or 202 tokens for Q4_K_S it slows down by more than half.

| model                          |       size |     params | backend    | ngl | threads |   main_gpu |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ---------: | --------------: | -------------------: |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.25 B | Vulkan     | 100 |       8 |          1 |           pp384 |         40.37 ± 0.00 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.25 B | Vulkan     | 100 |       8 |          1 |           pp385 |         16.00 ± 0.00 |
| llama ?B Q4_K - Small          |  11.79 GiB |    22.25 B | Vulkan     | 100 |       8 |          1 |           pp201 |         32.42 ± 0.00 |
| llama ?B Q4_K - Small          |  11.79 GiB |    22.25 B | Vulkan     | 100 |       8 |          1 |           pp202 |          9.17 ± 0.00 |

My GPUs aren't running out of VRAM and they're not swapping to GTT. Also the GPUs normally run right up to their power and tdc limits during prompt processing, but in this case as soon as the prompt size hits that amount the entire prompt processing segment runs at much lower power. For example my 470 runs at 120W during pp384 but it only hits around 80W or so during pp385. The memory clock also runs at a lower speed during pp385 but locking it at its maximum value didn't do a thing. It's almost as if the GPU is somehow running a different shader there.

Now if I decrease `-ub` then magically the prompt processing speeds return to normal. However that's not a real fix since if I increase the prompt size some more it breaks again at pp577.

| model                          |       size |     params | backend    | ngl | threads | n_ubatch |   main_gpu |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | ---------: | --------------: | -------------------: |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.25 B | Vulkan     | 100 |       8 |      256 |          1 |           pp385 |         35.52 ± 0.00 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.25 B | Vulkan     | 100 |       8 |      256 |          1 |           pp576 |         39.88 ± 0.00 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.25 B | Vulkan     | 100 |       8 |      256 |          1 |           pp577 |         21.13 ± 0.00 |

This also happens regardless if I set my main gpu to something else or if I have flash attention on or off.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: vulkan prompt processing suddenly slows down once I reach a certain prompt size #13765

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

model	size	params	backend	ngl	threads	main_gpu	test	t/s
llama ?B Q4_K - Medium	12.42 GiB	22.25 B	Vulkan	100	8	1	pp384	40.37 ± 0.00
llama ?B Q4_K - Medium	12.42 GiB	22.25 B	Vulkan	100	8	1	pp385	16.00 ± 0.00
llama ?B Q4_K - Small	11.79 GiB	22.25 B	Vulkan	100	8	1	pp201	32.42 ± 0.00
llama ?B Q4_K - Small	11.79 GiB	22.25 B	Vulkan	100	8	1	pp202	9.17 ± 0.00

Misc. bug: vulkan prompt processing suddenly slows down once I reach a certain prompt size #13765

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions