Skip to content

Misc. bug: vulkan prompt processing suddenly slows down once I reach a certain prompt size #13765

@netrunnereve

Description

@netrunnereve

This is a weird issue that I'm currently looking into, and I'll keep this open to share whatever I can find and also to see if anyone else is seeing this!

I'm running llama.cpp Vulkan on two GPUs, a FirePro W8100 and a RX 470 using Ubuntu 24 and RADV. On certain models prompt processing speed drops significantly once the prompt size reaches a certain amount. Text generation is perfectly fine and this only affects prompt processing.

For example here's Mistral Small 22B quantized with Q4_K_M and Q4_K_S, fully offloaded with -sm layer. Once the prompt hits 385 tokens for Q4_K_M or 202 tokens for Q4_K_S it slows down by more than half.

model size params backend ngl threads main_gpu test t/s
llama ?B Q4_K - Medium 12.42 GiB 22.25 B Vulkan 100 8 1 pp384 40.37 ± 0.00
llama ?B Q4_K - Medium 12.42 GiB 22.25 B Vulkan 100 8 1 pp385 16.00 ± 0.00
llama ?B Q4_K - Small 11.79 GiB 22.25 B Vulkan 100 8 1 pp201 32.42 ± 0.00
llama ?B Q4_K - Small 11.79 GiB 22.25 B Vulkan 100 8 1 pp202 9.17 ± 0.00

My GPUs aren't running out of VRAM and they're not swapping to GTT. Also the GPUs normally run right up to their power and tdc limits during prompt processing, but in this case as soon as the prompt size hits that amount the entire prompt processing segment runs at much lower power. For example my 470 runs at 120W during pp384 but it only hits around 80W or so during pp385. The memory clock also runs at a lower speed during pp385 but locking it at its maximum value didn't do a thing. It's almost as if the GPU is somehow running a different shader there.

Now if I decrease -ub then magically the prompt processing speeds return to normal. However that's not a real fix since if I increase the prompt size some more it breaks again at pp577.

model size params backend ngl threads n_ubatch main_gpu test t/s
llama ?B Q4_K - Medium 12.42 GiB 22.25 B Vulkan 100 8 256 1 pp385 35.52 ± 0.00
llama ?B Q4_K - Medium 12.42 GiB 22.25 B Vulkan 100 8 256 1 pp576 39.88 ± 0.00
llama ?B Q4_K - Medium 12.42 GiB 22.25 B Vulkan 100 8 256 1 pp577 21.13 ± 0.00

This also happens regardless if I set my main gpu to something else or if I have flash attention on or off.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions