Skip to content

Conversation

JohannesGaessler
Copy link
Collaborator

For models such as Qwen 2.5 3b with 8 Q heads per K/V head it seems to be better to use the mma FlashAttention kernel than the vector kernel:

GPU Model Microbatch size Test t/s master t/s 069d410 Speedup
RTX 4090 qwen2 3B Q4_0 512 tg128 287.82 287.43 1.00
RTX 4090 qwen2 3B Q4_0 512 tg128@d4096 252.76 253.17 1.00
RTX 4090 qwen2 3B Q4_0 512 tg128@d8192 230.98 228.35 0.99
RTX 4090 qwen2 3B Q4_0 512 tg128@d12288 212.58 211.96 1.00
RTX 4090 qwen2 3B Q4_0 512 tg128@d16384 196.78 196.28 1.00
RTX 4090 qwen2 3B Q4_0 512 tg128@d20480 183.69 183.53 1.00
RTX 4090 qwen2 3B Q4_0 512 tg128@d24576 171.84 172.22 1.00
RTX 4090 qwen2 3B Q4_0 512 tg128@d28672 161.60 161.61 1.00
RTX 4090 qwen2 3B Q4_0 512 tg128@d32768 151.92 148.61 0.98
RTX 4090 qwen2 3B Q4_0 512 tg128@d36864 143.42 144.39 1.01
RTX 4090 qwen2 3B Q4_0 512 tg128@d40960 135.77 139.50 1.03
RTX 4090 qwen2 3B Q4_0 512 tg128@d45056 128.66 133.57 1.04
RTX 4090 qwen2 3B Q4_0 512 tg128@d49152 121.80 127.93 1.05
RTX 4090 qwen2 3B Q4_0 512 tg128@d53248 115.74 122.90 1.06
RTX 4090 qwen2 3B Q4_0 512 tg128@d57344 109.87 118.40 1.08
RTX 4090 qwen2 3B Q4_0 512 tg128@d61440 104.90 113.65 1.08
RTX 4090 qwen2 3B Q4_0 512 tg128@d65536 100.22 107.77 1.08

Notably if the GPU is frequency limited this difference is even larger and up to +25%. With a frequency limit it would also be better to use the mma kernel in other cases such as with LLaMA which has 4 Q heads per K/V head. And since datacenter GPUs have lower frequencies than consumer GPUs this implies that choosing kernels solely based on compute capability is suboptimal. It's currently unclear to me how to best retrieve the GPU clocks in a way that considers user-defined limits, I opened a thread in the NVIDIA developer forums and will make a PR once I get a reply.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 2, 2025
@JohannesGaessler JohannesGaessler merged commit 03d4698 into ggml-org:master Aug 2, 2025
45 of 47 checks passed
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Aug 7, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Aug 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants