CUDA: use mma FA kernel for gqa > 4 on RTX 4000 #15035

JohannesGaessler · 2025-08-02T11:43:28Z

For models such as Qwen 2.5 3b with 8 Q heads per K/V head it seems to be better to use the mma FlashAttention kernel than the vector kernel:

GPU	Model	Microbatch size	Test	t/s master	t/s `069d410`	Speedup
RTX 4090	qwen2 3B Q4_0	512	tg128	287.82	287.43	1.00
RTX 4090	qwen2 3B Q4_0	512	tg128@d4096	252.76	253.17	1.00
RTX 4090	qwen2 3B Q4_0	512	tg128@d8192	230.98	228.35	0.99
RTX 4090	qwen2 3B Q4_0	512	tg128@d12288	212.58	211.96	1.00
RTX 4090	qwen2 3B Q4_0	512	tg128@d16384	196.78	196.28	1.00
RTX 4090	qwen2 3B Q4_0	512	tg128@d20480	183.69	183.53	1.00
RTX 4090	qwen2 3B Q4_0	512	tg128@d24576	171.84	172.22	1.00
RTX 4090	qwen2 3B Q4_0	512	tg128@d28672	161.60	161.61	1.00
RTX 4090	qwen2 3B Q4_0	512	tg128@d32768	151.92	148.61	0.98
RTX 4090	qwen2 3B Q4_0	512	tg128@d36864	143.42	144.39	1.01
RTX 4090	qwen2 3B Q4_0	512	tg128@d40960	135.77	139.50	1.03
RTX 4090	qwen2 3B Q4_0	512	tg128@d45056	128.66	133.57	1.04
RTX 4090	qwen2 3B Q4_0	512	tg128@d49152	121.80	127.93	1.05
RTX 4090	qwen2 3B Q4_0	512	tg128@d53248	115.74	122.90	1.06
RTX 4090	qwen2 3B Q4_0	512	tg128@d57344	109.87	118.40	1.08
RTX 4090	qwen2 3B Q4_0	512	tg128@d61440	104.90	113.65	1.08
RTX 4090	qwen2 3B Q4_0	512	tg128@d65536	100.22	107.77	1.08

Notably if the GPU is frequency limited this difference is even larger and up to +25%. With a frequency limit it would also be better to use the mma kernel in other cases such as with LLaMA which has 4 Q heads per K/V head. And since datacenter GPUs have lower frequencies than consumer GPUs this implies that choosing kernels solely based on compute capability is suboptimal. It's currently unclear to me how to best retrieve the GPU clocks in a way that considers user-defined limits, I opened a thread in the NVIDIA developer forums and will make a PR once I get a reply.

…)" This reverts commit 03d4698.

)" This reverts commit 2a6fe6e.

CUDA: use mma FA kernel for gqa > 4 on RTX 4000

069d410

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 2, 2025

ggerganov approved these changes Aug 2, 2025

View reviewed changes

JohannesGaessler merged commit 03d4698 into ggml-org:master Aug 2, 2025
45 of 47 checks passed

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Aug 7, 2025

Revert "CUDA: use mma FA kernel for gqa > 4 on RTX 4000 (ggml-org#15035…

2a6fe6e

…)" This reverts commit 03d4698.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Aug 7, 2025

Reapply "CUDA: use mma FA kernel for gqa > 4 on RTX 4000 (ggml-org#15035

b73633d

)" This reverts commit 2a6fe6e.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: use mma FA kernel for gqa > 4 on RTX 4000 #15035

CUDA: use mma FA kernel for gqa > 4 on RTX 4000 #15035

Uh oh!

JohannesGaessler commented Aug 2, 2025

Uh oh!

Uh oh!

Uh oh!

CUDA: use mma FA kernel for gqa > 4 on RTX 4000 #15035

CUDA: use mma FA kernel for gqa > 4 on RTX 4000 #15035

Uh oh!

Conversation

JohannesGaessler commented Aug 2, 2025

Uh oh!

Uh oh!

Uh oh!