Hip fattn expf approx by a-huk · Pull Request #23441 · ggml-org/llama.cpp

a-huk · 2026-05-20T20:45:32Z

Overview

Two fixes for flash attention on AMD RDNA3 GPUs (gfx1100/gfx1101/gfx1151):

1. Fix RDNA3 MMA dispatch to prevent register-spill regression

The AMD WMMA flash-attention path was dispatched for all head_dim ≤ 128, but only head_dim = 64 configs fit within the 256-VGPR wavefront budget on RDNA3 wave32. Configs with head_dim = 80–128 require 320–480+ bytes of scratch memory (91–114 spilled VGPRs for head_dim = 128), making them slower than the non-MMA tile path.

This PR tightens the dispatch guard from Q->ne[0] <= 128 to Q->ne[0] == 64, preventing regression on models with head_dim = 80–128 (Llama 3.x, Mistral, Phi, etc.). It also sets Q_in_reg = false for the head_dim = 64, ncols = 32/64 configs, which were borderline at 256 VGPRs exactly, this saves 12 VGPRs and removes their minor scratch usage.

2. Replace expf / __expf with hardware-native exp2f in all flash attention kernels

All three FA kernel implementations (fattn-mma-f16.cuh, fattn-tile.cuh, fattn-vec.cuh) used expf or __expf in the softmax inner loop. On AMD hardware exp2f is native (single instruction), while expf requires a multi-step approximation.

The substitution is exact: exp(x) = exp2(x · log₂e). Q values are pre-scaled by log₂e on load so all subsequent KQ dot products are already in base-2 space, and the softmax exp(x − m) calls become exp2(x − m) with no approximation error. Rescaling factors for the running-max update are adjusted to match.

The change is applied identically to all three kernels. On CUDA the difference is negligible (both paths use SFU hardware); on AMD it removes multi-step emulation from the softmax hotpath.

Additional information

VGPR budget analysis — RDNA3 wave32 (256-VGPR hard cap)

head_dim	ncols	VGPRs	Scratch	Notes
64	8/16	244–246	0	`Q_in_reg=true`
64	32/64	~244	0	`Q_in_reg=false` (this PR)
80	all	—	52–368 B	not dispatched
96	all	—	164–408 B	not dispatched
112	all	—	480–720 B	not dispatched
128	all	—	320–484 B	not dispatched (91–114 VGPRs spilling)

The inner loop body alone (KQ_C tiles, K/V load temporaries, WMMA compiler-allocated registers) accounts for ~300 of the ~370 VGPRs needed for head_dim = 128. Fixing head_dim ≥ 80 requires kernel restructuring and is tracked separately.

exp2f — benchmark (gfx1151, Qwen3.5-27B Q4_K_M)

Measured with llama-bench -p <N> -n 0 -fa 0 -fa 1. FA=1 here uses the TILE path (DKQ=256 is not MMA-dispatched); the exp2f gains shown below apply to the TILE and VEC kernels on any GPU.

context	FA=0 (t/s)	FA=1 before (t/s)	FA=1 after (t/s)	delta vs before
pp512	310.01	~308	322.99	+4.9%
pp4096	291.78	303.54	309.87	+2.1%
pp8192	283.95	280.66	300.54	+7.1%
pp32768	203.51	195.53	238.71	+22.1%

The gain scales with context length because the softmax loop runs once per KV tile — more tiles means more exp calls. The 99.5% capture of the theoretical maximum (measured by temporarily removing all exp calls) confirms that exp2f is essentially free on this hardware.

MMA dispatch fix — benchmark (gfx1151, Qwen3-0.6B BF16, head_dim=64, GQA ratio=2)

context	FA=0 (t/s)	FA=1 MMA (t/s)	delta
pp512	7442	10256	+38%
pp2048	4500	9690	+115%
tg128	~40	~40	±0% (memory-bound, expected)

Related: issue #21284 (gfx1151 inefficient defaults).

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

Replace expf() with __expf() in the softmax rescaling loops of fattn-mma-f16.cuh, fattn-tile.cuh, and fattn-vec.cuh. __expf is the hardware fast-path approximation (~4x faster than the IEEE-754 expf on both CUDA SFUs and AMD hardware). The accuracy difference (~1-2 ULP) is irrelevant in the softmax context: the subtracted max already bounds the input range, and LLM logit distributions are not sensitive to this precision level. Affected call sites (all in the per-tile softmax rescaling loop): - fattn-mma-f16.cuh: 6 sites in flash_attn_ext_f16_iter and flash_attn_ext_f16_process_tile - fattn-tile.cuh: 4 sites - fattn-vec.cuh: 5 sites

Three improvements to the RDNA3 WMMA flash attention path on gfx1100/gfx1101/gfx1151: 1. Fix register spill in RDNA3 configs (fattn-mma-f16.cuh): - DKQ=64, ncols=32,64: Q_in_reg true→false (eliminates 8-44 byte scratch spill) - DKQ=80-256: Q_in_reg true→false (reduces 52-720 byte spill to <36 bytes) 2. Restrict RDNA3 WMMA dispatch to DKQ=64 (fattn.cu): - DKQ=80-128: 320-480+ bytes scratch remain due to VKQ accumulator pressure in the WMMA tile loop, even with Q_in_reg=false. Excluded until the inner loop can be restructured to fit within gfx1151's 256 VGPR hard cap. - DKQ=256: no throughput benefit on gfx1151 (Q_in_reg=false forces nbatch_fa=32, giving 1024 inner iterations at 32K context — overhead dominates). 3. Replace __expf with exp2f in MMA softmax (fattn-mma-f16.cuh): - Q values are pre-scaled by log2(e) on load into shared memory, converting the softmax from base-e to base-2. All exp() -> exp2() with no approximation: exp(x) == exp2(x * log2(e)) exactly. - exp2f is hardware-native on AMD RDNA (single v_exp_f32 instruction). Captures essentially 100% of the theoretical gain from removing exp overhead, which accounts for ~18% of FA compute at 32K context. - Attention sinks (sinks_f) scaled by log2(e) to stay consistent with the shifted KQ_max tracking space. Benchmarks on Radeon 8060S (gfx1151, 40 CUs), Qwen3.5-27B Q4_K_M: | context | FA=0 (rocBLAS) | FA=1 (before) | FA=1 (after) | delta | |---------|----------------|---------------|--------------|--------| | pp512 | 310 t/s | ~308 | 323 t/s | +5% | | pp4096 | 292 t/s | 304 | 310 t/s | +2% | | pp8192 | 284 t/s | 281 | 301 t/s | +7% | | pp32768 | 204 t/s | 196 | 239 t/s | +22% | FA=1 now outperforms FA=0 at all tested context lengths.

ggml-gh-bot · 2026-05-20T20:49:55Z

Hi @a-huk, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

JohannesGaessler · 2026-05-20T21:15:51Z

As previously suggested, try -fast-math first and instead of multiple optimizations per PR please make one PR per optimization.

Also according to the llama.cpp AI usage policy:

It is strictly prohibited to use AI to write your posts for you (bug reports, feature requests, pull request descriptions, Github discussions, responding to humans, ...).

reujea0 and others added 2 commits May 19, 2026 14:39

a-huk requested a review from a team as a code owner May 20, 2026 20:45

a-huk mentioned this pull request May 20, 2026

ggml-cuda: use __expf in flash attention softmax hotpath #23339

Closed

1 task

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hip fattn expf approx#23441

Hip fattn expf approx#23441
a-huk wants to merge 2 commits into
ggml-org:masterfrom
a-huk:hip-fattn-expf-approx

a-huk commented May 20, 2026

Uh oh!

ggml-gh-bot Bot commented May 20, 2026

Uh oh!

JohannesGaessler commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

a-huk commented May 20, 2026

Overview

Additional information

Requirements

Uh oh!

ggml-gh-bot Bot commented May 20, 2026

Uh oh!

JohannesGaessler commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants