[KVCache] Add Triton software NVFP4 KV cache support by lesj0610 · Pull Request #44389 · vllm-project/vllm

lesj0610 · 2026-06-03T06:52:09Z

Purpose

This PR adds Triton backend support for --kv-cache-dtype nvfp4.

NVFP4 stores KV cache as E2M1 FP4 data plus E4M3 block scales. This Triton path performs the FP4 packing and unpack/dequant in software, without relying on native FP4 conversion instructions.

The practical benefit observed below is roughly 3x KV cache capacity on Qwen3.6-27B and Qwen3.6-35B-A3B, with MRCR quality comparable to the auto KV baseline and closer to that baseline than the TQ 4bit_nc reference in the same completed 30-sample runs.

Usage

# Auto-selects the Triton path when native FlashInfer NVFP4 KV is unavailable.
vllm serve <model> --kv-cache-dtype nvfp4

# Or select it explicitly.
vllm serve <model> --kv-cache-dtype nvfp4 --attention-backend TRITON_ATTN

LLM(model=..., kv_cache_dtype="nvfp4")
LLM(model=..., kv_cache_dtype="nvfp4", attention_backend="TRITON_ATTN")

Changes

Add a Triton KV cache write path that packs FP16/BF16 K/V into software E2M1 FP4 data plus E4M3 block scales.
Add an inline NVFP4 unpack/dequant path in Triton unified attention.
Use bytewise packed loads for NVFP4 decode to avoid duplicate packed-byte reads in 2D and segmented decode paths.
Wire NVFP4 cache shape handling, backend dtype dispatch, and per-spec cache dtype reshape logic.
Add prefill/raw-current K/V handling and softcap/sliding-window tile-bound support needed by the Triton path.
Add tests for NVFP4 cache write/read, MM prefix/tile bounds, and reshape dtype dispatch.
Warm up Triton NVFP4 attention before the JIT monitor is activated so the decode attention variant is not first compiled during inference.

KV Cache Capacity

Observed GPU KV cache size from the serving runs below. TQ 4bit_nc is included as a 4-bit KV reference point.

Model	KV dtype	GPU KV cache size	vs auto
Qwen3.6-27B	auto	294,183 tokens	1.00x
Qwen3.6-27B	nvfp4	882,551 tokens	3.01x
Qwen3.6-27B	TQ 4bit_nc	943,250 tokens	3.21x
Qwen3.6-35B-A3B	auto	745,237 tokens	1.00x
Qwen3.6-35B-A3B	nvfp4	2,187,264 tokens	2.93x
Qwen3.6-35B-A3B	TQ 4bit_nc	2,380,148 tokens	3.19x
Gemma4-31B	auto	122,360 tokens	1.00x
Gemma4-31B	nvfp4	438,681 tokens	3.59x
Gemma4-31B	TQ 4bit_nc	startup failed*	-
Gemma4-26B-A4B	auto	590,793 tokens	1.00x
Gemma4-26B-A4B	nvfp4	2,109,264 tokens	3.57x
Gemma4-26B-A4B	TQ 4bit_nc	startup failed*	-

Gemma4 models force TRITON_ATTN; TQ 4bit_nc did not start because that KV dtype is not supported by the selected backend, so no capacity number was available.

Quality Benchmark

MRCR, 30 samples, 32K context. Prefix hit rate was 1.0000 for all completed runs.

Model	KV dtype	Match ratio	n8 match ratio	Output tok/s	Status
Qwen3.6-27B	auto	0.9388	0.8243	16.60	completed
Qwen3.6-27B	nvfp4	0.9433	0.8378	10.34	completed
Qwen3.6-27B	TQ 4bit_nc	-	-	-	failed
Qwen3.6-35B-A3B	auto	0.9258	0.7855	66.81	completed
Qwen3.6-35B-A3B	nvfp4	0.9242	0.7807	38.10	completed
Qwen3.6-35B-A3B	TQ 4bit_nc	0.8983	0.7028	64.61	completed
Gemma4-31B	auto	0.7109	0.4714	14.59	completed
Gemma4-31B	nvfp4	0.7090	0.4612	8.78	completed
Gemma4-31B	TQ 4bit_nc	-	-	-	startup failed*
Gemma4-26B-A4B	auto	0.4337	0.3584	50.59	completed
Gemma4-26B-A4B	nvfp4	0.4537	0.3222	29.44	completed
Gemma4-26B-A4B	TQ 4bit_nc	-	-	-	startup failed*

The main quality signal here is the hardest MRCR slice (n8). On Qwen3.6-35B-A3B, NVFP4 keeps a higher n8 match ratio than TQ 4bit_nc (0.7807 vs 0.7028, +7.79 pp), while providing a similar ~3x KV capacity increase. On Qwen3.6-27B, NVFP4 completed the 32K MRCR run with no quality regression versus auto, while TQ 4bit_nc failed during continuation prefill under the same setup. On Gemma4-31B and Gemma4-26B-A4B, NVFP4 also completed the same 30-sample MRCR run with prefix hit rate 1.0000.

Gemma4 models force TRITON_ATTN in this setup; TQ 4bit_nc did not start because that KV dtype is not supported by the selected backend.

These results do not claim quality coverage beyond 32K. They show that at the tested 32K long-context setting, NVFP4 preserves retrieval quality better than the 4-bit TQ reference point.

Serving Benchmark

vllm bench serve, random 8K prompts, output length 64, 16 requests, and max_num_seqs=64. Qwen runs used default max_num_batched_tokens; Gemma4 runs used TRITON_ATTN with max_num_batched_tokens=2496 for the model's multimodal budget/backend constraints.

Model	KV dtype	Completed	Mean TTFT	Mean TPOT	Output tok/s	Notes
Qwen3.6-27B	auto	16/16	74.89 s	1023.87 ms	7.30	-
Qwen3.6-27B	nvfp4	16/16	93.33 s	1284.25 ms	5.85	-
Qwen3.6-27B	TQ 4bit_nc	16/16	75.03 s	1038.18 ms	7.24	-
Qwen3.6-35B-A3B	auto	16/16	17.47 s	244.57 ms	30.73	-
Qwen3.6-35B-A3B	nvfp4	16/16	22.61 s	317.83 ms	23.76	-
Qwen3.6-35B-A3B	TQ 4bit_nc	16/16	18.12 s	260.88 ms	29.20	-
Gemma4-31B	auto	16/16	83.60 s	1040.14 ms	6.47	-
Gemma4-31B	nvfp4	16/16	112.80 s	1596.84 ms	4.77	-
Gemma4-31B	TQ 4bit_nc	-	-	-	-	startup failed*
Gemma4-26B-A4B	auto	16/16	25.45 s	362.40 ms	21.00	-
Gemma4-26B-A4B	nvfp4	16/16	32.67 s	470.47 ms	16.27	-
Gemma4-26B-A4B	TQ 4bit_nc	-	-	-	-	startup failed*

Gemma4 models force TRITON_ATTN; TQ 4bit_nc did not start because that KV dtype is not supported by the selected backend.

Notes

The bytewise NVFP4 decode path reduces duplicate packed-byte reads in 2D and segmented decode. In an 8K-prompt / 256-output steady decode check on Qwen3.6-27B, NVFP4 output throughput improved from 20.36 to 22.35 tok/s and mean TPOT decreased from 418.63 to 350.21 ms; the corresponding auto baseline from the earlier run was 24.88 tok/s and 329.49 ms.
The intended trade-off is much larger KV cache capacity while keeping MRCR quality comparable to the auto KV baseline.
The validation above is CUDA-only; ROCm and XPU were not tested.
AI assistance: Codex and Claude were used in preparing this PR.

Validation

git diff --check HEAD
python3 -m py_compile on changed Python files
ruff check / ruff format --check on changed Python files
pytest tests/kernels/attention/test_attention_selector.py -k "nvfp4 or flash_attn_rejects" -q -> 8 passed
pytest tests/kernels/attention/test_triton_unified_attention.py -k nvfp4 -q -s -> 34 passed
pytest tests/kernels/attention/test_cache.py -k "reshape_and_cache_flash and nvfp4 and triton" -q -s -> 24 passed, 264 skipped
pytest tests/kernels/attention/test_triton_prefill_attention.py -q -> 113 passed
pytest tests/v1/worker/test_gpu_model_runner.py -k "attn_backend_cache_dtype_str or reshape_skipped_attention" -q -> 2 passed, 34 deselected
pytest tests/v1/worker/test_gpu_model_runner.py -k 'triton_nvfp4_attention_warmup' -q -> 4 passed, 36 deselected

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

mergify · 2026-06-03T06:57:37Z

Documentation preview: https://vllm--44389.org.readthedocs.build/en/44389/

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 77084a163b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com> (cherry picked from commit a567de7)

mergify · 2026-06-11T16:39:17Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lesj0610.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…fork-20260602 Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com> # Conflicts: # docs/design/attention_backends.md

mergify · 2026-06-12T06:14:50Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lesj0610.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…fork-20260602 Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com> # Conflicts: # docs/design/attention_backends.md # vllm/v1/attention/backends/triton_attn.py # vllm/v1/attention/ops/triton_attention_helpers.py # vllm/v1/attention/ops/triton_unified_attention.py

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

Route NVFP4 scalar pure-prefill through context attention, disable raw-current for pure-prefill and mismatched current K/V shapes, and avoid reading stale current slots in boundary tiles. Add mixed-causal NVFP4 coverage for the affected paths. Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

lesj0610 and others added 6 commits June 3, 2026 00:29

Add Triton NVFP4 KV cache support

a5e1679

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

Merge branch 'main' into lesj/triton-nvfp4-kv-fork-20260602

459d818

Drop overlapping worker stride test from NVFP4 PR

15362ac

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

Route non-SM100 NVFP4 KV to Triton

af8ba7a

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

Merge branch 'main' into lesj/triton-nvfp4-kv-fork-20260602

463a8b9

Merge branch 'main' into lesj/triton-nvfp4-kv-fork-20260602

77084a1

lesj0610 requested review from AndreasKaratzas, LucasWilkinson, MatthewBonanni, WoosukKwon, heheda12345, mgoin, njhill, pavanimajety, tdoublep, tlrmchlsmth, vadiklyutiy, yewentao256 and zyongye as code owners June 3, 2026 06:52

mergify Bot added documentation Improvements or additions to documentation nvidia v1 labels Jun 3, 2026

github-project-automation Bot added this to NVIDIA Jun 3, 2026

chatgpt-codex-connector Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread vllm/v1/attention/backends/triton_attn.py

lesj0610 added 2 commits June 3, 2026 16:57

Tune Triton NVFP4 decode launch config

74046e1

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

Reject Triton NVFP4 diff-KV configs

c6507d1

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

lesj0610 requested review from alexm-redhat, youkaichao and zhuohan123 as code owners June 3, 2026 11:28

lesj0610 and others added 2 commits June 11, 2026 13:32

Fix FlashInfer attention capability signature

c5a2724

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com> (cherry picked from commit a567de7)

Merge branch 'main' into lesj/triton-nvfp4-kv-fork-20260602

1775ea8

mergify Bot added the needs-rebase label Jun 11, 2026

Merge remote-tracking branch 'origin/main' into lesj/triton-nvfp4-kv-…

9360044

…fork-20260602 Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com> # Conflicts: # docs/design/attention_backends.md

mergify Bot removed the needs-rebase label Jun 12, 2026

mergify Bot added the needs-rebase label Jun 12, 2026

mergify Bot removed the needs-rebase label Jun 14, 2026

Handle DiffusionGemma sampler warmup capabilities

87efa4a

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

lesj0610 requested a review from NickLucche as a code owner June 15, 2026 23:10

lesj0610 and others added 18 commits June 16, 2026 08:17

Merge branch 'main' into lesj/triton-nvfp4-kv-fork-20260602

bcface4

Fix NVFP4 warmup for V2 model runner

0c2605f

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

Merge branch 'main' into lesj/triton-nvfp4-kv-fork-20260602

e6506e1

Merge branch 'main' into lesj/triton-nvfp4-kv-fork-20260602

e51caa5

Merge branch 'main' into lesj/triton-nvfp4-kv-fork-20260602

bd0cff4

Merge branch 'main' into lesj/triton-nvfp4-kv-fork-20260602

0593a7f

Optimize Triton NVFP4 decode paths

f74ef20

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

Tune Triton NVFP4 sliding prefill launch config

d868807

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

Optimize NVFP4 raw-current V loads

4d9f533

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

Tune NVFP4 full h256 launch config

9216dde

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

Use LUT decode for h256 NVFP4 values

f0d49ec

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

Coalesce NVFP4 bytewise V scale loads

242d88f

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

Optimize NVFP4 E2M1 decode

d4280a2

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

Optimize NVFP4 prefill kernels

041e674

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

Keep only NVFP4 prefill warmup in PR 44389

8cbb231

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

Optimize NVFP4 raw-current split attention

ea576a0

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

Merge branch 'main' into lesj/triton-nvfp4-kv-fork-20260602

4431af5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[KVCache] Add Triton software NVFP4 KV cache support#44389

[KVCache] Add Triton software NVFP4 KV cache support#44389
lesj0610 wants to merge 47 commits into
vllm-project:mainfrom
lesj0610:lesj/triton-nvfp4-kv-fork-20260602

lesj0610 commented Jun 3, 2026 •

edited

Loading

Uh oh!

mergify Bot commented Jun 3, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

mergify Bot commented Jun 11, 2026

Uh oh!

mergify Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lesj0610 commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Usage

Changes

KV Cache Capacity

Quality Benchmark

Serving Benchmark

Notes

Validation

Uh oh!

mergify Bot commented Jun 3, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

mergify Bot commented Jun 11, 2026

Uh oh!

mergify Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lesj0610 commented Jun 3, 2026 •

edited

Loading