Skip to content

[KVCache] Add Triton software NVFP4 KV cache support#44389

Open
lesj0610 wants to merge 47 commits into
vllm-project:mainfrom
lesj0610:lesj/triton-nvfp4-kv-fork-20260602
Open

[KVCache] Add Triton software NVFP4 KV cache support#44389
lesj0610 wants to merge 47 commits into
vllm-project:mainfrom
lesj0610:lesj/triton-nvfp4-kv-fork-20260602

Conversation

@lesj0610

@lesj0610 lesj0610 commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Purpose

This PR adds Triton backend support for --kv-cache-dtype nvfp4.

NVFP4 stores KV cache as E2M1 FP4 data plus E4M3 block scales. This Triton path performs the FP4 packing and unpack/dequant in software, without relying on native FP4 conversion instructions.

The practical benefit observed below is roughly 3x KV cache capacity on Qwen3.6-27B and Qwen3.6-35B-A3B, with MRCR quality comparable to the auto KV baseline and closer to that baseline than the TQ 4bit_nc reference in the same completed 30-sample runs.

Usage

# Auto-selects the Triton path when native FlashInfer NVFP4 KV is unavailable.
vllm serve <model> --kv-cache-dtype nvfp4

# Or select it explicitly.
vllm serve <model> --kv-cache-dtype nvfp4 --attention-backend TRITON_ATTN
LLM(model=..., kv_cache_dtype="nvfp4")
LLM(model=..., kv_cache_dtype="nvfp4", attention_backend="TRITON_ATTN")

Changes

  1. Add a Triton KV cache write path that packs FP16/BF16 K/V into software E2M1 FP4 data plus E4M3 block scales.
  2. Add an inline NVFP4 unpack/dequant path in Triton unified attention.
  3. Use bytewise packed loads for NVFP4 decode to avoid duplicate packed-byte reads in 2D and segmented decode paths.
  4. Wire NVFP4 cache shape handling, backend dtype dispatch, and per-spec cache dtype reshape logic.
  5. Add prefill/raw-current K/V handling and softcap/sliding-window tile-bound support needed by the Triton path.
  6. Add tests for NVFP4 cache write/read, MM prefix/tile bounds, and reshape dtype dispatch.
  7. Warm up Triton NVFP4 attention before the JIT monitor is activated so the decode attention variant is not first compiled during inference.

KV Cache Capacity

Observed GPU KV cache size from the serving runs below. TQ 4bit_nc is included as a 4-bit KV reference point.

Model KV dtype GPU KV cache size vs auto
Qwen3.6-27B auto 294,183 tokens 1.00x
Qwen3.6-27B nvfp4 882,551 tokens 3.01x
Qwen3.6-27B TQ 4bit_nc 943,250 tokens 3.21x
Qwen3.6-35B-A3B auto 745,237 tokens 1.00x
Qwen3.6-35B-A3B nvfp4 2,187,264 tokens 2.93x
Qwen3.6-35B-A3B TQ 4bit_nc 2,380,148 tokens 3.19x
Gemma4-31B auto 122,360 tokens 1.00x
Gemma4-31B nvfp4 438,681 tokens 3.59x
Gemma4-31B TQ 4bit_nc startup failed* -
Gemma4-26B-A4B auto 590,793 tokens 1.00x
Gemma4-26B-A4B nvfp4 2,109,264 tokens 3.57x
Gemma4-26B-A4B TQ 4bit_nc startup failed* -
  • Gemma4 models force TRITON_ATTN; TQ 4bit_nc did not start because that KV dtype is not supported by the selected backend, so no capacity number was available.

Quality Benchmark

MRCR, 30 samples, 32K context. Prefix hit rate was 1.0000 for all completed runs.

Model KV dtype Match ratio n8 match ratio Output tok/s Status
Qwen3.6-27B auto 0.9388 0.8243 16.60 completed
Qwen3.6-27B nvfp4 0.9433 0.8378 10.34 completed
Qwen3.6-27B TQ 4bit_nc - - - failed
Qwen3.6-35B-A3B auto 0.9258 0.7855 66.81 completed
Qwen3.6-35B-A3B nvfp4 0.9242 0.7807 38.10 completed
Qwen3.6-35B-A3B TQ 4bit_nc 0.8983 0.7028 64.61 completed
Gemma4-31B auto 0.7109 0.4714 14.59 completed
Gemma4-31B nvfp4 0.7090 0.4612 8.78 completed
Gemma4-31B TQ 4bit_nc - - - startup failed*
Gemma4-26B-A4B auto 0.4337 0.3584 50.59 completed
Gemma4-26B-A4B nvfp4 0.4537 0.3222 29.44 completed
Gemma4-26B-A4B TQ 4bit_nc - - - startup failed*

The main quality signal here is the hardest MRCR slice (n8). On Qwen3.6-35B-A3B, NVFP4 keeps a higher n8 match ratio than TQ 4bit_nc (0.7807 vs 0.7028, +7.79 pp), while providing a similar ~3x KV capacity increase. On Qwen3.6-27B, NVFP4 completed the 32K MRCR run with no quality regression versus auto, while TQ 4bit_nc failed during continuation prefill under the same setup. On Gemma4-31B and Gemma4-26B-A4B, NVFP4 also completed the same 30-sample MRCR run with prefix hit rate 1.0000.

  • Gemma4 models force TRITON_ATTN in this setup; TQ 4bit_nc did not start because that KV dtype is not supported by the selected backend.

These results do not claim quality coverage beyond 32K. They show that at the tested 32K long-context setting, NVFP4 preserves retrieval quality better than the 4-bit TQ reference point.

Serving Benchmark

vllm bench serve, random 8K prompts, output length 64, 16 requests, and max_num_seqs=64. Qwen runs used default max_num_batched_tokens; Gemma4 runs used TRITON_ATTN with max_num_batched_tokens=2496 for the model's multimodal budget/backend constraints.

Model KV dtype Completed Mean TTFT Mean TPOT Output tok/s Notes
Qwen3.6-27B auto 16/16 74.89 s 1023.87 ms 7.30 -
Qwen3.6-27B nvfp4 16/16 93.33 s 1284.25 ms 5.85 -
Qwen3.6-27B TQ 4bit_nc 16/16 75.03 s 1038.18 ms 7.24 -
Qwen3.6-35B-A3B auto 16/16 17.47 s 244.57 ms 30.73 -
Qwen3.6-35B-A3B nvfp4 16/16 22.61 s 317.83 ms 23.76 -
Qwen3.6-35B-A3B TQ 4bit_nc 16/16 18.12 s 260.88 ms 29.20 -
Gemma4-31B auto 16/16 83.60 s 1040.14 ms 6.47 -
Gemma4-31B nvfp4 16/16 112.80 s 1596.84 ms 4.77 -
Gemma4-31B TQ 4bit_nc - - - - startup failed*
Gemma4-26B-A4B auto 16/16 25.45 s 362.40 ms 21.00 -
Gemma4-26B-A4B nvfp4 16/16 32.67 s 470.47 ms 16.27 -
Gemma4-26B-A4B TQ 4bit_nc - - - - startup failed*
  • Gemma4 models force TRITON_ATTN; TQ 4bit_nc did not start because that KV dtype is not supported by the selected backend.

Notes

  • The bytewise NVFP4 decode path reduces duplicate packed-byte reads in 2D and segmented decode. In an 8K-prompt / 256-output steady decode check on Qwen3.6-27B, NVFP4 output throughput improved from 20.36 to 22.35 tok/s and mean TPOT decreased from 418.63 to 350.21 ms; the corresponding auto baseline from the earlier run was 24.88 tok/s and 329.49 ms.
  • The intended trade-off is much larger KV cache capacity while keeping MRCR quality comparable to the auto KV baseline.
  • The validation above is CUDA-only; ROCm and XPU were not tested.
  • AI assistance: Codex and Claude were used in preparing this PR.

Validation

  • git diff --check HEAD
  • python3 -m py_compile on changed Python files
  • ruff check / ruff format --check on changed Python files
  • pytest tests/kernels/attention/test_attention_selector.py -k "nvfp4 or flash_attn_rejects" -q -> 8 passed
  • pytest tests/kernels/attention/test_triton_unified_attention.py -k nvfp4 -q -s -> 34 passed
  • pytest tests/kernels/attention/test_cache.py -k "reshape_and_cache_flash and nvfp4 and triton" -q -s -> 24 passed, 264 skipped
  • pytest tests/kernels/attention/test_triton_prefill_attention.py -q -> 113 passed
  • pytest tests/v1/worker/test_gpu_model_runner.py -k "attn_backend_cache_dtype_str or reshape_skipped_attention" -q -> 2 passed, 34 deselected
  • pytest tests/v1/worker/test_gpu_model_runner.py -k 'triton_nvfp4_attention_warmup' -q -> 4 passed, 36 deselected

@mergify

mergify Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Documentation preview: https://vllm--44389.org.readthedocs.build/en/44389/

@mergify mergify Bot added documentation Improvements or additions to documentation nvidia v1 labels Jun 3, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 77084a163b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread vllm/v1/attention/backends/triton_attn.py
lesj0610 added 2 commits June 3, 2026 16:57
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
lesj0610 and others added 2 commits June 11, 2026 13:32
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
(cherry picked from commit a567de7)
@mergify

mergify Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lesj0610.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 11, 2026
…fork-20260602

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

# Conflicts:
#	docs/design/attention_backends.md
@mergify mergify Bot removed the needs-rebase label Jun 12, 2026
@mergify

mergify Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lesj0610.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 12, 2026
…fork-20260602

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>

# Conflicts:
#	docs/design/attention_backends.md
#	vllm/v1/attention/backends/triton_attn.py
#	vllm/v1/attention/ops/triton_attention_helpers.py
#	vllm/v1/attention/ops/triton_unified_attention.py
@mergify mergify Bot removed the needs-rebase label Jun 14, 2026
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
@lesj0610 lesj0610 requested a review from NickLucche as a code owner June 15, 2026 23:10
lesj0610 and others added 18 commits June 16, 2026 08:17
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Route NVFP4 scalar pure-prefill through context attention, disable raw-current for pure-prefill and mismatched current K/V shapes, and avoid reading stale current slots in boundary tiles. Add mixed-causal NVFP4 coverage for the affected paths.

Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation nvidia v1

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant