[KVCache] Add Triton software NVFP4 KV cache support#44389
Open
lesj0610 wants to merge 47 commits into
Open
Conversation
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Contributor
|
Documentation preview: https://vllm--44389.org.readthedocs.build/en/44389/ |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 77084a163b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com> (cherry picked from commit a567de7)
Contributor
|
This pull request has merge conflicts that must be resolved before it can be |
…fork-20260602 Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com> # Conflicts: # docs/design/attention_backends.md
Contributor
|
This pull request has merge conflicts that must be resolved before it can be |
…fork-20260602 Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com> # Conflicts: # docs/design/attention_backends.md # vllm/v1/attention/backends/triton_attn.py # vllm/v1/attention/ops/triton_attention_helpers.py # vllm/v1/attention/ops/triton_unified_attention.py
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Route NVFP4 scalar pure-prefill through context attention, disable raw-current for pure-prefill and mismatched current K/V shapes, and avoid reading stale current slots in boundary tiles. Add mixed-causal NVFP4 coverage for the affected paths. Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
Signed-off-by: lesj0610 <lesj0610@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
This PR adds Triton backend support for
--kv-cache-dtype nvfp4.NVFP4 stores KV cache as E2M1 FP4 data plus E4M3 block scales. This Triton path performs the FP4 packing and unpack/dequant in software, without relying on native FP4 conversion instructions.
The practical benefit observed below is roughly 3x KV cache capacity on Qwen3.6-27B and Qwen3.6-35B-A3B, with MRCR quality comparable to the
autoKV baseline and closer to that baseline than the TQ 4bit_nc reference in the same completed 30-sample runs.Usage
Changes
KV Cache Capacity
Observed GPU KV cache size from the serving runs below. TQ 4bit_nc is included as a 4-bit KV reference point.
TRITON_ATTN; TQ 4bit_nc did not start because that KV dtype is not supported by the selected backend, so no capacity number was available.Quality Benchmark
MRCR, 30 samples, 32K context. Prefix hit rate was 1.0000 for all completed runs.
The main quality signal here is the hardest MRCR slice (
n8). On Qwen3.6-35B-A3B, NVFP4 keeps a higher n8 match ratio than TQ 4bit_nc (0.7807 vs 0.7028, +7.79 pp), while providing a similar ~3x KV capacity increase. On Qwen3.6-27B, NVFP4 completed the 32K MRCR run with no quality regression versusauto, while TQ 4bit_nc failed during continuation prefill under the same setup. On Gemma4-31B and Gemma4-26B-A4B, NVFP4 also completed the same 30-sample MRCR run with prefix hit rate 1.0000.These results do not claim quality coverage beyond 32K. They show that at the tested 32K long-context setting, NVFP4 preserves retrieval quality better than the 4-bit TQ reference point.
Serving Benchmark
vllm bench serve, random 8K prompts, output length 64, 16 requests, andmax_num_seqs=64. Qwen runs used defaultmax_num_batched_tokens; Gemma4 runs usedTRITON_ATTNwithmax_num_batched_tokens=2496for the model's multimodal budget/backend constraints.TRITON_ATTN; TQ 4bit_nc did not start because that KV dtype is not supported by the selected backend.Notes
autoKV baseline.Validation
git diff --check HEADpython3 -m py_compileon changed Python filesruff check/ruff format --checkon changed Python filespytest tests/kernels/attention/test_attention_selector.py -k "nvfp4 or flash_attn_rejects" -q-> 8 passedpytest tests/kernels/attention/test_triton_unified_attention.py -k nvfp4 -q -s-> 34 passedpytest tests/kernels/attention/test_cache.py -k "reshape_and_cache_flash and nvfp4 and triton" -q -s-> 24 passed, 264 skippedpytest tests/kernels/attention/test_triton_prefill_attention.py -q-> 113 passedpytest tests/v1/worker/test_gpu_model_runner.py -k "attn_backend_cache_dtype_str or reshape_skipped_attention" -q-> 2 passed, 34 deselectedpytest tests/v1/worker/test_gpu_model_runner.py -k 'triton_nvfp4_attention_warmup' -q-> 4 passed, 36 deselected