[Attention] Add head_dim=512 support for FlashInfer trtllm attention backend by djmmoss · Pull Request #38822 · vllm-project/vllm

djmmoss · 2026-04-02T15:49:31Z

Add 512 to the FlashInfer backend's supported head sizes, enabling models with head_dim=512 attention layers to use the FlashInfer trtllm attention kernels on Blackwell GPUs.

This companion PR enables the head_dim=512 cubin support in FlashInfer: flashinfer-ai/flashinfer#2959

gemini-code-assist

Code Review

This pull request adds support for a head size of 512 to the FlashInfer attention backend. Feedback suggests that this specific head size should be restricted to Blackwell GPUs (SM100+) by implementing a supports_combination check, as older architectures do not natively support this dimension and may experience runtime crashes.

gemini-code-assist · 2026-04-02T15:51:03Z

    def get_supported_head_sizes(cls) -> list[int]:
        # https://github.com/flashinfer-ai/flashinfer/blob/3d55c71a62052c590c130897d3a3db49b14fcc34/include/flashinfer/utils.cuh#L157
-        return [64, 128, 256]
+        return [64, 128, 256, 512]


Adding 512 to the supported head sizes without hardware-specific validation can lead to runtime crashes on non-Blackwell GPUs. The PR description states that head_dim=512 is intended for the TRTLLM attention kernels on Blackwell (SM100+). However, FlashInferBackend is also used on earlier architectures (SM75+) where the native FlashInfer kernels do not support this head dimension.

To prevent invalid configurations from reaching the execution stage, you should override supports_combination to ensure head_dim=512 is only permitted when the device capability is SM100 or higher.

Suggested change

def get_supported_head_sizes(cls) -> list[int]:

# https://github.com/flashinfer-ai/flashinfer/blob/3d55c71a62052c590c130897d3a3db49b14fcc34/include/flashinfer/utils.cuh#L157

return [64, 128, 256]

return [64, 128, 256, 512]

def get_supported_head_sizes(cls) -> list[int]:

# FlashInfer native kernels support 64, 128, 256.

# 512 is supported via TRTLLM kernels on Blackwell.

return [64, 128, 256, 512]

@classmethod

def supports_combination(

cls,

head_size: int,

dtype: torch.dtype,

kv_cache_dtype: "CacheDType | None",

block_size: int | None,

use_mla: bool,

has_sink: bool,

use_sparse: bool,

device_capability: DeviceCapability,

) -> str | None:

if head_size == 512 and device_capability.major < 10:

return "head_dim=512 is only supported on Blackwell GPUs (SM100+)"

return None

djmmoss · 2026-04-02T16:26:16Z

The concern about runtime crashes on non-Blackwell GPUs is already handled by vLLM's existing backend selection and validation system:

Backend priority routing (cuda.py): On pre-SM100 GPUs, FLASH_ATTN is prioritized over FLASHINFER, and FLASH_ATTN already supports head_dim=512. FlashInfer would only be selected on SM100+ where the trtllm cubins are available.
Cubin-level validation: Even if FlashInfer were explicitly forced on an older GPU, the trtllm cubin loader validates kernel availability at initialization and raises a clear error — no silent crash.
Precedent: Other backends already list 512 without architecture gating (e.g., cpu_attn returns [32, 64, 80, 96, 112, 128, 160, 192, 224, 256, 512]), and flex_attention returns [] (all sizes accepted).

get_supported_head_sizes() declares what the backend can support across all its code paths, not what a specific GPU supports. The architecture-specific filtering happens in supports_compute_capability() and the backend priority logic.

Add 512 to the list of supported head sizes in the FlashInfer attention backend. This enables models with head_dim=512 (used in global/full attention layers) to use the FlashInfer trtllm-gen attention kernels on Blackwell GPUs. The trtllm-gen cubins for head_dim=512 are available in the FlashInfer cubin repository for SM100f (BF16, FP8, and FP16 dtypes). Co-authored-by: Claude Signed-off-by: Daniel Moss <dmoss@nvidia.com> Signed-off-by: Duncan Moss <djm.moss@gmail.com>

mergify · 2026-04-04T22:43:13Z

Documentation preview: https://vllm--38822.org.readthedocs.build/en/38822/

Signed-off-by: Duncan Moss <djm.moss@gmail.com>

mergify · 2026-04-30T20:22:15Z

Hi @djmmoss, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

vadiklyutiy · 2026-04-30T21:49:15Z

may we add unit test that cover 512?

djmmoss · 2026-05-06T19:42:36Z

waiting for: #41711

| may we add unit test that cover 512?
@vadiklyutiy this is covered in the flashinfer repo

mergify Bot added nvidia v1 labels Apr 2, 2026

github-project-automation Bot added this to NVIDIA Apr 2, 2026

gemini-code-assist Bot reviewed Apr 2, 2026

View reviewed changes

djmmoss force-pushed the dmoss/trtllm-fmha-head-dim-512 branch from f2e851e to 038ab75 Compare April 4, 2026 00:21

djmmoss marked this pull request as ready for review April 4, 2026 14:27

djmmoss requested review from mgoin, pavanimajety and vadiklyutiy as code owners April 4, 2026 14:27

Merge branch 'main' into dmoss/trtllm-fmha-head-dim-512

34ee306

mergify Bot added the documentation Improvements or additions to documentation label Apr 4, 2026

jhaotingc mentioned this pull request Apr 23, 2026

[Bug]: Gemma 4 31B INT4 on 2×24GB GPUs (TP=2): GPU KV cache size is 25,200 tokens at max_model_len=131072, gpu_memory_utilization=0.96, BF16 KV #39133

Open

Merge branch 'main' into dmoss/trtllm-fmha-head-dim-512

ec65cbe

Signed-off-by: Duncan Moss <djm.moss@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Attention] Add head_dim=512 support for FlashInfer trtllm attention backend#38822

[Attention] Add head_dim=512 support for FlashInfer trtllm attention backend#38822
djmmoss wants to merge 3 commits intovllm-project:mainfrom
djmmoss:dmoss/trtllm-fmha-head-dim-512

djmmoss commented Apr 2, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 2, 2026

Uh oh!

djmmoss commented Apr 2, 2026 •

edited

Loading

Uh oh!

mergify Bot commented Apr 4, 2026

Uh oh!

mergify Bot commented Apr 30, 2026

Uh oh!

vadiklyutiy commented Apr 30, 2026

Uh oh!

djmmoss commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    def get_supported_head_sizes(cls) -> list[int]:
-        # https://github.com/flashinfer-ai/flashinfer/blob/3d55c71a62052c590c130897d3a3db49b14fcc34/include/flashinfer/utils.cuh#L157
-        return [64, 128, 256]
-        return [64, 128, 256, 512]
+    def get_supported_head_sizes(cls) -> list[int]:
+        # FlashInfer native kernels support 64, 128, 256.
+        # 512 is supported via TRTLLM kernels on Blackwell.
+        return [64, 128, 256, 512]
+    @classmethod
+    def supports_combination(
+        cls,
+        head_size: int,
+        dtype: torch.dtype,
+        kv_cache_dtype: "CacheDType | None",
+        block_size: int | None,
+        use_mla: bool,
+        has_sink: bool,
+        use_sparse: bool,
+        device_capability: DeviceCapability,
+    ) -> str | None:
+        if head_size == 512 and device_capability.major < 10:
+            return "head_dim=512 is only supported on Blackwell GPUs (SM100+)"
+        return None

Uh oh!

Conversation

djmmoss commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

djmmoss commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify Bot commented Apr 4, 2026

Uh oh!

mergify Bot commented Apr 30, 2026

Uh oh!

vadiklyutiy commented Apr 30, 2026

Uh oh!

djmmoss commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

djmmoss commented Apr 2, 2026 •

edited

Loading

djmmoss commented Apr 2, 2026 •

edited

Loading