[Attention] Add head_dim=512 support for FlashInfer trtllm attention backend#38822
[Attention] Add head_dim=512 support for FlashInfer trtllm attention backend#38822djmmoss wants to merge 3 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds support for a head size of 512 to the FlashInfer attention backend. Feedback suggests that this specific head size should be restricted to Blackwell GPUs (SM100+) by implementing a supports_combination check, as older architectures do not natively support this dimension and may experience runtime crashes.
| def get_supported_head_sizes(cls) -> list[int]: | ||
| # https://github.com/flashinfer-ai/flashinfer/blob/3d55c71a62052c590c130897d3a3db49b14fcc34/include/flashinfer/utils.cuh#L157 | ||
| return [64, 128, 256] | ||
| return [64, 128, 256, 512] |
There was a problem hiding this comment.
Adding 512 to the supported head sizes without hardware-specific validation can lead to runtime crashes on non-Blackwell GPUs. The PR description states that head_dim=512 is intended for the TRTLLM attention kernels on Blackwell (SM100+). However, FlashInferBackend is also used on earlier architectures (SM75+) where the native FlashInfer kernels do not support this head dimension.
To prevent invalid configurations from reaching the execution stage, you should override supports_combination to ensure head_dim=512 is only permitted when the device capability is SM100 or higher.
| def get_supported_head_sizes(cls) -> list[int]: | |
| # https://github.com/flashinfer-ai/flashinfer/blob/3d55c71a62052c590c130897d3a3db49b14fcc34/include/flashinfer/utils.cuh#L157 | |
| return [64, 128, 256] | |
| return [64, 128, 256, 512] | |
| def get_supported_head_sizes(cls) -> list[int]: | |
| # FlashInfer native kernels support 64, 128, 256. | |
| # 512 is supported via TRTLLM kernels on Blackwell. | |
| return [64, 128, 256, 512] | |
| @classmethod | |
| def supports_combination( | |
| cls, | |
| head_size: int, | |
| dtype: torch.dtype, | |
| kv_cache_dtype: "CacheDType | None", | |
| block_size: int | None, | |
| use_mla: bool, | |
| has_sink: bool, | |
| use_sparse: bool, | |
| device_capability: DeviceCapability, | |
| ) -> str | None: | |
| if head_size == 512 and device_capability.major < 10: | |
| return "head_dim=512 is only supported on Blackwell GPUs (SM100+)" | |
| return None |
|
The concern about runtime crashes on non-Blackwell GPUs is already handled by vLLM's existing backend selection and validation system:
|
Add 512 to the list of supported head sizes in the FlashInfer attention backend. This enables models with head_dim=512 (used in global/full attention layers) to use the FlashInfer trtllm-gen attention kernels on Blackwell GPUs. The trtllm-gen cubins for head_dim=512 are available in the FlashInfer cubin repository for SM100f (BF16, FP8, and FP16 dtypes). Co-authored-by: Claude Signed-off-by: Daniel Moss <dmoss@nvidia.com> Signed-off-by: Duncan Moss <djm.moss@gmail.com>
f2e851e to
038ab75
Compare
|
Documentation preview: https://vllm--38822.org.readthedocs.build/en/38822/ |
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
|
Hi @djmmoss, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
may we add unit test that cover 512? |
|
waiting for: #41711 | may we add unit test that cover 512? |
Add
512to the FlashInfer backend's supported head sizes, enabling models withhead_dim=512attention layers to use the FlashInfer trtllm attention kernels on Blackwell GPUs.This companion PR enables the head_dim=512 cubin support in FlashInfer: flashinfer-ai/flashinfer#2959