Environment
GPU NVIDIA RTX PRO 6000 Blackwell Workstation Edition (sm_120, 96 GB)
Driver 595.71.05
CPU Ampere Altra - ARM Neoverse-N1, 64 cores (aarch64)
OS Ubuntu 24.04.4 LTS, kernel 6.17.0-23-generic
vllm 0.21.1rc1.dev269+gb06813e87
torch 2.11.0+cu130 (CUDA 13.0)
flashinfer 0.6.11.post2
transformers 5.9.0 · triton 3.6.0
(vllm collect-env can't run here — it shells out to pip and crashes on uv-only envs: get_pip_packages -> 'NoneType' object has no attribute 'splitlines'. Minor, separate.)
Bug
--kv-cache-dtype nvfp4 on an NVFP4 checkpoint starts up clean, captures graphs, then kills the engine on the first request (not at config validation). NVFP4 weights are fine; only the NVFP4 KV-cache attention path fails. Root cause is upstream — trtllm-gen FP4 FMHA has no sm_120 kernel (NVIDIA/TensorRT-LLM#10241, #11799) — but vLLM in front of it accepts the flag and dies cryptically rather than rejecting it.
The first request fails in flashinfer.prefill.plan, because vLLM passes the literal string "nvfp4" as kv_data_type and flashinfer resolves it with getattr(torch, "nvfp4"):
File ".../vllm/v1/attention/backends/flashinfer.py", line 1170, in build
prefill_wrapper.plan(...)
File ".../flashinfer/prefill.py", line 1859, in plan
kv_data_type = canonicalize_torch_dtype(kv_data_type)
File ".../flashinfer/utils.py", line 254, in canonicalize_torch_dtype
return getattr(torch, dtype)
AttributeError: module 'torch' has no attribute 'nvfp4'
Stock torch has no nvfp4 attribute (the packed FP4 dtype is torch.float4_e2m1fn_x2, and the KV buffer is allocated as torch.uint8). Aliasing torch.nvfp4 = torch.uint8 clears plan() and the strict _check_cached_qkv_data_type (buffer is uint8), and reaches the real wall:
File ".../flashinfer/prefill.py", line 254, in _paged_run
op.trtllm_paged_attention_context(...)
RuntimeError: Error in function 'TllmGenFmhaRunner' at .../trtllm/fmha/fmhaRunner.cuh:
Unsupported architecture
vLLM forces this path (vllm/v1/attention/backends/flashinfer.py:773: backend = "trtllm-gen" if self.is_kvcache_nvfp4 else "auto"), and trtllm-gen FMHA has no sm_120 build.
Reproducer
vllm serve <an-NVFP4-checkpoint> \
--kv-cache-dtype nvfp4 \
--max-model-len 8192
# starts fine; first /v1/chat/completions request -> EngineDeadError
Expected behavior
- Fail fast at init. If
kv_cache_dtype=nvfp4 is selected on an arch without a trtllm-gen FP4 FMHA kernel (sm_120/sm_121), reject it at engine init with a clear, actionable error (e.g. "nvfp4 KV cache requires sm_100/sm_103; use fp8"), instead of an AttributeError on the first token followed by Unsupported architecture.
- Resolve the dtype internally. Map the
nvfp4 kv-cache-dtype string to the actual storage dtype (torch.uint8) vLLM hands flashinfer, rather than relying on getattr(torch, "nvfp4") resolving in whatever torch/flashinfer happens to be pinned. Today that only works by accident of versions and breaks silently otherwise.
Workaround: --kv-cache-dtype fp8.
Related
Environment
(
vllm collect-envcan't run here — it shells out topipand crashes on uv-only envs:get_pip_packages->'NoneType' object has no attribute 'splitlines'. Minor, separate.)Bug
--kv-cache-dtype nvfp4on an NVFP4 checkpoint starts up clean, captures graphs, then kills the engine on the first request (not at config validation). NVFP4 weights are fine; only the NVFP4 KV-cache attention path fails. Root cause is upstream — trtllm-gen FP4 FMHA has no sm_120 kernel (NVIDIA/TensorRT-LLM#10241, #11799) — but vLLM in front of it accepts the flag and dies cryptically rather than rejecting it.The first request fails in
flashinfer.prefill.plan, because vLLM passes the literal string"nvfp4"askv_data_typeand flashinfer resolves it withgetattr(torch, "nvfp4"):Stock torch has no
nvfp4attribute (the packed FP4 dtype istorch.float4_e2m1fn_x2, and the KV buffer is allocated astorch.uint8). Aliasingtorch.nvfp4 = torch.uint8clearsplan()and the strict_check_cached_qkv_data_type(buffer is uint8), and reaches the real wall:vLLM forces this path (
vllm/v1/attention/backends/flashinfer.py:773:backend = "trtllm-gen" if self.is_kvcache_nvfp4 else "auto"), and trtllm-gen FMHA has no sm_120 build.Reproducer
Expected behavior
kv_cache_dtype=nvfp4is selected on an arch without a trtllm-gen FP4 FMHA kernel (sm_120/sm_121), reject it at engine init with a clear, actionable error (e.g. "nvfp4 KV cache requires sm_100/sm_103; use fp8"), instead of anAttributeErroron the first token followed byUnsupported architecture.nvfp4kv-cache-dtype string to the actual storage dtype (torch.uint8) vLLM hands flashinfer, rather than relying ongetattr(torch, "nvfp4")resolving in whatever torch/flashinfer happens to be pinned. Today that only works by accident of versions and breaks silently otherwise.Workaround:
--kv-cache-dtype fp8.Related