Skip to content

[Bug]: --kv-cache-dtype nvfp4 crashes at first request on SM120 instead of failing fast at init #43562

Description

@0xAlcibiades

Environment

GPU          NVIDIA RTX PRO 6000 Blackwell Workstation Edition (sm_120, 96 GB)
Driver       595.71.05
CPU          Ampere Altra - ARM Neoverse-N1, 64 cores (aarch64)
OS           Ubuntu 24.04.4 LTS, kernel 6.17.0-23-generic
vllm         0.21.1rc1.dev269+gb06813e87
torch        2.11.0+cu130 (CUDA 13.0)
flashinfer   0.6.11.post2
transformers 5.9.0 · triton 3.6.0

(vllm collect-env can't run here — it shells out to pip and crashes on uv-only envs: get_pip_packages -> 'NoneType' object has no attribute 'splitlines'. Minor, separate.)

Bug

--kv-cache-dtype nvfp4 on an NVFP4 checkpoint starts up clean, captures graphs, then kills the engine on the first request (not at config validation). NVFP4 weights are fine; only the NVFP4 KV-cache attention path fails. Root cause is upstream — trtllm-gen FP4 FMHA has no sm_120 kernel (NVIDIA/TensorRT-LLM#10241, #11799) — but vLLM in front of it accepts the flag and dies cryptically rather than rejecting it.

The first request fails in flashinfer.prefill.plan, because vLLM passes the literal string "nvfp4" as kv_data_type and flashinfer resolves it with getattr(torch, "nvfp4"):

File ".../vllm/v1/attention/backends/flashinfer.py", line 1170, in build
    prefill_wrapper.plan(...)
File ".../flashinfer/prefill.py", line 1859, in plan
    kv_data_type = canonicalize_torch_dtype(kv_data_type)
File ".../flashinfer/utils.py", line 254, in canonicalize_torch_dtype
    return getattr(torch, dtype)
AttributeError: module 'torch' has no attribute 'nvfp4'

Stock torch has no nvfp4 attribute (the packed FP4 dtype is torch.float4_e2m1fn_x2, and the KV buffer is allocated as torch.uint8). Aliasing torch.nvfp4 = torch.uint8 clears plan() and the strict _check_cached_qkv_data_type (buffer is uint8), and reaches the real wall:

File ".../flashinfer/prefill.py", line 254, in _paged_run
    op.trtllm_paged_attention_context(...)
RuntimeError: Error in function 'TllmGenFmhaRunner' at .../trtllm/fmha/fmhaRunner.cuh:
Unsupported architecture

vLLM forces this path (vllm/v1/attention/backends/flashinfer.py:773: backend = "trtllm-gen" if self.is_kvcache_nvfp4 else "auto"), and trtllm-gen FMHA has no sm_120 build.

Reproducer

vllm serve <an-NVFP4-checkpoint> \
    --kv-cache-dtype nvfp4 \
    --max-model-len 8192
# starts fine; first /v1/chat/completions request -> EngineDeadError

Expected behavior

  1. Fail fast at init. If kv_cache_dtype=nvfp4 is selected on an arch without a trtllm-gen FP4 FMHA kernel (sm_120/sm_121), reject it at engine init with a clear, actionable error (e.g. "nvfp4 KV cache requires sm_100/sm_103; use fp8"), instead of an AttributeError on the first token followed by Unsupported architecture.
  2. Resolve the dtype internally. Map the nvfp4 kv-cache-dtype string to the actual storage dtype (torch.uint8) vLLM hands flashinfer, rather than relying on getattr(torch, "nvfp4") resolving in whatever torch/flashinfer happens to be pinned. Today that only works by accident of versions and breaks silently otherwise.

Workaround: --kv-cache-dtype fp8.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions