[Bug]: --kv-cache-dtype nvfp4 crashes at first request on SM120 instead of failing fast at init

### Environment

```
GPU          NVIDIA RTX PRO 6000 Blackwell Workstation Edition (sm_120, 96 GB)
Driver       595.71.05
CPU          Ampere Altra - ARM Neoverse-N1, 64 cores (aarch64)
OS           Ubuntu 24.04.4 LTS, kernel 6.17.0-23-generic
vllm         0.21.1rc1.dev269+gb06813e87
torch        2.11.0+cu130 (CUDA 13.0)
flashinfer   0.6.11.post2
transformers 5.9.0 · triton 3.6.0
```

(`vllm collect-env` can't run here — it shells out to `pip` and crashes on uv-only envs: `get_pip_packages` -> `'NoneType' object has no attribute 'splitlines'`. Minor, separate.)

### Bug

`--kv-cache-dtype nvfp4` on an NVFP4 checkpoint starts up clean, captures graphs, then kills the engine on the **first request** (not at config validation). NVFP4 *weights* are fine; only the NVFP4 *KV-cache attention* path fails. Root cause is upstream — trtllm-gen FP4 FMHA has no sm_120 kernel (NVIDIA/TensorRT-LLM#10241, #11799) — but vLLM in front of it accepts the flag and dies cryptically rather than rejecting it.

The first request fails in `flashinfer.prefill.plan`, because vLLM passes the literal string `"nvfp4"` as `kv_data_type` and flashinfer resolves it with `getattr(torch, "nvfp4")`:

```
File ".../vllm/v1/attention/backends/flashinfer.py", line 1170, in build
    prefill_wrapper.plan(...)
File ".../flashinfer/prefill.py", line 1859, in plan
    kv_data_type = canonicalize_torch_dtype(kv_data_type)
File ".../flashinfer/utils.py", line 254, in canonicalize_torch_dtype
    return getattr(torch, dtype)
AttributeError: module 'torch' has no attribute 'nvfp4'
```

Stock torch has no `nvfp4` attribute (the packed FP4 dtype is `torch.float4_e2m1fn_x2`, and the KV buffer is allocated as `torch.uint8`). Aliasing `torch.nvfp4 = torch.uint8` clears `plan()` and the strict `_check_cached_qkv_data_type` (buffer is uint8), and reaches the real wall:

```
File ".../flashinfer/prefill.py", line 254, in _paged_run
    op.trtllm_paged_attention_context(...)
RuntimeError: Error in function 'TllmGenFmhaRunner' at .../trtllm/fmha/fmhaRunner.cuh:
Unsupported architecture
```

vLLM forces this path (`vllm/v1/attention/backends/flashinfer.py:773`: `backend = "trtllm-gen" if self.is_kvcache_nvfp4 else "auto"`), and trtllm-gen FMHA has no sm_120 build.

### Reproducer

```bash
vllm serve <an-NVFP4-checkpoint> \
    --kv-cache-dtype nvfp4 \
    --max-model-len 8192
# starts fine; first /v1/chat/completions request -> EngineDeadError
```

### Expected behavior

1. **Fail fast at init.** If `kv_cache_dtype=nvfp4` is selected on an arch without a trtllm-gen FP4 FMHA kernel (sm_120/sm_121), reject it at engine init with a clear, actionable error (e.g. *"nvfp4 KV cache requires sm_100/sm_103; use fp8"*), instead of an `AttributeError` on the first token followed by `Unsupported architecture`.
2. **Resolve the dtype internally.** Map the `nvfp4` kv-cache-dtype string to the actual storage dtype (`torch.uint8`) vLLM hands flashinfer, rather than relying on `getattr(torch, "nvfp4")` resolving in whatever torch/flashinfer happens to be pinned. Today that only works by accident of versions and breaks silently otherwise.

Workaround: `--kv-cache-dtype fp8`.

### Related

- #32220 — NVFP4 KV Cache Support (the feature this sits behind; closed/implemented)
- flashinfer-ai/flashinfer#2143 (fp4 KV for trtllm paged attention, P0), #2294 (SM120 nvfp4 KV **decode** via XQA — done), #2555 (SM120 attention backend validation), #2207 (fp4 KV head_dim 128 gap), #2577 (SM120 fp4 GEMM)
- NVIDIA/TensorRT-LLM#10241, #11799 (SM120 trtllm-gen FP4 FMHA kernel / cubins)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: --kv-cache-dtype nvfp4 crashes at first request on SM120 instead of failing fast at init #43562

Environment

Bug

Reproducer

Expected behavior

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

[Bug]: --kv-cache-dtype nvfp4 crashes at first request on SM120 instead of failing fast at init #43562

Description

Environment

Bug

Reproducer

Expected behavior

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions