[Quark] Support loading Quark NVFP4 checkpoints in vLLM by fxmarty-amd · Pull Request #35859 · vllm-project/vllm

fxmarty-amd · 2026-03-03T11:52:04Z

Purpose

https://github.com/amd/Quark/ has experimental nvfp4 support that will be extended in future releases. The PR enables loading in vLLM NVFP4 models (dense and MOE) quantized using Quark library.

Todo:

Port the parallel layer scale recomputation logic [won't do - raising an error in case q/k/v projections, gate/up projections weight global scales are not equal].

Test Plan

pytest tests/quantization/test_quark.py -s -vvvvv -k "test_nvfp4_wikitext_correctness"

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

…vfp4-simulation-support-moe

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

…fp4-simulation-aot-weight-dequantization

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

gemini-code-assist

Code Review

This pull request adds support for loading Quark NVFP4 checkpoints in vLLM, including an emulation path for hardware that doesn't natively support NVFP4. The changes are extensive, touching configuration, quantization layers, and tests. A significant part of the work involves refactoring to accommodate the new emulation backend for both dense and MoE layers. While the overall approach is sound, I've identified a critical issue in the handling of quantization scales for the new QuarkNVFP4 scheme which could lead to incorrect model outputs.

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

mergify · 2026-04-13T21:27:32Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fxmarty-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

BowenBao

Overall looks good, will take a look again once #35737 is landed.

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

…fp4-simulated-quark

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

fxmarty-amd · 2026-05-04T12:40:42Z

        # Move the E2M1 lookup table to the device now, because
        # `.to(device)` is not allowed during CUDA graph capture.
-        kE2M1ToFloat_handle.val = kE2M1ToFloat_handle.val.to(layer.weight.device)
+        kE2M1ToFloat_handle.val = kE2M1ToFloat_handle.val.to(layer.w13_weight.device)


Typo from #40033 - surprised it slipped in.

fxmarty-amd · 2026-05-04T12:42:15Z

Hi @kylesayrs @mgoin, this PR should be in a good state, appreciate if you are able to have a look again, thank you!

mgoin

Looks straightforward, thanks adopting the kernel/moe interface!

mergify · 2026-05-04T20:15:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fxmarty-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

…sk for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' Signed-off-by: Felix Marty <Felix.Marty@amd.com>

fxmarty-amd · 2026-05-06T12:21:20Z

Getting a seemingly unrelated CI failure https://buildkite.com/vllm/ci/builds/64486#019df8a9-125b-43c1-af7f-679765bfef60:

�_bk;t=1777993427576 #23 76.57 /opt/rocm/lib/llvm/bin/clang++  -DCUTLASS_ENABLE_DIRECT_CUDA_DRIVER_CALL=1 -DHIPBLASLT_USE_ROCROLLER -DPy_LIMITED_API=3 -DTORCH_EXTENSION_NAME=_C -DUSE_C10D_GLOO -DUSE_C10D_MPI -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_PROF_API=1 -DUSE_RPC -DUSE_TENSORPIPE -D_C_EXPORTS -D__HIP_PLATFORM_AMD__ -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -I/app/vllm/build/temp.linux-x86_64-cpython-312/csrc -isystem /usr/include/python3.12 -isystem /usr/local/lib/python3.12/dist-packages/torch/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -isystem /opt/rocm-7.2.2/include/hiprand -isystem /opt/rocm-7.2.2/include/rocrand -Wno-unused-result -Wno-unused-value -O2 -g -DNDEBUG -std=gnu++17 --offload-arch=gfx90a --offload-arch=gfx942 --offload-arch=gfx950 -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -fPIC -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -DUSE_ROCM -DENABLE_FP8 -U__HIP_NO_HALF_CONVERSIONS__ -U__HIP_NO_HALF_OPERATORS__ -Werror=unused-variable -fno-gpu-rdc -DTORCH_HIP_VERSION=702 -Wno-shift-count-negative -Wno-shift-count-overflow -DCAFFE2_USE_MIOPEN -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -std=c++17 -DHIP_ENABLE_WARP_SYNC_BUILTINS -DHIPBLASLT_OUTER_VEC -DUSE_ROCM_CK_GEMM -MD -MT CMakeFiles/_C.dir/csrc/cache_kernels.hip.o -MF CMakeFiles/_C.dir/csrc/cache_kernels.hip.o.d -o CMakeFiles/_C.dir/csrc/cache_kernels.hip.o -x hip -c /app/vllm/build/temp.linux-x86_64-cpython-312/csrc/cache_kernels.hip
�_bk;t=1777993427576 #23 76.57 In file included from /app/vllm/build/temp.linux-x86_64-cpython-312/csrc/cache_kernels.hip:13:
�_bk;t=1777993427576 #23 76.57 /app/vllm/build/temp.linux-x86_64-cpython-312/csrc/concat_mla_q.cuh:4:10: fatal error: 'cuda_bf16.h' file not found
�_bk;t=1777993427576 #23 76.57     4 | #include <cuda_bf16.h>
�_bk;t=1777993427576 #23 76.57       |          ^~~~~~~~~~~~~
�_bk;t=1777993427576 #23 76.57 1 error generated when compiling for gfx90a.

fxmarty-amd added 17 commits March 2, 2026 12:16

fix issues with nvfp4 dense emulation in vllm (squash)

b313689

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

address comments

bc6ff39

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

nvfp4 moe emulation support

14bc668

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

Merge branch 'upstream-nvfp4-simulation-support-rocm' into upstream-n…

a11d131

…vfp4-simulation-support-moe

wip use TritonExperts

95c6a4a

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

wip cleanup

5a2cf8c

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

wip cleanup

0ea8f82

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

wip cleanup

d99373e

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

fix activation quantization

7a5f2ba

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

address comment

457f9df

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

aot weight dequantization

86d6316

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

use emulation_dequantize_weights for quark OCP MX as well

2cb040b

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

tiny fix

7a67180

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

enable test on non-blackwell devices

01b4dce

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

Merge branch 'upstream-nvfp4-simulation-support-moe' into upstream-nv…

aef916d

…fp4-simulation-aot-weight-dequantization

add test

c4aff81

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

add test

4710a00

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

mergify Bot added the rocm Related to AMD ROCm label Mar 3, 2026

github-project-automation Bot added this to AMD Mar 3, 2026

github-project-automation Bot moved this to Todo in AMD Mar 3, 2026

gemini-code-assist Bot reviewed Mar 3, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/quantization/quark/schemes/quark_nvfp4.py Outdated

support quark dense and moe nvfp4

affdda7

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

fxmarty-amd force-pushed the upstream-nvfp4-simulated-quark branch from 6e11ec3 to affdda7 Compare March 3, 2026 12:03

fxmarty-amd added 2 commits March 3, 2026 08:13

wip cleanup

da111bd

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

bug fixes and add test

0cc4207

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

fxmarty-amd marked this pull request as ready for review March 3, 2026 15:42

fxmarty-amd requested review from mgoin, robertgshaw2-redhat, tjtanaa and yewentao256 as code owners March 3, 2026 15:42

update to use the kernel abstraction

dee3b31

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

fxmarty-amd requested a review from vadiklyutiy as a code owner April 13, 2026 11:27

mergify Bot removed the needs-rebase label Apr 13, 2026

mergify Bot added the needs-rebase label Apr 13, 2026

BowenBao approved these changes Apr 14, 2026

View reviewed changes

fxmarty-amd added 4 commits April 15, 2026 10:29

Merge branch 'main' into upstream-nvfp4-simulation-support-moe

4e7ab24

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

address comment

bfc4f90

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

Merge branch 'main' into upstream-nvfp4-simulation-support-moe

1e914e9

Merge branch 'upstream-nvfp4-simulation-support-moe' into upstream-nv…

ace24ac

…fp4-simulated-quark

mergify Bot removed the needs-rebase label Apr 16, 2026

fxmarty-amd added 5 commits April 16, 2026 17:44

Merge branch 'main' into upstream-nvfp4-simulated-quark

970797f

Merge branch 'main' into upstream-nvfp4-simulated-quark

f8cbdc2

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

Merge branch 'main' into upstream-nvfp4-simulated-quark

42518a1

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

address review comment - precise warning

a6dd538

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

fix typo

a1814ad

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

fxmarty-amd commented May 4, 2026

View reviewed changes

fxmarty-amd requested a review from kylesayrs May 4, 2026 12:41

mgoin approved these changes May 4, 2026

View reviewed changes

github-project-automation Bot moved this to Ready in NVIDIA May 4, 2026

mgoin added ready ONLY add when PR is ready to merge/full CI is needed quantization labels May 4, 2026

mergify Bot added the needs-rebase label May 4, 2026

Merge branch 'main' into upstream-nvfp4-simulated-quark

41be38d

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

mergify Bot removed the needs-rebase label May 5, 2026

fxmarty-amd added 2 commits May 5, 2026 07:12

remove outdated comments

4c729d3

Signed-off-by: Felix Marty <Felix.Marty@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Quark] Support loading Quark NVFP4 checkpoints in vLLM#35859

[Quark] Support loading Quark NVFP4 checkpoints in vLLM#35859
fxmarty-amd wants to merge 110 commits intovllm-project:mainfrom
fxmarty-amd:upstream-nvfp4-simulated-quark

fxmarty-amd commented Mar 3, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

mergify Bot commented Apr 13, 2026

Uh oh!

BowenBao left a comment

Uh oh!

fxmarty-amd May 4, 2026

Uh oh!

fxmarty-amd commented May 4, 2026

Uh oh!

mgoin left a comment •

edited

Loading

Uh oh!

mergify Bot commented May 4, 2026

Uh oh!

fxmarty-amd commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

fxmarty-amd commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify Bot commented Apr 13, 2026

Uh oh!

BowenBao left a comment

Choose a reason for hiding this comment

Uh oh!

fxmarty-amd May 4, 2026

Choose a reason for hiding this comment

Uh oh!

fxmarty-amd commented May 4, 2026

Uh oh!

mgoin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented May 4, 2026

Uh oh!

fxmarty-amd commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fxmarty-amd commented Mar 3, 2026 •

edited

Loading

mgoin left a comment •

edited

Loading