Skip to content

[Quark] Support loading Quark NVFP4 checkpoints in vLLM#35859

Open
fxmarty-amd wants to merge 110 commits intovllm-project:mainfrom
fxmarty-amd:upstream-nvfp4-simulated-quark
Open

[Quark] Support loading Quark NVFP4 checkpoints in vLLM#35859
fxmarty-amd wants to merge 110 commits intovllm-project:mainfrom
fxmarty-amd:upstream-nvfp4-simulated-quark

Conversation

@fxmarty-amd
Copy link
Copy Markdown
Contributor

@fxmarty-amd fxmarty-amd commented Mar 3, 2026

Purpose

https://github.com/amd/Quark/ has experimental nvfp4 support that will be extended in future releases. The PR enables loading in vLLM NVFP4 models (dense and MOE) quantized using Quark library.

Todo:

  • Port the parallel layer scale recomputation logic [won't do - raising an error in case q/k/v projections, gate/up projections weight global scales are not equal].

Test Plan

pytest tests/quantization/test_quark.py -s -vvvvv -k "test_nvfp4_wikitext_correctness"

Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
@mergify mergify Bot added the rocm Related to AMD ROCm label Mar 3, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD Mar 3, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for loading Quark NVFP4 checkpoints in vLLM, including an emulation path for hardware that doesn't natively support NVFP4. The changes are extensive, touching configuration, quantization layers, and tests. A significant part of the work involves refactoring to accommodate the new emulation backend for both dense and MoE layers. While the overall approach is sound, I've identified a critical issue in the handling of quantization scales for the new QuarkNVFP4 scheme which could lead to incorrect model outputs.

Comment thread vllm/model_executor/layers/quantization/quark/schemes/quark_nvfp4.py Outdated
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
@fxmarty-amd fxmarty-amd force-pushed the upstream-nvfp4-simulated-quark branch from 6e11ec3 to affdda7 Compare March 3, 2026 12:03
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
@fxmarty-amd fxmarty-amd marked this pull request as ready for review March 3, 2026 15:42
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
@mergify mergify Bot removed the needs-rebase label Apr 13, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 13, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fxmarty-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Apr 13, 2026
Copy link
Copy Markdown
Contributor

@BowenBao BowenBao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, will take a look again once #35737 is landed.

@mergify mergify Bot removed the needs-rebase label Apr 16, 2026
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
# Move the E2M1 lookup table to the device now, because
# `.to(device)` is not allowed during CUDA graph capture.
kE2M1ToFloat_handle.val = kE2M1ToFloat_handle.val.to(layer.weight.device)
kE2M1ToFloat_handle.val = kE2M1ToFloat_handle.val.to(layer.w13_weight.device)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo from #40033 - surprised it slipped in.

@fxmarty-amd fxmarty-amd requested a review from kylesayrs May 4, 2026 12:41
@fxmarty-amd
Copy link
Copy Markdown
Contributor Author

Hi @kylesayrs @mgoin, this PR should be in a good state, appreciate if you are able to have a look again, thank you!

Copy link
Copy Markdown
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks straightforward, thanks adopting the kernel/moe interface!

@github-project-automation github-project-automation Bot moved this to Ready in NVIDIA May 4, 2026
@mgoin mgoin added ready ONLY add when PR is ready to merge/full CI is needed quantization labels May 4, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 4, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fxmarty-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 4, 2026
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
@mergify mergify Bot removed the needs-rebase label May 5, 2026
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
…sk for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'

Signed-off-by: Felix Marty <Felix.Marty@amd.com>
@fxmarty-amd
Copy link
Copy Markdown
Contributor Author

Getting a seemingly unrelated CI failure https://buildkite.com/vllm/ci/builds/64486#019df8a9-125b-43c1-af7f-679765bfef60:

�_bk;t=1777993427576 #23 76.57 /opt/rocm/lib/llvm/bin/clang++  -DCUTLASS_ENABLE_DIRECT_CUDA_DRIVER_CALL=1 -DHIPBLASLT_USE_ROCROLLER -DPy_LIMITED_API=3 -DTORCH_EXTENSION_NAME=_C -DUSE_C10D_GLOO -DUSE_C10D_MPI -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_PROF_API=1 -DUSE_RPC -DUSE_TENSORPIPE -D_C_EXPORTS -D__HIP_PLATFORM_AMD__ -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -I/app/vllm/build/temp.linux-x86_64-cpython-312/csrc -isystem /usr/include/python3.12 -isystem /usr/local/lib/python3.12/dist-packages/torch/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -isystem /opt/rocm-7.2.2/include/hiprand -isystem /opt/rocm-7.2.2/include/rocrand -Wno-unused-result -Wno-unused-value -O2 -g -DNDEBUG -std=gnu++17 --offload-arch=gfx90a --offload-arch=gfx942 --offload-arch=gfx950 -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -fPIC -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -DUSE_ROCM -DENABLE_FP8 -U__HIP_NO_HALF_CONVERSIONS__ -U__HIP_NO_HALF_OPERATORS__ -Werror=unused-variable -fno-gpu-rdc -DTORCH_HIP_VERSION=702 -Wno-shift-count-negative -Wno-shift-count-overflow -DCAFFE2_USE_MIOPEN -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -std=c++17 -DHIP_ENABLE_WARP_SYNC_BUILTINS -DHIPBLASLT_OUTER_VEC -DUSE_ROCM_CK_GEMM -MD -MT CMakeFiles/_C.dir/csrc/cache_kernels.hip.o -MF CMakeFiles/_C.dir/csrc/cache_kernels.hip.o.d -o CMakeFiles/_C.dir/csrc/cache_kernels.hip.o -x hip -c /app/vllm/build/temp.linux-x86_64-cpython-312/csrc/cache_kernels.hip
�_bk;t=1777993427576 #23 76.57 In file included from /app/vllm/build/temp.linux-x86_64-cpython-312/csrc/cache_kernels.hip:13:
�_bk;t=1777993427576 #23 76.57 /app/vllm/build/temp.linux-x86_64-cpython-312/csrc/concat_mla_q.cuh:4:10: fatal error: 'cuda_bf16.h' file not found
�_bk;t=1777993427576 #23 76.57     4 | #include <cuda_bf16.h>
�_bk;t=1777993427576 #23 76.57       |          ^~~~~~~~~~~~~
�_bk;t=1777993427576 #23 76.57 1 error generated when compiling for gfx90a.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia quantization ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Todo
Status: Ready

Development

Successfully merging this pull request may close these issues.

4 participants