[Quark] Support loading Quark NVFP4 checkpoints in vLLM#35859
[Quark] Support loading Quark NVFP4 checkpoints in vLLM#35859fxmarty-amd wants to merge 110 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
…vfp4-simulation-support-moe
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
…fp4-simulation-aot-weight-dequantization
There was a problem hiding this comment.
Code Review
This pull request adds support for loading Quark NVFP4 checkpoints in vLLM, including an emulation path for hardware that doesn't natively support NVFP4. The changes are extensive, touching configuration, quantization layers, and tests. A significant part of the work involves refactoring to accommodate the new emulation backend for both dense and MoE layers. While the overall approach is sound, I've identified a critical issue in the handling of quantization scales for the new QuarkNVFP4 scheme which could lead to incorrect model outputs.
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
6e11ec3 to
affdda7
Compare
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
…fp4-simulated-quark
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
| # Move the E2M1 lookup table to the device now, because | ||
| # `.to(device)` is not allowed during CUDA graph capture. | ||
| kE2M1ToFloat_handle.val = kE2M1ToFloat_handle.val.to(layer.weight.device) | ||
| kE2M1ToFloat_handle.val = kE2M1ToFloat_handle.val.to(layer.w13_weight.device) |
There was a problem hiding this comment.
Typo from #40033 - surprised it slipped in.
|
Hi @kylesayrs @mgoin, this PR should be in a good state, appreciate if you are able to have a look again, thank you! |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
…sk for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' Signed-off-by: Felix Marty <Felix.Marty@amd.com>
|
Getting a seemingly unrelated CI failure https://buildkite.com/vllm/ci/builds/64486#019df8a9-125b-43c1-af7f-679765bfef60: |
Purpose
https://github.com/amd/Quark/ has experimental nvfp4 support that will be extended in future releases. The PR enables loading in vLLM NVFP4 models (dense and MOE) quantized using Quark library.
Todo:
Test Plan
pytest tests/quantization/test_quark.py -s -vvvvv -k "test_nvfp4_wikitext_correctness"