Support 4over6 nvfp4 for quantizer and fused MoE by zianglih · Pull Request #3264 · flashinfer-ai/flashinfer

zianglih · 2026-05-07T22:11:05Z

📌 Description

Implement 4over6 nvfp4 from:

TE PR:

Implement 4over6 NVFP4 recipe NVIDIA/TransformerEngine#2972

For original nvfp4, only cutlass_fused_moe is supported.
For per-token nvfp4, only trtllm_fp4_block_scale_routed_moe and trtllm_fp4_block_scale_moe are supported.

The results is bitwise exact with reference implementation by enabling:

FLASHINFER_NVFP4_4OVER6_DISABLE_MSE_FAST_MATH=1
TRTLLM_DISABLE_FP4_QUANT_FAST_MATH=1

    Note:
        Set `FLASHINFER_NVFP4_4OVER6=1` to enable the CUDA backend's 4over6
        MSE scale-candidate mode for fp16/bf16 NVFP4 quantization. This mode
        uses the fouroversix adaptive NV scale range, `256 * 6`, instead of the
        standard NVFP4 range, `448 * 6`. For non-per-token outputs, downstream
        dequantization or GEMM code must use the corresponding adjusted global
        scale. Set `FLASHINFER_NVFP4_4OVER6_DISABLE_MSE_FAST_MATH=1` to use
        the bitwise-exact MSE comparison path.

Under strict no fast math mode, the quantizer is bitwise exact with pytorch reference implementation.

Need to rebase after:

Future work:

TE recipe implementation after Implement row-scaled NVFP4 fprop recipe NVIDIA/TransformerEngine#2931 is merged
Performance optimization

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Release Notes

New Features
- Introduced NVFP4 "4-over-6" quantization mode for improved FP4 precision, configurable via environment variables
- Added MSE-based scale candidate selection to enhance quantization accuracy
- Implemented runtime toggles for FP4 fast-math and optimization control
Improvements
- Enhanced FP4 quantization kernel dispatch for flexible runtime configuration

coderabbitai · 2026-05-07T22:11:21Z

📝 Walkthrough

Walkthrough

This PR extends NVFP4 per-token quantization with a runtime-configurable 4-over-6 MSE-based scale-candidate selection mode. It adds environment-controlled dispatch helpers, dual-candidate FP4 conversion utilities with per-block MSE comparison, kernel-level scale derivation, and comprehensive test coverage across unit tests and MoE integration tests.

Changes

NVFP4 4-over-6 per-token quantization

Layer / File(s)	Summary
Environment Configuration `csrc/nv_internal/tensorrt_llm/common/envUtils.h`, `csrc/nv_internal/cpp/common/envUtils.cpp`	Add `getEnvNVFP4Use4Over6()` and `getEnvNVFP4Disable4Over6MSEFastMath()` accessors; remove static caching from `getEnvDisableFP4QuantFastMath()` for fresh reads.
Template Signatures `csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh`	Extend `quantize_with_block_size`, `quantize_with_block_size_tma`, `cvt_fp16_to_fp4_expert`, and `block_scale_interleave_kernel` with `USE_4OVER6` and `DISABLE_4OVER6_MSE_FAST_MATH` template parameters; add compile-time constraints.
FP16→FP4 Conversion Core `csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh`	Add `e2m1_code_to_float()` helper; extend `cvt_warp_fp16_to_fp4` with dual-candidate scale generation (6.0 and 4-over-6), per-element MSE computation, warp reduction, and lower-error selection.
Dispatch Helpers `csrc/nv_internal/cpp/kernels/quantization.cu`	Introduce `dispatchBool`, `dispatchSFLayout`, `dispatchFP4QuantMathMode`, and `dispatchFP4KernelConfig` to route env-driven template selections; refactor `invokeNvfp4QuantAndPerTokenScale` and `launchFP4QuantizationTma`.
Quantization Kernel Logic `csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh`, `csrc/nv_internal/cpp/kernels/quantization.cu`	Update SF inversion to use precise division when fast-math disabled; add `nvfp4QuantAndPerTokenScaleKernel` branch for 4-over-6-adjusted per-token scale with zero/denormal handling; wire `USE_4OVER6` and `DISABLE_4OVER6_MSE_FAST_MATH` through `cvt_warp_fp16_to_fp4` calls.
Cutlass MoE Integration `csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh`	Add `dispatchNVFP4QuantConfig` dispatch helper; extend `quantizePackedFPXValue`, `expandInputRowsKernel`, `doActivationKernel` with 4-over-6 template parameters and constraint that `USE_4OVER6` applies only to NVFP4.
Python Reference Helpers `tests/test_helpers/utils_fp4.py`	Update `nvfp4_global_encode_scale_te()` and `nvfp4_global_decode_scale_te()` to accept `use_4over6`; add `_ref_fp4_quant_te_with_decode_scale()` and `ref_fp4_quant_4over6_te()` for dual-candidate MSE-based selection.
Test Environment Management `tests/moe/utils.py`, `tests/utils/test_fp4_quantize.py`	Add `nvfp4_4over6_env()` context manager and autouse `set_nvfp4_4over6_env()` fixture to manage env variables; add `_te_ref_scale_bytes_for_layout()` to convert reference scales to per-layout byte tensors.
Test Coverage `tests/utils/test_fp4_quantize.py`, `tests/moe/test_trtllm_cutlass_fused_moe.py`, `tests/moe/test_trtllm_gen_per_token_moe.py`	Parametrize core and MoE tests with `use_4over6` and `weights_use_4over6`; compute expected scales using `use_4over6`-aware helpers; apply MSE tolerances for 4-over-6 mode; wrap weight quantization in env context.
Runner Integration `csrc/trtllm_fused_moe_runner.cu`	Update `globalScaleInv` computation to conditionally select `1/(2566)` or `1/(4486)` via `getEnvNVFP4Use4Over6()`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

mm_fp4 trtllm backend leaks padding scales into real rows (use_8x4_sf_layout=True) #2861: The changes to FP4/NVFP4 quantization code and scale-layout dispatch paths address per-row scale selection and SF layout handling relevant to reported padding-scale leakage in mm_fp4.

Possibly related PRs

flashinfer-ai/flashinfer#3027: Directly related; both modify invokeNvfp4QuantAndPerTokenScale, FP4 conversion helpers, and kernel dispatch.
flashinfer-ai/flashinfer#3237: Directly related; both modify NVFP4 quantization paths and cvt_warp_fp16_to_fp4 dispatch logic.
flashinfer-ai/flashinfer#3014: Code-level related; both modify Cutlass MoE kernels (quantizePackedFPXValue, expandInputRowsKernel, doActivationKernel).

Suggested reviewers

yzh119
sricketts
IwakuraRein
cyx-6
samuellees
bkryu
saltyminty

Poem

🐰 Four-over-six, a wily choice to make,
Two candidates for every FP4 stake,
MSE whispers which is best,
Per-token scaling passes the test,
Environment flags guide the way,
Making NVFP4 smarter every day! 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 44.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly and concisely describes the main change: adding support for 4over6 nvfp4 quantization to both the quantizer and fused MoE components.
Description check	✅ Passed	The PR description includes a detailed 📌 Description section with implementation details, references to papers/code, environment variable documentation, and notes on support coverage. All required checklist items are marked as complete.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Timed out fetching pipeline failures after 30000ms

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a new NVFP4 quantization mode called '4/6 MSE scale-candidate mode,' which is activated via the FLASHINFER_NVFP4_FOUR_OVER_SIX environment variable. The implementation includes updates to CUDA kernels for per-token scaling and quantization, as well as corresponding Python tests and documentation. Reviewer feedback suggests several optimizations for the CUDA code, including refactoring duplicated logic into helper functions, precalculating values to reduce redundant arithmetic operations within loops, and replacing switch statements with lookup tables to improve performance and readability.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/utils/test_fp4_quantize.py (1)

706-747: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Pin FOUR_OVER_SIX off for the baseline TE-reference test.

Line 706 validates the non-4/6 reference path, but this test can be affected by an externally set FLASHINFER_NVFP4_FOUR_OVER_SIX. Make the mode explicit in-test to avoid environment-coupled failures.

🔧 Proposed fix

 def test_nvfp4_per_token_quantize_te_reference(
     dtype: torch.dtype,
     shape: tuple[int, int],
     sf_layout: SfLayout,
     init_data: str,
     device: str,
+    monkeypatch: pytest.MonkeyPatch,
 ) -> None:
     """Per-token NVFP4 quantization should match the TE Python reference bitwise."""
     if not _is_fp4_supported(torch.device(device)):
         pytest.skip("Nvfp4 Requires compute capability >= 10 and CUDA >= 12.8")
+    monkeypatch.setenv("FLASHINFER_NVFP4_FOUR_OVER_SIX", "0")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/utils/test_fp4_quantize.py` around lines 706 - 747, In
test_nvfp4_per_token_quantize_te_reference ensure the FOUR_OVER_SIX mode is
pinned off so the TE-reference path is deterministic: at the start of
test_nvfp4_per_token_quantize_te_reference set the environment flag
FLASHINFER_NVFP4_FOUR_OVER_SIX="0" (or call your library’s setter if available)
before creating x and running ref_fp4_quant_te/nvfp4_quantize, and restore the
previous value at the end of the test to avoid leaking global state.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tests/utils/test_fp4_quantize.py`:
- Around line 706-747: In test_nvfp4_per_token_quantize_te_reference ensure the
FOUR_OVER_SIX mode is pinned off so the TE-reference path is deterministic: at
the start of test_nvfp4_per_token_quantize_te_reference set the environment flag
FLASHINFER_NVFP4_FOUR_OVER_SIX="0" (or call your library’s setter if available)
before creating x and running ref_fp4_quant_te/nvfp4_quantize, and restore the
previous value at the end of the test to avoid leaking global state.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e9dcc260-81db-4c62-b9e1-585a7ba243bb

📥 Commits

Reviewing files that changed from the base of the PR and between c5c089b and 0b79d4f.

📒 Files selected for processing (5)

csrc/nv_internal/cpp/kernels/quantization.cu
csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh
flashinfer/quantization/fp4_quantization.py
tests/utils/test_fp4_quantize.py

aleozlx

looks good to me so far!

thx for the contrib. pls address conflicts

coderabbitai

Actionable comments posted: 1

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 064e42d2-1286-4387-8bd1-9c66fe18ddac

📥 Commits

Reviewing files that changed from the base of the PR and between 0b79d4f and b36e9a6.

📒 Files selected for processing (7)

csrc/nv_internal/cpp/common/envUtils.cpp
csrc/nv_internal/cpp/kernels/quantization.cu
csrc/nv_internal/tensorrt_llm/common/envUtils.h
csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh
flashinfer/quantization/fp4_quantization.py
tests/utils/test_fp4_quantize.py

✅ Files skipped from review due to trivial changes (2)

csrc/nv_internal/tensorrt_llm/common/envUtils.h
flashinfer/quantization/fp4_quantization.py

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/moe/test_trtllm_gen_per_token_moe.py (1)
114-134: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

This changes the scales, but not the backend mode.

The new use_4over6 branch only rewrites the Python-side NVFP4 scale factors. The test never enables 4over6 via set_nvfp4_4over6_env before calling nvfp4_quantize() and trtllm_fp4_block_scale_routed_moe(), so the True cases are not validating the actual 4over6 implementation. Apply the shared env helper around the quantize + kernel section so both sides run in the same mode.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/moe/test_trtllm_gen_per_token_moe.py` around lines 114 - 134, The test
only updates Python-side scales via nvfp4_global_decode_scale_te but never flips
the backend mode, so wrap the quantize+kernel calls with the shared helper
set_nvfp4_4over6_env(use_4over6) so the backend is actually in 4over6 mode when
calling nvfp4_quantize and trtllm_fp4_block_scale_routed_moe; specifically, call
set_nvfp4_4over6_env(use_4over6) around the block that computes
hidden_states/hidden_states_scale/per_token_scale_inv with nvfp4_quantize and
the subsequent trtllm_fp4_block_scale_routed_moe invocation so both scale
computation and kernel execution use the same mode (references:
nvfp4_global_decode_scale_te, nvfp4_quantize, set_nvfp4_4over6_env,
trtllm_fp4_block_scale_routed_moe).

♻️ Duplicate comments (1)

csrc/nv_internal/cpp/kernels/quantization.cu (1)
338-362: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

FP32 input still aborts when FLASHINFER_NVFP4_4OVER6=1.

use4Over6 is read unconditionally from the process-global env var, and the if constexpr (std::is_same_v<T, float>) branch then aborts via TLLM_CHECK_WITH_INFO(!USE_4OVER6, ...). Any caller that quantizes a float input in a process where the env var is set (e.g. an MoE test running after a 4-over-6 test set the env in the same process) will fail, even though the legacy FP32 kernel is unchanged and capable of handling the request. Force use4Over6=false for T=float at the env-read site instead of aborting downstream.
💡 Suggested fix
-  bool const disableFP4QuantFastMath = tensorrt_llm::common::getEnvDisableFP4QuantFastMath();
-  bool const use4Over6 = tensorrt_llm::common::getEnvNVFP4Use4Over6();
-  bool const disable4Over6MSEFastMath = tensorrt_llm::common::getEnvNVFP4Disable4Over6MSEFastMath();
+  bool const disableFP4QuantFastMath = tensorrt_llm::common::getEnvDisableFP4QuantFastMath();
+  bool const use4Over6 =
+      !std::is_same_v<T, float> && tensorrt_llm::common::getEnvNVFP4Use4Over6();
+  bool const disable4Over6MSEFastMath =
+      use4Over6 && tensorrt_llm::common::getEnvNVFP4Disable4Over6MSEFastMath();
With that, the TLLM_CHECK_WITH_INFO(!USE_4OVER6, ...) inside the T=float branch becomes unreachable and can be dropped (or kept defensively).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@csrc/nv_internal/cpp/kernels/quantization.cu` around lines 338 - 362, The
code reads the process-global use4Over6 unconditionally which causes FP32
instantiations to abort; fix by making the env-read T-aware: move or re-evaluate
tensorrt_llm::common::getEnvNVFP4Use4Over6() into the template/lambda scope
where T is visible (the launchKernel capture/instantiation) and force it false
for T=float (e.g. compute auto const use4Over6 =
tensorrt_llm::common::getEnvNVFP4Use4Over6() && !std::is_same_v<T,float> and
pass that as the use4Over6Tag/std::bool_constant), then remove or leave the
now-unreachable TLLM_CHECK_WITH_INFO(!USE_4OVER6, ...) in the float branch.

🧹 Nitpick comments (1)

tests/test_helpers/utils_fp4.py (1)

295-302: ⚡ Quick win

Vectorize the per-element MSE accumulation.

The explicit Python loop over block_size=16 is unnecessary work and obscures the intent. A vectorized form is shorter, faster, and (because the reduction order across the last dim is implementation-defined either way) preserves the strict < tiebreak on pick_four.

♻️ Proposed refactor

-    err4 = torch.zeros((m, n // block_size), dtype=torch.float32, device=x.device)
-    err6 = torch.zeros((m, n // block_size), dtype=torch.float32, device=x.device)
-    for i in range(block_size):
-        diff4 = dq4[:, :, i] - x_blocks[:, :, i]
-        diff6 = dq6[:, :, i] - x_blocks[:, :, i]
-        err4 += diff4 * diff4
-        err6 += diff6 * diff6
-    pick_four = err4 < err6
+    err4 = ((dq4 - x_blocks) ** 2).sum(dim=-1)
+    err6 = ((dq6 - x_blocks) ** 2).sum(dim=-1)
+    pick_four = err4 < err6

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_helpers/utils_fp4.py` around lines 295 - 302, The loop computes
per-block MSE by accumulating squared differences across the last dim; replace
the explicit for-loop with a vectorized reduction: compute diff4 = dq4 -
x_blocks and diff6 = dq6 - x_blocks, square them and sum over the last axis to
produce err4 and err6, then set pick_four = err4 < err6 (preserving the strict <
tiebreak). Update variables err4, err6, diff4, diff6 and use the existing dq4,
dq6, x_blocks, and pick_four names so the change is localized to that block.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/moe/test_trtllm_gen_fused_moe.py`:
- Around line 2680-2687: run_moe_test currently only uses use_4over6 for
skipping but never actually sets the process env, so 4over6 paths may not be
exercised; wrap the quantize/reference/production section inside the
set_nvfp4_4over6_env context by calling set_nvfp4_4over6_env(use_4over6) (and
ensure the helper is imported) before entering the FP4
quantize/reference/production logic in run_moe_test and restore/unset it after
that block so the FLASHINFER_NVFP4_4OVER6 env state is consistently applied only
for those test cases.

In `@tests/moe/test_trtllm_gen_moe_autotune_tactics.py`:
- Around line 160-169: The test never actually enables the 4over6 NVFP4 runtime
flag because set_nvfp4_4over6_env is never applied; update the test harness so
that when _quant_mode_config is called with use_4over6=True the runtime
environment is toggled for the kernel run: call set_nvfp4_4over6_env(True)
before invoking _run_kernel_with_tactic (and set_nvfp4_4over6_env(False) or
restore the previous state after) so the launched kernel uses the 4over6 path;
adjust every place that constructs the use_4over6=True matrix (including the
other occurrences you noted) to wrap the kernel invocation with the env setter
rather than only changing scales.

In `@tests/moe/test_trtllm_gen_routed_fused_moe.py`:
- Around line 82-83: The test toggles use_4over6 but never actually flips the
NVFP4 4over6 environment, so fp4_quantize() and the routed/non-routed MoE kernel
calls still use the global env; fix by wrapping the sections that perform FP4
quantization and invoke the MoE kernels (references: fp4_quantize, the routed
MoE kernel call(s) and the non-routed MoE kernel call(s)) in the
set_nvfp4_4over6_env context when use_4over6 is True (e.g., with
set_nvfp4_4over6_env(): ...) so the env is applied for those operations and is
restored afterward; apply this same wrapping to the other similar test blocks
currently duplicated later in the file.

In `@tests/moe/utils.py`:
- Around line 40-65: The fixture set_nvfp4_4over6_env currently force-sets
TRTLLM_DISABLE_FP4_QUANT_FAST_MATH and
FLASHINFER_NVFP4_4OVER6_DISABLE_MSE_FAST_MATH unconditionally; change it so
those two env vars are only set when request.getfixturevalue("use_4over6") is
truthy (i.e., set them inside the branch where use_4over6 is True and leave them
untouched when False), while still recording original_values and restoring them
after yield; keep FLASHINFER_NVFP4_4OVER6 set to "1"/"0" based on use_4over6 as
before.

---

Outside diff comments:
In `@tests/moe/test_trtllm_gen_per_token_moe.py`:
- Around line 114-134: The test only updates Python-side scales via
nvfp4_global_decode_scale_te but never flips the backend mode, so wrap the
quantize+kernel calls with the shared helper set_nvfp4_4over6_env(use_4over6) so
the backend is actually in 4over6 mode when calling nvfp4_quantize and
trtllm_fp4_block_scale_routed_moe; specifically, call
set_nvfp4_4over6_env(use_4over6) around the block that computes
hidden_states/hidden_states_scale/per_token_scale_inv with nvfp4_quantize and
the subsequent trtllm_fp4_block_scale_routed_moe invocation so both scale
computation and kernel execution use the same mode (references:
nvfp4_global_decode_scale_te, nvfp4_quantize, set_nvfp4_4over6_env,
trtllm_fp4_block_scale_routed_moe).

---

Duplicate comments:
In `@csrc/nv_internal/cpp/kernels/quantization.cu`:
- Around line 338-362: The code reads the process-global use4Over6
unconditionally which causes FP32 instantiations to abort; fix by making the
env-read T-aware: move or re-evaluate
tensorrt_llm::common::getEnvNVFP4Use4Over6() into the template/lambda scope
where T is visible (the launchKernel capture/instantiation) and force it false
for T=float (e.g. compute auto const use4Over6 =
tensorrt_llm::common::getEnvNVFP4Use4Over6() && !std::is_same_v<T,float> and
pass that as the use4Over6Tag/std::bool_constant), then remove or leave the
now-unreachable TLLM_CHECK_WITH_INFO(!USE_4OVER6, ...) in the float branch.

---

Nitpick comments:
In `@tests/test_helpers/utils_fp4.py`:
- Around line 295-302: The loop computes per-block MSE by accumulating squared
differences across the last dim; replace the explicit for-loop with a vectorized
reduction: compute diff4 = dq4 - x_blocks and diff6 = dq6 - x_blocks, square
them and sum over the last axis to produce err4 and err6, then set pick_four =
err4 < err6 (preserving the strict < tiebreak). Update variables err4, err6,
diff4, diff6 and use the existing dq4, dq6, x_blocks, and pick_four names so the
change is localized to that block.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 07674709-c056-41fd-8bc9-27c3e59e1102

📥 Commits

Reviewing files that changed from the base of the PR and between b36e9a6 and 7d2f214.

📒 Files selected for processing (14)

csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
csrc/nv_internal/cpp/common/envUtils.cpp
csrc/nv_internal/cpp/kernels/quantization.cu
csrc/nv_internal/tensorrt_llm/common/envUtils.h
csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh
tests/moe/test_trtllm_cutlass_fused_moe.py
tests/moe/test_trtllm_gen_fused_moe.py
tests/moe/test_trtllm_gen_moe_autotune_tactics.py
tests/moe/test_trtllm_gen_per_token_moe.py
tests/moe/test_trtllm_gen_routed_fused_moe.py
tests/moe/utils.py
tests/test_helpers/utils_fp4.py
tests/utils/test_fp4_quantize.py

🚧 Files skipped from review as they are similar to previous changes (2)

csrc/nv_internal/cpp/common/envUtils.cpp
csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh

IwakuraRein · 2026-05-09T01:39:16Z

/bot run

flashinfer-bot · 2026-05-09T01:39:30Z

GitLab MR !655 has been created, and the CI pipeline #50739016 is currently running. I'll report back once the pipeline job completes.

zianglih · 2026-05-11T04:36:42Z

Some fused moe apis may use quantizers within their cubin. Only 1) per-token fused moe and 2) cutlass fused moe are correctly supported.

zianglih · 2026-05-11T06:39:07Z

# flashinfer_trtllm_routed without 4over6
SGLANG_FLASHINFER_PER_TOKEN_NVFP4_MOE=1 \
TRTLLM_DISABLE_FP4_QUANT_FAST_MATH=1 \
python -m sglang.launch_server \
  --kv-cache-dtype bf16 \
  --model-path nvidia/Qwen3-30B-A3B-NVFP4 \
  --disable-piecewise-cuda-graph \
  --moe-runner-backend flashinfer_trtllm_routed
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.937
Invalid: 0.001
Latency: 10.785 s
Output throughput: 13958.097 token/s
Accuracy: 0.937
Invalid: 0.001
Latency: 10.849 s
Output throughput: 13875.159 token/s

# flashinfer_trtllm_routed with 4over6 after fix
FLASHINFER_NVFP4_4OVER6=1 \
FLASHINFER_NVFP4_4OVER6_DISABLE_MSE_FAST_MATH=1 \
SGLANG_FLASHINFER_PER_TOKEN_NVFP4_MOE=1 \
TRTLLM_DISABLE_FP4_QUANT_FAST_MATH=1 \
python -m sglang.launch_server \
  --kv-cache-dtype bf16 \
  --model-path nvidia/Qwen3-30B-A3B-NVFP4 \
  --disable-piecewise-cuda-graph \
  --moe-runner-backend flashinfer_trtllm_routed
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.934
Invalid: 0.001
Latency: 10.575 s
Output throughput: 13768.806 token/s
Accuracy: 0.934
Invalid: 0.001
Latency: 10.628 s
Output throughput: 13700.039 token/s

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/moe/test_trtllm_gen_per_token_moe.py`:
- Around line 116-125: The nvfp4 activation quantization call is executed
outside the 4over6 runtime env so when use_4over6=True the activation-side
quantizer dispatch may not be applied; wrap the nvfp4_quantize invocation in the
same nvfp4_4over6_env(use_4over6) context used for nvfp4_global_decode_scale_te
(or otherwise set the 4over6 runtime for the scope) so hidden_states,
hidden_states_scale, per_token_scale_inv are produced under
nvfp4_4over6_env(use_4over6) using the same inputs (hidden_states_bf16,
hidden_states_global_scale_inv, SfLayout.layout_linear, per_token_activation).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4ed55cda-2c4c-4507-9a5c-3ddcbb07ab4a

📥 Commits

Reviewing files that changed from the base of the PR and between 7d2f214 and c37ec1f.

📒 Files selected for processing (3)

csrc/trtllm_fused_moe_runner.cu
tests/moe/test_trtllm_gen_per_token_moe.py
tests/moe/utils.py

🚧 Files skipped from review as they are similar to previous changes (1)

tests/moe/utils.py

coderabbitai · 2026-05-11T06:45:22Z

+    hidden_states_global_scale_inv = nvfp4_global_decode_scale_te(
+        torch.ones((), dtype=torch.float32, device=device),
+        use_4over6=use_4over6,
+    )
    hidden_states, hidden_states_scale, per_token_scale_inv = nvfp4_quantize(
        hidden_states_bf16,
-        1.0 / (448.0 * 6.0),
+        hidden_states_global_scale_inv,
        sfLayout=SfLayout.layout_linear,
        per_token_activation=True,
    )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Apply use_4over6 runtime env to hidden-state quantization as well.

use_4over6=True updates decode-scale selection (Line 116-Line 119), but nvfp4_quantize on Line 120 still runs outside nvfp4_4over6_env(...). That can miss activation-side 4over6 quantizer dispatch and reduce test validity.

Suggested patch

hidden_states_global_scale_inv = nvfp4_global_decode_scale_te( torch.ones((), dtype=torch.float32, device=device), use_4over6=use_4over6, ) - hidden_states, hidden_states_scale, per_token_scale_inv = nvfp4_quantize( - hidden_states_bf16, - hidden_states_global_scale_inv, - sfLayout=SfLayout.layout_linear, - per_token_activation=True, - ) + with moe_utils.nvfp4_4over6_env(use_4over6): + hidden_states, hidden_states_scale, per_token_scale_inv = nvfp4_quantize( + hidden_states_bf16, + hidden_states_global_scale_inv, + sfLayout=SfLayout.layout_linear, + per_token_activation=True, + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

hidden_states_global_scale_inv = nvfp4_global_decode_scale_te(

torch.ones((), dtype=torch.float32, device=device),

use_4over6=use_4over6,

)

hidden_states, hidden_states_scale, per_token_scale_inv = nvfp4_quantize(

hidden_states_bf16,

1.0 / (448.0 * 6.0),

hidden_states_global_scale_inv,

sfLayout=SfLayout.layout_linear,

per_token_activation=True,

)

hidden_states_global_scale_inv = nvfp4_global_decode_scale_te(

torch.ones((), dtype=torch.float32, device=device),

use_4over6=use_4over6,

)

with moe_utils.nvfp4_4over6_env(use_4over6):

hidden_states, hidden_states_scale, per_token_scale_inv = nvfp4_quantize(

hidden_states_bf16,

hidden_states_global_scale_inv,

sfLayout=SfLayout.layout_linear,

per_token_activation=True,

)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/moe/test_trtllm_gen_per_token_moe.py` around lines 116 - 125, The nvfp4 activation quantization call is executed outside the 4over6 runtime env so when use_4over6=True the activation-side quantizer dispatch may not be applied; wrap the nvfp4_quantize invocation in the same nvfp4_4over6_env(use_4over6) context used for nvfp4_global_decode_scale_te (or otherwise set the 4over6 runtime for the scope) so hidden_states, hidden_states_scale, per_token_scale_inv are produced under nvfp4_4over6_env(use_4over6) using the same inputs (hidden_states_bf16, hidden_states_global_scale_inv, SfLayout.layout_linear, per_token_activation).

tests/moe/utils.py defines set_nvfp4_4over6_env as an autouse=True pytest fixture. Each commented file imports/binds it into module scope

@zianglih, you're right — I missed that. Since set_nvfp4_4over6_env is an autouse=True fixture that activates whenever the use_4over6 parameter fixture is present, the entire test body (including the hidden_states nvfp4_quantize call) already executes under the correct FLASHINFER_NVFP4_4OVER6 environment derived from use_4over6. The separate with moe_utils.nvfp4_4over6_env(weights_use_4over6) block makes sense only for weight quantization because weights_use_4over6 can differ from use_4over6. My suggestion was incorrect — no change needed here.

✏️ Learnings added

Learnt from: zianglih Repo: flashinfer-ai/flashinfer PR: 3264 File: tests/moe/test_trtllm_gen_per_token_moe.py:116-125 Timestamp: 2026-05-11T06:57:06.760Z Learning: In the flashinfer repo (tests/moe/), `set_nvfp4_4over6_env` is an `autouse=True` pytest fixture defined in `tests/moe/utils.py`. When a test file binds it at module scope (e.g., `set_nvfp4_4over6_env = moe_utils.set_nvfp4_4over6_env`), the fixture automatically sets `FLASHINFER_NVFP4_4OVER6`, `TRTLLM_DISABLE_FP4_QUANT_FAST_MATH`, and `FLASHINFER_NVFP4_4OVER6_DISABLE_MSE_FAST_MATH` for any test parametrized with `use_4over6`. This means the entire test body already runs under the correct 4over6 environment — no additional context manager wrapping is needed for activation-side quantization.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: yzh119 Repo: flashinfer-ai/flashinfer PR: 2370 File: tests/gdn/conftest.py:25-34 Timestamp: 2026-01-21T21:26:00.701Z Learning: Tests in the repository assume CUDA is available and do not require torch.cuda.is_available() guards in pytest fixtures. Ensure test files under tests/ follow this convention and avoid adding CPU-only guards in fixtures unless explicitly handling a non-CUDA environment.

aleozlx · 2026-05-11T21:19:11Z

/bot run

flashinfer-bot · 2026-05-11T21:19:29Z

GitLab MR !655 has been updated with latest changes, and the CI pipeline #50962827 is currently running. I'll report back once the pipeline job completes.

zianglih requested review from aleozlx, bkryu, cyx-6, jimmyzho, kahyunnam, nv-yunzheq, saltyminty, samuellees, sricketts, yongwww, yyihuang and yzh119 as code owners May 7, 2026 22:11

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

zianglih mentioned this pull request May 8, 2026

[Roadmap] Blackwell MXFP8 and NVFP4 RL training radixark/miles#615

Open

30 tasks

aleozlx reviewed May 8, 2026

View reviewed changes

Implementation after rebase

b36e9a6

ziang-and force-pushed the 4over6 branch from 0b79d4f to b36e9a6 Compare May 8, 2026 19:05

zianglih marked this pull request as draft May 8, 2026 19:08

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

Comment thread csrc/nv_internal/cpp/kernels/quantization.cu Outdated

zianglih added 7 commits May 8, 2026 12:18

Clean up test

6b4a6e0

Support te exact for original nvfp4

f15d0f1

Support 4over6 for original nvfp4

ebb1909

Refactor dispatch

3156f4d

Further refactor dispatch

a6e90d2

Clean up test env var

b6bea62

Clean up test

629cd01

zianglih changed the title ~~Implement 4 over 6 nvfp4 quantizer for per-token nvfp4~~ Implement 4over6 nvfp4 quantizer for per-token nvfp4 May 8, 2026

zianglih changed the title ~~Implement 4over6 nvfp4 quantizer for per-token nvfp4~~ Implement 4over6 nvfp4 quantizer May 8, 2026

Expand moe tests

73b7790

flashinfer-bot added the op: moe label May 8, 2026

zianglih added 4 commits May 8, 2026 16:16

Avoid implicit 256 448 conversion in reference

6d90e63

Require the user to use 256 for global scales

78b7fd0

Extend implementation for silu_and_mul_scaled_nvfp4_experts_quantize

8cc7d6d

Reorder arg list

0ab5369

zianglih changed the title ~~Implement 4over6 nvfp4 quantizer~~ Support 4over6 nvfp4 for quantizer and fused MoE May 9, 2026

zianglih added 2 commits May 8, 2026 18:04

Expand cutlass moe support

20a5b25

Drop padding test

7d2f214

zianglih marked this pull request as ready for review May 9, 2026 01:06

zianglih requested review from IwakuraRein and jiahanc as code owners May 9, 2026 01:06

coderabbitai Bot reviewed May 9, 2026

View reviewed changes

Comment thread tests/moe/test_trtllm_gen_fused_moe.py Outdated

Comment thread tests/moe/test_trtllm_gen_moe_autotune_tactics.py Outdated

Comment thread tests/moe/test_trtllm_gen_routed_fused_moe.py Outdated

Comment thread tests/moe/utils.py

IwakuraRein added the run-ci label May 9, 2026

zianglih mentioned this pull request May 9, 2026

Implement 4over6 NVFP4 recipe NVIDIA/TransformerEngine#2972

Open

13 tasks

zianglih marked this pull request as draft May 11, 2026 05:24

zianglih added 4 commits May 10, 2026 22:25

Fix per-token moe internal quant

65e41d5

Drop unsupported test coverage

3c65875

Extend test for ordinary weights

909b399

Minor numerics

c37ec1f

zianglih marked this pull request as ready for review May 11, 2026 06:39

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

Conversation

zianglih commented May 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

aleozlx left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

IwakuraRein commented May 9, 2026

Uh oh!

flashinfer-bot commented May 9, 2026

Uh oh!

zianglih commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zianglih commented May 11, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zianglih May 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

aleozlx commented May 11, 2026

Uh oh!

flashinfer-bot commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

zianglih commented May 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading

zianglih commented May 11, 2026 •

edited

Loading

coderabbitai Bot May 11, 2026 •

edited

Loading