Skip to content

Support 4over6 nvfp4 for quantizer and fused MoE#3264

Open
zianglih wants to merge 19 commits into
flashinfer-ai:mainfrom
zianglih:4over6
Open

Support 4over6 nvfp4 for quantizer and fused MoE#3264
zianglih wants to merge 19 commits into
flashinfer-ai:mainfrom
zianglih:4over6

Conversation

@zianglih
Copy link
Copy Markdown
Contributor

@zianglih zianglih commented May 7, 2026

📌 Description

@HumansAnd

Implement 4over6 nvfp4 from:

TE PR:

For original nvfp4, only cutlass_fused_moe is supported.
For per-token nvfp4, only trtllm_fp4_block_scale_routed_moe and trtllm_fp4_block_scale_moe are supported.

The results is bitwise exact with reference implementation by enabling:

  • FLASHINFER_NVFP4_4OVER6_DISABLE_MSE_FAST_MATH=1
  • TRTLLM_DISABLE_FP4_QUANT_FAST_MATH=1
    Note:
        Set `FLASHINFER_NVFP4_4OVER6=1` to enable the CUDA backend's 4over6
        MSE scale-candidate mode for fp16/bf16 NVFP4 quantization. This mode
        uses the fouroversix adaptive NV scale range, `256 * 6`, instead of the
        standard NVFP4 range, `448 * 6`. For non-per-token outputs, downstream
        dequantization or GEMM code must use the corresponding adjusted global
        scale. Set `FLASHINFER_NVFP4_4OVER6_DISABLE_MSE_FAST_MATH=1` to use
        the bitwise-exact MSE comparison path.

Under strict no fast math mode, the quantizer is bitwise exact with pytorch reference implementation.

Need to rebase after:

Future work:

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced NVFP4 "4-over-6" quantization mode for improved FP4 precision, configurable via environment variables
    • Added MSE-based scale candidate selection to enhance quantization accuracy
    • Implemented runtime toggles for FP4 fast-math and optimization control
  • Improvements

    • Enhanced FP4 quantization kernel dispatch for flexible runtime configuration

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

📝 Walkthrough

Walkthrough

This PR extends NVFP4 per-token quantization with a runtime-configurable 4-over-6 MSE-based scale-candidate selection mode. It adds environment-controlled dispatch helpers, dual-candidate FP4 conversion utilities with per-block MSE comparison, kernel-level scale derivation, and comprehensive test coverage across unit tests and MoE integration tests.

Changes

NVFP4 4-over-6 per-token quantization

Layer / File(s) Summary
Environment Configuration
csrc/nv_internal/tensorrt_llm/common/envUtils.h, csrc/nv_internal/cpp/common/envUtils.cpp
Add getEnvNVFP4Use4Over6() and getEnvNVFP4Disable4Over6MSEFastMath() accessors; remove static caching from getEnvDisableFP4QuantFastMath() for fresh reads.
Template Signatures
csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
Extend quantize_with_block_size, quantize_with_block_size_tma, cvt_fp16_to_fp4_expert, and block_scale_interleave_kernel with USE_4OVER6 and DISABLE_4OVER6_MSE_FAST_MATH template parameters; add compile-time constraints.
FP16→FP4 Conversion Core
csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh
Add e2m1_code_to_float() helper; extend cvt_warp_fp16_to_fp4 with dual-candidate scale generation (6.0 and 4-over-6), per-element MSE computation, warp reduction, and lower-error selection.
Dispatch Helpers
csrc/nv_internal/cpp/kernels/quantization.cu
Introduce dispatchBool, dispatchSFLayout, dispatchFP4QuantMathMode, and dispatchFP4KernelConfig to route env-driven template selections; refactor invokeNvfp4QuantAndPerTokenScale and launchFP4QuantizationTma.
Quantization Kernel Logic
csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh, csrc/nv_internal/cpp/kernels/quantization.cu
Update SF inversion to use precise division when fast-math disabled; add nvfp4QuantAndPerTokenScaleKernel branch for 4-over-6-adjusted per-token scale with zero/denormal handling; wire USE_4OVER6 and DISABLE_4OVER6_MSE_FAST_MATH through cvt_warp_fp16_to_fp4 calls.
Cutlass MoE Integration
csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
Add dispatchNVFP4QuantConfig dispatch helper; extend quantizePackedFPXValue, expandInputRowsKernel, doActivationKernel with 4-over-6 template parameters and constraint that USE_4OVER6 applies only to NVFP4.
Python Reference Helpers
tests/test_helpers/utils_fp4.py
Update nvfp4_global_encode_scale_te() and nvfp4_global_decode_scale_te() to accept use_4over6; add _ref_fp4_quant_te_with_decode_scale() and ref_fp4_quant_4over6_te() for dual-candidate MSE-based selection.
Test Environment Management
tests/moe/utils.py, tests/utils/test_fp4_quantize.py
Add nvfp4_4over6_env() context manager and autouse set_nvfp4_4over6_env() fixture to manage env variables; add _te_ref_scale_bytes_for_layout() to convert reference scales to per-layout byte tensors.
Test Coverage
tests/utils/test_fp4_quantize.py, tests/moe/test_trtllm_cutlass_fused_moe.py, tests/moe/test_trtllm_gen_per_token_moe.py
Parametrize core and MoE tests with use_4over6 and weights_use_4over6; compute expected scales using use_4over6-aware helpers; apply MSE tolerances for 4-over-6 mode; wrap weight quantization in env context.
Runner Integration
csrc/trtllm_fused_moe_runner.cu
Update globalScaleInv computation to conditionally select 1/(256*6) or 1/(448*6) via getEnvNVFP4Use4Over6().

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • yzh119
  • sricketts
  • IwakuraRein
  • cyx-6
  • samuellees
  • bkryu
  • saltyminty

Poem

🐰 Four-over-six, a wily choice to make,
Two candidates for every FP4 stake,
MSE whispers which is best,
Per-token scaling passes the test,
Environment flags guide the way,
Making NVFP4 smarter every day! 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and concisely describes the main change: adding support for 4over6 nvfp4 quantization to both the quantizer and fused MoE components.
Description check ✅ Passed The PR description includes a detailed 📌 Description section with implementation details, references to papers/code, environment variable documentation, and notes on support coverage. All required checklist items are marked as complete.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Timed out fetching pipeline failures after 30000ms

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new NVFP4 quantization mode called '4/6 MSE scale-candidate mode,' which is activated via the FLASHINFER_NVFP4_FOUR_OVER_SIX environment variable. The implementation includes updates to CUDA kernels for per-token scaling and quantization, as well as corresponding Python tests and documentation. Reviewer feedback suggests several optimizations for the CUDA code, including refactoring duplicated logic into helper functions, precalculating values to reduce redundant arithmetic operations within loops, and replacing switch statements with lookup tables to improve performance and readability.

Comment thread csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh Outdated
Comment thread csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh Outdated
Comment thread csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh Outdated
Comment thread csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh
Comment thread csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh Outdated
Comment thread csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/utils/test_fp4_quantize.py (1)

706-747: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Pin FOUR_OVER_SIX off for the baseline TE-reference test.

Line 706 validates the non-4/6 reference path, but this test can be affected by an externally set FLASHINFER_NVFP4_FOUR_OVER_SIX. Make the mode explicit in-test to avoid environment-coupled failures.

🔧 Proposed fix
 def test_nvfp4_per_token_quantize_te_reference(
     dtype: torch.dtype,
     shape: tuple[int, int],
     sf_layout: SfLayout,
     init_data: str,
     device: str,
+    monkeypatch: pytest.MonkeyPatch,
 ) -> None:
     """Per-token NVFP4 quantization should match the TE Python reference bitwise."""
     if not _is_fp4_supported(torch.device(device)):
         pytest.skip("Nvfp4 Requires compute capability >= 10 and CUDA >= 12.8")
+    monkeypatch.setenv("FLASHINFER_NVFP4_FOUR_OVER_SIX", "0")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/utils/test_fp4_quantize.py` around lines 706 - 747, In
test_nvfp4_per_token_quantize_te_reference ensure the FOUR_OVER_SIX mode is
pinned off so the TE-reference path is deterministic: at the start of
test_nvfp4_per_token_quantize_te_reference set the environment flag
FLASHINFER_NVFP4_FOUR_OVER_SIX="0" (or call your library’s setter if available)
before creating x and running ref_fp4_quant_te/nvfp4_quantize, and restore the
previous value at the end of the test to avoid leaking global state.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tests/utils/test_fp4_quantize.py`:
- Around line 706-747: In test_nvfp4_per_token_quantize_te_reference ensure the
FOUR_OVER_SIX mode is pinned off so the TE-reference path is deterministic: at
the start of test_nvfp4_per_token_quantize_te_reference set the environment flag
FLASHINFER_NVFP4_FOUR_OVER_SIX="0" (or call your library’s setter if available)
before creating x and running ref_fp4_quant_te/nvfp4_quantize, and restore the
previous value at the end of the test to avoid leaking global state.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e9dcc260-81db-4c62-b9e1-585a7ba243bb

📥 Commits

Reviewing files that changed from the base of the PR and between c5c089b and 0b79d4f.

📒 Files selected for processing (5)
  • csrc/nv_internal/cpp/kernels/quantization.cu
  • csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
  • csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh
  • flashinfer/quantization/fp4_quantization.py
  • tests/utils/test_fp4_quantize.py

Copy link
Copy Markdown
Collaborator

@aleozlx aleozlx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me so far!

thx for the contrib. pls address conflicts

@zianglih zianglih marked this pull request as draft May 8, 2026 19:08
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 064e42d2-1286-4387-8bd1-9c66fe18ddac

📥 Commits

Reviewing files that changed from the base of the PR and between 0b79d4f and b36e9a6.

📒 Files selected for processing (7)
  • csrc/nv_internal/cpp/common/envUtils.cpp
  • csrc/nv_internal/cpp/kernels/quantization.cu
  • csrc/nv_internal/tensorrt_llm/common/envUtils.h
  • csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
  • csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh
  • flashinfer/quantization/fp4_quantization.py
  • tests/utils/test_fp4_quantize.py
✅ Files skipped from review due to trivial changes (2)
  • csrc/nv_internal/tensorrt_llm/common/envUtils.h
  • flashinfer/quantization/fp4_quantization.py

Comment thread csrc/nv_internal/cpp/kernels/quantization.cu Outdated
@zianglih zianglih changed the title Implement 4 over 6 nvfp4 quantizer for per-token nvfp4 Implement 4over6 nvfp4 quantizer for per-token nvfp4 May 8, 2026
@zianglih zianglih changed the title Implement 4over6 nvfp4 quantizer for per-token nvfp4 Implement 4over6 nvfp4 quantizer May 8, 2026
@zianglih zianglih changed the title Implement 4over6 nvfp4 quantizer Support 4over6 nvfp4 for quantizer and fused MoE May 9, 2026
@zianglih zianglih marked this pull request as ready for review May 9, 2026 01:06
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/moe/test_trtllm_gen_per_token_moe.py (1)

114-134: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

This changes the scales, but not the backend mode.

The new use_4over6 branch only rewrites the Python-side NVFP4 scale factors. The test never enables 4over6 via set_nvfp4_4over6_env before calling nvfp4_quantize() and trtllm_fp4_block_scale_routed_moe(), so the True cases are not validating the actual 4over6 implementation. Apply the shared env helper around the quantize + kernel section so both sides run in the same mode.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/moe/test_trtllm_gen_per_token_moe.py` around lines 114 - 134, The test
only updates Python-side scales via nvfp4_global_decode_scale_te but never flips
the backend mode, so wrap the quantize+kernel calls with the shared helper
set_nvfp4_4over6_env(use_4over6) so the backend is actually in 4over6 mode when
calling nvfp4_quantize and trtllm_fp4_block_scale_routed_moe; specifically, call
set_nvfp4_4over6_env(use_4over6) around the block that computes
hidden_states/hidden_states_scale/per_token_scale_inv with nvfp4_quantize and
the subsequent trtllm_fp4_block_scale_routed_moe invocation so both scale
computation and kernel execution use the same mode (references:
nvfp4_global_decode_scale_te, nvfp4_quantize, set_nvfp4_4over6_env,
trtllm_fp4_block_scale_routed_moe).
♻️ Duplicate comments (1)
csrc/nv_internal/cpp/kernels/quantization.cu (1)

338-362: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

FP32 input still aborts when FLASHINFER_NVFP4_4OVER6=1.

use4Over6 is read unconditionally from the process-global env var, and the if constexpr (std::is_same_v<T, float>) branch then aborts via TLLM_CHECK_WITH_INFO(!USE_4OVER6, ...). Any caller that quantizes a float input in a process where the env var is set (e.g. an MoE test running after a 4-over-6 test set the env in the same process) will fail, even though the legacy FP32 kernel is unchanged and capable of handling the request. Force use4Over6=false for T=float at the env-read site instead of aborting downstream.

💡 Suggested fix
-  bool const disableFP4QuantFastMath = tensorrt_llm::common::getEnvDisableFP4QuantFastMath();
-  bool const use4Over6 = tensorrt_llm::common::getEnvNVFP4Use4Over6();
-  bool const disable4Over6MSEFastMath = tensorrt_llm::common::getEnvNVFP4Disable4Over6MSEFastMath();
+  bool const disableFP4QuantFastMath = tensorrt_llm::common::getEnvDisableFP4QuantFastMath();
+  bool const use4Over6 =
+      !std::is_same_v<T, float> && tensorrt_llm::common::getEnvNVFP4Use4Over6();
+  bool const disable4Over6MSEFastMath =
+      use4Over6 && tensorrt_llm::common::getEnvNVFP4Disable4Over6MSEFastMath();

With that, the TLLM_CHECK_WITH_INFO(!USE_4OVER6, ...) inside the T=float branch becomes unreachable and can be dropped (or kept defensively).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@csrc/nv_internal/cpp/kernels/quantization.cu` around lines 338 - 362, The
code reads the process-global use4Over6 unconditionally which causes FP32
instantiations to abort; fix by making the env-read T-aware: move or re-evaluate
tensorrt_llm::common::getEnvNVFP4Use4Over6() into the template/lambda scope
where T is visible (the launchKernel capture/instantiation) and force it false
for T=float (e.g. compute auto const use4Over6 =
tensorrt_llm::common::getEnvNVFP4Use4Over6() && !std::is_same_v<T,float> and
pass that as the use4Over6Tag/std::bool_constant), then remove or leave the
now-unreachable TLLM_CHECK_WITH_INFO(!USE_4OVER6, ...) in the float branch.
🧹 Nitpick comments (1)
tests/test_helpers/utils_fp4.py (1)

295-302: ⚡ Quick win

Vectorize the per-element MSE accumulation.

The explicit Python loop over block_size=16 is unnecessary work and obscures the intent. A vectorized form is shorter, faster, and (because the reduction order across the last dim is implementation-defined either way) preserves the strict < tiebreak on pick_four.

♻️ Proposed refactor
-    err4 = torch.zeros((m, n // block_size), dtype=torch.float32, device=x.device)
-    err6 = torch.zeros((m, n // block_size), dtype=torch.float32, device=x.device)
-    for i in range(block_size):
-        diff4 = dq4[:, :, i] - x_blocks[:, :, i]
-        diff6 = dq6[:, :, i] - x_blocks[:, :, i]
-        err4 += diff4 * diff4
-        err6 += diff6 * diff6
-    pick_four = err4 < err6
+    err4 = ((dq4 - x_blocks) ** 2).sum(dim=-1)
+    err6 = ((dq6 - x_blocks) ** 2).sum(dim=-1)
+    pick_four = err4 < err6
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_helpers/utils_fp4.py` around lines 295 - 302, The loop computes
per-block MSE by accumulating squared differences across the last dim; replace
the explicit for-loop with a vectorized reduction: compute diff4 = dq4 -
x_blocks and diff6 = dq6 - x_blocks, square them and sum over the last axis to
produce err4 and err6, then set pick_four = err4 < err6 (preserving the strict <
tiebreak). Update variables err4, err6, diff4, diff6 and use the existing dq4,
dq6, x_blocks, and pick_four names so the change is localized to that block.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/moe/test_trtllm_gen_fused_moe.py`:
- Around line 2680-2687: run_moe_test currently only uses use_4over6 for
skipping but never actually sets the process env, so 4over6 paths may not be
exercised; wrap the quantize/reference/production section inside the
set_nvfp4_4over6_env context by calling set_nvfp4_4over6_env(use_4over6) (and
ensure the helper is imported) before entering the FP4
quantize/reference/production logic in run_moe_test and restore/unset it after
that block so the FLASHINFER_NVFP4_4OVER6 env state is consistently applied only
for those test cases.

In `@tests/moe/test_trtllm_gen_moe_autotune_tactics.py`:
- Around line 160-169: The test never actually enables the 4over6 NVFP4 runtime
flag because set_nvfp4_4over6_env is never applied; update the test harness so
that when _quant_mode_config is called with use_4over6=True the runtime
environment is toggled for the kernel run: call set_nvfp4_4over6_env(True)
before invoking _run_kernel_with_tactic (and set_nvfp4_4over6_env(False) or
restore the previous state after) so the launched kernel uses the 4over6 path;
adjust every place that constructs the use_4over6=True matrix (including the
other occurrences you noted) to wrap the kernel invocation with the env setter
rather than only changing scales.

In `@tests/moe/test_trtllm_gen_routed_fused_moe.py`:
- Around line 82-83: The test toggles use_4over6 but never actually flips the
NVFP4 4over6 environment, so fp4_quantize() and the routed/non-routed MoE kernel
calls still use the global env; fix by wrapping the sections that perform FP4
quantization and invoke the MoE kernels (references: fp4_quantize, the routed
MoE kernel call(s) and the non-routed MoE kernel call(s)) in the
set_nvfp4_4over6_env context when use_4over6 is True (e.g., with
set_nvfp4_4over6_env(): ...) so the env is applied for those operations and is
restored afterward; apply this same wrapping to the other similar test blocks
currently duplicated later in the file.

In `@tests/moe/utils.py`:
- Around line 40-65: The fixture set_nvfp4_4over6_env currently force-sets
TRTLLM_DISABLE_FP4_QUANT_FAST_MATH and
FLASHINFER_NVFP4_4OVER6_DISABLE_MSE_FAST_MATH unconditionally; change it so
those two env vars are only set when request.getfixturevalue("use_4over6") is
truthy (i.e., set them inside the branch where use_4over6 is True and leave them
untouched when False), while still recording original_values and restoring them
after yield; keep FLASHINFER_NVFP4_4OVER6 set to "1"/"0" based on use_4over6 as
before.

---

Outside diff comments:
In `@tests/moe/test_trtllm_gen_per_token_moe.py`:
- Around line 114-134: The test only updates Python-side scales via
nvfp4_global_decode_scale_te but never flips the backend mode, so wrap the
quantize+kernel calls with the shared helper set_nvfp4_4over6_env(use_4over6) so
the backend is actually in 4over6 mode when calling nvfp4_quantize and
trtllm_fp4_block_scale_routed_moe; specifically, call
set_nvfp4_4over6_env(use_4over6) around the block that computes
hidden_states/hidden_states_scale/per_token_scale_inv with nvfp4_quantize and
the subsequent trtllm_fp4_block_scale_routed_moe invocation so both scale
computation and kernel execution use the same mode (references:
nvfp4_global_decode_scale_te, nvfp4_quantize, set_nvfp4_4over6_env,
trtllm_fp4_block_scale_routed_moe).

---

Duplicate comments:
In `@csrc/nv_internal/cpp/kernels/quantization.cu`:
- Around line 338-362: The code reads the process-global use4Over6
unconditionally which causes FP32 instantiations to abort; fix by making the
env-read T-aware: move or re-evaluate
tensorrt_llm::common::getEnvNVFP4Use4Over6() into the template/lambda scope
where T is visible (the launchKernel capture/instantiation) and force it false
for T=float (e.g. compute auto const use4Over6 =
tensorrt_llm::common::getEnvNVFP4Use4Over6() && !std::is_same_v<T,float> and
pass that as the use4Over6Tag/std::bool_constant), then remove or leave the
now-unreachable TLLM_CHECK_WITH_INFO(!USE_4OVER6, ...) in the float branch.

---

Nitpick comments:
In `@tests/test_helpers/utils_fp4.py`:
- Around line 295-302: The loop computes per-block MSE by accumulating squared
differences across the last dim; replace the explicit for-loop with a vectorized
reduction: compute diff4 = dq4 - x_blocks and diff6 = dq6 - x_blocks, square
them and sum over the last axis to produce err4 and err6, then set pick_four =
err4 < err6 (preserving the strict < tiebreak). Update variables err4, err6,
diff4, diff6 and use the existing dq4, dq6, x_blocks, and pick_four names so the
change is localized to that block.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 07674709-c056-41fd-8bc9-27c3e59e1102

📥 Commits

Reviewing files that changed from the base of the PR and between b36e9a6 and 7d2f214.

📒 Files selected for processing (14)
  • csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
  • csrc/nv_internal/cpp/common/envUtils.cpp
  • csrc/nv_internal/cpp/kernels/quantization.cu
  • csrc/nv_internal/tensorrt_llm/common/envUtils.h
  • csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
  • csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh
  • tests/moe/test_trtllm_cutlass_fused_moe.py
  • tests/moe/test_trtllm_gen_fused_moe.py
  • tests/moe/test_trtllm_gen_moe_autotune_tactics.py
  • tests/moe/test_trtllm_gen_per_token_moe.py
  • tests/moe/test_trtllm_gen_routed_fused_moe.py
  • tests/moe/utils.py
  • tests/test_helpers/utils_fp4.py
  • tests/utils/test_fp4_quantize.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • csrc/nv_internal/cpp/common/envUtils.cpp
  • csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh

Comment thread tests/moe/test_trtllm_gen_fused_moe.py Outdated
Comment thread tests/moe/test_trtllm_gen_moe_autotune_tactics.py Outdated
Comment thread tests/moe/test_trtllm_gen_routed_fused_moe.py Outdated
Comment thread tests/moe/utils.py
@IwakuraRein
Copy link
Copy Markdown
Collaborator

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !655 has been created, and the CI pipeline #50739016 is currently running. I'll report back once the pipeline job completes.

@zianglih
Copy link
Copy Markdown
Contributor Author

zianglih commented May 11, 2026

Some fused moe apis may use quantizers within their cubin. Only 1) per-token fused moe and 2) cutlass fused moe are correctly supported.

@zianglih zianglih marked this pull request as draft May 11, 2026 05:24
@zianglih
Copy link
Copy Markdown
Contributor Author

# flashinfer_trtllm_routed without 4over6
SGLANG_FLASHINFER_PER_TOKEN_NVFP4_MOE=1 \
TRTLLM_DISABLE_FP4_QUANT_FAST_MATH=1 \
python -m sglang.launch_server \
  --kv-cache-dtype bf16 \
  --model-path nvidia/Qwen3-30B-A3B-NVFP4 \
  --disable-piecewise-cuda-graph \
  --moe-runner-backend flashinfer_trtllm_routed
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.937
Invalid: 0.001
Latency: 10.785 s
Output throughput: 13958.097 token/s
Accuracy: 0.937
Invalid: 0.001
Latency: 10.849 s
Output throughput: 13875.159 token/s

# flashinfer_trtllm_routed with 4over6 after fix
FLASHINFER_NVFP4_4OVER6=1 \
FLASHINFER_NVFP4_4OVER6_DISABLE_MSE_FAST_MATH=1 \
SGLANG_FLASHINFER_PER_TOKEN_NVFP4_MOE=1 \
TRTLLM_DISABLE_FP4_QUANT_FAST_MATH=1 \
python -m sglang.launch_server \
  --kv-cache-dtype bf16 \
  --model-path nvidia/Qwen3-30B-A3B-NVFP4 \
  --disable-piecewise-cuda-graph \
  --moe-runner-backend flashinfer_trtllm_routed
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.934
Invalid: 0.001
Latency: 10.575 s
Output throughput: 13768.806 token/s
Accuracy: 0.934
Invalid: 0.001
Latency: 10.628 s
Output throughput: 13700.039 token/s

@zianglih zianglih marked this pull request as ready for review May 11, 2026 06:39
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/moe/test_trtllm_gen_per_token_moe.py`:
- Around line 116-125: The nvfp4 activation quantization call is executed
outside the 4over6 runtime env so when use_4over6=True the activation-side
quantizer dispatch may not be applied; wrap the nvfp4_quantize invocation in the
same nvfp4_4over6_env(use_4over6) context used for nvfp4_global_decode_scale_te
(or otherwise set the 4over6 runtime for the scope) so hidden_states,
hidden_states_scale, per_token_scale_inv are produced under
nvfp4_4over6_env(use_4over6) using the same inputs (hidden_states_bf16,
hidden_states_global_scale_inv, SfLayout.layout_linear, per_token_activation).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4ed55cda-2c4c-4507-9a5c-3ddcbb07ab4a

📥 Commits

Reviewing files that changed from the base of the PR and between 7d2f214 and c37ec1f.

📒 Files selected for processing (3)
  • csrc/trtllm_fused_moe_runner.cu
  • tests/moe/test_trtllm_gen_per_token_moe.py
  • tests/moe/utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/moe/utils.py

Comment on lines +116 to 125
hidden_states_global_scale_inv = nvfp4_global_decode_scale_te(
torch.ones((), dtype=torch.float32, device=device),
use_4over6=use_4over6,
)
hidden_states, hidden_states_scale, per_token_scale_inv = nvfp4_quantize(
hidden_states_bf16,
1.0 / (448.0 * 6.0),
hidden_states_global_scale_inv,
sfLayout=SfLayout.layout_linear,
per_token_activation=True,
)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Apply use_4over6 runtime env to hidden-state quantization as well.

use_4over6=True updates decode-scale selection (Line 116-Line 119), but nvfp4_quantize on Line 120 still runs outside nvfp4_4over6_env(...). That can miss activation-side 4over6 quantizer dispatch and reduce test validity.

Suggested patch
     hidden_states_global_scale_inv = nvfp4_global_decode_scale_te(
         torch.ones((), dtype=torch.float32, device=device),
         use_4over6=use_4over6,
     )
-    hidden_states, hidden_states_scale, per_token_scale_inv = nvfp4_quantize(
-        hidden_states_bf16,
-        hidden_states_global_scale_inv,
-        sfLayout=SfLayout.layout_linear,
-        per_token_activation=True,
-    )
+    with moe_utils.nvfp4_4over6_env(use_4over6):
+        hidden_states, hidden_states_scale, per_token_scale_inv = nvfp4_quantize(
+            hidden_states_bf16,
+            hidden_states_global_scale_inv,
+            sfLayout=SfLayout.layout_linear,
+            per_token_activation=True,
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
hidden_states_global_scale_inv = nvfp4_global_decode_scale_te(
torch.ones((), dtype=torch.float32, device=device),
use_4over6=use_4over6,
)
hidden_states, hidden_states_scale, per_token_scale_inv = nvfp4_quantize(
hidden_states_bf16,
1.0 / (448.0 * 6.0),
hidden_states_global_scale_inv,
sfLayout=SfLayout.layout_linear,
per_token_activation=True,
)
hidden_states_global_scale_inv = nvfp4_global_decode_scale_te(
torch.ones((), dtype=torch.float32, device=device),
use_4over6=use_4over6,
)
with moe_utils.nvfp4_4over6_env(use_4over6):
hidden_states, hidden_states_scale, per_token_scale_inv = nvfp4_quantize(
hidden_states_bf16,
hidden_states_global_scale_inv,
sfLayout=SfLayout.layout_linear,
per_token_activation=True,
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/moe/test_trtllm_gen_per_token_moe.py` around lines 116 - 125, The nvfp4
activation quantization call is executed outside the 4over6 runtime env so when
use_4over6=True the activation-side quantizer dispatch may not be applied; wrap
the nvfp4_quantize invocation in the same nvfp4_4over6_env(use_4over6) context
used for nvfp4_global_decode_scale_te (or otherwise set the 4over6 runtime for
the scope) so hidden_states, hidden_states_scale, per_token_scale_inv are
produced under nvfp4_4over6_env(use_4over6) using the same inputs
(hidden_states_bf16, hidden_states_global_scale_inv, SfLayout.layout_linear,
per_token_activation).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests/moe/utils.py defines set_nvfp4_4over6_env as an autouse=True pytest fixture. Each commented file imports/binds it into module scope

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zianglih, you're right — I missed that. Since set_nvfp4_4over6_env is an autouse=True fixture that activates whenever the use_4over6 parameter fixture is present, the entire test body (including the hidden_states nvfp4_quantize call) already executes under the correct FLASHINFER_NVFP4_4OVER6 environment derived from use_4over6. The separate with moe_utils.nvfp4_4over6_env(weights_use_4over6) block makes sense only for weight quantization because weights_use_4over6 can differ from use_4over6. My suggestion was incorrect — no change needed here.


✏️ Learnings added
Learnt from: zianglih
Repo: flashinfer-ai/flashinfer PR: 3264
File: tests/moe/test_trtllm_gen_per_token_moe.py:116-125
Timestamp: 2026-05-11T06:57:06.760Z
Learning: In the flashinfer repo (tests/moe/), `set_nvfp4_4over6_env` is an `autouse=True` pytest fixture defined in `tests/moe/utils.py`. When a test file binds it at module scope (e.g., `set_nvfp4_4over6_env = moe_utils.set_nvfp4_4over6_env`), the fixture automatically sets `FLASHINFER_NVFP4_4OVER6`, `TRTLLM_DISABLE_FP4_QUANT_FAST_MATH`, and `FLASHINFER_NVFP4_4OVER6_DISABLE_MSE_FAST_MATH` for any test parametrized with `use_4over6`. This means the entire test body already runs under the correct 4over6 environment — no additional context manager wrapping is needed for activation-side quantization.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: yzh119
Repo: flashinfer-ai/flashinfer PR: 2370
File: tests/gdn/conftest.py:25-34
Timestamp: 2026-01-21T21:26:00.701Z
Learning: Tests in the repository assume CUDA is available and do not require torch.cuda.is_available() guards in pytest fixtures. Ensure test files under tests/ follow this convention and avoid adding CPU-only guards in fixtures unless explicitly handling a non-CUDA environment.

@aleozlx
Copy link
Copy Markdown
Collaborator

aleozlx commented May 11, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !655 has been updated with latest changes, and the CI pipeline #50962827 is currently running. I'll report back once the pipeline job completes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants