perf(autotuner): replace power-of-2 token buckets with hybrid spacing & fix missing routing_replay_out arg by StudyingShao · Pull Request #3115 · flashinfer-ai/flashinfer

StudyingShao · 2026-04-18T08:25:56Z

📌 Description

This PR includes two improvements:

perf(autotuner): Replace power-of-2 token buckets with hybrid spacing — Pure power-of-2 spacing creates huge gaps at large values (e.g. a jump from 1024 to 2048), forcing the autotuner to pick a kernel optimised for a very different workload size. The new hybrid scheme uses four phases with progressively coarser spacing:
- Phase 1: [min .. 256] — power-of-2 (step ×2)
- Phase 2: (256 .. 2048] — linear step 256
- Phase 3: (2048 .. 4096] — linear step 512
- Phase 4: (4096 .. max] — power-of-2 (step ×2)
All callsites in MoE, GEMM, and low-latency GEMM autotuners are updated to use the new get_hybrid_num_tokens_buckets / map_to_hybrid_bucket API.
fix: Pass missing routing_replay_out arg to trtllm_fp8_per_tensor_scale_moe — Two call sites in fused_moe/core.py were missing the routing_replay_out argument, causing it to be silently dropped.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Changed files:

flashinfer/fused_moe/utils.py — Core implementation: new get_hybrid_num_tokens_buckets, map_to_hybrid_bucket, map_to_hybrid_bucket_uncapped; removed old get_last_power_of_2_num_tokens_buckets
flashinfer/fused_moe/core.py — Updated all MoE autotuner callsites + added missing routing_replay_out arg
flashinfer/fused_moe/cute_dsl/tuner.py — Updated CuTe DSL FP4 MoE tuner callsite
flashinfer/gemm/gemm_base.py — Updated GEMM (FP8, BF16, FP4, MXFP8, TGV) autotuner configs
flashinfer/trtllm_low_latency_gemm.py — Updated low-latency GEMM autotuner config

Summary by CodeRabbit

Improvements
- Switched autotuning to a hybrid token-bucketing scheme for more representative dynamic profiles, improving kernel selection and tuning accuracy.
- Enhanced mapping of dynamic token counts into tuning buckets, yielding more consistent performance profiling.
New Features
- FP8 MoE execution now forwards an optional routing-replay output through the execution path for downstream routing diagnostics.

coderabbitai · 2026-04-18T08:26:15Z

📝 Walkthrough

Walkthrough

Replaces power-of-2 token-bucketing with a new four-phase hybrid bucketing across MoE and GEMM autotuning and callsites; threads an optional routing_replay_out tensor from the runner into the FP8 per-tensor MoE C++ kernel invocation.

Changes

Cohort / File(s)	Summary
Bucketing Utilities `flashinfer/fused_moe/utils.py`	Removed power-of-2 bucket generators; added `get_hybrid_num_tokens_buckets`, `_ceil_to_step`, `map_to_hybrid_bucket`, and `map_to_hybrid_bucket_uncapped` implementing a four-phase hybrid bucketing strategy.
MoE Autotuning & Runtime `flashinfer/fused_moe/core.py`, `flashinfer/fused_moe/cute_dsl/tuner.py`	Switched DynamicTensorSpec `num_tokens` bucketing to hybrid utilities; threaded optional `routing_replay_out` through the runner `forward` into the `trtllm_fp8_per_tensor_scale_moe_op` call for FP8 per-tensor MoE path.
GEMM Autotuning `flashinfer/gemm/gemm_base.py`, `flashinfer/trtllm_low_latency_gemm.py`	Replaced last-power-of-2 bucket generators/mapping with `get_hybrid_num_tokens_buckets` and `map_to_hybrid_bucket_uncapped` for dynamic `-2` tensor dimensions in multiple autotuning specs; updated imports accordingly.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client/Caller
    participant Tuner as Autotuner/Tuner
    participant Utils as Bucketing Utils
    participant Runner as MoE Runner
    participant Kernel as trtllm_fp8_per_tensor_scale_moe_op

    Client->>Tuner: request tuning / forward(input with num_tokens)
    Tuner->>Utils: get_hybrid_num_tokens_buckets(max_tokens)
    Tuner->>Utils: map_to_hybrid_bucket(num_tokens, max_tokens)
    Tuner->>Runner: select tactic / provide mapped bucket
    Client->>Runner: forward(..., routing_replay_out=?)
    Runner->>Kernel: call trtllm_fp8_per_tensor_scale_moe_op(..., routing_replay_out)
    Kernel-->>Runner: result
    Runner-->>Client: output

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

Possibly related PRs

Route the missing parameter for trtllm_fp8_per_tensor_scale_moe_op #3094 — threads routing_replay_out into the FP8 per-tensor MoE C++ op (high overlap with runtime change).
Prevent MoE autotuner buffer overflow on large token buckets #3025 — modifies CuteDsl MoE autotuner runner specs (overlaps with tuning-config changes).

Suggested reviewers

yzh119
aleozlx
samuellees
IwakuraRein
yongwww
cyx-6
bkryu
nv-yunzheq
sricketts
jimmyzho

Poem

🐇
Buckets braided, four-phase light,
Tokens hop and find their bite.
Threads routed, kernels sing,
Runners hum and magic bring,
A rabbit cheers: "Hooray—tune right!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes both main changes: the replacement of power-of-2 token buckets with hybrid spacing and the fix for the missing routing_replay_out argument.
Description check	✅ Passed	The description fully addresses the template requirements with a clear explanation of both changes, related implementation details, pre-commit and test status, and helpful reviewer notes listing all modified files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request replaces the power-of-2 token bucket generation logic with a hybrid approach across several modules to improve autotuning for MoE workloads. The new logic uses four phases with varying spacing, including power-of-2 and linear steps. Additionally, a routing_replay_out parameter is added to the MoE forward functions. A logic inconsistency was identified in get_hybrid_num_tokens_buckets where the generated buckets may not align with the mapping function when min_num_tokens is greater than one, which could lead to autotuner failures.

gemini-code-assist · 2026-04-18T08:28:41Z

+    m = max(min_num_tokens, 1)
+    while m <= min(max_num_tokens, _PHASE1_END):
+        buckets.append(m)
+        m *= 2
+
+    # Phase 2: linear step 256 in (_PHASE1_END, _PHASE2_END]
+    m = _PHASE1_END + _PHASE2_STEP
+    while m <= min(max_num_tokens, _PHASE2_END):
+        buckets.append(m)
+        m += _PHASE2_STEP
+
+    # Phase 3: linear step 512 in (_PHASE2_END, _PHASE3_END]
+    m = _PHASE2_END + _PHASE3_STEP
+    while m <= min(max_num_tokens, _PHASE3_END):
+        buckets.append(m)
+        m += _PHASE3_STEP
+
+    # Phase 4: power-of-2 beyond _PHASE3_END
+    m = _PHASE3_END * 2
+    while m <= max_num_tokens:
+        buckets.append(m)
+        m *= 2
+
+    if not buckets or buckets[-1] != max_num_tokens:
+        buckets.append(max_num_tokens)
+
+    return tuple(sorted(set(buckets)))


The implementation of get_hybrid_num_tokens_buckets has a critical inconsistency with map_to_hybrid_bucket when min_num_tokens > 1.

Phase 1 Mismatch: If min_num_tokens is not a power of 2 (e.g., 10), Phase 1 currently generates buckets starting from that value (e.g., [10, 20, 40, ...]). However, map_to_hybrid_bucket uses next_positive_power_of_2(x) for Phase 1, which means an input of size 10 will map to bucket 16. Since 16 is not in the generated list, the autotuner will fail to find a tuned tactic for this size.

Phase 2-4 Filtering: The loops for subsequent phases use fixed starting points (e.g., _PHASE1_END + _PHASE2_STEP), which results in buckets smaller than min_num_tokens being added to the list if min_num_tokens is large.

The robust fix is to always generate the full set of potential buckets starting from 1 (to ensure consistency with the mapping logic) and then filter the final result to keep only those within the [min_num_tokens, max_num_tokens] range.

buckets: List[int] = [] # Phase 1: power-of-2 up to _PHASE1_END m = 1 while m <= min(max_num_tokens, _PHASE1_END): buckets.append(m) m *= 2 # Phase 2: linear step 256 in (_PHASE1_END, _PHASE2_END] m = _PHASE1_END + _PHASE2_STEP while m <= min(max_num_tokens, _PHASE2_END): buckets.append(m) m += _PHASE2_STEP # Phase 3: linear step 512 in (_PHASE2_END, _PHASE3_END] m = _PHASE2_END + _PHASE3_STEP while m <= min(max_num_tokens, _PHASE3_END): buckets.append(m) m += _PHASE3_STEP # Phase 4: power-of-2 beyond _PHASE3_END m = _PHASE3_END * 2 while m <= max_num_tokens: buckets.append(m) m *= 2 if not buckets or buckets[-1] != max_num_tokens: buckets.append(max_num_tokens) return tuple(sorted(set(b for b in buckets if b >= min_num_tokens and b <= max_num_tokens)))

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/fused_moe/utils.py`:
- Around line 217-223: The docstring in fused_moe/utils.py (around the function
that describes the four phases) contains Unicode multiplication characters "×"
which trigger Ruff; replace those with the ASCII letter "x" (e.g., change "step
×2" to "step x2") so the docstring uses plain ASCII and pre-commit passes;
update all occurrences in that docstring text accordingly.
- Around line 224-253: get_hybrid_num_tokens_buckets is not honoring
min_num_tokens across phases: phase1 starts at min_num_tokens without rounding
up to the next power-of-2, and phases 2/3 always start at fixed boundaries
(e.g., _PHASE1_END+_PHASE2_STEP) which can emit values below min_num_tokens. Fix
by computing phase starts relative to min_num_tokens: for phase1 set m to the
smallest power-of-2 >= min_num_tokens (use bit math or loop) and then multiply
by 2; for phase2 set m to the smallest value >= min_num_tokens and >=
(_PHASE1_END+_PHASE2_STEP) that aligns to the _PHASE2_STEP grid (ceil to next
multiple of _PHASE2_STEP); for phase3 do the same alignment with _PHASE3_STEP
and _PHASE2_END; and for phase4 start at max(min_num_tokens, _PHASE3_END*2) then
multiply by 2; ensure every appended bucket >= min_num_tokens and <=
max_num_tokens and keep the final sorting/unique logic intact (variables:
get_hybrid_num_tokens_buckets, min_num_tokens, max_num_tokens, _PHASE1_END,
_PHASE2_STEP, _PHASE2_END, _PHASE3_STEP, _PHASE3_END).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f83d9649-3a2d-43ac-98e8-a7f2b490a9f6

📥 Commits

Reviewing files that changed from the base of the PR and between 7d0f68e and 5027cbff0ca18416abce41dd16c53a30e4cf9d1e.

📒 Files selected for processing (5)

flashinfer/fused_moe/core.py
flashinfer/fused_moe/cute_dsl/tuner.py
flashinfer/fused_moe/utils.py
flashinfer/gemm/gemm_base.py
flashinfer/trtllm_low_latency_gemm.py

coderabbitai · 2026-04-18T08:30:42Z

+    This function uses four phases with progressively coarser spacing::
+
+        Phase 1:  [min .. 256]   — power-of-2    (step ×2)
+        Phase 2:  (256 .. 2048]  — linear step 256
+        Phase 3:  (2048 .. 4096] — linear step 512
+        Phase 4:  (4096 .. max]  — power-of-2    (step ×2)
+    """


⚠️ Potential issue | 🟡 Minor

Replace ambiguous multiplication signs in the docstring.

Ruff flags the Unicode × characters here; use plain x to keep pre-commit clean.

Proposed fix

- Phase 1: [min .. 256] — power-of-2 (step ×2) + Phase 1: [min .. 256] — power-of-2 (step x2) Phase 2: (256 .. 2048] — linear step 256 Phase 3: (2048 .. 4096] — linear step 512 - Phase 4: (4096 .. max] — power-of-2 (step ×2) + Phase 4: (4096 .. max] — power-of-2 (step x2)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

This function uses four phases with progressively coarser spacing::

Phase 1: [min .. 256] — power-of-2 (step ×2)

Phase 2: (256 .. 2048] — linear step 256

Phase 3: (2048 .. 4096] — linear step 512

Phase 4: (4096 .. max] — power-of-2 (step ×2)

"""

This function uses four phases with progressively coarser spacing::

Phase 1: [min .. 256] — power-of-2 (step x2)

Phase 2: (256 .. 2048] — linear step 256

Phase 3: (2048 .. 4096] — linear step 512

Phase 4: (4096 .. max] — power-of-2 (step x2)

"""

🧰 Tools

🪛 Ruff (0.15.10)

[warning] 219-219: Docstring contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF002)

[warning] 222-222: Docstring contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF002)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@flashinfer/fused_moe/utils.py` around lines 217 - 223, The docstring in fused_moe/utils.py (around the function that describes the four phases) contains Unicode multiplication characters "×" which trigger Ruff; replace those with the ASCII letter "x" (e.g., change "step ×2" to "step x2") so the docstring uses plain ASCII and pre-commit passes; update all occurrences in that docstring text accordingly.

samuellees · 2026-04-20T10:02:09Z

/bot run

flashinfer-bot · 2026-04-20T10:02:49Z

GitLab MR !570 has been created, and the CI pipeline #48974308 is currently running. I'll report back once the pipeline job completes.

samuellees

LGTM, waiting for the CI pass

samuellees · 2026-04-21T12:51:03Z

/bot run

flashinfer-bot · 2026-04-21T12:51:42Z

GitLab MR !570 has been updated with latest changes, and the CI pipeline #49091125 is currently running. I'll report back once the pipeline job completes.

samuellees · 2026-04-22T02:42:01Z

/bot run

flashinfer-bot · 2026-04-22T02:42:32Z

GitLab MR !570 has been created, and the CI pipeline #49156002 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-04-22T02:43:42Z

[FAILED] Pipeline #49156002: 1/20 passed

samuellees · 2026-04-22T02:45:04Z

Hi @StudyingShao , Could you please:

Fix the conflicts
Take a look if this ci fail is relative with this PR? https://gitlab-master.nvidia.com/dl/flashinfer/flashinfer-ci/-/jobs/302286472#L2688

Thx!

Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

flashinfer/fused_moe/utils.py (1)
236-239: ⚠️ Potential issue | 🟡 Minor

Replace Unicode × with ASCII x in docstring (RUF002).

Ruff 0.15.10 still flags lines 236 and 239. Swap × for x to keep pre-commit clean.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/fused_moe/utils.py` around lines 236 - 239, Replace the Unicode
multiplication symbol '×' with ASCII 'x' in the docstring block that lists
"Phase 1" through "Phase 4" (the lines showing steps like "step ×2") in
fused_moe.utils so the text reads "step x2" (and similarly for any other
occurrences), commit the change to satisfy the RUF002 warning.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/fused_moe/utils.py`:
- Around line 273-289: The branch in map_to_hybrid_bucket currently returns
next_positive_power_of_2(x) when x <= _PHASE1_END, which can exceed
max_num_tokens; change that branch to clamp the result (return
min(next_positive_power_of_2(x), max_num_tokens)) so the function always honors
the [1, max_num_tokens] contract referenced in the docstring (update the branch
handling x <= _PHASE1_END in map_to_hybrid_bucket to use min(...,
max_num_tokens) and keep other branches unchanged).

---

Duplicate comments:
In `@flashinfer/fused_moe/utils.py`:
- Around line 236-239: Replace the Unicode multiplication symbol '×' with ASCII
'x' in the docstring block that lists "Phase 1" through "Phase 4" (the lines
showing steps like "step ×2") in fused_moe.utils so the text reads "step x2"
(and similarly for any other occurrences), commit the change to satisfy the
RUF002 warning.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dc7330a7-f65b-476d-98ef-c2db6d3f60ee

📥 Commits

Reviewing files that changed from the base of the PR and between ee52bd937b2dd434e797ae72d3d47ec89b37d387 and d2de788.

📒 Files selected for processing (5)

flashinfer/fused_moe/core.py
flashinfer/fused_moe/cute_dsl/tuner.py
flashinfer/fused_moe/utils.py
flashinfer/gemm/gemm_base.py
flashinfer/trtllm_low_latency_gemm.py

🚧 Files skipped from review as they are similar to previous changes (3)

flashinfer/trtllm_low_latency_gemm.py
flashinfer/gemm/gemm_base.py
flashinfer/fused_moe/core.py

coderabbitai · 2026-04-22T10:35:25Z

+def map_to_hybrid_bucket(x: int, max_num_tokens: int) -> int:
+    """Map an arbitrary num_tokens to the nearest hybrid bucket (rounding up).
+
+    Mirrors the four-phase spacing of :func:`get_hybrid_num_tokens_buckets`.
+    The result is clamped to ``[1, max_num_tokens]``.
+    """
+    if x <= 0:
+        return 1
+    if x >= max_num_tokens:
+        return max_num_tokens
+    if x <= _PHASE1_END:
+        return next_positive_power_of_2(x)
+    if x <= _PHASE2_END:
+        return min(_ceil_to_step(x, _PHASE2_STEP), max_num_tokens)
+    if x <= _PHASE3_END:
+        return min(_ceil_to_step(x, _PHASE3_STEP), max_num_tokens)
+    return min(next_positive_power_of_2(x), max_num_tokens)


⚠️ Potential issue | 🟡 Minor

Edge case: map_to_hybrid_bucket can exceed max_num_tokens when max_num_tokens < 256.

For x in (0, max_num_tokens) with max_num_tokens <= _PHASE1_END, the branch at line 283-284 returns next_positive_power_of_2(x) without the max_num_tokens clamp, which can exceed the stated [1, max_num_tokens] contract. Example: map_to_hybrid_bucket(70, 100) returns 128.

All current callsites in this PR pass 8192, so this is not actively exploited — but the docstring guarantees clamping unconditionally, and the returned value won't exist in get_hybrid_num_tokens_buckets(100)'s output, which could cause silent autotuner profile mismatches if someone adopts the API with a small cap in the future.

🛡️ Proposed fix

if x <= _PHASE1_END: - return next_positive_power_of_2(x) + return min(next_positive_power_of_2(x), max_num_tokens) if x <= _PHASE2_END: return min(_ceil_to_step(x, _PHASE2_STEP), max_num_tokens)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@flashinfer/fused_moe/utils.py` around lines 273 - 289, The branch in map_to_hybrid_bucket currently returns next_positive_power_of_2(x) when x <= _PHASE1_END, which can exceed max_num_tokens; change that branch to clamp the result (return min(next_positive_power_of_2(x), max_num_tokens)) so the function always honors the [1, max_num_tokens] contract referenced in the docstring (update the branch handling x <= _PHASE1_END in map_to_hybrid_bucket to use min(..., max_num_tokens) and keep other branches unchanged).

samuellees · 2026-04-22T10:45:22Z

/bot run

flashinfer-bot · 2026-04-22T10:45:54Z

GitLab MR !570 has been updated with latest changes, and the CI pipeline #49186597 is currently running. I'll report back once the pipeline job completes.

samuellees · 2026-04-22T16:23:06Z

/bot run

flashinfer-bot · 2026-04-22T16:23:19Z

GitLab MR !570 has been created, and the CI pipeline #49208053 is currently running. I'll report back once the pipeline job completes.

samuellees · 2026-04-23T01:49:44Z

/bot run

flashinfer-bot · 2026-04-23T01:50:57Z

GitLab MR !570 has been created, and the CI pipeline #49249883 is currently running. I'll report back once the pipeline job completes.

samuellees · 2026-04-23T11:12:42Z

/bot run

flashinfer-bot · 2026-04-23T11:13:34Z

GitLab MR !570 has been updated with latest changes, and the CI pipeline #49285210 is currently running. I'll report back once the pipeline job completes.

…nfiguration Adds six no-GPU pytest cases at `tests/moe/test_cute_dsl_fused_moe.py::TestAutotunerBucketConfig` guarding the autotuner bucket-cap fix and locking in the load-bearing behavioral parity with TRT-LLM's pattern at `cute_dsl_custom_ops.py:2390-2391` and `2700-2703`. Three "no hardcoded cap" regression guards (the load-bearing property of the fix): 1. `test_gen_tuning_buckets_is_callable_not_static_tuple` — pins `gen_tuning_buckets` on the runner's `tuning_config` to be a bare callable, not a pre-computed tuple. 2. `test_gen_tuning_buckets_no_hardcoded_8192_cap` — verifies that calling the configured `gen_tuning_buckets` with input dims 8192, 16384, and 32768 produces bucket sets whose maximum reflects the input value. 3. `test_map_to_tuning_buckets_above_8192_not_capped` — verifies that `map_to_tuning_buckets(x)` for x ∈ {16384, 32768, 65536} doesn't cap at 8192. Ensures we use `map_to_hybrid_bucket_uncapped` instead of `lambda x: map_to_hybrid_bucket(x, 8192)`. Three TRT-LLM-parity regression guards (lock in the behavioral-equivalence-where-achievable): 4. `test_map_to_tuning_buckets_phase1_matches_trtllm_at_powers_of_2` — pins fi/trt-llm parity at power-of-2 inputs ≤ 256 (hybrid Phase 1, where pure power-of-2 spacing is preserved). At these inputs, fi's `map_to_tuning_buckets(x)` must equal x and equal `last_positive_power_of_2(x)` (TRT-LLM's pattern). 5. `test_map_to_tuning_buckets_is_monotonic` — pins monotonic non-decreasing behavior across hybrid Phases 1-4. TRT-LLM's `last_positive_power_of_2` and fi's `map_to_hybrid_bucket_uncapped` both satisfy this; catches a regression that would introduce non-monotonic mapping. 6. `test_gen_tuning_buckets_covers_trtllm_power_of_2_points` — pins that fi's hybrid bucket set is a SUPERSET of TRT-LLM's power-of-2 bucket set at every max_n tested. The hybrid scheme intentionally adds intermediate linear-step buckets in Phase 2/3 (per PR flashinfer-ai#3115's perf rationale) but must preserve the coarse-grained power-of-2 coverage TRT-LLM has. These six tests together enforce: (a) no hardcoded cap, (b) callable form, (c) TRT-LLM-equivalence at power-of-2 probe points, (d) monotonicity, (e) coarse-grained coverage parity with TRT-LLM. The hybrid-vs-power-of-2 deviation in Phase 2/3/4 is intentional and documented (PR flashinfer-ai#3115); the tests don't enforce parity in those phases because that would regress fi's deliberate perf optimization. All tests are pure-Python and run without a GPU. They construct a `CuteDslFusedMoENvfp4Runner` with a no-op `forward_impl` to inspect its `tuning_config`; no GPU, no CuteDSL kernel binaries, no autotune side effects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address gemini-code-assist review on PR flashinfer-ai#3216: the test was importing `get_last_power_of_2_num_tokens_buckets` from `flashinfer.fused_moe.utils`, but PR flashinfer-ai#3115 (merged 2026-04-24) removed that function in favor of the hybrid bucket scheme. The import would have caused an ImportError when the test was collected. Replace the call with an equivalent inline construction that mirrors TRT-LLM's `get_last_power_of_2_num_tokens_buckets` (in `tensorrt_llm/_torch/utils.py:291`): powers of 2 from 1 up to `last_positive_power_of_2(max_n)`. `last_positive_power_of_2` is still available in `flashinfer.fused_moe.utils`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…me input (#3216)  ## 📌 Description The autotuner's `DynamicTensorSpec` in `flashinfer/fused_moe/cute_dsl/tuner.py` declared `gen_tuning_buckets` as the pre-computed tuple `get_hybrid_num_tokens_buckets(8192)` and `map_to_tuning_buckets` as `lambda x: map_to_hybrid_bucket(x,8192)`. The hardcoded 8192 cap silently clamped any runtime workload larger than that to the 8192-bucket's cached tactic — at DeepSeek-V3 prefill (N=16384) fi profiled at half the per-expert workload and used a tactic optimized for the wrong shape. This PR replaces the pre-computed tuple with the bare callable form (`get_hybrid_num_tokens_buckets`) and switches the mapper to the uncapped variant `map_to_hybrid_bucket_uncapped` (added alongside the hybrid-bucket scheme for exactly this case). The autotuner now invokes them with the actual input dim at autotune time, matching TRT-LLM's pattern at `cute_dsl_custom_ops.py:2390-2391` and flashinfer's own pattern at `gemm/gemm_base.py:_FP8_GEMM_SM100_TUNING_CONFIG`. ## 🔍 Related Issues #3171 #3198 #3115 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Bug Fixes** * MoE autotuner now uses uncapped dynamic hybrid bucket mapping instead of a fixed-bounded set, improving adaptation to varying input token sizes. * **Tests** * Added offline tests validating autotuner bucket configuration: dynamic bucket generation, responsiveness to input size, monotonic mapping behavior, large-input scaling, and alignment with expected power-of-2 bucket values.  --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

StudyingShao requested review from IwakuraRein, aleozlx, bkryu, cyx-6, dhiraj113, jiahanc, jimmyzho, kahyunnam, nv-yunzheq, saltyminty, samuellees, sricketts, yongwww, yyihuang and yzh119 as code owners April 18, 2026 08:25

flashinfer-bot added op: gemm op: moe labels Apr 18, 2026

StudyingShao mentioned this pull request Apr 18, 2026

fix(autotuner): hybrid bucket spacing and cache-key fix for fused MoE #3063

Closed

5 tasks

gemini-code-assist Bot reviewed Apr 18, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 18, 2026

View reviewed changes

samuellees added the run-ci label Apr 20, 2026

samuellees self-assigned this Apr 20, 2026

samuellees approved these changes Apr 20, 2026

View reviewed changes

samuellees added run-ci and removed run-ci labels Apr 21, 2026

samuellees enabled auto-merge (squash) April 22, 2026 01:57

perf(autotuner): replace power-of-2 token buckets with hybrid spacing

d2de788

Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>

auto-merge was automatically disabled April 22, 2026 10:31
Head branch was pushed to by a user without write access

StudyingShao force-pushed the jiangs/autotuner-hybrid-bucket-spacing-main branch from ee52bd9 to d2de788 Compare April 22, 2026 10:31

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

Merge branch 'main' into jiangs/autotuner-hybrid-bucket-spacing-main

e27bfb7

aleozlx approved these changes Apr 24, 2026

View reviewed changes

aleozlx merged commit a265b4e into flashinfer-ai:main Apr 24, 2026
29 of 31 checks passed

qiching mentioned this pull request Apr 29, 2026

[Bug] Incorrect outputs for autotuned FP4 MoE for DSV4 #3197

Open

leejnau mentioned this pull request May 1, 2026

fix(cute_dsl/moe): make autotuner bucket configuration adapt to runtime input #3216

Merged

5 tasks

Conversation

StudyingShao commented Apr 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

samuellees commented Apr 20, 2026

Uh oh!

flashinfer-bot commented Apr 20, 2026

Uh oh!

samuellees left a comment

Choose a reason for hiding this comment

Uh oh!

samuellees commented Apr 21, 2026

Uh oh!

flashinfer-bot commented Apr 21, 2026

Uh oh!

samuellees commented Apr 22, 2026

Uh oh!

flashinfer-bot commented Apr 22, 2026

Uh oh!

flashinfer-bot commented Apr 22, 2026

Uh oh!

samuellees commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

samuellees commented Apr 22, 2026

Uh oh!

flashinfer-bot commented Apr 22, 2026

Uh oh!

samuellees commented Apr 22, 2026

Uh oh!

flashinfer-bot commented Apr 22, 2026

Uh oh!

samuellees commented Apr 23, 2026

Uh oh!

flashinfer-bot commented Apr 23, 2026

Uh oh!

samuellees commented Apr 23, 2026

Uh oh!

flashinfer-bot commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

StudyingShao commented Apr 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 18, 2026 •

edited

Loading

samuellees commented Apr 22, 2026 •

edited

Loading