You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<!-- .github/pull_request_template.md -->
## 📌 Description
This refreshes `benchmarks/bench_moe_deepseek.py` so it runs cleanly on
current main. It rebases Yunzhe Qiu's bench rewrite (`5677a080` from
#2886, which restructures the bench so autotune runs inside the
`bench_gpu_time` measurement region) onto post-#3252 main, plus a small
follow-up that fixes a stale `RoutingMethodType` import.
Two commits from the original #2886 are intentionally dropped because
their fixes have since landed independently. `c0b80b64`'s `num_tokens <=
max_num_tokens` prealloc guard is now subsumed by #3252 — the
`use_prealloc` predicate in `cute_dsl/fused_moe.py` already includes
that check. And `f3beb602`'s `_force_autotune_off()` bench-side
workaround for CUPTI measurement pollution is no longer needed: #3126
moved the cache lookup ahead of `_prepare_input_tensors` synthesis in
the autotuner's tuning-mode loop, eliminating the pollution at the
source. The only remaining mismatch with current main was
`RoutingMethodType`, which moved from `flashinfer.fused_moe.core` to
`flashinfer.tllm_enums` (and is re-exported via `flashinfer.fused_moe`)
— fixed in the second commit here.
Verified on B200 inside `nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc14`:
DeepSeek-V3 at bs=128 ep=8 measures CuteDSL=0.147 ms / TRTLLM=0.144 ms —
in the clean band that matches prior post-pollution-fix measurements
(~0.157 / ~0.142). An 18-cell matrix (N=1, 8, 128, 512, 2048, 16384 ×
EP=1, 8, 16) and an 8-cell gen-phase decode sweep also ran without
errors. Closes#2886.
## 🔍 Related Issues
#2886#3126#3252
## 🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.
### ✅ Pre-commit Checks
- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.
> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).
## 🧪 Tests
- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).
## Reviewer Notes
<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
## Release Notes
* **Refactor**
* Improved MoE throughput benchmarking methodology with enhanced
pre-warm invocation and synchronization for accurate timing capture
* Refactored autotuning strategy to occur inline during benchmark warmup
phase
* Reorganized benchmark output display for clearer result presentation
[](https://app.coderabbit.ai/change-stack/flashinfer-ai/flashinfer/pull/3292)
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: Yunzhe Qiu <yunzheq@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0 commit comments