Skip to content

update moe_smooth_quant dispatch and v2 supports block_m is a multiple of 16#2333

Merged
valarLip merged 1 commit intomainfrom
jun/moe_smoothquant3
Mar 19, 2026
Merged

update moe_smooth_quant dispatch and v2 supports block_m is a multiple of 16#2333
valarLip merged 1 commit intomainfrom
jun/moe_smoothquant3

Conversation

@junhaha666
Copy link
Contributor

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

@junhaha666 junhaha666 requested review from a team and Copilot March 18, 2026 15:42
@github-actions
Copy link
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2333 --add-label <label>

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the MoE smooth-per-token scaled quantization dispatch to better support block_m values that are multiples of 16, and adjusts the Python-side routing logic for small-token MoE stage1 cases.

Changes:

  • Update moe_smooth_per_token_scaled_quant_v2 dispatch math to use a fixed block_split=16 and enforce block_m % 16 == 0.
  • Add an is_balanced flag to moe_smooth_per_token_scaled_quant to control whether stage1 uses the v1 kernel or a smooth_per_token_scaled_quant fallback for small M.
  • Remove outdated commented call examples in fused_moe_bf16_asm.py.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
csrc/kernels/quant_kernels.cu Enforces block_m divisibility by 16 and changes v2 block splitting/indexing to be consistent for non-power-of-2 block_m (as long as divisible by 16).
aiter/ops/quant.py Adds is_balanced flag and changes stage1 small-M dispatch behavior (v1 vs smooth-quant fallback).
aiter/fused_moe_bf16_asm.py Removes commented-out alternative quant call paths around asm_moe.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@valarLip valarLip merged commit 3fb9c71 into main Mar 19, 2026
29 checks passed
@valarLip valarLip deleted the jun/moe_smoothquant3 branch March 19, 2026 06:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants