Skip to content

[Bugfix] Fix MoE flatten_tp_size unconditionally including dp_size#36240

Open
AjAnubolu wants to merge 4 commits into
vllm-project:mainfrom
AjAnubolu:fix/moe-flatten-tp-36222
Open

[Bugfix] Fix MoE flatten_tp_size unconditionally including dp_size#36240
AjAnubolu wants to merge 4 commits into
vllm-project:mainfrom
AjAnubolu:fix/moe-flatten-tp-36222

Conversation

@AjAnubolu
Copy link
Copy Markdown
Contributor

DP was folded into TP for non-EP MoE, causing incorrect partition sizes (e.g. 64 instead of 128 with TP=4 DP=2). Only flatten PCP into TP; keep DP ranks replicated.

Closes #36222

@mergify mergify Bot added the bug Something isn't working label Mar 6, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in the Mixture of Experts (MoE) layer configuration where the data parallelism (DP) size was incorrectly flattened into the tensor parallelism (TP) size for non-expert-parallel (non-EP) MoE setups. The fix correctly separates the configuration logic for EP and non-EP cases. For non-EP MoE, it now ensures that DP ranks are treated as holding replicated weights and are not folded into the TP dimension, which aligns with the intended design. The logic for EP MoE, where DP is correctly flattened into the expert dimension, is preserved and moved into its respective code block for clarity. The documentation has also been updated to reflect this corrected behavior with clear examples. The changes appear to be correct and effectively resolve the issue.

Split flatten_tp_across_dp_and_pcp into separate EP and non-EP paths so
DP ranks hold replicated MoE weights instead of being folded into TP.

Signed-off-by: AjAnubolu <anuboluajay@gmail.com>
@AjAnubolu AjAnubolu force-pushed the fix/moe-flatten-tp-36222 branch from fad2e99 to ba22024 Compare March 13, 2026 03:08
@aoshen02
Copy link
Copy Markdown
Collaborator

Thanks for the contribution, @andakai will take a look

@andakai
Copy link
Copy Markdown

andakai commented May 14, 2026

Thanks for the fix. I checked and it can fix #36222. Now when enable_expert_parallel=False, TP=4, DP=2, PCP=1, returns tp_size=4 instead of tp_size=8. This preserves DP as a replica dimension for non-EP MoE, so an expert intermediate_size=512 is partitioned as 512 / 4 = 128 rather than 512 / 8 = 64.

Here is another file gpt_oss.py which has a few custom MoE weight-loading paths that call
FusedMoEParallelConfig.flatten_tp_across_dp_and_pcp():

  • vllm/model_executor/models/gpt_oss.py:_load_weights_mxfp4
  • vllm/model_executor/models/gpt_oss.py:_load_weights_quark
  • vllm/model_executor/models/gpt_oss.py:_load_weights_other

Comment on lines 984 to 985
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the comment inside flatten_tp_across_dp_and_pcp() may be a bit misleading:

In the non-EP path, the function is intentionally called with neutral DP values (dp_size=1, dp_rank=0) so that DP is not included in the flattened TP size. Could we reword this comment to make the EP vs non-EP behavior explicit? For example:

# Flatten the DP/PCP/TP dimensions selected by the caller into a
# single TP-like rank space. In non-EP mode callers pass dp_size=1
# and dp_rank=0 so DP remains a replica dimension; in EP mode callers
# pass the real DP size/rank so DP participates in the EP rank space.

I think this would make it more clear.

Clarify the EP vs non-EP behavior: in non-EP mode callers pass
dp_size=1, dp_rank=0 so DP stays a replica dimension; in EP mode
they pass real DP size/rank so DP joins the EP rank space.

Co-authored-by: Claude
Co-authored-by: andakai
Signed-off-by: AjAnubolu <anuboluajay@gmail.com>
@AjAnubolu AjAnubolu requested a review from zyongye as a code owner May 15, 2026 21:54
The three weight-loading paths in gpt_oss.py (_load_weights_mxfp4,
_load_weights_quark, _load_weights_other) call
FusedMoEParallelConfig.flatten_tp_across_dp_and_pcp() with the real
dp_size / dp_rank regardless of whether expert parallelism is enabled.
In non-EP mode this folds DP into the TP-like rank space, slicing each
expert's weights too thinly (e.g. TP=4, DP=2 gave intermediate_size//8
instead of intermediate_size//4).

Match the canonical pattern in FusedMoEParallelConfig.make(): in non-EP
mode pass dp_size=1, dp_rank=0 so DP stays a replica dimension. EP mode
behavior is unchanged.

Reported-by: andakai
Co-authored-by: Claude
Signed-off-by: AjAnubolu <anuboluajay@gmail.com>
@andakai
Copy link
Copy Markdown

andakai commented May 16, 2026

LGTM, thanks for addressing the comments.

@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label May 16, 2026
@aoshen02 aoshen02 added help wanted Extra attention is needed ci/build labels May 16, 2026
@AjAnubolu
Copy link
Copy Markdown
Contributor Author

LGTM, thanks for addressing the comments.

thanks for the feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ci/build gpt-oss Related to GPT-OSS models help wanted Extra attention is needed ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: To Triage

Development

Successfully merging this pull request may close these issues.

[Usage]: MoE flatten_tp_size should not unconditionally include dp_size — DP loses its original semantics for MoE layers

4 participants