[Bugfix] Fix MoE flatten_tp_size unconditionally including dp_size#36240
[Bugfix] Fix MoE flatten_tp_size unconditionally including dp_size#36240AjAnubolu wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a bug in the Mixture of Experts (MoE) layer configuration where the data parallelism (DP) size was incorrectly flattened into the tensor parallelism (TP) size for non-expert-parallel (non-EP) MoE setups. The fix correctly separates the configuration logic for EP and non-EP cases. For non-EP MoE, it now ensures that DP ranks are treated as holding replicated weights and are not folded into the TP dimension, which aligns with the intended design. The logic for EP MoE, where DP is correctly flattened into the expert dimension, is preserved and moved into its respective code block for clarity. The documentation has also been updated to reflect this corrected behavior with clear examples. The changes appear to be correct and effectively resolve the issue.
Split flatten_tp_across_dp_and_pcp into separate EP and non-EP paths so DP ranks hold replicated MoE weights instead of being folded into TP. Signed-off-by: AjAnubolu <anuboluajay@gmail.com>
fad2e99 to
ba22024
Compare
|
Thanks for the contribution, @andakai will take a look |
|
Thanks for the fix. I checked and it can fix #36222. Now when Here is another file
|
There was a problem hiding this comment.
Now the comment inside flatten_tp_across_dp_and_pcp() may be a bit misleading:
In the non-EP path, the function is intentionally called with neutral DP values (dp_size=1, dp_rank=0) so that DP is not included in the flattened TP size. Could we reword this comment to make the EP vs non-EP behavior explicit? For example:
# Flatten the DP/PCP/TP dimensions selected by the caller into a
# single TP-like rank space. In non-EP mode callers pass dp_size=1
# and dp_rank=0 so DP remains a replica dimension; in EP mode callers
# pass the real DP size/rank so DP participates in the EP rank space.
I think this would make it more clear.
Clarify the EP vs non-EP behavior: in non-EP mode callers pass dp_size=1, dp_rank=0 so DP stays a replica dimension; in EP mode they pass real DP size/rank so DP joins the EP rank space. Co-authored-by: Claude Co-authored-by: andakai Signed-off-by: AjAnubolu <anuboluajay@gmail.com>
The three weight-loading paths in gpt_oss.py (_load_weights_mxfp4, _load_weights_quark, _load_weights_other) call FusedMoEParallelConfig.flatten_tp_across_dp_and_pcp() with the real dp_size / dp_rank regardless of whether expert parallelism is enabled. In non-EP mode this folds DP into the TP-like rank space, slicing each expert's weights too thinly (e.g. TP=4, DP=2 gave intermediate_size//8 instead of intermediate_size//4). Match the canonical pattern in FusedMoEParallelConfig.make(): in non-EP mode pass dp_size=1, dp_rank=0 so DP stays a replica dimension. EP mode behavior is unchanged. Reported-by: andakai Co-authored-by: Claude Signed-off-by: AjAnubolu <anuboluajay@gmail.com>
|
LGTM, thanks for addressing the comments. |
thanks for the feedback! |
DP was folded into TP for non-EP MoE, causing incorrect partition sizes (e.g. 64 instead of 128 with TP=4 DP=2). Only flatten PCP into TP; keep DP ranks replicated.
Closes #36222