[main] [3/5] Qwen3.5 support: SharedExpertMLP meta init by wplf · Pull Request #4754 · NVIDIA/Megatron-LM

wplf · 2026-05-12T07:34:03Z

Qwen3.5 support series

This is part of a 5-PR series adding Qwen3.5-VL support, split for review clarity.

Main PRs (this series):

[1/5] MTP packed-seq CP+THD fix — fix(mtp): use padded cu_seqlens in MTP roll for THD with CP #4495
[2/5] FSDP DTensor Bridge checkpoint compatibility — [main] [2/5] Qwen3.5 support: FSDP DTensor Bridge checkpoint compatibility #4753
[3/5] SharedExpertMLP meta init — [main] [3/5] Qwen3.5 support: SharedExpertMLP meta init #4754 ← this PR
[4/5] Interleaved MRoPE layout — [main] [4/5] Qwen3.5 support: Interleaved MRoPE layout #4755
[5/5] Qwen3.5-VL training example — [main] [5/5] Qwen3.5 support: Qwen3.5-VL training example #4756

Dev PRs (corresponding mirrors):

Summary

Add _reset_parameters to SharedExpertMLP so the directly-owned gate_weight is materialized off the meta device when use_shared_expert_gate=True.

Why

Without this, meta-init leaves gate_weight on the meta device and the first forward fails with a meta-tensor error. Submodules already have their own _reset_parameters, so only the directly-owned gate_weight needs handling.

The implementation mirrors the standard per-parameter init pattern:

run init_method when config.perform_initialization
cast to config.params_dtype
set sequence_parallel attribute

Risk

Only fires when use_shared_expert_gate=True and gate_weight is not None — no effect on existing paths.

Notes

Mirror of #4749 (same patch, targeting main instead of dev).

🤖 Generated with Claude Code

copy-pr-bot · 2026-05-12T07:34:28Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Add `_reset_parameters` to `SharedExpertMLP` so the directly-owned `gate_weight` is materialized off the meta device when `use_shared_expert_gate=True`. Without this, meta-init leaves `gate_weight` on the meta device and forward fails. Mirrors the per-parameter init pattern already used in other Megatron modules (run `init_method` if `perform_initialization`, cast to `params_dtype`, set `sequence_parallel`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: BestJuly <19769279+BestJuly@users.noreply.github.com>

wplf · 2026-06-04T11:06:35Z

/ok to test eab5149

wplf added the Run tests label May 12, 2026

wplf changed the title ~~fix(moe): initialize SharedExpertMLP gate_weight under meta init~~ [main] [3/5] Qwen3.5 support: SharedExpertMLP meta init May 12, 2026

This was referenced May 12, 2026

[main] [4/5] Qwen3.5 support: Interleaved MRoPE layout #4755

Open

[main] [5/5] Qwen3.5 support: Qwen3.5-VL training example #4756

Open

[main] [follow-up] Qwen3.5 support: MoE aux loss padding_mask #4777

Open

wplf force-pushed the fix/shared-experts-meta-init-main branch from 067dd9c to eab5149 Compare May 13, 2026 10:24

wplf marked this pull request as ready for review June 4, 2026 11:06

wplf requested review from a team as code owners June 4, 2026 11:06

svcnvidia-nemo-ci added the complexity: low label Jun 4, 2026

copy-pr-bot Bot temporarily deployed to public June 4, 2026 11:07 Inactive

copy-pr-bot Bot temporarily deployed to test June 4, 2026 11:07 Inactive

copy-pr-bot Bot temporarily deployed to public June 4, 2026 11:10 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[main] [3/5] Qwen3.5 support: SharedExpertMLP meta init#4754

[main] [3/5] Qwen3.5 support: SharedExpertMLP meta init#4754
wplf wants to merge 1 commit into
NVIDIA:mainfrom
wplf:fix/shared-experts-meta-init-main

wplf commented May 12, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

wplf commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wplf commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Qwen3.5 support series

Summary

Why

Risk

Notes

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

wplf commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wplf commented May 12, 2026 •

edited

Loading