[main] [3/5] Qwen3.5 support: SharedExpertMLP meta init#4754
Open
wplf wants to merge 1 commit into
Open
Conversation
This was referenced May 12, 2026
This was referenced May 12, 2026
Add `_reset_parameters` to `SharedExpertMLP` so the directly-owned `gate_weight` is materialized off the meta device when `use_shared_expert_gate=True`. Without this, meta-init leaves `gate_weight` on the meta device and forward fails. Mirrors the per-parameter init pattern already used in other Megatron modules (run `init_method` if `perform_initialization`, cast to `params_dtype`, set `sequence_parallel`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: BestJuly <19769279+BestJuly@users.noreply.github.com>
067dd9c to
eab5149
Compare
Member
Author
|
/ok to test eab5149 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Qwen3.5 support series
This is part of a 5-PR series adding Qwen3.5-VL support, split for review clarity.
Main PRs (this series):
Dev PRs (corresponding mirrors):
Summary
Add
_reset_parameterstoSharedExpertMLPso the directly-ownedgate_weightis materialized off the meta device whenuse_shared_expert_gate=True.Why
Without this, meta-init leaves
gate_weighton the meta device and the first forward fails with a meta-tensor error. Submodules already have their own_reset_parameters, so only the directly-ownedgate_weightneeds handling.The implementation mirrors the standard per-parameter init pattern:
init_methodwhenconfig.perform_initializationconfig.params_dtypesequence_parallelattributeRisk
Only fires when
use_shared_expert_gate=Trueandgate_weight is not None— no effect on existing paths.Notes
Mirror of #4749 (same patch, targeting
maininstead ofdev).🤖 Generated with Claude Code