Skip to content

[model] feat: support qwen35 mtp sft/rl#5898

Open
zpltys wants to merge 2 commits intoverl-project:mainfrom
meituan-search:feature/qwen_35
Open

[model] feat: support qwen35 mtp sft/rl#5898
zpltys wants to merge 2 commits intoverl-project:mainfrom
meituan-search:feature/qwen_35

Conversation

@zpltys
Copy link
Copy Markdown
Contributor

@zpltys zpltys commented Apr 7, 2026

What does this PR do?

To run qwen35 with mtp, you should use mbridge with pr: ISEEKYAN/mbridge#98

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Qwen3.5 MoE models with Multi-Token Prediction (MTP) for Megatron-based SFT and GRPO training. Changes include new example scripts, model registry updates, and enhanced configuration conversion logic to handle Qwen3.5-specific MTP attributes. Review feedback recommends making the internal MTP configuration helpers public for better cross-module reuse, addressing a potential bug where user-defined MTP loss scaling factors are ignored during configuration conversion, and eliminating redundant logic in the Megatron workers by consistently using the new helper functions.

Comment on lines +1477 to +1501
def _get_mtp_num_layers(hf_config):
"""Get MTP layer count from various config formats.

Supports:
- num_nextn_predict_layers (DeepSeek, Qwen3 style)
- mtp_num_hidden_layers (Qwen3.5 style, in hf_config or text_config)
"""
if hasattr(hf_config, "num_nextn_predict_layers") and hf_config.num_nextn_predict_layers > 0:
return hf_config.num_nextn_predict_layers
if hasattr(hf_config, "mtp_num_hidden_layers") and hf_config.mtp_num_hidden_layers > 0:
return hf_config.mtp_num_hidden_layers
if hasattr(hf_config, "text_config") and hasattr(hf_config.text_config, "mtp_num_hidden_layers"):
if hf_config.text_config.mtp_num_hidden_layers > 0:
return hf_config.text_config.mtp_num_hidden_layers
return 0


def _set_mtp_num_layers(hf_config, value: int):
"""Set MTP layer count in the appropriate config field."""
if hasattr(hf_config, "num_nextn_predict_layers"):
hf_config.num_nextn_predict_layers = value
elif hasattr(hf_config, "mtp_num_hidden_layers"):
hf_config.mtp_num_hidden_layers = value
elif hasattr(hf_config, "text_config") and hasattr(hf_config.text_config, "mtp_num_hidden_layers"):
hf_config.text_config.mtp_num_hidden_layers = value
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The MTP configuration helpers _get_mtp_num_layers and _set_mtp_num_layers are useful across different modules (e.g., in config_converter.py and megatron_workers.py). They should be made public by removing the leading underscore to follow Python naming conventions for shared utilities and to avoid linting issues when accessed from other packages.

def get_mtp_num_layers(hf_config):
    """Get MTP layer count from various config formats.

    Supports:
        - num_nextn_predict_layers (DeepSeek, Qwen3 style)
        - mtp_num_hidden_layers (Qwen3.5 style, in hf_config or text_config)
    """
    if hasattr(hf_config, "num_nextn_predict_layers") and hf_config.num_nextn_predict_layers > 0:
        return hf_config.num_nextn_predict_layers
    if hasattr(hf_config, "mtp_num_hidden_layers") and hf_config.mtp_num_hidden_layers > 0:
        return hf_config.mtp_num_hidden_layers
    if hasattr(hf_config, "text_config") and hasattr(hf_config.text_config, "mtp_num_hidden_layers"):
        if hf_config.text_config.mtp_num_hidden_layers > 0:
            return hf_config.text_config.mtp_num_hidden_layers
    return 0


def set_mtp_num_layers(hf_config, value: int):
    """Set MTP layer count in the appropriate config field."""
    if hasattr(hf_config, "num_nextn_predict_layers"):
        hf_config.num_nextn_predict_layers = value
    elif hasattr(hf_config, "mtp_num_hidden_layers"):
        hf_config.mtp_num_hidden_layers = value
    elif hasattr(hf_config, "text_config") and hasattr(hf_config.text_config, "mtp_num_hidden_layers"):
        hf_config.text_config.mtp_num_hidden_layers = value
References
  1. According to PEP-8, a leading underscore is used to indicate that a method or variable is intended for internal use within a module or class. If the utility is intended to be shared across different modules, the leading underscore should be removed to make it public. (link)

else False
)
hf_config = model_config.hf_config
mtp_num_layers = _get_mtp_num_layers(hf_config)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Update the call to use the public helper name.

Suggested change
mtp_num_layers = _get_mtp_num_layers(hf_config)
mtp_num_layers = get_mtp_num_layers(hf_config)

return
elif not enable_mtp and has_mtp:
model_config.hf_config.num_nextn_predict_layers = 0
_set_mtp_num_layers(hf_config, 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Update the call to use the public helper name.

Suggested change
_set_mtp_num_layers(hf_config, 0)
set_mtp_num_layers(hf_config, 0)

Comment on lines +291 to +302
transformer_config = check_and_construct_configs(args, TransformerConfig)

# MTP support: mtp_num_hidden_layers may be in hf_config or hf_config.text_config
mtp_num_layers = 0
if hasattr(hf_config, "mtp_num_hidden_layers"):
mtp_num_layers = hf_config.mtp_num_hidden_layers
elif hasattr(hf_config, "text_config") and hasattr(hf_config.text_config, "mtp_num_hidden_layers"):
mtp_num_layers = hf_config.text_config.mtp_num_hidden_layers

if mtp_num_layers > 0:
transformer_config.mtp_num_layers = mtp_num_layers
transformer_config.mtp_loss_scaling_factor = getattr(hf_config, "mtp_loss_scaling_factor", 0.1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation has two issues:

  1. It duplicates the MTP layer detection logic and is less comprehensive than the get_mtp_num_layers helper (e.g., it misses num_nextn_predict_layers used by Qwen3 models).
  2. It potentially ignores user overrides for mtp_loss_scaling_factor. If the scaling factor is provided via override_transformer_config_kwargs, it is added to args but then removed by check_and_construct_configs if the TransformerConfig class doesn't explicitly define it. The subsequent getattr(hf_config, ...) call would then revert to the default value from the model config, ignoring the user's intent.
Suggested change
transformer_config = check_and_construct_configs(args, TransformerConfig)
# MTP support: mtp_num_hidden_layers may be in hf_config or hf_config.text_config
mtp_num_layers = 0
if hasattr(hf_config, "mtp_num_hidden_layers"):
mtp_num_layers = hf_config.mtp_num_hidden_layers
elif hasattr(hf_config, "text_config") and hasattr(hf_config.text_config, "mtp_num_hidden_layers"):
mtp_num_layers = hf_config.text_config.mtp_num_hidden_layers
if mtp_num_layers > 0:
transformer_config.mtp_num_layers = mtp_num_layers
transformer_config.mtp_loss_scaling_factor = getattr(hf_config, "mtp_loss_scaling_factor", 0.1)
# Capture MTP scaling factor before it's potentially removed by check_and_construct_configs
mtp_loss_scaling_factor = override_transformer_config_kwargs.get("mtp_loss_scaling_factor", getattr(hf_config, "mtp_loss_scaling_factor", 0.1))
transformer_config = check_and_construct_configs(args, TransformerConfig)
# MTP support
from verl.utils.megatron_utils import get_mtp_num_layers
mtp_num_layers = get_mtp_num_layers(hf_config)
if mtp_num_layers > 0:
transformer_config.mtp_num_layers = mtp_num_layers
transformer_config.mtp_loss_scaling_factor = mtp_loss_scaling_factor

Comment on lines +162 to +169
# DeepSeek-style MTP field
if hasattr(hf_config, "num_nextn_predict_layers"):
hf_config.num_nextn_predict_layers = 0
# Qwen3.5-style MTP field: mtp_num_hidden_layers
if hasattr(hf_config, "mtp_num_hidden_layers"):
hf_config.mtp_num_hidden_layers = 0
if hasattr(hf_config, "text_config") and hasattr(hf_config.text_config, "mtp_num_hidden_layers"):
hf_config.text_config.mtp_num_hidden_layers = 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This logic for disabling MTP fields is redundant with the set_mtp_num_layers helper. Using the helper ensures consistency across the codebase when handling different MTP attribute names (DeepSeek vs Qwen style).

Suggested change
# DeepSeek-style MTP field
if hasattr(hf_config, "num_nextn_predict_layers"):
hf_config.num_nextn_predict_layers = 0
# Qwen3.5-style MTP field: mtp_num_hidden_layers
if hasattr(hf_config, "mtp_num_hidden_layers"):
hf_config.mtp_num_hidden_layers = 0
if hasattr(hf_config, "text_config") and hasattr(hf_config.text_config, "mtp_num_hidden_layers"):
hf_config.text_config.mtp_num_hidden_layers = 0
# Disable MTP fields
from verl.utils.megatron_utils import set_mtp_num_layers
set_mtp_num_layers(hf_config, 0)

ArronHZG
ArronHZG previously approved these changes Apr 8, 2026
Copy link
Copy Markdown
Collaborator

@ArronHZG ArronHZG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please follow gemini‘s recommand to refactor code.

@zpltys zpltys changed the title support qwen35 mtp sft/rl [model][feat]support qwen35 mtp sft/rl Apr 8, 2026
@zpltys zpltys changed the title [model][feat]support qwen35 mtp sft/rl [model]feat:support qwen35 mtp sft/rl Apr 8, 2026
@zpltys zpltys changed the title [model]feat:support qwen35 mtp sft/rl [model] feat: support qwen35 mtp sft/rl Apr 8, 2026
@@ -0,0 +1,157 @@
#!/usr/bin/env bash
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reuse examples/sft/gsm8k/run_qwen3_5_megatron.sh with additional MTP option.

# DeepSeek-style MTP field
if hasattr(hf_config, "num_nextn_predict_layers"):
hf_config.num_nextn_predict_layers = 0
# Qwen3.5-style MTP field: mtp_num_hidden_layers
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

megatron_worekrs.py has been deprecated, please do not modify it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants