[model] feat: support qwen35 mtp sft/rl by zpltys · Pull Request #5898 · verl-project/verl

zpltys · 2026-04-07T10:50:25Z

What does this PR do?

To run qwen35 with mtp, you should use mbridge with pr: ISEEKYAN/mbridge#98

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

gemini-code-assist

Code Review

This pull request introduces support for Qwen3.5 MoE models with Multi-Token Prediction (MTP) for Megatron-based SFT and GRPO training. Changes include new example scripts, model registry updates, and enhanced configuration conversion logic to handle Qwen3.5-specific MTP attributes. Review feedback recommends making the internal MTP configuration helpers public for better cross-module reuse, addressing a potential bug where user-defined MTP loss scaling factors are ignored during configuration conversion, and eliminating redundant logic in the Megatron workers by consistently using the new helper functions.

gemini-code-assist · 2026-04-07T10:53:32Z

verl/utils/megatron_utils.py

+def _get_mtp_num_layers(hf_config):
+    """Get MTP layer count from various config formats.
+
+    Supports:
+        - num_nextn_predict_layers (DeepSeek, Qwen3 style)
+        - mtp_num_hidden_layers (Qwen3.5 style, in hf_config or text_config)
+    """
+    if hasattr(hf_config, "num_nextn_predict_layers") and hf_config.num_nextn_predict_layers > 0:
+        return hf_config.num_nextn_predict_layers
+    if hasattr(hf_config, "mtp_num_hidden_layers") and hf_config.mtp_num_hidden_layers > 0:
+        return hf_config.mtp_num_hidden_layers
+    if hasattr(hf_config, "text_config") and hasattr(hf_config.text_config, "mtp_num_hidden_layers"):
+        if hf_config.text_config.mtp_num_hidden_layers > 0:
+            return hf_config.text_config.mtp_num_hidden_layers
+    return 0
+
+
+def _set_mtp_num_layers(hf_config, value: int):
+    """Set MTP layer count in the appropriate config field."""
+    if hasattr(hf_config, "num_nextn_predict_layers"):
+        hf_config.num_nextn_predict_layers = value
+    elif hasattr(hf_config, "mtp_num_hidden_layers"):
+        hf_config.mtp_num_hidden_layers = value
+    elif hasattr(hf_config, "text_config") and hasattr(hf_config.text_config, "mtp_num_hidden_layers"):
+        hf_config.text_config.mtp_num_hidden_layers = value


The MTP configuration helpers _get_mtp_num_layers and _set_mtp_num_layers are useful across different modules (e.g., in config_converter.py and megatron_workers.py). They should be made public by removing the leading underscore to follow Python naming conventions for shared utilities and to avoid linting issues when accessed from other packages.

def get_mtp_num_layers(hf_config): """Get MTP layer count from various config formats. Supports: - num_nextn_predict_layers (DeepSeek, Qwen3 style) - mtp_num_hidden_layers (Qwen3.5 style, in hf_config or text_config) """ if hasattr(hf_config, "num_nextn_predict_layers") and hf_config.num_nextn_predict_layers > 0: return hf_config.num_nextn_predict_layers if hasattr(hf_config, "mtp_num_hidden_layers") and hf_config.mtp_num_hidden_layers > 0: return hf_config.mtp_num_hidden_layers if hasattr(hf_config, "text_config") and hasattr(hf_config.text_config, "mtp_num_hidden_layers"): if hf_config.text_config.mtp_num_hidden_layers > 0: return hf_config.text_config.mtp_num_hidden_layers return 0 def set_mtp_num_layers(hf_config, value: int): """Set MTP layer count in the appropriate config field.""" if hasattr(hf_config, "num_nextn_predict_layers"): hf_config.num_nextn_predict_layers = value elif hasattr(hf_config, "mtp_num_hidden_layers"): hf_config.mtp_num_hidden_layers = value elif hasattr(hf_config, "text_config") and hasattr(hf_config.text_config, "mtp_num_hidden_layers"): hf_config.text_config.mtp_num_hidden_layers = value

References

According to PEP-8, a leading underscore is used to indicate that a method or variable is intended for internal use within a module or class. If the utility is intended to be shared across different modules, the leading underscore should be removed to make it public. ^(link)

gemini-code-assist · 2026-04-07T10:53:32Z

verl/utils/megatron_utils.py

-        else False
-    )
+    hf_config = model_config.hf_config
+    mtp_num_layers = _get_mtp_num_layers(hf_config)


Update the call to use the public helper name.

Suggested change

mtp_num_layers = _get_mtp_num_layers(hf_config)

mtp_num_layers = get_mtp_num_layers(hf_config)

gemini-code-assist · 2026-04-07T10:53:32Z

verl/utils/megatron_utils.py

        return
    elif not enable_mtp and has_mtp:
-        model_config.hf_config.num_nextn_predict_layers = 0
+        _set_mtp_num_layers(hf_config, 0)


Update the call to use the public helper name.

Suggested change

_set_mtp_num_layers(hf_config, 0)

set_mtp_num_layers(hf_config, 0)

gemini-code-assist · 2026-04-07T10:53:32Z

verl/models/mcore/config_converter.py

+    transformer_config = check_and_construct_configs(args, TransformerConfig)
+
+    # MTP support: mtp_num_hidden_layers may be in hf_config or hf_config.text_config
+    mtp_num_layers = 0
+    if hasattr(hf_config, "mtp_num_hidden_layers"):
+        mtp_num_layers = hf_config.mtp_num_hidden_layers
+    elif hasattr(hf_config, "text_config") and hasattr(hf_config.text_config, "mtp_num_hidden_layers"):
+        mtp_num_layers = hf_config.text_config.mtp_num_hidden_layers
+
+    if mtp_num_layers > 0:
+        transformer_config.mtp_num_layers = mtp_num_layers
+        transformer_config.mtp_loss_scaling_factor = getattr(hf_config, "mtp_loss_scaling_factor", 0.1)


The current implementation has two issues:

It duplicates the MTP layer detection logic and is less comprehensive than the get_mtp_num_layers helper (e.g., it misses num_nextn_predict_layers used by Qwen3 models).

It potentially ignores user overrides for mtp_loss_scaling_factor. If the scaling factor is provided via override_transformer_config_kwargs, it is added to args but then removed by check_and_construct_configs if the TransformerConfig class doesn't explicitly define it. The subsequent getattr(hf_config, ...) call would then revert to the default value from the model config, ignoring the user's intent.

Suggested change

transformer_config = check_and_construct_configs(args, TransformerConfig)

# MTP support: mtp_num_hidden_layers may be in hf_config or hf_config.text_config

mtp_num_layers = 0

if hasattr(hf_config, "mtp_num_hidden_layers"):

mtp_num_layers = hf_config.mtp_num_hidden_layers

elif hasattr(hf_config, "text_config") and hasattr(hf_config.text_config, "mtp_num_hidden_layers"):

mtp_num_layers = hf_config.text_config.mtp_num_hidden_layers

if mtp_num_layers > 0:

transformer_config.mtp_num_layers = mtp_num_layers

transformer_config.mtp_loss_scaling_factor = getattr(hf_config, "mtp_loss_scaling_factor", 0.1)

# Capture MTP scaling factor before it's potentially removed by check_and_construct_configs

mtp_loss_scaling_factor = override_transformer_config_kwargs.get("mtp_loss_scaling_factor", getattr(hf_config, "mtp_loss_scaling_factor", 0.1))

transformer_config = check_and_construct_configs(args, TransformerConfig)

# MTP support

from verl.utils.megatron_utils import get_mtp_num_layers

mtp_num_layers = get_mtp_num_layers(hf_config)

if mtp_num_layers > 0:

transformer_config.mtp_num_layers = mtp_num_layers

transformer_config.mtp_loss_scaling_factor = mtp_loss_scaling_factor

gemini-code-assist · 2026-04-07T10:53:32Z

verl/workers/megatron_workers.py

+            # DeepSeek-style MTP field
            if hasattr(hf_config, "num_nextn_predict_layers"):
                hf_config.num_nextn_predict_layers = 0
+            # Qwen3.5-style MTP field: mtp_num_hidden_layers
+            if hasattr(hf_config, "mtp_num_hidden_layers"):
+                hf_config.mtp_num_hidden_layers = 0
+            if hasattr(hf_config, "text_config") and hasattr(hf_config.text_config, "mtp_num_hidden_layers"):
+                hf_config.text_config.mtp_num_hidden_layers = 0


This logic for disabling MTP fields is redundant with the set_mtp_num_layers helper. Using the helper ensures consistency across the codebase when handling different MTP attribute names (DeepSeek vs Qwen style).

Suggested change

# DeepSeek-style MTP field

if hasattr(hf_config, "num_nextn_predict_layers"):

hf_config.num_nextn_predict_layers = 0

# Qwen3.5-style MTP field: mtp_num_hidden_layers

if hasattr(hf_config, "mtp_num_hidden_layers"):

hf_config.mtp_num_hidden_layers = 0

if hasattr(hf_config, "text_config") and hasattr(hf_config.text_config, "mtp_num_hidden_layers"):

hf_config.text_config.mtp_num_hidden_layers = 0

# Disable MTP fields

from verl.utils.megatron_utils import set_mtp_num_layers

set_mtp_num_layers(hf_config, 0)

ArronHZG

please follow gemini‘s recommand to refactor code.

wuxibin89 · 2026-04-08T06:26:18Z

examples/sft/gsm8k/run_qwen3_5_megatron_mtp.sh

@@ -0,0 +1,157 @@
+#!/usr/bin/env bash


Please reuse examples/sft/gsm8k/run_qwen3_5_megatron.sh with additional MTP option.

wuxibin89 · 2026-04-08T06:27:33Z

verl/workers/megatron_workers.py

+            # DeepSeek-style MTP field
            if hasattr(hf_config, "num_nextn_predict_layers"):
                hf_config.num_nextn_predict_layers = 0
+            # Qwen3.5-style MTP field: mtp_num_hidden_layers


megatron_worekrs.py has been deprecated, please do not modify it.

support qwen35 mtp sft/rl

e86568d

zpltys requested review from ArronHZG, ISEEKYAN, vermouth1992 and wuxibin89 as code owners April 7, 2026 10:50

gemini-code-assist bot reviewed Apr 7, 2026

View reviewed changes

ArronHZG previously approved these changes Apr 8, 2026

View reviewed changes

ArronHZG reviewed Apr 8, 2026

View reviewed changes

zpltys changed the title ~~support qwen35 mtp sft/rl~~ [model][feat]support qwen35 mtp sft/rl Apr 8, 2026

zpltys changed the title ~~[model][feat]support qwen35 mtp sft/rl~~ [model]feat:support qwen35 mtp sft/rl Apr 8, 2026

zpltys changed the title ~~[model]feat:support qwen35 mtp sft/rl~~ [model] feat: support qwen35 mtp sft/rl Apr 8, 2026

update rl script

6ec5f72

zpltys dismissed ArronHZG’s stale review via 6ec5f72 April 8, 2026 06:14

wuxibin89 reviewed Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[model] feat: support qwen35 mtp sft/rl#5898

[model] feat: support qwen35 mtp sft/rl#5898
zpltys wants to merge 2 commits intoverl-project:mainfrom
meituan-search:feature/qwen_35

zpltys commented Apr 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 7, 2026

Uh oh!

gemini-code-assist bot Apr 7, 2026

Uh oh!

gemini-code-assist bot Apr 7, 2026

Uh oh!

gemini-code-assist bot Apr 7, 2026

Uh oh!

gemini-code-assist bot Apr 7, 2026

Uh oh!

ArronHZG left a comment

Uh oh!

wuxibin89 Apr 8, 2026

Uh oh!

wuxibin89 Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	mtp_num_layers = _get_mtp_num_layers(hf_config)
	mtp_num_layers = get_mtp_num_layers(hf_config)

	_set_mtp_num_layers(hf_config, 0)
	set_mtp_num_layers(hf_config, 0)

Conversation

zpltys commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

ArronHZG left a comment

Choose a reason for hiding this comment

Uh oh!

wuxibin89 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

wuxibin89 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zpltys commented Apr 7, 2026 •

edited

Loading