Skip to content

[Qwen3.5] Enable MTP spec_v2 and add test for nvidia/Qwen3.5-397B-A17B-NVFP4#19391

Merged
Fridge003 merged 2 commits into
sgl-project:mainfrom
hlu1:qwen35_test
Mar 4, 2026
Merged

[Qwen3.5] Enable MTP spec_v2 and add test for nvidia/Qwen3.5-397B-A17B-NVFP4#19391
Fridge003 merged 2 commits into
sgl-project:mainfrom
hlu1:qwen35_test

Conversation

@hlu1
Copy link
Copy Markdown
Collaborator

@hlu1 hlu1 commented Feb 26, 2026

Motivation

  • Make MTP_v2 work for Qwen3.5 by passing mm_input_embeds to the MTP head.
  • Add MTP_v1/v2 and non-MTP accuracy test for nvidia/Qwen3.5-397B-A17B-NVFP4 and check acceptance length in the MTP tests. Note that it uses the eval harness from sglang.test.run_eval import run_eval which applies the chat_template. Without the chat_template, the accuracy is very bad. The sampling parameters are based on the official recommendation from https://huggingface.co/Qwen/Qwen3.5-397B-A17B-FP8
  • Remove the two extra_buffer and mtp_v2 checks. These two checks are incorrect because mtp_v2 only requires extra_buffer when radix_cache is on. To make it more user friendly, I changed the default behavior of turning off radix-cache silently when spec decoding, no_buffer, and radix cache are enabled at the same time to raising an exception, in case the user want to enable spec decoding (v1 or v2) and radix cache but forgot to enable extra_buffer.
  • Remove duplicated code from server_args.py

co-author: @vincentzed #18906

Accuracy

Without radix cache

gpqa:
Repeat: 8, mean: 0.866
Scores: ['0.859', '0.869', '0.874', '0.869', '0.884', '0.848', '0.869', '0.859']

With radix cache

Repeat: 8, mean: 0.861
Scores: ['0.848', '0.843', '0.874', '0.854', '0.864', '0.859', '0.894', '0.854']

Benchmark

radix cache off

tps_vs_throughput_v1_vs_v2

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@hlu1
Copy link
Copy Markdown
Collaborator Author

hlu1 commented Feb 26, 2026

/tag-and-rerun-ci

@hlu1 hlu1 changed the title [Qwen3.5] Add test for nvidia/Qwen3.5-397B-A17B-NVFP4 [Qwen3.5] Enable MTP_v2 and add test for nvidia/Qwen3.5-397B-A17B-NVFP4 Feb 28, 2026
@hlu1 hlu1 requested a review from hanming-lu February 28, 2026 02:15
@hlu1
Copy link
Copy Markdown
Collaborator Author

hlu1 commented Feb 28, 2026

/tag-and-rerun-ci

Comment thread python/sglang/srt/server_args.py
@vincentzed
Copy link
Copy Markdown
Contributor

vincentzed commented Mar 1, 2026

I tested trtllm_mha under this as well:
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
| 3.088 | 512 | 3.413 | 165.82 |

Triton
| 2.177 | 512 | 3.303 | 235.23 |

@hlu1 hlu1 changed the title [Qwen3.5] Enable MTP_v2 and add test for nvidia/Qwen3.5-397B-A17B-NVFP4 [Qwen3.5] Enable MTP spec_v2 and add test for nvidia/Qwen3.5-397B-A17B-NVFP4 Mar 2, 2026
Comment thread python/sglang/srt/disaggregation/decode.py
Comment thread python/sglang/srt/mem_cache/memory_pool.py
Comment thread test/registered/4-gpu-models/test_qwen35_models.py
Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as the CI passes. CC: @yizhang2077 Please double check.

Copy link
Copy Markdown
Collaborator

@hzh0425 hzh0425 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hlu1
Copy link
Copy Markdown
Collaborator Author

hlu1 commented Mar 2, 2026

The gb200 CI is temporarily disabled. I ran the tests locally and they all pass.

Co-authored-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
Co-authored-by: lzy <tomlzy213@gmail.com>
@hlu1
Copy link
Copy Markdown
Collaborator Author

hlu1 commented Mar 3, 2026

/rerun-failed-ci

@hlu1
Copy link
Copy Markdown
Collaborator Author

hlu1 commented Mar 4, 2026

/rerun-failed-ci

@hlu1
Copy link
Copy Markdown
Collaborator Author

hlu1 commented Mar 4, 2026

/rerun-failed-ci

1 similar comment
@hlu1
Copy link
Copy Markdown
Collaborator Author

hlu1 commented Mar 4, 2026

/rerun-failed-ci

@hlu1
Copy link
Copy Markdown
Collaborator Author

hlu1 commented Mar 4, 2026

Both test/registered/4-gpu-models/test_qwen3_next_models_mtp.py and test/registered/4-gpu-models/test_qwen35_models.py have passed in the latest CI run.

@Fridge003 Fridge003 merged commit 9457c04 into sgl-project:main Mar 4, 2026
175 of 203 checks passed
qeternity pushed a commit to qeternity/sglang that referenced this pull request Mar 6, 2026
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Mar 6, 2026
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants