Add Spec V2 for Qwen 3 next by vincentzed · Pull Request #15591 · sgl-project/sglang

vincentzed · 2025-12-22T01:55:15Z

Motivation

Modifications

Here’s a command we want to make work:

CUDA_VISIBLE_DEVICES=4,5,6,7 SGLANG_ENABLE_JIT_DEEPGEMM=0 SGLANG_ENABLE_SPEC_V2=1 python -m sglang.launch_server \
  --model-path Qwen/Qwen3-Next-80B-A3B-Instruct \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --speculative-algorithm NEXTN \
  --tp 4 \
  --model-loader-extra-config "{\"enable_multithread_load\": true, \"num_threads\": 8}"

Where the key option is that SGLANG_ENABLE_SPEC_V2=1 for overlap of draft + verify

Acc is tested (no gibberish)

Speedup:
By total token throughput:
16373/14293 *100 - 100 = 14.5525781851%

GSP benchmark result:

python3 -m sglang.bench_serving \
  --backend sglang-oai-chat \
  --model Qwen/Qwen3-Next-80B-A3B-Instruct \
  --dataset-name generated-shared-prefix \
  --num-prompts 512 \
  --gsp-num-groups 32 \
  --gsp-prompts-per-group 16 \
  --gsp-system-prompt-len 2048 \
  --gsp-question-len 128 \
  --gsp-output-len 256 \
  --request-rate 10 \
  --flush-cache \
  --output-file qwen3_next_bench_v2.jsonl

Spec V2, This branch:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    10.0      
Max request concurrency:                 not set   
Successful requests:                     512       
Benchmark duration (s):                  79.69     
Total input tokens:                      1173686   
Total input text tokens:                 1173686   
Total input vision tokens:               0         
Total generated tokens:                  131072    
Total generated tokens (retokenized):    130889    
Request throughput (req/s):              6.43      
Input token throughput (tok/s):          14728.93  
Output token throughput (tok/s):         1644.86   
Peak output token throughput (tok/s):    1964.00   
Peak concurrent requests:                219       
Total token throughput (tok/s):          16373.79  
Concurrency:                             121.38    
Accept length:                           2.38      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   18891.69  
Median E2E Latency (ms):                 19308.95  
---------------Time to First Token----------------
Mean TTFT (ms):                          11952.38  
Median TTFT (ms):                        11721.14  
P99 TTFT (ms):                           24736.56  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          27.21     
Median TPOT (ms):                        28.08     
P99 TPOT (ms):                           35.03     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           28.23     
Median ITL (ms):                         12.32     
P95 ITL (ms):                            94.34     
P99 ITL (ms):                            185.87    
Max ITL (ms):                            831.23    
==================================================

Spec V1 on this branch:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    10.0      
Max request concurrency:                 not set   
Successful requests:                     512       
Benchmark duration (s):                  91.59     
Total input tokens:                      1173686   
Total input text tokens:                 1173686   
Total input vision tokens:               0         
Total generated tokens:                  131072    
Total generated tokens (retokenized):    130870    
Request throughput (req/s):              5.59      
Input token throughput (tok/s):          12814.32  
Output token throughput (tok/s):         1431.05   
Peak output token throughput (tok/s):    1766.00   
Peak concurrent requests:                252       
Total token throughput (tok/s):          14245.36  
Concurrency:                             139.15    
Accept length:                           2.37      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   24891.99  
Median E2E Latency (ms):                 24959.58  
---------------Time to First Token----------------
Mean TTFT (ms):                          16858.14  
Median TTFT (ms):                        16453.77  
P99 TTFT (ms):                           36294.50  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          31.51     
Median TPOT (ms):                        32.57     
P99 TPOT (ms):                           39.77     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           32.69     
Median ITL (ms):                         13.20     
P95 ITL (ms):                            101.28    
P99 ITL (ms):                            194.41    
Max ITL (ms):                            979.18    
==================================================

Make sure Spec V1 did not have regression. So we test Spec v1 on main:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    10.0      
Max request concurrency:                 not set   
Successful requests:                     512       
Benchmark duration (s):                  91.29     
Total input tokens:                      1173686   
Total input text tokens:                 1173686   
Total input vision tokens:               0         
Total generated tokens:                  131072    
Total generated tokens (retokenized):    130931    
Request throughput (req/s):              5.61      
Input token throughput (tok/s):          12857.37  
Output token throughput (tok/s):         1435.85   
Peak output token throughput (tok/s):    1769.00   
Peak concurrent requests:                241       
Total token throughput (tok/s):          14293.23  
Concurrency:                             133.99    
Accept length:                           2.37      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   23888.86  
Median E2E Latency (ms):                 23660.63  
---------------Time to First Token----------------
Mean TTFT (ms):                          15890.54  
Median TTFT (ms):                        15069.81  
P99 TTFT (ms):                           35798.34  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          31.37     
Median TPOT (ms):                        32.31     
P99 TPOT (ms):                           40.22     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           32.53     
Median ITL (ms):                         13.17     
P95 ITL (ms):                            108.71    
P99 ITL (ms):                            204.41    
Max ITL (ms):                            967.28    
==================================================

The result ~= Spec V1 on the branch, which proves our optimization is solid.
Also here is the trace:

All 3:
https://drive.google.com/drive/folders/17HodwEV2LPuyPri7W01b1_8pDid9JCzD?usp=sharing

In this PR, we also fixed a cuda stream sync that caused a perf regression initially (see comments inline), here are some examples of unwanted stream sync.

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-22T01:55:32Z

Summary of Changes

Hello @vincentzed, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on refining and optimizing the speculative decoding V2 implementation within SGLang. The core changes involve introducing a specialized KV cache management mode for EAGLEWorkerV2 and optimizing Mamba state updates by conditionally bypassing masking operations. These improvements collectively aim to boost the overall performance of speculative decoding, as evidenced by the provided benchmark results showing enhanced throughput and lower latencies.

Highlights

Speculative Decoding V2 Enhancements: Introduced a new DRAFT_EXTEND_V2 forward mode specifically for EAGLEWorkerV2 to optimize the filling of the draft KV cache after verification, improving efficiency in speculative decoding.
Optimized Mamba State Update: Implemented a skip_masking flag in the update_mamba_state_after_mtp_verify function. This optimization prevents unnecessary cudaStreamSynchronize calls when accepted_steps are guaranteed to be non-negative, leading to faster Mamba state updates in specific V2 paths.
Integration of Mamba State Optimization: Integrated the optimized Mamba state update logic into the EAGLEWorkerV2's verification process for hybrid GDN models, leveraging the new skip_masking=True parameter to enhance performance.
Performance Improvements: Benchmark results demonstrate significant performance gains for Spec V2 on this branch compared to Spec V1, with improved request throughput and reduced end-to-end latency.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces Spec V2 for Qwen 3 next, an optimization that enables overlapping the draft and verify stages in speculative decoding. The changes primarily focus on handling a new forward mode, DRAFT_EXTEND_V2, and optimizing Mamba state updates. While the implementation is mostly sound and demonstrates performance improvements, I have identified a critical bug in the calculation of accepted steps within eagle_worker_v2.py that needs to be addressed. Additionally, I've provided a suggestion to improve code clarity by removing a redundant variable.

gemini-code-assist · 2025-12-22T01:57:45Z

+        ):
+            # Calculate accepted_steps for mamba state update
+            # Include the bonus token (+1)
+            accepted_length_with_bonus = accept_length


The variable accepted_length_with_bonus is redundant because accept_length already includes the bonus token, as mentioned in the comment on line 764. You can remove this line and use accept_length directly in torch.cumsum on line 770 and in the calculation on line 785. This will make the code clearer and less prone to confusion.

vincentzed · 2025-12-22T02:37:45Z

Various BS benchmark:

Cmd: OUTPUT=before_spec_v2.jsonl && rm -f $OUTPUT && for ((N=1; N<=128; N*=2)); do python3 -m sglang.bench_serving --backend sglang --flush-cache --dataset-name random --random-input-len 1024 --random-output-len 1024 --random-range-ratio 1.0 --num-prompts $((6*N)) --max-concurrency $N --output-file $OUTPUT; done && python test/srt/parse_results.py $OUTPUT

Summary of below:

+-------------+----------------------+---------+---------+-----------+------------------------+
| concurrency | metric               | before  | after   | delta     | gain                   |
+=============+======================+=========+=========+===========+========================+
| 1           | throughput            | 259.9   | 332.7   | +72.8     | +28.0%                 |
|             | p99_ttft_ms           | 332.0   | 206.2   | −125.8    | −37.9%                 |
|             | mean_tpot_ms          | 3.67    | 2.85    | −0.82     | −22.3%                 |
+-------------+----------------------+---------+---------+-----------+------------------------+
| 8           | throughput            | 1237.5  | 1533.9  | +296.4    | +24.0%                 |
|             | p99_ttft_ms           | 500.5   | 482.9   | −17.6     | −3.5%                  |
|             | mean_tpot_ms          | 5.80    | 4.66    | −1.14     | −19.7%                 |
+-------------+----------------------+---------+---------+-----------+------------------------+
| 32          | throughput            | 2874.2  | 3313.8  | +439.6    | +15.3%                 |
|             | p99_ttft_ms           | 702.8   | 764.3   | +61.5     | +8.7% (regression)     |
|             | mean_tpot_ms          | 10.25   | 8.87    | −1.38     | −13.5%                 |
+-------------+----------------------+---------+---------+-----------+------------------------+
| 128         | throughput            | 3414.6  | 3760.6  | +346.0    | +10.1%                 |
|             | p99_ttft_ms           | 26155.6 | 24057.7 | −2097.9   | −8.0%                  |
|             | mean_tpot_ms          | 13.63   | 12.38   | −1.25     | −9.2%                  |
+-------------+----------------------+---------+---------+-----------+------------------------+

Before

+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |            259.911 |             259.911 |        186.638 |          159.810 |       331.980 |          3.666 |            3.530 |         4.365 |               259.911 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |             2.000 |            427.746 |             427.746 |        184.218 |          163.498 |       297.547 |          4.414 |            4.205 |         5.888 |               213.873 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  2 |             4.000 |            758.685 |             758.685 |        202.884 |          155.606 |       480.605 |          4.798 |            4.453 |         6.812 |               189.671 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  3 |             8.000 |           1237.523 |            1237.523 |        223.833 |          158.601 |       500.534 |          5.800 |            5.514 |         8.861 |               154.690 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  4 |            16.000 |           1923.267 |            1923.267 |        235.574 |          160.461 |       590.376 |          7.517 |            6.950 |        11.465 |               120.204 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  5 |            32.000 |           2874.232 |            2874.232 |        260.423 |          161.570 |       702.761 |         10.250 |            9.756 |        16.054 |                89.820 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  6 |            64.000 |           3421.672 |            3421.672 |       4652.512 |         5072.364 |      8589.624 |         13.332 |           12.608 |        21.103 |                53.464 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  7 |           128.000 |           3414.624 |            3414.624 |      22230.409 |        24696.099 |     26155.598 |         13.630 |           12.627 |        21.162 |                26.677 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+

After

Saved summary to: after_spec_v2_summary.csv

+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |            332.667 |             332.667 |        163.454 |          154.984 |       206.172 |          2.846 |            2.718 |         3.276 |               332.667 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |             2.000 |            543.485 |             543.485 |        201.973 |          168.965 |       380.087 |          3.376 |            3.096 |         4.600 |               271.743 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  2 |             4.000 |            991.061 |             991.061 |        191.516 |          152.331 |       414.234 |          3.672 |            3.391 |         5.266 |               247.765 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  3 |             8.000 |           1533.888 |            1533.888 |        241.940 |          164.845 |       482.935 |          4.656 |            4.239 |         7.163 |               191.736 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  4 |            16.000 |           2302.798 |            2302.798 |        241.296 |          163.455 |       545.563 |          6.222 |            5.759 |         9.776 |               143.925 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  5 |            32.000 |           3313.815 |            3313.815 |        300.407 |          175.229 |       764.259 |          8.869 |            8.493 |        14.249 |               103.557 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  6 |            64.000 |           3890.366 |            3890.366 |       4120.522 |         4512.661 |      7544.377 |         11.741 |           11.151 |        18.609 |                60.787 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  7 |           128.000 |           3760.559 |            3760.559 |      20254.542 |        22428.491 |     24057.687 |         12.378 |           11.476 |        19.190 |                29.379 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+

vincentzed · 2025-12-22T02:59:33Z

/tag-and-rerun-ci

vincentzed · 2025-12-22T15:58:55Z

/rerun-failed-ci

vincentzed · 2025-12-25T02:40:57Z

/rerun-failed-ci

Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>

vincentzed · 2026-01-08T02:36:54Z

@yizhang2077 may you help take a look again?

whybeyoung · 2026-01-09T01:39:00Z

LGTM

vincentzed · 2026-02-13T18:36:35Z

@yizhang2077 Hello, can you give it a approval. And then, in #18808, we can apply the improvements on top.)

vincentzed requested review from Ying1123, hanming-lu, hebiao064, hnyls2002, merrymercy and yizhang2077 as code owners December 22, 2025 01:55

vincentzed changed the title ~~Vz/clean spec v2~~ Add Spec V2 for Qwen 3 next Dec 22, 2025

gemini-code-assist Bot reviewed Dec 22, 2025

View reviewed changes

github-actions Bot added the run-ci label Dec 22, 2025

yizhang2077 reviewed Dec 25, 2025

View reviewed changes

Comment thread python/sglang/srt/speculative/eagle_worker_v2.py Outdated

vincentzed force-pushed the vz/clean-spec-v2 branch from de4ad74 to e405fdd Compare January 5, 2026 02:35

vincentzed added 3 commits January 5, 2026 02:52

more more more mo

07fce09

Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>

updmore

38f30d4

Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>

tiny refactor

4747111

Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>

vincentzed force-pushed the vz/clean-spec-v2 branch from e405fdd to 4747111 Compare January 5, 2026 02:57

Merge branch 'main' into vz/clean-spec-v2

6278438

yizhang2077 reviewed Jan 23, 2026

View reviewed changes

Comment thread python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py

Merge branch 'main' into vz/clean-spec-v2

b0abe4d

hlu1 mentioned this pull request Feb 11, 2026

[Tracking] Qwen3.5/Qwen3-Next Optimizations #18590

Open

38 tasks

Merge branch 'main' into vz/clean-spec-v2

256c7e2

hzh0425 mentioned this pull request Feb 13, 2026

[Spec V2] Support specV2 for qwen3next #18808

Merged

5 tasks

Merge branch 'main' into vz/clean-spec-v2

769e04e

vincentzed mentioned this pull request Feb 16, 2026

[Temporarily unblock spec v2 qwen3.5] #18906

Closed

5 tasks

b8zhong closed this Feb 28, 2026

b8zhong deleted the vz/clean-spec-v2 branch March 9, 2026 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Spec V2 for Qwen 3 next#15591

Add Spec V2 for Qwen 3 next#15591
vincentzed wants to merge 7 commits into
sgl-project:mainfrom
bzhng-development:vz/clean-spec-v2

vincentzed commented Dec 22, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Dec 22, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot Dec 22, 2025

Uh oh!

vincentzed commented Dec 22, 2025 •

edited

Loading

Uh oh!

vincentzed commented Dec 22, 2025

Uh oh!

vincentzed commented Dec 22, 2025

Uh oh!

vincentzed commented Dec 25, 2025

Uh oh!

Uh oh!

vincentzed commented Jan 8, 2026

Uh oh!

whybeyoung commented Jan 9, 2026

Uh oh!

Uh oh!

vincentzed commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

vincentzed commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Dec 22, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

vincentzed commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Various BS benchmark:

Summary of below:

Before

After

Uh oh!

vincentzed commented Dec 22, 2025

Uh oh!

vincentzed commented Dec 22, 2025

Uh oh!

vincentzed commented Dec 25, 2025

Uh oh!

Uh oh!

vincentzed commented Jan 8, 2026

Uh oh!

whybeyoung commented Jan 9, 2026

Uh oh!

Uh oh!

vincentzed commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vincentzed commented Dec 22, 2025 •

edited

Loading

vincentzed commented Dec 22, 2025 •

edited

Loading