Skip to content

Add Spec V2 for Qwen 3 next#15591

Closed
vincentzed wants to merge 7 commits into
sgl-project:mainfrom
bzhng-development:vz/clean-spec-v2
Closed

Add Spec V2 for Qwen 3 next#15591
vincentzed wants to merge 7 commits into
sgl-project:mainfrom
bzhng-development:vz/clean-spec-v2

Conversation

@vincentzed
Copy link
Copy Markdown
Contributor

@vincentzed vincentzed commented Dec 22, 2025

Motivation

Modifications

Here’s a command we want to make work:

CUDA_VISIBLE_DEVICES=4,5,6,7 SGLANG_ENABLE_JIT_DEEPGEMM=0 SGLANG_ENABLE_SPEC_V2=1 python -m sglang.launch_server \
  --model-path Qwen/Qwen3-Next-80B-A3B-Instruct \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --speculative-algorithm NEXTN \
  --tp 4 \
  --model-loader-extra-config "{\"enable_multithread_load\": true, \"num_threads\": 8}"

Where the key option is that SGLANG_ENABLE_SPEC_V2=1 for overlap of draft + verify

Acc is tested (no gibberish)

Speedup:
By total token throughput:
16373/14293 *100 - 100 = 14.5525781851%

GSP benchmark result:

python3 -m sglang.bench_serving \
  --backend sglang-oai-chat \
  --model Qwen/Qwen3-Next-80B-A3B-Instruct \
  --dataset-name generated-shared-prefix \
  --num-prompts 512 \
  --gsp-num-groups 32 \
  --gsp-prompts-per-group 16 \
  --gsp-system-prompt-len 2048 \
  --gsp-question-len 128 \
  --gsp-output-len 256 \
  --request-rate 10 \
  --flush-cache \
  --output-file qwen3_next_bench_v2.jsonl

Spec V2, This branch:
CleanShot 2025-12-21 at 21 17 09@2x

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    10.0      
Max request concurrency:                 not set   
Successful requests:                     512       
Benchmark duration (s):                  79.69     
Total input tokens:                      1173686   
Total input text tokens:                 1173686   
Total input vision tokens:               0         
Total generated tokens:                  131072    
Total generated tokens (retokenized):    130889    
Request throughput (req/s):              6.43      
Input token throughput (tok/s):          14728.93  
Output token throughput (tok/s):         1644.86   
Peak output token throughput (tok/s):    1964.00   
Peak concurrent requests:                219       
Total token throughput (tok/s):          16373.79  
Concurrency:                             121.38    
Accept length:                           2.38      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   18891.69  
Median E2E Latency (ms):                 19308.95  
---------------Time to First Token----------------
Mean TTFT (ms):                          11952.38  
Median TTFT (ms):                        11721.14  
P99 TTFT (ms):                           24736.56  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          27.21     
Median TPOT (ms):                        28.08     
P99 TPOT (ms):                           35.03     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           28.23     
Median ITL (ms):                         12.32     
P95 ITL (ms):                            94.34     
P99 ITL (ms):                            185.87    
Max ITL (ms):                            831.23    
==================================================

Spec V1 on this branch:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    10.0      
Max request concurrency:                 not set   
Successful requests:                     512       
Benchmark duration (s):                  91.59     
Total input tokens:                      1173686   
Total input text tokens:                 1173686   
Total input vision tokens:               0         
Total generated tokens:                  131072    
Total generated tokens (retokenized):    130870    
Request throughput (req/s):              5.59      
Input token throughput (tok/s):          12814.32  
Output token throughput (tok/s):         1431.05   
Peak output token throughput (tok/s):    1766.00   
Peak concurrent requests:                252       
Total token throughput (tok/s):          14245.36  
Concurrency:                             139.15    
Accept length:                           2.37      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   24891.99  
Median E2E Latency (ms):                 24959.58  
---------------Time to First Token----------------
Mean TTFT (ms):                          16858.14  
Median TTFT (ms):                        16453.77  
P99 TTFT (ms):                           36294.50  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          31.51     
Median TPOT (ms):                        32.57     
P99 TPOT (ms):                           39.77     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           32.69     
Median ITL (ms):                         13.20     
P95 ITL (ms):                            101.28    
P99 ITL (ms):                            194.41    
Max ITL (ms):                            979.18    
==================================================

Make sure Spec V1 did not have regression. So we test Spec v1 on main:

CleanShot 2025-12-21 at 21 16 03@2x
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    10.0      
Max request concurrency:                 not set   
Successful requests:                     512       
Benchmark duration (s):                  91.29     
Total input tokens:                      1173686   
Total input text tokens:                 1173686   
Total input vision tokens:               0         
Total generated tokens:                  131072    
Total generated tokens (retokenized):    130931    
Request throughput (req/s):              5.61      
Input token throughput (tok/s):          12857.37  
Output token throughput (tok/s):         1435.85   
Peak output token throughput (tok/s):    1769.00   
Peak concurrent requests:                241       
Total token throughput (tok/s):          14293.23  
Concurrency:                             133.99    
Accept length:                           2.37      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   23888.86  
Median E2E Latency (ms):                 23660.63  
---------------Time to First Token----------------
Mean TTFT (ms):                          15890.54  
Median TTFT (ms):                        15069.81  
P99 TTFT (ms):                           35798.34  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          31.37     
Median TPOT (ms):                        32.31     
P99 TPOT (ms):                           40.22     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           32.53     
Median ITL (ms):                         13.17     
P95 ITL (ms):                            108.71    
P99 ITL (ms):                            204.41    
Max ITL (ms):                            967.28    
==================================================

The result ~= Spec V1 on the branch, which proves our optimization is solid.
Also here is the trace:

All 3:
https://drive.google.com/drive/folders/17HodwEV2LPuyPri7W01b1_8pDid9JCzD?usp=sharing

In this PR, we also fixed a cuda stream sync that caused a perf regression initially (see comments inline), here are some examples of unwanted stream sync.
CleanShot 2025-12-21 at 21 19 31@2x
CleanShot 2025-12-21 at 21 22 31@2x

Accuracy Tests

Benchmarking and Profiling

Checklist

@vincentzed vincentzed changed the title Vz/clean spec v2 Add Spec V2 for Qwen 3 next Dec 22, 2025
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @vincentzed, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on refining and optimizing the speculative decoding V2 implementation within SGLang. The core changes involve introducing a specialized KV cache management mode for EAGLEWorkerV2 and optimizing Mamba state updates by conditionally bypassing masking operations. These improvements collectively aim to boost the overall performance of speculative decoding, as evidenced by the provided benchmark results showing enhanced throughput and lower latencies.

Highlights

  • Speculative Decoding V2 Enhancements: Introduced a new DRAFT_EXTEND_V2 forward mode specifically for EAGLEWorkerV2 to optimize the filling of the draft KV cache after verification, improving efficiency in speculative decoding.
  • Optimized Mamba State Update: Implemented a skip_masking flag in the update_mamba_state_after_mtp_verify function. This optimization prevents unnecessary cudaStreamSynchronize calls when accepted_steps are guaranteed to be non-negative, leading to faster Mamba state updates in specific V2 paths.
  • Integration of Mamba State Optimization: Integrated the optimized Mamba state update logic into the EAGLEWorkerV2's verification process for hybrid GDN models, leveraging the new skip_masking=True parameter to enhance performance.
  • Performance Improvements: Benchmark results demonstrate significant performance gains for Spec V2 on this branch compared to Spec V1, with improved request throughput and reduced end-to-end latency.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Spec V2 for Qwen 3 next, an optimization that enables overlapping the draft and verify stages in speculative decoding. The changes primarily focus on handling a new forward mode, DRAFT_EXTEND_V2, and optimizing Mamba state updates. While the implementation is mostly sound and demonstrates performance improvements, I have identified a critical bug in the calculation of accepted steps within eagle_worker_v2.py that needs to be addressed. Additionally, I've provided a suggestion to improve code clarity by removing a redundant variable.

Comment thread python/sglang/srt/speculative/eagle_worker_v2.py Outdated
):
# Calculate accepted_steps for mamba state update
# Include the bonus token (+1)
accepted_length_with_bonus = accept_length
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable accepted_length_with_bonus is redundant because accept_length already includes the bonus token, as mentioned in the comment on line 764. You can remove this line and use accept_length directly in torch.cumsum on line 770 and in the calculation on line 785. This will make the code clearer and less prone to confusion.

@vincentzed
Copy link
Copy Markdown
Contributor Author

vincentzed commented Dec 22, 2025

Various BS benchmark:

Cmd: OUTPUT=before_spec_v2.jsonl && rm -f $OUTPUT && for ((N=1; N<=128; N*=2)); do python3 -m sglang.bench_serving --backend sglang --flush-cache --dataset-name random --random-input-len 1024 --random-output-len 1024 --random-range-ratio 1.0 --num-prompts $((6*N)) --max-concurrency $N --output-file $OUTPUT; done && python test/srt/parse_results.py $OUTPUT

Summary of below:

+-------------+----------------------+---------+---------+-----------+------------------------+
| concurrency | metric               | before  | after   | delta     | gain                   |
+=============+======================+=========+=========+===========+========================+
| 1           | throughput            | 259.9   | 332.7   | +72.8     | +28.0%                 |
|             | p99_ttft_ms           | 332.0   | 206.2   | −125.8    | −37.9%                 |
|             | mean_tpot_ms          | 3.67    | 2.85    | −0.82     | −22.3%                 |
+-------------+----------------------+---------+---------+-----------+------------------------+
| 8           | throughput            | 1237.5  | 1533.9  | +296.4    | +24.0%                 |
|             | p99_ttft_ms           | 500.5   | 482.9   | −17.6     | −3.5%                  |
|             | mean_tpot_ms          | 5.80    | 4.66    | −1.14     | −19.7%                 |
+-------------+----------------------+---------+---------+-----------+------------------------+
| 32          | throughput            | 2874.2  | 3313.8  | +439.6    | +15.3%                 |
|             | p99_ttft_ms           | 702.8   | 764.3   | +61.5     | +8.7% (regression)     |
|             | mean_tpot_ms          | 10.25   | 8.87    | −1.38     | −13.5%                 |
+-------------+----------------------+---------+---------+-----------+------------------------+
| 128         | throughput            | 3414.6  | 3760.6  | +346.0    | +10.1%                 |
|             | p99_ttft_ms           | 26155.6 | 24057.7 | −2097.9   | −8.0%                  |
|             | mean_tpot_ms          | 13.63   | 12.38   | −1.25     | −9.2%                  |
+-------------+----------------------+---------+---------+-----------+------------------------+

Before

+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |            259.911 |             259.911 |        186.638 |          159.810 |       331.980 |          3.666 |            3.530 |         4.365 |               259.911 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |             2.000 |            427.746 |             427.746 |        184.218 |          163.498 |       297.547 |          4.414 |            4.205 |         5.888 |               213.873 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  2 |             4.000 |            758.685 |             758.685 |        202.884 |          155.606 |       480.605 |          4.798 |            4.453 |         6.812 |               189.671 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  3 |             8.000 |           1237.523 |            1237.523 |        223.833 |          158.601 |       500.534 |          5.800 |            5.514 |         8.861 |               154.690 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  4 |            16.000 |           1923.267 |            1923.267 |        235.574 |          160.461 |       590.376 |          7.517 |            6.950 |        11.465 |               120.204 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  5 |            32.000 |           2874.232 |            2874.232 |        260.423 |          161.570 |       702.761 |         10.250 |            9.756 |        16.054 |                89.820 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  6 |            64.000 |           3421.672 |            3421.672 |       4652.512 |         5072.364 |      8589.624 |         13.332 |           12.608 |        21.103 |                53.464 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  7 |           128.000 |           3414.624 |            3414.624 |      22230.409 |        24696.099 |     26155.598 |         13.630 |           12.627 |        21.162 |                26.677 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+

After

Saved summary to: after_spec_v2_summary.csv

+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |            332.667 |             332.667 |        163.454 |          154.984 |       206.172 |          2.846 |            2.718 |         3.276 |               332.667 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |             2.000 |            543.485 |             543.485 |        201.973 |          168.965 |       380.087 |          3.376 |            3.096 |         4.600 |               271.743 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  2 |             4.000 |            991.061 |             991.061 |        191.516 |          152.331 |       414.234 |          3.672 |            3.391 |         5.266 |               247.765 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  3 |             8.000 |           1533.888 |            1533.888 |        241.940 |          164.845 |       482.935 |          4.656 |            4.239 |         7.163 |               191.736 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  4 |            16.000 |           2302.798 |            2302.798 |        241.296 |          163.455 |       545.563 |          6.222 |            5.759 |         9.776 |               143.925 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  5 |            32.000 |           3313.815 |            3313.815 |        300.407 |          175.229 |       764.259 |          8.869 |            8.493 |        14.249 |               103.557 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  6 |            64.000 |           3890.366 |            3890.366 |       4120.522 |         4512.661 |      7544.377 |         11.741 |           11.151 |        18.609 |                60.787 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  7 |           128.000 |           3760.559 |            3760.559 |      20254.542 |        22428.491 |     24057.687 |         12.378 |           11.476 |        19.190 |                29.379 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+

@vincentzed
Copy link
Copy Markdown
Contributor Author

/tag-and-rerun-ci

@vincentzed
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

1 similar comment
@vincentzed
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

Comment thread python/sglang/srt/speculative/eagle_worker_v2.py Outdated
Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
@vincentzed
Copy link
Copy Markdown
Contributor Author

@yizhang2077 may you help take a look again?

@whybeyoung
Copy link
Copy Markdown
Collaborator

LGTM

Comment thread python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py
@vincentzed
Copy link
Copy Markdown
Contributor Author

@yizhang2077 Hello, can you give it a approval. And then, in #18808, we can apply the improvements on top.)

@b8zhong b8zhong closed this Feb 28, 2026
@b8zhong b8zhong deleted the vz/clean-spec-v2 branch March 9, 2026 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants