[NVIDIA] Support flashinfer a2a with flashinfer_trtllm_routed moe by trevor-m · Pull Request #22394 · sgl-project/sglang

trevor-m · 2026-04-08T23:28:31Z

Motivation

Flashinfer a2a can now be used in combination with flashinfer_trtllm_routed moe.

Modifications

Adds "fused_func" for flashinfer dispatcher -> flashinfer_trtllm_routed moe runner
Supports fp4 quantize before comm
In Flashinfer dispatcher, remove dummy token workaround for zero local tokens - flashinfer a2a now supports this.
Update server arg validation
Add FP4 and FP8 e2e test

Accuracy Tests

FP8

python3 -m sglang.launch_server   --model-path deepseek-ai/DeepSeek-R1-0528    --tp 4 --moe-a2a-backend flashinfer --moe-runner-backend flashinfer_trtllm_routed --ep 4 --dp 4 --enable-dp-attention --quantization fp8 --mem-fraction-static 0.95 --max-running-requests 512
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.984
Invalid: 0.000
Latency: 102.780 s
Output throughput: 1170.071 token/s

FP4

python3 -m sglang.launch_server   --model-path nvidia/DeepSeek-R1-0528-FP4-v2   --tp 4 --moe-a2a-backend flashinfer --moe-runner-backend flashinfer_trtllm_routed --ep 4 --dp 4 --enable-dp-attention --quantization modelopt_fp4
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.978
Invalid: 0.000
Latency: 16.056 s
Output throughput: 7645.199 token/s

Speed Tests and Profiling

GB200 FP4 Disagg 4xDEP4 Prefill + 1xDEP32 Decode srt-slurm config

Maximum request concurrency: 512
============ Serving Benchmark Result ============
Successful requests:                     5120   
Benchmark duration (s):                  166.88
Total input tokens:                      4717859 
Total generated tokens:                  4722209  
Request throughput (req/s):              30.68
Output token throughput (tok/s):         28296.74
Total Token throughput (tok/s):          56567.42
---------------Time to First Token----------------
Mean TTFT (ms):                          818.76 
Median TTFT (ms):                        638.47 
P99 TTFT (ms):                           3433.44
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          16.44   
Median TPOT (ms):                        16.33   
P99 TPOT (ms):                           17.42   
---------------Inter-token Latency----------------
Mean ITL (ms):                           800.45
Median ITL (ms):                         805.71
P99 ITL (ms):                            1058.22
----------------End-to-end Latency----------------
Mean E2EL (ms):                          15969.19  
Median E2EL (ms):                        15930.51   
P99 E2EL (ms):                           19609.53  
==================================================

Maximum request concurrency: 2048
============ Serving Benchmark Result ============
Successful requests:                     20480
Benchmark duration (s):                  333.05
Total input tokens:                      18880692
Total generated tokens:                  18888974
Request throughput (req/s):              61.49
Output token throughput (tok/s):         56715.98
Total Token throughput (tok/s):          113407.09
---------------Time to First Token----------------
Mean TTFT (ms):                          1461.76
Median TTFT (ms):                        723.74
P99 TTFT (ms):                           12048.79
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.50
Median TPOT (ms):                        33.54
P99 TPOT (ms):                           35.81
---------------Inter-token Latency----------------
Mean ITL (ms):                           1631.86
Median ITL (ms):                         1695.71
P99 ITL (ms):                            2038.74
----------------End-to-end Latency----------------
Mean E2EL (ms):                          32327.96
Median E2EL (ms):                        31939.77
P99 E2EL (ms):                           45540.03
==================================================

Maximum request concurrency: 4096
============ Serving Benchmark Result ============
Successful requests:                     40960
Benchmark duration (s):                  513.58
Total input tokens:                      37769666
Total generated tokens:                  37742239
Request throughput (req/s):              79.75
Output token throughput (tok/s):         73488.77
Total Token throughput (tok/s):          147030.95
---------------Time to First Token----------------
Mean TTFT (ms):                          2173.38
Median TTFT (ms):                        771.77
P99 TTFT (ms):                           22110.64
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          51.97
Median TPOT (ms):                        52.47
P99 TPOT (ms):                           56.63
---------------Inter-token Latency----------------
Mean ITL (ms):                           2530.38
Median ITL (ms):                         2675.99
P99 ITL (ms):                            4023.23
----------------End-to-end Latency----------------
Mean E2EL (ms):                          50001.87
Median E2EL (ms):                        49519.06
P99 E2EL (ms):                           69921.02
==================================================

Baseline (deepep + flashinfer_cutedsl + SBO) srt-slurm config

Maximum request concurrency: 512
============ Serving Benchmark Result ============
Successful requests:                     5120
Benchmark duration (s):                  154.96
Total input tokens:                      4717859
Total generated tokens:                  4722209
Request throughput (req/s):              33.04
Output token throughput (tok/s):         30473.36
Total Token throughput (tok/s):          60918.64
---------------Time to First Token----------------
Mean TTFT (ms):                          857.72
Median TTFT (ms):                        670.33
P99 TTFT (ms):                           3632.77
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.24
Median TPOT (ms):                        15.24
P99 TPOT (ms):                           15.66
---------------Inter-token Latency----------------
Mean ITL (ms):                           742.04
Median ITL (ms):                         759.97
P99 ITL (ms):                            920.63
----------------End-to-end Latency----------------
Mean E2EL (ms):                          14902.56
Median E2EL (ms):                        14830.25
P99 E2EL (ms):                           18559.29
==================================================
Maximum request concurrency: 2048        
============ Serving Benchmark Result ============
Successful requests:                     20480   
Benchmark duration (s):                  186.33  
Total input tokens:                      18880692
Total generated tokens:                  18888974 
Request throughput (req/s):              109.91
Output token throughput (tok/s):         101374.98
Total Token throughput (tok/s):          202705.51
---------------Time to First Token----------------
Mean TTFT (ms):                          1617.71
Median TTFT (ms):                        786.29
P99 TTFT (ms):                           11880.44
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.58   
Median TPOT (ms):                        17.56   
P99 TPOT (ms):                           18.43   
---------------Inter-token Latency----------------
Mean ITL (ms):                           856.31
Median ITL (ms):                         871.37
P99 ITL (ms):                            1543.57
----------------End-to-end Latency----------------
Mean E2EL (ms):                          17814.64  
Median E2EL (ms):                        17273.63  
P99 E2EL (ms):                           28485.44  
==================================================
Maximum request concurrency: 4096
============ Serving Benchmark Result ============
Successful requests:                     40960
Benchmark duration (s):                  324.86
Total input tokens:                      37769666
Total generated tokens:                  37742239
Request throughput (req/s):              126.08
Output token throughput (tok/s):         116179.28
Total Token throughput (tok/s):          232442.98
---------------Time to First Token----------------
Mean TTFT (ms):                          13768.68
Median TTFT (ms):                        15420.10
P99 TTFT (ms):                           22753.71
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.25
Median TPOT (ms):                        18.87
P99 TPOT (ms):                           25.37
---------------Inter-token Latency----------------
Mean ITL (ms):                           937.55
Median ITL (ms):                         884.23
P99 ITL (ms):                            2456.50
----------------End-to-end Latency----------------
Mean E2EL (ms):                          31489.99
Median E2EL (ms):                        31819.01
P99 E2EL (ms):                           44965.56
==================================================

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist · 2026-04-08T23:28:36Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

trevor-m added 5 commits April 8, 2026 13:19

Support flashinfer_trtllm_routed with flashinfer a2a

a782ea9

fix

dcecf16

Remove dummy token

0ba3f4b

Add test

c2d0b21

Fixes

8e5f392

trevor-m requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners April 8, 2026 23:28

nvpohanh mentioned this pull request Apr 17, 2026

[Tracking] Qwen3.5-397B (G)B200 Functional Support and Optimizations #20024

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVIDIA] Support flashinfer a2a with flashinfer_trtllm_routed moe#22394

[NVIDIA] Support flashinfer a2a with flashinfer_trtllm_routed moe#22394
trevor-m wants to merge 5 commits into
sgl-project:mainfrom
trevor-m:a2a-trtllm-routed

trevor-m commented Apr 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

trevor-m commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

trevor-m commented Apr 8, 2026 •

edited

Loading