Skip to content

[NVIDIA] Support flashinfer a2a with flashinfer_trtllm_routed moe#22394

Open
trevor-m wants to merge 5 commits into
sgl-project:mainfrom
trevor-m:a2a-trtllm-routed
Open

[NVIDIA] Support flashinfer a2a with flashinfer_trtllm_routed moe#22394
trevor-m wants to merge 5 commits into
sgl-project:mainfrom
trevor-m:a2a-trtllm-routed

Conversation

@trevor-m
Copy link
Copy Markdown
Collaborator

@trevor-m trevor-m commented Apr 8, 2026

Motivation

Flashinfer a2a can now be used in combination with flashinfer_trtllm_routed moe.

Modifications

  • Adds "fused_func" for flashinfer dispatcher -> flashinfer_trtllm_routed moe runner
  • Supports fp4 quantize before comm
  • In Flashinfer dispatcher, remove dummy token workaround for zero local tokens - flashinfer a2a now supports this.
  • Update server arg validation
  • Add FP4 and FP8 e2e test

Accuracy Tests

FP8

python3 -m sglang.launch_server   --model-path deepseek-ai/DeepSeek-R1-0528    --tp 4 --moe-a2a-backend flashinfer --moe-runner-backend flashinfer_trtllm_routed --ep 4 --dp 4 --enable-dp-attention --quantization fp8 --mem-fraction-static 0.95 --max-running-requests 512
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.984
Invalid: 0.000
Latency: 102.780 s
Output throughput: 1170.071 token/s

FP4

python3 -m sglang.launch_server   --model-path nvidia/DeepSeek-R1-0528-FP4-v2   --tp 4 --moe-a2a-backend flashinfer --moe-runner-backend flashinfer_trtllm_routed --ep 4 --dp 4 --enable-dp-attention --quantization modelopt_fp4
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.978
Invalid: 0.000
Latency: 16.056 s
Output throughput: 7645.199 token/s

Speed Tests and Profiling

GB200 FP4 Disagg 4xDEP4 Prefill + 1xDEP32 Decode srt-slurm config

Maximum request concurrency: 512
============ Serving Benchmark Result ============
Successful requests:                     5120   
Benchmark duration (s):                  166.88
Total input tokens:                      4717859 
Total generated tokens:                  4722209  
Request throughput (req/s):              30.68
Output token throughput (tok/s):         28296.74
Total Token throughput (tok/s):          56567.42
---------------Time to First Token----------------
Mean TTFT (ms):                          818.76 
Median TTFT (ms):                        638.47 
P99 TTFT (ms):                           3433.44
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          16.44   
Median TPOT (ms):                        16.33   
P99 TPOT (ms):                           17.42   
---------------Inter-token Latency----------------
Mean ITL (ms):                           800.45
Median ITL (ms):                         805.71
P99 ITL (ms):                            1058.22
----------------End-to-end Latency----------------
Mean E2EL (ms):                          15969.19  
Median E2EL (ms):                        15930.51   
P99 E2EL (ms):                           19609.53  
==================================================

Maximum request concurrency: 2048
============ Serving Benchmark Result ============
Successful requests:                     20480
Benchmark duration (s):                  333.05
Total input tokens:                      18880692
Total generated tokens:                  18888974
Request throughput (req/s):              61.49
Output token throughput (tok/s):         56715.98
Total Token throughput (tok/s):          113407.09
---------------Time to First Token----------------
Mean TTFT (ms):                          1461.76
Median TTFT (ms):                        723.74
P99 TTFT (ms):                           12048.79
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.50
Median TPOT (ms):                        33.54
P99 TPOT (ms):                           35.81
---------------Inter-token Latency----------------
Mean ITL (ms):                           1631.86
Median ITL (ms):                         1695.71
P99 ITL (ms):                            2038.74
----------------End-to-end Latency----------------
Mean E2EL (ms):                          32327.96
Median E2EL (ms):                        31939.77
P99 E2EL (ms):                           45540.03
==================================================

Maximum request concurrency: 4096
============ Serving Benchmark Result ============
Successful requests:                     40960
Benchmark duration (s):                  513.58
Total input tokens:                      37769666
Total generated tokens:                  37742239
Request throughput (req/s):              79.75
Output token throughput (tok/s):         73488.77
Total Token throughput (tok/s):          147030.95
---------------Time to First Token----------------
Mean TTFT (ms):                          2173.38
Median TTFT (ms):                        771.77
P99 TTFT (ms):                           22110.64
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          51.97
Median TPOT (ms):                        52.47
P99 TPOT (ms):                           56.63
---------------Inter-token Latency----------------
Mean ITL (ms):                           2530.38
Median ITL (ms):                         2675.99
P99 ITL (ms):                            4023.23
----------------End-to-end Latency----------------
Mean E2EL (ms):                          50001.87
Median E2EL (ms):                        49519.06
P99 E2EL (ms):                           69921.02
==================================================

Baseline (deepep + flashinfer_cutedsl + SBO) srt-slurm config

Maximum request concurrency: 512
============ Serving Benchmark Result ============
Successful requests:                     5120
Benchmark duration (s):                  154.96
Total input tokens:                      4717859
Total generated tokens:                  4722209
Request throughput (req/s):              33.04
Output token throughput (tok/s):         30473.36
Total Token throughput (tok/s):          60918.64
---------------Time to First Token----------------
Mean TTFT (ms):                          857.72
Median TTFT (ms):                        670.33
P99 TTFT (ms):                           3632.77
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.24
Median TPOT (ms):                        15.24
P99 TPOT (ms):                           15.66
---------------Inter-token Latency----------------
Mean ITL (ms):                           742.04
Median ITL (ms):                         759.97
P99 ITL (ms):                            920.63
----------------End-to-end Latency----------------
Mean E2EL (ms):                          14902.56
Median E2EL (ms):                        14830.25
P99 E2EL (ms):                           18559.29
==================================================
Maximum request concurrency: 2048        
============ Serving Benchmark Result ============
Successful requests:                     20480   
Benchmark duration (s):                  186.33  
Total input tokens:                      18880692
Total generated tokens:                  18888974 
Request throughput (req/s):              109.91
Output token throughput (tok/s):         101374.98
Total Token throughput (tok/s):          202705.51
---------------Time to First Token----------------
Mean TTFT (ms):                          1617.71
Median TTFT (ms):                        786.29
P99 TTFT (ms):                           11880.44
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.58   
Median TPOT (ms):                        17.56   
P99 TPOT (ms):                           18.43   
---------------Inter-token Latency----------------
Mean ITL (ms):                           856.31
Median ITL (ms):                         871.37
P99 ITL (ms):                            1543.57
----------------End-to-end Latency----------------
Mean E2EL (ms):                          17814.64  
Median E2EL (ms):                        17273.63  
P99 E2EL (ms):                           28485.44  
==================================================
Maximum request concurrency: 4096
============ Serving Benchmark Result ============
Successful requests:                     40960
Benchmark duration (s):                  324.86
Total input tokens:                      37769666
Total generated tokens:                  37742239
Request throughput (req/s):              126.08
Output token throughput (tok/s):         116179.28
Total Token throughput (tok/s):          232442.98
---------------Time to First Token----------------
Mean TTFT (ms):                          13768.68
Median TTFT (ms):                        15420.10
P99 TTFT (ms):                           22753.71
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.25
Median TPOT (ms):                        18.87
P99 TPOT (ms):                           25.37
---------------Inter-token Latency----------------
Mean ITL (ms):                           937.55
Median ITL (ms):                         884.23
P99 ITL (ms):                            2456.50
----------------End-to-end Latency----------------
Mean E2EL (ms):                          31489.99
Median E2EL (ms):                        31819.01
P99 E2EL (ms):                           44965.56
==================================================

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant