Summary
DeepSeek R1 (NVFP4) prefill throughput with Expert Parallelism (EP) has regressed by ~6x between v0.5.9 and dev-0401.
- v0.5.8 DEP4 (TP4+DP4+EP4): 60,398 tok/s
- v0.5.9 DEP4 (TP4+DP4+EP4): 57,877 tok/s ← still healthy
- dev-0401 DEP4 (TP4+DP4+EP4): 9,530 tok/s ← 6.1x slower
DP4 (same config but without EP, ep_size=1) shows no regression across all three versions (~40k tok/s). This confirms the regression is isolated to the EP code path and was introduced after v0.5.9.
Environment
- Hardware: 4x NVIDIA GB200 (single node, NVLink)
- Model: DeepSeek R1 NVFP4 (
deepseek-ai/DeepSeek-R1, 671B, 256 experts)
- Quantization:
modelopt_fp4, kv-cache-dtype: fp8_e4m3
- Mode: Aggregated (agg) serving, prefill-only benchmark (ISL=1000, OSL=1)
Container versions tested
| Version |
Container |
SGLang source |
| v0.5.8 |
lmsysorg/sglang:v0.5.8 |
built-in |
| v0.5.9 |
lmsysorg/sglang:v0.5.9 |
built-in |
| dev-0401 |
lmsysorg/sglang:dev (2026-04-01) |
mounted sglang source @ 19d7f2d40 |
Reproduction
Server launch args (common)
python -m sglang.launch_server \
--model-path /path/to/DeepSeek-R1-NVFP4 \
--served-model-name deepseek-ai/DeepSeek-R1 \
--trust-remote-code \
--attention-backend trtllm_mla \
--quantization modelopt_fp4 \
--kv-cache-dtype fp8_e4m3 \
--moe-runner-backend flashinfer_cutlass \
--fp4-gemm-backend flashinfer_cutlass \
--disable-radix-cache \
--max-running-requests 1024 \
--mem-fraction-static 0.80 \
--chunked-prefill-size 81920 \
--max-prefill-tokens 81920 \
--context-length 1100 \
--cuda-graph-max-bs 1024 \
--tensor-parallel-size 4 \
--data-parallel-size 4 \
--enable-dp-attention
DEP4 (with EP — the regressed config)
Add --expert-parallel-size 4 to the above args.
DP4 (without EP — the baseline, no regression)
No additional args (ep_size defaults to 1).
Benchmark command
python benchmark_serving.py \
--backend sglang \
--model deepseek-ai/DeepSeek-R1 \
--tokenizer /path/to/DeepSeek-R1-NVFP4 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1 \
--random-range-ratio 1.0 \
--num-prompts 12288 \
--request-rate inf \
--max-concurrency 4096 \
--ignore-eos \
--disable-tqdm
Note: SGLang automatically adjusts chunked_prefill_size to 20480 when enable-dp-attention is set ("adjusted to avoid MoE kernel issues"). Both v0.5.8 and dev apply this adjustment, so the actual chunked_prefill_size is 20480 in all runs.
Results
DeepSeek R1 NVFP4 (4x GB200, agg mode, prefill throughput)
| Config |
v0.5.8 |
v0.5.9 |
dev-0401 |
Delta (v0.5.9→dev-0401) |
| DP4 (tp4 dp4, no EP) |
40,437 tok/s |
40,795 tok/s |
39,620 tok/s |
-3% (no regression) |
| DEP4 (tp4 dp4 ep4) |
60,398 tok/s |
57,877 tok/s |
9,530 tok/s |
-84% (6.1x regression) |
Expected behavior
DEP4 should achieve at least the same throughput as DP4 (~40k tok/s), ideally matching v0.5.9 performance (~58k tok/s) where EP provided 1.4x speedup.
Summary
DeepSeek R1 (NVFP4) prefill throughput with Expert Parallelism (EP) has regressed by ~6x between
v0.5.9anddev-0401.DP4 (same config but without EP,
ep_size=1) shows no regression across all three versions (~40k tok/s). This confirms the regression is isolated to the EP code path and was introduced after v0.5.9.Environment
deepseek-ai/DeepSeek-R1, 671B, 256 experts)modelopt_fp4,kv-cache-dtype: fp8_e4m3Container versions tested
lmsysorg/sglang:v0.5.8lmsysorg/sglang:v0.5.9lmsysorg/sglang:dev(2026-04-01)19d7f2d40Reproduction
Server launch args (common)
DEP4 (with EP — the regressed config)
Add
--expert-parallel-size 4to the above args.DP4 (without EP — the baseline, no regression)
No additional args (ep_size defaults to 1).
Benchmark command
python benchmark_serving.py \ --backend sglang \ --model deepseek-ai/DeepSeek-R1 \ --tokenizer /path/to/DeepSeek-R1-NVFP4 \ --dataset-name random \ --random-input-len 1000 \ --random-output-len 1 \ --random-range-ratio 1.0 \ --num-prompts 12288 \ --request-rate inf \ --max-concurrency 4096 \ --ignore-eos \ --disable-tqdmNote: SGLang automatically adjusts
chunked_prefill_sizeto 20480 whenenable-dp-attentionis set ("adjusted to avoid MoE kernel issues"). Both v0.5.8 and dev apply this adjustment, so the actualchunked_prefill_sizeis 20480 in all runs.Results
DeepSeek R1 NVFP4 (4x GB200, agg mode, prefill throughput)
Expected behavior
DEP4 should achieve at least the same throughput as DP4 (~40k tok/s), ideally matching v0.5.9 performance (~58k tok/s) where EP provided 1.4x speedup.