Skip to content

[Performance][Small-EP] 6x prefill throughput regression with Expert Parallelism on DeepSeek R1 between v0.5.9 and dev #22379

@YAMY1234

Description

@YAMY1234

Summary

DeepSeek R1 (NVFP4) prefill throughput with Expert Parallelism (EP) has regressed by ~6x between v0.5.9 and dev-0401.

  • v0.5.8 DEP4 (TP4+DP4+EP4): 60,398 tok/s
  • v0.5.9 DEP4 (TP4+DP4+EP4): 57,877 tok/s ← still healthy
  • dev-0401 DEP4 (TP4+DP4+EP4): 9,530 tok/s ← 6.1x slower

DP4 (same config but without EP, ep_size=1) shows no regression across all three versions (~40k tok/s). This confirms the regression is isolated to the EP code path and was introduced after v0.5.9.

Environment

  • Hardware: 4x NVIDIA GB200 (single node, NVLink)
  • Model: DeepSeek R1 NVFP4 (deepseek-ai/DeepSeek-R1, 671B, 256 experts)
  • Quantization: modelopt_fp4, kv-cache-dtype: fp8_e4m3
  • Mode: Aggregated (agg) serving, prefill-only benchmark (ISL=1000, OSL=1)

Container versions tested

Version Container SGLang source
v0.5.8 lmsysorg/sglang:v0.5.8 built-in
v0.5.9 lmsysorg/sglang:v0.5.9 built-in
dev-0401 lmsysorg/sglang:dev (2026-04-01) mounted sglang source @ 19d7f2d40

Reproduction

Server launch args (common)

python -m sglang.launch_server \
    --model-path /path/to/DeepSeek-R1-NVFP4 \
    --served-model-name deepseek-ai/DeepSeek-R1 \
    --trust-remote-code \
    --attention-backend trtllm_mla \
    --quantization modelopt_fp4 \
    --kv-cache-dtype fp8_e4m3 \
    --moe-runner-backend flashinfer_cutlass \
    --fp4-gemm-backend flashinfer_cutlass \
    --disable-radix-cache \
    --max-running-requests 1024 \
    --mem-fraction-static 0.80 \
    --chunked-prefill-size 81920 \
    --max-prefill-tokens 81920 \
    --context-length 1100 \
    --cuda-graph-max-bs 1024 \
    --tensor-parallel-size 4 \
    --data-parallel-size 4 \
    --enable-dp-attention

DEP4 (with EP — the regressed config)

Add --expert-parallel-size 4 to the above args.

DP4 (without EP — the baseline, no regression)

No additional args (ep_size defaults to 1).

Benchmark command

python benchmark_serving.py \
    --backend sglang \
    --model deepseek-ai/DeepSeek-R1 \
    --tokenizer /path/to/DeepSeek-R1-NVFP4 \
    --dataset-name random \
    --random-input-len 1000 \
    --random-output-len 1 \
    --random-range-ratio 1.0 \
    --num-prompts 12288 \
    --request-rate inf \
    --max-concurrency 4096 \
    --ignore-eos \
    --disable-tqdm

Note: SGLang automatically adjusts chunked_prefill_size to 20480 when enable-dp-attention is set ("adjusted to avoid MoE kernel issues"). Both v0.5.8 and dev apply this adjustment, so the actual chunked_prefill_size is 20480 in all runs.

Results

DeepSeek R1 NVFP4 (4x GB200, agg mode, prefill throughput)

Config v0.5.8 v0.5.9 dev-0401 Delta (v0.5.9→dev-0401)
DP4 (tp4 dp4, no EP) 40,437 tok/s 40,795 tok/s 39,620 tok/s -3% (no regression)
DEP4 (tp4 dp4 ep4) 60,398 tok/s 57,877 tok/s 9,530 tok/s -84% (6.1x regression)

Expected behavior

DEP4 should achieve at least the same throughput as DP4 (~40k tok/s), ideally matching v0.5.9 performance (~58k tok/s) where EP provided 1.4x speedup.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions