[Performance][Small-EP] 6x prefill throughput regression with Expert Parallelism on DeepSeek R1 between v0.5.9 and dev


## Summary

DeepSeek R1 (NVFP4) prefill throughput with Expert Parallelism (EP) has regressed by **~6x** between `v0.5.9` and `dev-0401`.

- **v0.5.8 DEP4** (TP4+DP4+EP4): **60,398 tok/s**
- **v0.5.9 DEP4** (TP4+DP4+EP4): **57,877 tok/s** ← still healthy
- **dev-0401 DEP4** (TP4+DP4+EP4): **9,530 tok/s** ← 6.1x slower

DP4 (same config but without EP, `ep_size=1`) shows **no regression** across all three versions (~40k tok/s). This confirms the regression is isolated to the EP code path and was introduced **after v0.5.9**.

## Environment

- **Hardware**: 4x NVIDIA GB200 (single node, NVLink)
- **Model**: DeepSeek R1 NVFP4 (`deepseek-ai/DeepSeek-R1`, 671B, 256 experts)
- **Quantization**: `modelopt_fp4`, `kv-cache-dtype: fp8_e4m3`
- **Mode**: Aggregated (agg) serving, prefill-only benchmark (ISL=1000, OSL=1)

### Container versions tested
| Version | Container | SGLang source |
|---------|-----------|---------------|
| v0.5.8 | `lmsysorg/sglang:v0.5.8` | built-in |
| v0.5.9 | `lmsysorg/sglang:v0.5.9` | built-in |
| dev-0401 | `lmsysorg/sglang:dev` (2026-04-01) | mounted sglang source @ `19d7f2d40` |

## Reproduction

### Server launch args (common)

```
python -m sglang.launch_server \
    --model-path /path/to/DeepSeek-R1-NVFP4 \
    --served-model-name deepseek-ai/DeepSeek-R1 \
    --trust-remote-code \
    --attention-backend trtllm_mla \
    --quantization modelopt_fp4 \
    --kv-cache-dtype fp8_e4m3 \
    --moe-runner-backend flashinfer_cutlass \
    --fp4-gemm-backend flashinfer_cutlass \
    --disable-radix-cache \
    --max-running-requests 1024 \
    --mem-fraction-static 0.80 \
    --chunked-prefill-size 81920 \
    --max-prefill-tokens 81920 \
    --context-length 1100 \
    --cuda-graph-max-bs 1024 \
    --tensor-parallel-size 4 \
    --data-parallel-size 4 \
    --enable-dp-attention
```

### DEP4 (with EP — the regressed config)

Add `--expert-parallel-size 4` to the above args.

### DP4 (without EP — the baseline, no regression)

No additional args (ep_size defaults to 1).

### Benchmark command

```bash
python benchmark_serving.py \
    --backend sglang \
    --model deepseek-ai/DeepSeek-R1 \
    --tokenizer /path/to/DeepSeek-R1-NVFP4 \
    --dataset-name random \
    --random-input-len 1000 \
    --random-output-len 1 \
    --random-range-ratio 1.0 \
    --num-prompts 12288 \
    --request-rate inf \
    --max-concurrency 4096 \
    --ignore-eos \
    --disable-tqdm
```

**Note**: SGLang automatically adjusts `chunked_prefill_size` to 20480 when `enable-dp-attention` is set ("adjusted to avoid MoE kernel issues"). Both v0.5.8 and dev apply this adjustment, so the actual `chunked_prefill_size` is 20480 in all runs.

## Results

### DeepSeek R1 NVFP4 (4x GB200, agg mode, prefill throughput)

| Config | v0.5.8 | v0.5.9 | dev-0401 | Delta (v0.5.9→dev-0401) |
|--------|--------|--------|----------|--------------------------|
| **DP4** (tp4 dp4, no EP) | 40,437 tok/s | 40,795 tok/s | 39,620 tok/s | **-3%** (no regression) |
| **DEP4** (tp4 dp4 ep4) | **60,398 tok/s** | **57,877 tok/s** | **9,530 tok/s** | **-84% (6.1x regression)** |


## Expected behavior

DEP4 should achieve at least the same throughput as DP4 (~40k tok/s), ideally matching v0.5.9 performance (~58k tok/s) where EP provided 1.4x speedup.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance][Small-EP] 6x prefill throughput regression with Expert Parallelism on DeepSeek R1 between v0.5.9 and dev #22379

Summary

Environment

Container versions tested

Reproduction

Server launch args (common)

DEP4 (with EP — the regressed config)

DP4 (without EP — the baseline, no regression)

Benchmark command

Results

DeepSeek R1 NVFP4 (4x GB200, agg mode, prefill throughput)

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Version	Container	SGLang source
v0.5.8	`lmsysorg/sglang:v0.5.8`	built-in
v0.5.9	`lmsysorg/sglang:v0.5.9`	built-in
dev-0401	`lmsysorg/sglang:dev` (2026-04-01)	mounted sglang source @ `19d7f2d40`

Config	v0.5.8	v0.5.9	dev-0401	Delta (v0.5.9→dev-0401)
DP4 (tp4 dp4, no EP)	40,437 tok/s	40,795 tok/s	39,620 tok/s	-3% (no regression)
DEP4 (tp4 dp4 ep4)	60,398 tok/s	57,877 tok/s	9,530 tok/s	-84% (6.1x regression)

[Performance][Small-EP] 6x prefill throughput regression with Expert Parallelism on DeepSeek R1 between v0.5.9 and dev #22379

Description

Summary

Environment

Container versions tested

Reproduction

Server launch args (common)

DEP4 (with EP — the regressed config)

DP4 (without EP — the baseline, no regression)

Benchmark command

Results

DeepSeek R1 NVFP4 (4x GB200, agg mode, prefill throughput)

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions