Reenable MNNVL backend for FlashInfer allreduce fusion by wenscarl · Pull Request #23402 · sgl-project/sglang

wenscarl · 2026-04-21T20:48:45Z

Summary

Reintroduces FlashInfer unified fused allreduce + residual + RMSNorm with an explicit backend flag --flashinfer-allreduce-fusion-backend (auto | trtllm | mnnvl), and fixes interaction with piecewise CUDA graph when the MNNVL path is used.

Background

Original integration: PR #12787.
Reverted due to hangs (e.g. some models / CI): PR #20792.
Related report (fused AR + RMSNorm + piecewise graph): vLLM #35772.

Problem

Hangs were correlated with piecewise CUDA graph capture/replay while MNNVL-style allreduce fusion was active, including configurations where the fused op was already a torch.compile split op relative to PCG. Reducing the number of piecewise graphs made the hang much rarer. Eager execution and the decode (non–piecewise) CUDA graph path did not reproduce the issue in our testing.
When stuck, stacks pointed to lack of progress in the Lamport-style wait loop in FlashInfer’s MNNVL allreduce implementation, e.g. trtllm_mnnvl_allreduce.cuh (lines ~557–573), consistent with oneshot/twoshot progress assumptions conflicting with PCG replay.

What this PR does

Adds --flashinfer-allreduce-fusion-backend and gates fusion on flashinfer_allreduce_fusion_backend is not None (see communicator.py, server_args.py).
mnnvl or auto on SM100 (where MNNVL may be selected): disables piecewise CUDA graph in model_runner.py so MNNVL fusion is not replayed inside PCG; keeps flashinfer_allreduce_residual_rmsnorm registered as a split op so it runs eagerly between graph pieces.
trtllm: removes that split-op name from PCG split_ops when MNNVL split is not required, so fusion can stay in-graph for piecewise compile.
Layernorm: if FlashInfer fusion returns (None, None), always performs tensor_model_parallel_all_reduce before RMSNorm (fixes missing allreduce on fallback).
Deprecates --enable-flashinfer-allreduce-fusion: if set with no backend, maps to --flashinfer-allreduce-fusion-backend=auto and logs a warning.
Workspace creation uses FlashInfer create_allreduce_fusion_workspace with backend=…, optional gpus_per_node, and preserves the existing NCCL device + GLOO cpu TorchDistBackend workaround where applicable.

Benchmarks (Gb200 4GPUs)

Server:

sglang serve \
  --model-path openai/gpt-oss-120b \
  --tensor-parallel-size 2 \
  --reasoning-parser gpt-oss \
  --tool-call-parser gpt-oss \
  --flashinfer-allreduce-fusion-backend mnnvl or trtllm \
  --disable-flashinfer-autotune

Client (bench_serving):

  python3 -m sglang.bench_serving \
    --backend sglang \
    --dataset-name random \
    --random-input-len 1024 \
    --random-output-len 1024 \
    --random-range-ratio 1.0 \
    --num-prompts 128 \
    --max-concurrency ${BS} \
    --request-rate inf \
    --disable-ignore-eos

Max request concurrency	MNNVL TPOT (ms)	TRT-LLM TPOT (ms) (piecewise CUDA graph)	Speedup vs TRT-LLM (piecewise CUDA graph)	TRT-LLM TPOT (ms) (without piecewise CUDA graph)	Speedup vs TRT-LLM (without piecewise CUDA graph)
1	3.30	3.79	1.15x	3.79	1.15x
4	4.25	4.75	1.12x	4.83	1.14x
16	5.93	6.76	1.14x	6.87	1.16x
32	7.16	8.34	1.17x	8.57	1.20x
64	8.93	10.60	1.19x	10.83	1.21x

Notes for reviewers

MNNVL + PCG is treated as unsafe (similar policy to other non–graph-safe comm); TRT-LLM path remains the default for single-node auto where applicable.
FlashInfer / topology details are documented in comments in flashinfer_comm_fusion.py.

Accuracy:

python3 -m sglang.test.few_shot_gsm8k --num-questions 200

Accuracy: 0.875
Invalid: 0.020
Latency: 27.135 s
Output throughput: 2344.613 token/s

How to reproduce the hang

Goal
You want both:

FlashInfer MNNVL (or auto on SM100 where MNNVL is chosen) for fused allreduce + residual + RMSNorm, and
Piecewise CUDA graph capture/replay still enabled (the path this PR disables in model_runner.py).
On this PR’s default code, piecewise CUDA graph is skipped when flashinfer_ar_needs_piecewise_cuda_graph_split(server_args) is true, so you will not see the hang until you relax that guard.

Code change (force the bad combination)
Option A — turn piecewise CUDA graph back on (most direct)

In python/sglang/srt/model_executor/model_runner.py, comment out or remove the early return that runs when flashinfer_ar_needs_piecewise_cuda_graph_split is true (the block that logs “Disable piecewise CUDA graph because MNNVL allreduce fusion is enabled”).

Option B — make the “disable PCG” helper always false

In python/sglang/srt/layers/flashinfer_comm_fusion.py, change flashinfer_ar_needs_piecewise_cuda_graph_split so it always returns False (or only returns False for your MNNVL / auto case). Then model_runner will not skip piecewise CUDA graph, while piecewise_cuda_graph_runner can still treat the fused op as a split op for MNNVL when your branch logic keeps it in split_ops.

You reported that even Option B still hung; that is why the shipped fix disables PCG entirely for MNNVL instead of relying on split-op alone.
@nvpohanh

CI States

Latest PR Test (Base): Run #25962704277
Latest PR Test (Extra): ⚠️ Not enabled — add run-ci-extra label to opt in.

gemini-code-assist · 2026-04-21T20:48:48Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

# Conflicts: # python/sglang/srt/layers/flashinfer_comm_fusion.py

gemini-code-assist · 2026-05-05T22:04:33Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

b8zhong · 2026-05-06T13:56:59Z

+                "Enable FlashInfer allreduce fusion and choose backend. "
+                "When not set the feature is disabled. "
+                "'auto': choose best backend (trtllm single-node, mnnvl multi-node). "
+                "'trtllm': single-node only, supports fused quantization. "


Currently, I'm not sure we use the quantize fusion in code (maybe only in benchmarks, can you help verify)

Verified — only kARResidualRMSNorm (no quant) is called in SGLang. The quant patterns appear only in flashinfer benchmarks. Removed the misleading line from the help text.

- simplify flashinfer.comm imports (drop legacy fallbacks; rely on pinned version) - drop "supports fused quantization" from trtllm backend help text

b8zhong

/tag-and-rerun-ci

b8zhong · 2026-05-08T14:52:23Z

@wenscarl could you resolve the conflicts?

b8zhong · 2026-05-13T15:13:27Z

Even though it does not use trigger completion at end, the E2E speedup will be more influential (first image is MNNVL). So it's fine. Just verified on B300

wenscarl · 2026-05-13T16:19:36Z

gist for debugging hang.

b8zhong · 2026-05-14T22:01:05Z

@wenscarl The MTP failure looks related on DSV32. Can you help take a look

Edit: it's not related. Just tested it locally

BBuf

One more concern: the deterministic inference path does not seem fully migrated to the new backend field.

In server_args.py, _handle_deterministic_inference() still only disables the old enable_flashinfer_allreduce_fusion flag, but it does not clear flashinfer_allreduce_fusion_backend. This matters especially for rl_on_policy_target, because deterministic inference is enabled inside _handle_deterministic_inference(), after the earlier enforce_disable_flashinfer_allreduce_fusion handling has already run.

So some deterministic inference configurations may still end up with the new FlashInfer allreduce fusion backend enabled. Should we also set flashinfer_allreduce_fusion_backend = None in this path?

BBuf · 2026-05-15T09:02:36Z

The current CI also does not look clean yet. A few failures seem worth checking before merge:

B200: test/registered/quant/test_deepseek_v32_fp4_mtp_4gpu.py fails with acc_length=1.00, hitting AssertionError: 1.0 not greater than 2.7. This looks especially relevant to the allreduce/RMSNorm correctness risk.
H200: one test fails a speed threshold, with 177.33 < 180.
H100: test_gpt_oss_4gpu.py fails and produces CUDA coredumps.

b8zhong · 2026-05-15T15:55:38Z

I ran the B200 test locally, and it could pass
This feature is not enabled on H200
The H100 is known failure on main

…ence

wenscarl force-pushed the ar_debug2 branch from 46af6c9 to ce804ee Compare April 21, 2026 21:16

disable piecewise CUDA graph when MNNVL allreduce fusion is active

e6d17fd

wenscarl force-pushed the ar_debug2 branch from ce804ee to e6d17fd Compare April 21, 2026 23:23

Merge remote-tracking branch 'origin/main' into ar_debug2

a03c834

# Conflicts: # python/sglang/srt/layers/flashinfer_comm_fusion.py

wenscarl marked this pull request as ready for review May 5, 2026 22:04

wenscarl requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, hebiao064, hnyls2002, ispobock and merrymercy as code owners May 5, 2026 22:04

docs: sync LMSYS SGLang blog cards

04b50a4

wenscarl requested a review from b8zhong May 6, 2026 03:38

b8zhong reviewed May 6, 2026

View reviewed changes

Comment thread python/sglang/srt/layers/flashinfer_comm_fusion.py Outdated

b8zhong reviewed May 6, 2026

View reviewed changes

address review comments

25f2d52

- simplify flashinfer.comm imports (drop legacy fallbacks; rely on pinned version) - drop "supports fused quantization" from trtllm backend help text

wenscarl requested a review from b8zhong May 7, 2026 14:31

Merge branch 'main' into ar_debug2

b7f5aa7

b8zhong approved these changes May 7, 2026

View reviewed changes

Merge branch 'main' into ar_debug2

9e1b1bf

Merge branch 'sgl-project:main' into main

1d67ee1

b8zhong added the run-ci label May 8, 2026

wenscarl requested a review from wisclmy0611 as a code owner May 8, 2026 15:32

Fix mnnvl ar fusion cuda graph capture fall back.

085105d

auto-merge was automatically disabled May 12, 2026 17:56
Head branch was pushed to by a user without write access

wenscarl added 2 commits May 12, 2026 13:39

Skip bailout.

00f6ebe

Add fallback and remove split op

6fe02b6

b8zhong reviewed May 13, 2026

View reviewed changes

Comment thread python/sglang/srt/layers/flashinfer_comm_fusion.py Outdated

b8zhong added high priority bypass-fastfail labels May 13, 2026

wenscarl added 4 commits May 13, 2026 11:22

Merge branch 'sgl-project:main' into main

ab3c171

Simplify sm100 check

1fb01bb

Merge remote-tracking branch 'shuw/main' into ar_debug2

86bf470

Update comments

1b6279d

wenscarl requested a review from b8zhong May 13, 2026 16:45

fix index.mdx

63ca041

b8zhong self-assigned this May 13, 2026

Merge branch 'main' into ar_debug2

7b156cb

michaelzhang-ai mentioned this pull request May 14, 2026

[Bug] CI flake: stage-b-test-2-gpu-large-amd on linux-mi300-2gpu-sglang — ncclCommInitRank "unhandled cuda error" reproduces across unrelated PRs #25216

Closed

5 tasks

Merge branch 'main' into ar_debug2

06ac53e

hnyls2002 removed the high priority label May 14, 2026

Merge branch 'main' into ar_debug2

3e5ef52

BBuf reviewed May 15, 2026

View reviewed changes

wenscarl added 2 commits May 15, 2026 16:17

Merge branch 'main' into ar_debug2

6bcff35

fix: clear flashinfer_allreduce_fusion_backend in deterministic infer…

8f6caaa

…ence

wenscarl requested a review from BBuf May 15, 2026 22:57

wenscarl added 2 commits May 15, 2026 17:58

Merge branch 'main' into ar_debug2

22bd707

Merge branch 'main' into ar_debug2

bd08193

Conversation

wenscarl commented Apr 21, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Problem

What this PR does

Benchmarks (Gb200 4GPUs)

Notes for reviewers

How to reproduce the hang

CI States

Uh oh!

gemini-code-assist Bot commented Apr 21, 2026

Uh oh!

gemini-code-assist Bot commented May 5, 2026

Uh oh!

Uh oh!

b8zhong May 6, 2026

Choose a reason for hiding this comment

Uh oh!

wenscarl May 7, 2026

Choose a reason for hiding this comment

Uh oh!

b8zhong left a comment

Choose a reason for hiding this comment

Uh oh!

b8zhong commented May 8, 2026

Uh oh!

Uh oh!

b8zhong commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wenscarl commented May 13, 2026

Uh oh!

b8zhong commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BBuf left a comment

Choose a reason for hiding this comment

Uh oh!

BBuf commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

b8zhong commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wenscarl commented Apr 21, 2026 •

edited by github-actions Bot

Loading

b8zhong commented May 13, 2026 •

edited

Loading

b8zhong commented May 14, 2026 •

edited

Loading

BBuf commented May 15, 2026 •

edited

Loading