[Tracking] Qwen3.5-397B (G)B200 Functional Support and Optimizations

## Qwen3.5-397B (G)B200 Functional Support and Optimizations

Tracking issue for Qwen3.5-397B support in SGLang on (G)B200.

---

### Progress Tracker

#### no-MTP+Agg

| Category | Precision | Item | Status |
|---|---|---|---|
| Functional | FP8 | DEP (high-throughput) works functionally | ✅ DONE |
| Functional | FP8 | DEP (high-throughput) has good accuracy | ✅ DONE |
| Functional | FP8 | TP (low-latency) works functionally | ✅ DONE |
| Functional | FP8 | TP (low-latency) has good accuracy | ✅ DONE |
| Functional | NVFP4 | DEP (high-throughput) works functionally | ✅ DONE |
| Functional | NVFP4 | DEP (high-throughput) has good accuracy | ✅ DONE |
| Functional | NVFP4 | TP (low-latency) works functionally | ✅ DONE |
| Functional | NVFP4 | TP (low-latency) has good accuracy | ✅ DONE |
| Baseline Perf | FP8 | DEP (high-throughput) uses correct backends/kernels | ✅ DONE |
| Baseline Perf | FP8 | TP (low-latency) uses correct backends/kernels | ✅ DONE |
| Baseline Perf | NVFP4 | DEP (high-throughput) uses correct backends/kernels | ✅ DONE |
| Baseline Perf | NVFP4 | TP (low-latency) uses correct backends/kernels | ✅ DONE |
| Cookbook | (all) | Update SGLang cookbook | 🔄 IN PROGRESS |
| Perf Analysis | FP8 | Round 1 perf analysis | 🔄 IN PROGRESS |
| Perf Analysis | NVFP4 | Round 1 perf analysis | 🔄 IN PROGRESS |
| Perf Optimization | FP8 | Round 1 perf optimizations | 🔄 IN PROGRESS |
| Perf Optimization | NVFP4 | Round 1 perf optimizations | 🔄 IN PROGRESS |

#### MTP+Agg

| Category | Precision | Item | Status |
|---|---|---|---|
| Functional | FP8 | DEP (high-throughput) works functionally | ✅ DONE |
| Functional | FP8 | DEP (high-throughput) has good accuracy | ✅ DONE |
| Functional | FP8 | TP (low-latency) works functionally | ✅ DONE |
| Functional | FP8 | TP (low-latency) has good accuracy | ✅ DONE |
| Functional | NVFP4 | DEP (high-throughput) works functionally | ✅ DONE |
| Functional | NVFP4 | DEP (high-throughput) has good accuracy | ✅ DONE |
| Functional | NVFP4 | TP (low-latency) works functionally | ✅ DONE |
| Functional | NVFP4 | TP (low-latency) has good accuracy | ✅ DONE |
| Baseline Perf | FP8 | DEP (high-throughput) uses correct backends/kernels | ✅ DONE |
| Baseline Perf | FP8 | TP (low-latency) uses correct backends/kernels | ✅ DONE |
| Baseline Perf | NVFP4 | DEP (high-throughput) uses correct backends/kernels | ✅ DONE |
| Baseline Perf | NVFP4 | TP (low-latency) uses correct backends/kernels | ✅ DONE |
| Cookbook | (all) | Update SGLang cookbook | 🔄 IN PROGRESS |
| Perf Analysis | FP8 | Round 1 perf analysis | 🔄 IN PROGRESS |
| Perf Analysis | NVFP4 | Round 1 perf analysis | 🔄 IN PROGRESS |
| Perf Optimization | FP8 | Round 1 perf optimizations | 🔄 IN PROGRESS |
| Perf Optimization | NVFP4 | Round 1 perf optimizations | 🔄 IN PROGRESS |

#### no-MTP+Disagg

| Category | Precision | Item | Status |
|---|---|---|---|
| Functional | FP8 | DEP (high-throughput) works functionally | ✅ DONE |
| Functional | FP8 | DEP (high-throughput) has good accuracy | ✅ DONE |
| Functional | FP8 | TP (low-latency) works functionally | ✅ DONE |
| Functional | FP8 | TP (low-latency) has good accuracy | ✅ DONE |
| Functional | NVFP4 | DEP (high-throughput) works functionally | ✅ DONE |
| Functional | NVFP4 | DEP (high-throughput) has good accuracy | ✅ DONE |
| Functional | NVFP4 | TP (low-latency) works functionally | ✅ DONE |
| Functional | NVFP4 | TP (low-latency) has good accuracy | ✅ DONE |
| Baseline Perf | FP8 | DEP (high-throughput) uses correct backends/kernels | ✅ DONE |
| Baseline Perf | FP8 | TP (low-latency) uses correct backends/kernels | ✅ DONE |
| Baseline Perf | NVFP4 | DEP (high-throughput) uses correct backends/kernels | ✅ DONE |
| Baseline Perf | NVFP4 | TP (low-latency) uses correct backends/kernels | ✅ DONE |
| Cookbook | (all) | Update SGLang cookbook | 🔄 IN PROGRESS |
| Perf Analysis | FP8 | Round 1 perf analysis | 🔄 IN PROGRESS |
| Perf Analysis | NVFP4 | Round 1 perf analysis | 🔄 IN PROGRESS |
| Perf Optimization | FP8 | Round 1 perf optimizations | 🔄 IN PROGRESS |
| Perf Optimization | NVFP4 | Round 1 perf optimizations | 🔄 IN PROGRESS |

#### MTP+Disagg

| Category | Precision | Item | Status |
|---|---|---|---|
| Functional | FP8 | DEP (high-throughput) works functionally | ✅ DONE |
| Functional | FP8 | DEP (high-throughput) has good accuracy | ✅ DONE |
| Functional | FP8 | TP (low-latency) works functionally | ✅ DONE |
| Functional | FP8 | TP (low-latency) has good accuracy | ✅ DONE |
| Functional | NVFP4 | DEP (high-throughput) works functionally | ✅ DONE |
| Functional | NVFP4 | DEP (high-throughput) has good accuracy | ✅ DONE |
| Functional | NVFP4 | TP (low-latency) works functionally | ✅ DONE |
| Functional | NVFP4 | TP (low-latency) has good accuracy | ✅ DONE |
| Baseline Perf | FP8 | DEP (high-throughput) uses correct backends/kernels | ✅ DONE |
| Baseline Perf | FP8 | TP (low-latency) uses correct backends/kernels | ✅ DONE |
| Baseline Perf | NVFP4 | DEP (high-throughput) uses correct backends/kernels | ✅ DONE |
| Baseline Perf | NVFP4 | TP (low-latency) uses correct backends/kernels | ✅ DONE |
| Cookbook | (all) | Update SGLang cookbook | 🔄 IN PROGRESS |
| Perf Analysis | FP8 | Round 1 perf analysis | 🔄 IN PROGRESS |
| Perf Analysis | NVFP4 | Round 1 perf analysis | 🔄 IN PROGRESS |
| Perf Optimization | FP8 | Round 1 perf optimizations | 🔄 IN PROGRESS |
| Perf Optimization | NVFP4 | Round 1 perf optimizations | 🔄 IN PROGRESS |

---

### Weekly Progress

#### 2026-05-08

- B200/B300 Agg FP8/FP4 performance configs were updated; the latest B200 FP4 Agg 8k/1k MTP chart shows a substantial improvement over the prior run.
- Pending optimizations remain in review: #23273 for the GDN MTP kernel update, flashinfer-ai/flashinfer#3147 for the GDN decode kernel, and #22921 for GDN prefill kernel integration.
#### 2026-04-17

- **MoE kernels**: cutedsl MoE V2 (#21339) fixed (needs FlashInfer v0.6.8 for autotuner fix flashinfer-ai/flashinfer#3025); A2A + trtllm_routed MoE in review (#22394); A2A + cutedsl v2 MoE in review (#22669, perf debugging in progress)
- **SmallEP optimization**: reduce_scatterv merged (#22642) — fuses all-reduce + dp_scatter into single reduce_scatterv for DP+EP MoE. **-13.6% NCCL time, +7.7% e2e throughput** on Qwen3.5-397B DP4EP4
- **GDN prefill kernel**: merged into FlashInfer v0.6.7.post3 (flashinfer-ai/flashinfer#3001), SGLang integration #22921 in review — **~6.8% prefill-only throughput gain** on Blackwell (ISL=1000, OSL=1, FP8 DEP4 on GB200: 63,603 → 67,923)
- **GDN decode MTP sm100 kernel**: SGLang integration in progress, debugging accuracy issue
- **Regression fixes**: Dynamo hang race condition fixed (https://github.com/ai-dynamo/dynamo/pull/8080); Disagg Wide-EP ~2x throughput regression resolved (#22095, https://github.com/ai-dynamo/dynamo/pull/7984); Small-EP 6x prefill regression resolved (#22379, fixed in FI v0.6.7.post3)
- **Disagg/NIXL**: GPU staging buffer merged (#19890), heterogeneous TP staging in draft (#22536)
- **MNNVL AllReduce**: #12787 merged but reverted (#20792) due to GPT-OSS hang — PCG-related, tracking in flashinfer-ai/flashinfer#3053


#### 2026-04-08

- NVFP4 now fully functional across all variants — SGLang upgraded to FlashInfer v0.6.7.post2 and released v0.5.10
- All Functional and Baseline Perf tasks marked DONE for all variants and precisions
- Cookbook, Perf Analysis all now IN PROGRESS (including previously blocked Disagg variants)
- GDN optimization: decode with BF16 state + MTP merged (flashinfer-ai/flashinfer#2679); prefill with BF16 state in review (flashinfer-ai/flashinfer#3001)
- TPx_TPy kv-transfer: Mooncake PR #19890 merged; NIXL migration in progress (#22145 merged, ongoing dev branches)
- MNNVL AllReduce: still reverted due to hang; debugging in progress
- Dynamo perf regression found (#22095); fix PR in review (https://github.com/ai-dynamo/dynamo/pull/7984)

#### 2026-03-30

- NVFP4 Functional: pending FlashInfer v0.6.7 upgrade (#21422 in review)
- GDN kernel optimization: in progress (flashinfer-ai/flashinfer#2687)
- Disagg perf fix (prefill/decode TP size mismatch): Mooncake PR in review (#19890). Ongoing discussion about how to do the same for NIXL.
- MNNVL AllReduce: PR (#12787) was merged but reverted in #20792 due to GPT-OSS hang. Will be debugged and re-submitted.
- cutedsl MoE backend: in progress

#### 2026-03-17

- FP8 Functional: DONE for all variants (MTP+Disagg FP8 now done)
- NVFP4 Functional: all in progress across all variants
- Baseline Perf: all variants now IN PROGRESS
- Disagg perf fix (prefill/decode TP size mismatch): PR in review (#19890)
- GDN kernel optimization: in progress (flashinfer-ai/flashinfer#2687)
- MNNVL AllReduce: PR in review (#12787)
- cutedsl MoE backend: in progress, resolving FlashInfer issues

#### 2026-03-12

- Bug tracking: all known bugs listed in #20069 — prioritizing B200-relevant ones; others handled by community contributors
- Spec V2 (MTP) support PR #19391 has been merged
- Disagg TP size mismatch fix PR #19890 is in review (only affects non-MLA networks)
- GDN (Gated Delta Net) kernel: working to unify prefill/decode to both use FlashInfer — tracked in https://github.com/flashinfer-ai/flashinfer/issues/2687
- MNNVL AllReduce PR #12787 is in review
- Cookbook update on hold pending discussion on whether to target main branch or latest release (Spec V2 only in main)

#### 2026-03-06 (update)

- Agg functional support (no-MTP+Agg, MTP+Agg): DONE for both FP8 and NVFP4
- MTP enabled via merged PR: #19391
- Disagg functional support: IN PROGRESS (debugging TP size mismatch in Prefill/Decode)
- Agg cookbook, IBDB pipeline, perf analysis: all IN PROGRESS
- Perf optimizations (Agg + Disagg): IN PROGRESS — GDN (gated delta net) kernel optimizations ongoing

#### 2026-03-06

- Created and initialized all tasks

---

### Related GitHub Issues

- #23273 - GDN MTP kernel update (in review)
- flashinfer-ai/flashinfer#3147 - FlashInfer GDN decode kernel (in review)
- #22921 - GDN prefill kernel integration (in review)
- #18590 - Known perf optimizations for Qwen3.5-397B on SGLang
- #19391 - Enable MTP for Qwen3.5-397B (merged)
- #19890 - Fix Disagg perf issue when prefill/decode TP sizes do not match (merged)
- #12787 - MNNVL AllReduce support (merged, then reverted in #20792)
- #20069 - Qwen3.5-397B bug tracking
- #20792 - Revert MNNVL AllReduce due to GPT-OSS hang
- #21422 - FlashInfer v0.6.7 upgrade
- #22145 - NIXL backend migration (merged)
- #22095 - Dynamo perf regression
- https://github.com/flashinfer-ai/flashinfer/issues/2687 - FlashInfer GDN prefill kernel support
- https://github.com/flashinfer-ai/flashinfer/pull/2679 - GDN decode with BF16 state + MTP support (merged)
- https://github.com/flashinfer-ai/flashinfer/pull/3001 - GDN prefill with BF16 state (in review)
- https://github.com/ai-dynamo/dynamo/pull/7984 - Dynamo perf regression fix (in review)
- #21339 - cutedsl MoE V2 (fixed)
- #22394 - A2A + trtllm_routed MoE integration (in review)
- #22669 - A2A + cutedsl v2 MoE for Qwen3.5 (in review)
- #22642 - SmallEP reduce_scatterv optimization (merged)
- #22921 - GDN prefill kernel SGLang integration (in review)
- #22379 - Small-EP 6x prefill throughput regression fix (merged)
- #22536 - Heterogeneous TP staging buffer (draft)
- https://github.com/flashinfer-ai/flashinfer/pull/3025 - FlashInfer autotuner fix for cutedsl MoE (merged, v0.6.8)
- https://github.com/flashinfer-ai/flashinfer/issues/3053 - MNNVL AllReduce PCG hang tracking
- https://github.com/ai-dynamo/dynamo/pull/8080 - Dynamo hang race condition fix (merged)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tracking] Qwen3.5-397B (G)B200 Functional Support and Optimizations #20024

Qwen3.5-397B (G)B200 Functional Support and Optimizations

Progress Tracker

no-MTP+Agg

MTP+Agg

no-MTP+Disagg

MTP+Disagg

Weekly Progress

2026-05-08

2026-04-17

2026-04-08

2026-03-30

2026-03-17

2026-03-12

2026-03-06 (update)

2026-03-06

Related GitHub Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Category	Precision	Item	Status
Functional	FP8	DEP (high-throughput) works functionally	✅ DONE
Functional	FP8	DEP (high-throughput) has good accuracy	✅ DONE
Functional	FP8	TP (low-latency) works functionally	✅ DONE
Functional	FP8	TP (low-latency) has good accuracy	✅ DONE
Functional	NVFP4	DEP (high-throughput) works functionally	✅ DONE
Functional	NVFP4	DEP (high-throughput) has good accuracy	✅ DONE
Functional	NVFP4	TP (low-latency) works functionally	✅ DONE
Functional	NVFP4	TP (low-latency) has good accuracy	✅ DONE
Baseline Perf	FP8	DEP (high-throughput) uses correct backends/kernels	✅ DONE
Baseline Perf	FP8	TP (low-latency) uses correct backends/kernels	✅ DONE
Baseline Perf	NVFP4	DEP (high-throughput) uses correct backends/kernels	✅ DONE
Baseline Perf	NVFP4	TP (low-latency) uses correct backends/kernels	✅ DONE
Cookbook	(all)	Update SGLang cookbook	🔄 IN PROGRESS
Perf Analysis	FP8	Round 1 perf analysis	🔄 IN PROGRESS
Perf Analysis	NVFP4	Round 1 perf analysis	🔄 IN PROGRESS
Perf Optimization	FP8	Round 1 perf optimizations	🔄 IN PROGRESS
Perf Optimization	NVFP4	Round 1 perf optimizations	🔄 IN PROGRESS

[Tracking] Qwen3.5-397B (G)B200 Functional Support and Optimizations #20024

Description

Qwen3.5-397B (G)B200 Functional Support and Optimizations

Progress Tracker

no-MTP+Agg

MTP+Agg

no-MTP+Disagg

MTP+Disagg

Weekly Progress

2026-05-08

2026-04-17

2026-04-08

2026-03-30

2026-03-17

2026-03-12

2026-03-06 (update)

2026-03-06

Related GitHub Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions