Qwen3.5-397B (G)B200 Functional Support and Optimizations
Tracking issue for Qwen3.5-397B support in SGLang on (G)B200.
Progress Tracker
no-MTP+Agg
Category
Precision
Item
Status
Functional
FP8
DEP (high-throughput) works functionally
✅ DONE
Functional
FP8
DEP (high-throughput) has good accuracy
✅ DONE
Functional
FP8
TP (low-latency) works functionally
✅ DONE
Functional
FP8
TP (low-latency) has good accuracy
✅ DONE
Functional
NVFP4
DEP (high-throughput) works functionally
✅ DONE
Functional
NVFP4
DEP (high-throughput) has good accuracy
✅ DONE
Functional
NVFP4
TP (low-latency) works functionally
✅ DONE
Functional
NVFP4
TP (low-latency) has good accuracy
✅ DONE
Baseline Perf
FP8
DEP (high-throughput) uses correct backends/kernels
✅ DONE
Baseline Perf
FP8
TP (low-latency) uses correct backends/kernels
✅ DONE
Baseline Perf
NVFP4
DEP (high-throughput) uses correct backends/kernels
✅ DONE
Baseline Perf
NVFP4
TP (low-latency) uses correct backends/kernels
✅ DONE
Cookbook
(all)
Update SGLang cookbook
🔄 IN PROGRESS
Perf Analysis
FP8
Round 1 perf analysis
🔄 IN PROGRESS
Perf Analysis
NVFP4
Round 1 perf analysis
🔄 IN PROGRESS
Perf Optimization
FP8
Round 1 perf optimizations
🔄 IN PROGRESS
Perf Optimization
NVFP4
Round 1 perf optimizations
🔄 IN PROGRESS
MTP+Agg
Category
Precision
Item
Status
Functional
FP8
DEP (high-throughput) works functionally
✅ DONE
Functional
FP8
DEP (high-throughput) has good accuracy
✅ DONE
Functional
FP8
TP (low-latency) works functionally
✅ DONE
Functional
FP8
TP (low-latency) has good accuracy
✅ DONE
Functional
NVFP4
DEP (high-throughput) works functionally
✅ DONE
Functional
NVFP4
DEP (high-throughput) has good accuracy
✅ DONE
Functional
NVFP4
TP (low-latency) works functionally
✅ DONE
Functional
NVFP4
TP (low-latency) has good accuracy
✅ DONE
Baseline Perf
FP8
DEP (high-throughput) uses correct backends/kernels
✅ DONE
Baseline Perf
FP8
TP (low-latency) uses correct backends/kernels
✅ DONE
Baseline Perf
NVFP4
DEP (high-throughput) uses correct backends/kernels
✅ DONE
Baseline Perf
NVFP4
TP (low-latency) uses correct backends/kernels
✅ DONE
Cookbook
(all)
Update SGLang cookbook
🔄 IN PROGRESS
Perf Analysis
FP8
Round 1 perf analysis
🔄 IN PROGRESS
Perf Analysis
NVFP4
Round 1 perf analysis
🔄 IN PROGRESS
Perf Optimization
FP8
Round 1 perf optimizations
🔄 IN PROGRESS
Perf Optimization
NVFP4
Round 1 perf optimizations
🔄 IN PROGRESS
no-MTP+Disagg
Category
Precision
Item
Status
Functional
FP8
DEP (high-throughput) works functionally
✅ DONE
Functional
FP8
DEP (high-throughput) has good accuracy
✅ DONE
Functional
FP8
TP (low-latency) works functionally
✅ DONE
Functional
FP8
TP (low-latency) has good accuracy
✅ DONE
Functional
NVFP4
DEP (high-throughput) works functionally
✅ DONE
Functional
NVFP4
DEP (high-throughput) has good accuracy
✅ DONE
Functional
NVFP4
TP (low-latency) works functionally
✅ DONE
Functional
NVFP4
TP (low-latency) has good accuracy
✅ DONE
Baseline Perf
FP8
DEP (high-throughput) uses correct backends/kernels
✅ DONE
Baseline Perf
FP8
TP (low-latency) uses correct backends/kernels
✅ DONE
Baseline Perf
NVFP4
DEP (high-throughput) uses correct backends/kernels
✅ DONE
Baseline Perf
NVFP4
TP (low-latency) uses correct backends/kernels
✅ DONE
Cookbook
(all)
Update SGLang cookbook
🔄 IN PROGRESS
Perf Analysis
FP8
Round 1 perf analysis
🔄 IN PROGRESS
Perf Analysis
NVFP4
Round 1 perf analysis
🔄 IN PROGRESS
Perf Optimization
FP8
Round 1 perf optimizations
🔄 IN PROGRESS
Perf Optimization
NVFP4
Round 1 perf optimizations
🔄 IN PROGRESS
MTP+Disagg
Category
Precision
Item
Status
Functional
FP8
DEP (high-throughput) works functionally
✅ DONE
Functional
FP8
DEP (high-throughput) has good accuracy
✅ DONE
Functional
FP8
TP (low-latency) works functionally
✅ DONE
Functional
FP8
TP (low-latency) has good accuracy
✅ DONE
Functional
NVFP4
DEP (high-throughput) works functionally
✅ DONE
Functional
NVFP4
DEP (high-throughput) has good accuracy
✅ DONE
Functional
NVFP4
TP (low-latency) works functionally
✅ DONE
Functional
NVFP4
TP (low-latency) has good accuracy
✅ DONE
Baseline Perf
FP8
DEP (high-throughput) uses correct backends/kernels
✅ DONE
Baseline Perf
FP8
TP (low-latency) uses correct backends/kernels
✅ DONE
Baseline Perf
NVFP4
DEP (high-throughput) uses correct backends/kernels
✅ DONE
Baseline Perf
NVFP4
TP (low-latency) uses correct backends/kernels
✅ DONE
Cookbook
(all)
Update SGLang cookbook
🔄 IN PROGRESS
Perf Analysis
FP8
Round 1 perf analysis
🔄 IN PROGRESS
Perf Analysis
NVFP4
Round 1 perf analysis
🔄 IN PROGRESS
Perf Optimization
FP8
Round 1 perf optimizations
🔄 IN PROGRESS
Perf Optimization
NVFP4
Round 1 perf optimizations
🔄 IN PROGRESS
Weekly Progress
2026-05-08
2026-04-17
MoE kernels : cutedsl MoE V2 (Add dedicated FlashInferCuteDslMoE layer for standard-path FP4 MoE #21339 ) fixed (needs FlashInfer v0.6.8 for autotuner fix Prevent MoE autotuner buffer overflow on large token buckets flashinfer-ai/flashinfer#3025 ); A2A + trtllm_routed MoE in review ([NVIDIA] Support flashinfer a2a with flashinfer_trtllm_routed moe #22394 ); A2A + cutedsl v2 MoE in review (feat: Support flashinfer_cutedsl MoE runner with flashinfer alltoall backend #22669 , perf debugging in progress)
SmallEP optimization : reduce_scatterv merged (Replace all-reduce + dp_scatter with reduce_scatterv for DP attention #22642 ) — fuses all-reduce + dp_scatter into single reduce_scatterv for DP+EP MoE. -13.6% NCCL time, +7.7% e2e throughput on Qwen3.5-397B DP4EP4
GDN prefill kernel : merged into FlashInfer v0.6.7.post3 ([feat] Add blackwell GDN prefill kernel flashinfer-ai/flashinfer#3001 ), SGLang integration [NVIDIA] [GDN] Add FlashInfer prefill support for SM100+ (Blackwell) #22921 in review — ~6.8% prefill-only throughput gain on Blackwell (ISL=1000, OSL=1, FP8 DEP4 on GB200: 63,603 → 67,923)
GDN decode MTP sm100 kernel : SGLang integration in progress, debugging accuracy issue
Regression fixes : Dynamo hang race condition fixed (fix(deps): pin mio after lockfile drift ai-dynamo/dynamo#8080 ); Disagg Wide-EP ~2x throughput regression resolved ([Performance Regression] ~2x throughput drop in disaggregated PD mode with Wide-EP (DeepSeek-R1 FP4, GB200) between SGLang v0.5.8 and latest nightly #22095 , fix: dp_rank always 0 in non-KV router mode ai-dynamo/dynamo#7984 ); Small-EP 6x prefill regression resolved ([Performance][Small-EP] 6x prefill throughput regression with Expert Parallelism on DeepSeek R1 between v0.5.9 and dev #22379 , fixed in FI v0.6.7.post3)
Disagg/NIXL : GPU staging buffer merged ([Disagg] GPU staging buffer with dynamic ring allocator for heterogeneous TP KV transfer #19890 ), heterogeneous TP staging in draft ([Disagg][NIXL] Add staging buffer support for heterogeneous TP KV transfer #22536 )
MNNVL AllReduce : [Nvidia] Add trtllm mnnvl allreduce with unified flashinfer allreduce fusion api #12787 merged but reverted (Revert "[Nvidia] Add trtllm mnnvl allreduce with unified flashinfer allreduce fusion api" #20792 ) due to GPT-OSS hang — PCG-related, tracking in Hang in oneshotAllreduceFusionKernel During Piecewise CUDA Graph Replay flashinfer-ai/flashinfer#3053
2026-04-08
2026-03-30
2026-03-17
2026-03-12
2026-03-06 (update)
Agg functional support (no-MTP+Agg, MTP+Agg): DONE for both FP8 and NVFP4
MTP enabled via merged PR: [Qwen3.5] Enable MTP spec_v2 and add test for nvidia/Qwen3.5-397B-A17B-NVFP4 #19391
Disagg functional support: IN PROGRESS (debugging TP size mismatch in Prefill/Decode)
Agg cookbook, IBDB pipeline, perf analysis: all IN PROGRESS
Perf optimizations (Agg + Disagg): IN PROGRESS — GDN (gated delta net) kernel optimizations ongoing
2026-03-06
Created and initialized all tasks
Related GitHub Issues
Qwen3.5-397B (G)B200 Functional Support and Optimizations
Tracking issue for Qwen3.5-397B support in SGLang on (G)B200.
Progress Tracker
no-MTP+Agg
MTP+Agg
no-MTP+Disagg
MTP+Disagg
Weekly Progress
2026-05-08
2026-04-17
2026-04-08
2026-03-30
2026-03-17
2026-03-12
2026-03-06 (update)
2026-03-06
Related GitHub Issues