Commit c92b80d
committed
Default to FlashInfer GDN decode on SM100+ with bf16 mamba state
On SM100+ with mamba-ssm-dtype=bfloat16, automatically set
--linear-attn-decode-backend to flashinfer when not explicitly
specified. This gives 1-5% TPOT improvement at higher concurrencies.
The prerequisite bug (OOB from negative padding indices in bf16
decode kernel) was fixed in FlashInfer v0.6.7 via
flashinfer-ai/flashinfer#2810.
Verified on Qwen3.5-397B-A17B-NVFP4 (4xGB200, no_buffer +
disable-radix-cache), sa-bench ISL=1024 OSL=1024, conc 2-1024:
- GSM8K accuracy: 0.977-0.979
- Mean TPOT: -1.3% (conc=2) to -4.5% (conc=1024)
- Excluded when MTP speculative decoding is active (not yet supported)
- Output throughput: +1.3% (conc=2) to +4.7% (conc=1024)1 parent f9a4e2c commit c92b80d
1 file changed
Lines changed: 19 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2557 | 2557 | | |
2558 | 2558 | | |
2559 | 2559 | | |
2560 | | - | |
2561 | 2560 | | |
2562 | 2561 | | |
| 2562 | + | |
| 2563 | + | |
| 2564 | + | |
| 2565 | + | |
| 2566 | + | |
| 2567 | + | |
| 2568 | + | |
| 2569 | + | |
| 2570 | + | |
| 2571 | + | |
| 2572 | + | |
| 2573 | + | |
| 2574 | + | |
| 2575 | + | |
| 2576 | + | |
| 2577 | + | |
| 2578 | + | |
| 2579 | + | |
| 2580 | + | |
2563 | 2581 | | |
2564 | 2582 | | |
2565 | 2583 | | |
| |||
0 commit comments