Commit 866023c
Sandermage
v7.65: PN26b — Genesis-original sparse-V Triton kernel for SM86
First sparse-V tile-skip kernel deployed for NVIDIA Ampere consumer
(SM86). No upstream Ampere reference exists — TRT-LLM #9821 +
FlashInfer #2477 ship for SM90+ only.
DESIGN — synthesized from 4-agent research 2026-05-01
======================================================
Fork rather than text-patch upstream Triton kernel:
- vllm/_genesis/kernels/triton_turboquant_decode_sparse_v.py
Genesis-original Triton kernel mirroring upstream `_tq_decode_stage1`
+ opt-in SPARSE_V tile-skip + sink-token protection + skip-rate
observability. Lazy-compiled, cached per process.
- vllm/_genesis/wiring/perf_hotfix/patch_N26_sparse_v_kernel.py
Lean dispatcher wrapper around upstream
triton_turboquant_decode_attention. Bakes threshold + tuning params
at apply() time. NO per-call GPU↔CPU sync (initial v1 had .item()
per call → catastrophic regression -16% short / -22% long, REJECTED;
v2 lean fixed it).
- vllm/_genesis/tests/test_pn26_sparse_v_kernel.py
TDD test suite — 7 CPU tests pass on Mac, 3 GPU smoke tests skip
cleanly on non-CUDA. Validates: threshold logic, BLASST λ scaling,
min_ctx default, wiring contract, dispatcher registry.
KEY FEATURES (v5)
=================
1. **Lean dispatcher** — no per-call sync. Always routes to forked
kernel; Triton constexpr DCE handles SPARSE_V=0 → byte-equivalent
to upstream when threshold doesn't fire.
2. **Configurable launch params** baked at apply() time:
- GENESIS_PN26_SPARSE_V_BLOCK_KV (4/8/16, default 4)
- GENESIS_PN26_SPARSE_V_NUM_WARPS (1/2/4/8, default 4 — winner)
- GENESIS_PN26_SPARSE_V_NUM_STAGES (1/2/3, default 1)
3. **`tl.range()` pipelining hint** (P67 v7.50 pattern) — Triton
compiler overlaps cp.async with prior-iter MMA on Ampere.
4. **Cache modifier `.cg`** on K/V dequant raw loads — L2 streaming.
5. **StreamingLLM sink-token protection** (first SINK_TOKENS=4 KV
positions never skipped — preserves long-context quality).
6. **BLASST λ=a/L scaling scaffold** ready (kernel-level seq_lens
load avoids per-call sync). Default mode = fixed threshold.
7. **Skip-rate observability** (NEW): per-CTA atomic int64 counters,
constexpr-DCE'd to zero overhead when DEBUG=0. When ON, periodic
logging every N calls (default 500) reports lifetime + per-launch
skip rate. Cost ~50-100 ns per CTA at epilogue (~0.05% kernel
overhead, statistically indistinguishable from baseline noise).
EMPIRICAL SWEEP — 35B FP8 PROD (TQ k8v4 + MTP K=3, 2× A5000 SM86)
=================================================================
Apples-to-apples bench at 100-token output (matches historical PROD
reference of 171-204 TPS):
| BLOCK_KV | num_warps | mean | max | CV |
|----------|-----------|--------|--------|-------|
| OFF | (baseline)| 175.41 | 185.15 | 4.20% |
| 8 | 1 | 178.33 | 187.67 | 3.78% |
| 8 | 2 | 180.36 | 190.24 | 4.70% |
| 16 | 2 | 178.35 | 190.74 | 3.26% |
| 8 | 4 | 183.11 | 202.38 | 5.26% |
| 8 | 8 | 181.24 | 196.60 | 5.78% |
| **4** | **4** | **184.89** | 194.56 | 4.63% |
| 4 | 8 | 177.40 | 191.97 | 5.79% |
Winner: BLOCK_KV=4, num_warps=4 (baked as kernel default).
FINAL A/B — 35B PROD with full bench harness
==============================================
Comprehensive bench (warmup + tool-call + sustained 50-req + concurrent):
| Metric | Baseline | PN26b v5 | Δ |
|--------------------|----------|--------------|----------------|
| Warmup mean TPS | 175.41 | 177.60 | +1.2% |
| Tool-call (7city) | 7/7 | 7/7 | preserved |
| Sustained mean TPS | 175.41 | **182.30** | **+3.9%** |
| Sustained max TPS | 185.15 | **212.24** | **+14.7%** ⭐ |
| Sustained p50 | n/a | 181.23 | new |
| Sustained p90 | n/a | 197.01 | new |
| Sustained p99 | n/a | 210.86 | new |
| Sustained CV | 4.20% | 7.02% | +2.82pp |
| Errors / 50 reqs | 0 | 0 | match |
| VRAM delta | 0 | +142 MiB | acceptable |
The 212 TPS max EXCEEDS the historical reference ceiling (171-204).
Tool-call quality fully preserved. Concurrent load: 2.27 req/s.
CAVEAT — empirical skip rate
============================
Skip rate at threshold=0.005 on our 100-token-output workload is very
low. Most TPS gain comes from kernel restructuring (`tl.range()` +
cache hints + larger num_warps), not the skip itself. The skip-rate
observability counter ships so future operators can data-drive their
threshold tuning at long-context workloads where skip rate naturally
rises.
BUG FIXED IN THIS COMMIT
========================
Wiring file (`patch_N26_sparse_v_kernel.py`) used `os.environ.get()`
without importing `os`. Caused NameError during apply() → wrapper not
installed. Added `import os`. Verified via boot: 45 applied / 44
skipped / 0 failed. Sparse-V dispatcher correctly wraps upstream.
NOT ENABLED IN ANY LAUNCH SCRIPT BY DEFAULT
============================================
PN26b is opt-in via GENESIS_ENABLE_PN26_SPARSE_V=1. Operators on
different SMs (89/90, datacenter) or larger batch sizes may see
different cost-benefit ratios. The 35B PROD launch script enables it
empirically with BLOCK_KV=4 num_warps=4 (winner from sweep) +
threshold=0.005 + DEBUG=1 for ongoing observability.
NEXT STEPS (DEFERRED TO NEXT SESSION)
======================================
- Per-row vote design for P67 multi-query verify path (research
agent design captured; ~4-8h implementation)
- Long-context (>32K) threshold sweep with skip-rate observability
- Per-layer threshold table (BSFA-style calibration)
- Self-Indexing KVCache paper (arXiv 2603.14224) backlog item
Sources:
- vllm#41422 (TheTom) — design template, AMD MI300X validated only
- BLASST arXiv 2512.12087 — λ=a/L threshold scaling formula
- TRT-LLM PR #9821 (Skip Softmax Attention) — production reference
- SpargeAttn ICML 2025 — RTX 3090/4090/L40 Ampere validation
- tq-kv reference (onur-gokyildiz-bhi) — SM86-compatible CUDA pattern
- StreamingLLM arXiv 2309.17453 — sink token protection1 parent 09ddb96 commit 866023c
7 files changed
Lines changed: 1510 additions & 1 deletion
File tree
- scripts
- vllm/_genesis
- kernels
- patches
- tests
- wiring/perf_hotfix
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
143 | 143 | | |
144 | 144 | | |
145 | 145 | | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
146 | 206 | | |
147 | 207 | | |
148 | 208 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
50 | | - | |
| 50 | + | |
51 | 51 | | |
52 | 52 | | |
53 | 53 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
613 | 613 | | |
614 | 614 | | |
615 | 615 | | |
| 616 | + | |
| 617 | + | |
| 618 | + | |
| 619 | + | |
| 620 | + | |
| 621 | + | |
| 622 | + | |
| 623 | + | |
| 624 | + | |
| 625 | + | |
| 626 | + | |
| 627 | + | |
| 628 | + | |
| 629 | + | |
| 630 | + | |
| 631 | + | |
| 632 | + | |
| 633 | + | |
| 634 | + | |
| 635 | + | |
| 636 | + | |
| 637 | + | |
| 638 | + | |
| 639 | + | |
| 640 | + | |
| 641 | + | |
| 642 | + | |
616 | 643 | | |
617 | 644 | | |
618 | 645 | | |
| |||
0 commit comments