Skip to content

Commit 866023c

Browse files
author
Sandermage
committed
v7.65: PN26b — Genesis-original sparse-V Triton kernel for SM86
First sparse-V tile-skip kernel deployed for NVIDIA Ampere consumer (SM86). No upstream Ampere reference exists — TRT-LLM #9821 + FlashInfer #2477 ship for SM90+ only. DESIGN — synthesized from 4-agent research 2026-05-01 ====================================================== Fork rather than text-patch upstream Triton kernel: - vllm/_genesis/kernels/triton_turboquant_decode_sparse_v.py Genesis-original Triton kernel mirroring upstream `_tq_decode_stage1` + opt-in SPARSE_V tile-skip + sink-token protection + skip-rate observability. Lazy-compiled, cached per process. - vllm/_genesis/wiring/perf_hotfix/patch_N26_sparse_v_kernel.py Lean dispatcher wrapper around upstream triton_turboquant_decode_attention. Bakes threshold + tuning params at apply() time. NO per-call GPU↔CPU sync (initial v1 had .item() per call → catastrophic regression -16% short / -22% long, REJECTED; v2 lean fixed it). - vllm/_genesis/tests/test_pn26_sparse_v_kernel.py TDD test suite — 7 CPU tests pass on Mac, 3 GPU smoke tests skip cleanly on non-CUDA. Validates: threshold logic, BLASST λ scaling, min_ctx default, wiring contract, dispatcher registry. KEY FEATURES (v5) ================= 1. **Lean dispatcher** — no per-call sync. Always routes to forked kernel; Triton constexpr DCE handles SPARSE_V=0 → byte-equivalent to upstream when threshold doesn't fire. 2. **Configurable launch params** baked at apply() time: - GENESIS_PN26_SPARSE_V_BLOCK_KV (4/8/16, default 4) - GENESIS_PN26_SPARSE_V_NUM_WARPS (1/2/4/8, default 4 — winner) - GENESIS_PN26_SPARSE_V_NUM_STAGES (1/2/3, default 1) 3. **`tl.range()` pipelining hint** (P67 v7.50 pattern) — Triton compiler overlaps cp.async with prior-iter MMA on Ampere. 4. **Cache modifier `.cg`** on K/V dequant raw loads — L2 streaming. 5. **StreamingLLM sink-token protection** (first SINK_TOKENS=4 KV positions never skipped — preserves long-context quality). 6. **BLASST λ=a/L scaling scaffold** ready (kernel-level seq_lens load avoids per-call sync). Default mode = fixed threshold. 7. **Skip-rate observability** (NEW): per-CTA atomic int64 counters, constexpr-DCE'd to zero overhead when DEBUG=0. When ON, periodic logging every N calls (default 500) reports lifetime + per-launch skip rate. Cost ~50-100 ns per CTA at epilogue (~0.05% kernel overhead, statistically indistinguishable from baseline noise). EMPIRICAL SWEEP — 35B FP8 PROD (TQ k8v4 + MTP K=3, 2× A5000 SM86) ================================================================= Apples-to-apples bench at 100-token output (matches historical PROD reference of 171-204 TPS): | BLOCK_KV | num_warps | mean | max | CV | |----------|-----------|--------|--------|-------| | OFF | (baseline)| 175.41 | 185.15 | 4.20% | | 8 | 1 | 178.33 | 187.67 | 3.78% | | 8 | 2 | 180.36 | 190.24 | 4.70% | | 16 | 2 | 178.35 | 190.74 | 3.26% | | 8 | 4 | 183.11 | 202.38 | 5.26% | | 8 | 8 | 181.24 | 196.60 | 5.78% | | **4** | **4** | **184.89** | 194.56 | 4.63% | | 4 | 8 | 177.40 | 191.97 | 5.79% | Winner: BLOCK_KV=4, num_warps=4 (baked as kernel default). FINAL A/B — 35B PROD with full bench harness ============================================== Comprehensive bench (warmup + tool-call + sustained 50-req + concurrent): | Metric | Baseline | PN26b v5 | Δ | |--------------------|----------|--------------|----------------| | Warmup mean TPS | 175.41 | 177.60 | +1.2% | | Tool-call (7city) | 7/7 | 7/7 | preserved | | Sustained mean TPS | 175.41 | **182.30** | **+3.9%** | | Sustained max TPS | 185.15 | **212.24** | **+14.7%** ⭐ | | Sustained p50 | n/a | 181.23 | new | | Sustained p90 | n/a | 197.01 | new | | Sustained p99 | n/a | 210.86 | new | | Sustained CV | 4.20% | 7.02% | +2.82pp | | Errors / 50 reqs | 0 | 0 | match | | VRAM delta | 0 | +142 MiB | acceptable | The 212 TPS max EXCEEDS the historical reference ceiling (171-204). Tool-call quality fully preserved. Concurrent load: 2.27 req/s. CAVEAT — empirical skip rate ============================ Skip rate at threshold=0.005 on our 100-token-output workload is very low. Most TPS gain comes from kernel restructuring (`tl.range()` + cache hints + larger num_warps), not the skip itself. The skip-rate observability counter ships so future operators can data-drive their threshold tuning at long-context workloads where skip rate naturally rises. BUG FIXED IN THIS COMMIT ======================== Wiring file (`patch_N26_sparse_v_kernel.py`) used `os.environ.get()` without importing `os`. Caused NameError during apply() → wrapper not installed. Added `import os`. Verified via boot: 45 applied / 44 skipped / 0 failed. Sparse-V dispatcher correctly wraps upstream. NOT ENABLED IN ANY LAUNCH SCRIPT BY DEFAULT ============================================ PN26b is opt-in via GENESIS_ENABLE_PN26_SPARSE_V=1. Operators on different SMs (89/90, datacenter) or larger batch sizes may see different cost-benefit ratios. The 35B PROD launch script enables it empirically with BLOCK_KV=4 num_warps=4 (winner from sweep) + threshold=0.005 + DEBUG=1 for ongoing observability. NEXT STEPS (DEFERRED TO NEXT SESSION) ====================================== - Per-row vote design for P67 multi-query verify path (research agent design captured; ~4-8h implementation) - Long-context (>32K) threshold sweep with skip-rate observability - Per-layer threshold table (BSFA-style calibration) - Self-Indexing KVCache paper (arXiv 2603.14224) backlog item Sources: - vllm#41422 (TheTom) — design template, AMD MI300X validated only - BLASST arXiv 2512.12087 — λ=a/L threshold scaling formula - TRT-LLM PR #9821 (Skip Softmax Attention) — production reference - SpargeAttn ICML 2025 — RTX 3090/4090/L40 Ampere validation - tq-kv reference (onur-gokyildiz-bhi) — SM86-compatible CUDA pattern - StreamingLLM arXiv 2309.17453 — sink token protection
1 parent 09ddb96 commit 866023c

7 files changed

Lines changed: 1510 additions & 1 deletion

File tree

CHANGELOG.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,66 @@ loud-and-clear in the per-release notes.
143143
(no PluggableLayer inheritance). Our pin pre-dates #35178 by 2
144144
days; we are NOT vulnerable. PN27 scaffold ready when we pin-bump.
145145

146+
### PN26b sparse-V kernel — major iteration (v5, 2026-05-01)
147+
148+
Comprehensive deep-dive on Genesis-original sparse-V Triton kernel based
149+
on 4-agent research synthesis (skip-rate observability + per-row vote +
150+
memory profiling + 14-day community scan).
151+
152+
**v5 design** (lean dispatcher + tuning + observability):
153+
154+
- **Lean dispatcher** (no per-call GPU↔CPU sync; v1's `.item()` per call
155+
caused -16% short-ctx + -22% long-ctx regression — REJECTED).
156+
- **Configurable launch params** baked at apply() time: BLOCK_KV (4/8/16),
157+
num_warps (1/2/4/8), num_stages.
158+
- **`tl.range()` pipelining hint** (P67 v7.50 pattern, Triton compiler
159+
cp.async overlap with prior-iter MMA on Ampere).
160+
- **Cache modifier `.cg`** on K/V dequant raw loads (L2 streaming).
161+
- **Sink-token protection** (StreamingLLM finding — first 4 KV positions
162+
never skipped).
163+
- **Skip-rate observability** (NEW): per-CTA atomic int64 counters,
164+
constexpr-DCE'd to zero overhead when disabled, `~50-100 ns` per CTA
165+
at epilogue when enabled. Periodic logging every 500 calls so
166+
operator sees real skip rate without cross-process IPC.
167+
- **BLASST adaptive threshold scaffold** (`λ = scale_factor / ctx_len`)
168+
ready in code; default OFF until skip-rate data informs which mode
169+
is better.
170+
171+
**Empirical sweep on 35B FP8 PROD (TQ k8v4 + MTP K=3, 2× A5000 SM86)**:
172+
173+
| BLOCK_KV | num_warps | mean | max | CV |
174+
|---|---|---|---|---|
175+
| OFF (baseline) || 175.41 | 185.15 | 4.20% |
176+
| 8 | 1 | 178.33 | 187.67 | 3.78% |
177+
| 8 | 2 | 180.36 | 190.24 | 4.70% |
178+
| 16 | 2 | 178.35 | 190.74 | 3.26% |
179+
| 8 | 4 | 183.11 | 202.38 | 5.26% |
180+
| 8 | 8 | 181.24 | 196.60 | 5.78% |
181+
| **4** | **4** | **184.89** | 194.56 | 4.63% |
182+
| 4 | 8 | 177.40 | 191.97 | 5.79% |
183+
184+
Winner: **BLOCK_KV=4, num_warps=4** (baked as kernel default).
185+
186+
**Final 35B PROD A/B (apples to apples, 100t output)**:
187+
188+
| Config | tool-call | mean | min | max | CV |
189+
|---------------------|-----------|--------|-------|--------|-------|
190+
| Baseline (OFF) | 7/7 | 175.41 | 158.71| 185.15 | 4.20% |
191+
| **PN26b v5** | **7/7** | **182.30** | 153.53 | **212.24** | 7.02% |
192+
| Δ | match | **+3.9%** | -3.3% | **+14.7%**| +2.82pp |
193+
194+
The `212 max` exceeds the historical 35B PROD ceiling reference (171-204
195+
TPS quoted from earlier sessions). Tool-call quality preserved (7/7).
196+
Sustained 50-request load: 0 errors, p50=181, p90=197, p99=211. VRAM
197+
delta +142 MiB (acceptable, no leak).
198+
199+
**Caveat**: skip rate at threshold=0.005 is empirically very low on our
200+
short-output workload (most TPS gain comes from kernel restructuring,
201+
not the skip itself). Skip-rate counter scaffold ships so future
202+
operators can data-drive their threshold tuning. Long-context (>16K
203+
input) deeper sweep deferred to next session — needs sustained-context
204+
workload to characterize properly.
205+
146206
### Bench results — `v7.65` PROD eligibility
147207

148208
35B FP8 DFlash 160K (TP=2 + DFlash spec K=3 + PN22+PN23+PN24):

scripts/start_35b_fp8_PROD.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ docker run -d \
4747
-e GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 -e GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=50000 \
4848
-e GENESIS_ENABLE_P37=1 -e GENESIS_TQ_MAX_MODEL_LEN=320000 \
4949
-e GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 -e GENESIS_PROFILE_RUN_CAP_M=4096 \
50-
-e GENESIS_ENABLE_P74_CHUNK_CLAMP=1 -e GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=0 -e GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=0 -e GENESIS_ENABLE_P79D_PREEMPT_ASYNC_DISCARD=0 -e GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 -e GENESIS_ENABLE_P82=1 -e GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1 -e GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 -e GENESIS_ENABLE_P99=1 -e GENESIS_ENABLE_PN17_FA2_LSE_CLAMP=1 -e GENESIS_ENABLE_PN19_SCOPED_MAX_SPLIT=1 -e GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 -e GENESIS_ENABLE_P103=1 -e GENESIS_ENABLE_P101=1 -e GENESIS_P82_THRESHOLD_SINGLE=0.3 -e GENESIS_PREALLOC_TOKEN_BUDGET=4096 -e GENESIS_BUFFER_MODE=shared \
50+
-e GENESIS_ENABLE_P74_CHUNK_CLAMP=1 -e GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=0 -e GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=0 -e GENESIS_ENABLE_P79D_PREEMPT_ASYNC_DISCARD=0 -e GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 -e GENESIS_ENABLE_P82=1 -e GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1 -e GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 -e GENESIS_ENABLE_P99=1 -e GENESIS_ENABLE_PN17_FA2_LSE_CLAMP=1 -e GENESIS_ENABLE_PN19_SCOPED_MAX_SPLIT=1 -e GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 -e GENESIS_ENABLE_PN26_SPARSE_V=1 -e GENESIS_PN26_SPARSE_V_THRESHOLD=0.005 -e GENESIS_PN26_SPARSE_V_BLOCK_KV=4 -e GENESIS_PN26_SPARSE_V_NUM_WARPS=4 -e GENESIS_PN26_SPARSE_V_DEBUG=1 -e GENESIS_ENABLE_P103=1 -e GENESIS_ENABLE_P101=1 -e GENESIS_P82_THRESHOLD_SINGLE=0.3 -e GENESIS_PREALLOC_TOKEN_BUDGET=4096 -e GENESIS_BUFFER_MODE=shared \
5151
vllm/vllm-openai:nightly -c \
5252
"set -e; echo \"=== v775 35B baseline upstream P67 (matches v759 PROD) ===\"; \
5353
pip install --quiet --disable-pip-version-check pandas scipy xxhash; \

vllm/_genesis/dispatcher.py

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -613,6 +613,33 @@ class ValidationIssue:
613613
"conflicts_with": [],
614614
"requires_patches": [],
615615
},
616+
"PN26b": {
617+
"title": "Sparse-V tile-skip Genesis kernel (BLASST λ=a/L for SM86)",
618+
"env_flag": "GENESIS_ENABLE_PN26_SPARSE_V",
619+
"default_on": False,
620+
"category": "perf_hotfix",
621+
"credit": (
622+
"Genesis-original Triton kernel fork — first sparse-V tile-skip "
623+
"deployed for SM86 (Ampere consumer). Synthesized from 4-agent "
624+
"research 2026-05-01: vllm#41422 (TheTom, AMD-only validated) "
625+
"design template + BLASST arXiv 2512.12087 (Yuan et al. Dec 2025) "
626+
"λ=a/L threshold formula + tq-kv reference (CUDA, SM86-compatible) "
627+
"acc*re_scale skip semantics + StreamingLLM (arXiv 2309.17453) "
628+
"sink token protection (first 4 KV positions never skipped). "
629+
"Mechanism: when tl.max(p) < threshold for a KV tile, skip V load + "
630+
"dequant + weighted sum, just decay accumulator. Online softmax "
631+
"denominator/max still update so totals stay numerically exact "
632+
"for non-skipped tiles. Composes with PN26 main (centroids "
633+
"prebake) + P98 (workspace revert) + P67 (multi-query — separate "
634+
"code path, not affected). Default OFF; opt-in via "
635+
"GENESIS_ENABLE_PN26_SPARSE_V=1 + GENESIS_PN26_SPARSE_V_THRESHOLD "
636+
"(fixed) OR GENESIS_PN26_SPARSE_V_SCALE_FACTOR (BLASST adaptive)."
637+
),
638+
"upstream_pr": 41422,
639+
"applies_to": {},
640+
"conflicts_with": [],
641+
"requires_patches": [],
642+
},
616643
"PN27": {
617644
"title": "Revert MoERunnerInterface PluggableLayer (vllm#41440)",
618645
"env_flag": "GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE",

0 commit comments

Comments
 (0)