Sandermage
diff --git a/‎CHANGELOG.md‎
Lines changed: 60 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 60 additions & 0 deletions
diff --git a/‎scripts/start_35b_fp8_PROD.sh‎
Lines changed: 1 addition & 1 deletion b/‎scripts/start_35b_fp8_PROD.sh‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎vllm/_genesis/dispatcher.py‎
Lines changed: 27 additions & 0 deletions b/‎vllm/_genesis/dispatcher.py‎
Lines changed: 27 additions & 0 deletions
@@ -143,6 +143,66 @@ loud-and-clear in the per-release notes.
   (no PluggableLayer inheritance). Our pin pre-dates #35178 by 2
   days; we are NOT vulnerable. PN27 scaffold ready when we pin-bump.
 
+### PN26b sparse-V kernel — major iteration (v5, 2026-05-01)
+
+Comprehensive deep-dive on Genesis-original sparse-V Triton kernel based
+on 4-agent research synthesis (skip-rate observability + per-row vote +
+memory profiling + 14-day community scan).
+
+**v5 design** (lean dispatcher + tuning + observability):
+
+- **Lean dispatcher** (no per-call GPU↔CPU sync; v1's `.item()` per call
+  caused -16% short-ctx + -22% long-ctx regression — REJECTED).
+- **Configurable launch params** baked at apply() time: BLOCK_KV (4/8/16),
+  num_warps (1/2/4/8), num_stages.
+- **`tl.range()` pipelining hint** (P67 v7.50 pattern, Triton compiler
+  cp.async overlap with prior-iter MMA on Ampere).
+- **Cache modifier `.cg`** on K/V dequant raw loads (L2 streaming).
+- **Sink-token protection** (StreamingLLM finding — first 4 KV positions
+  never skipped).
+- **Skip-rate observability** (NEW): per-CTA atomic int64 counters,
+  constexpr-DCE'd to zero overhead when disabled, `~50-100 ns` per CTA
+  at epilogue when enabled. Periodic logging every 500 calls so
+  operator sees real skip rate without cross-process IPC.
+- **BLASST adaptive threshold scaffold** (`λ = scale_factor / ctx_len`)
+  ready in code; default OFF until skip-rate data informs which mode
+  is better.
+
+**Empirical sweep on 35B FP8 PROD (TQ k8v4 + MTP K=3, 2× A5000 SM86)**:
+
+| BLOCK_KV | num_warps | mean | max | CV |
+|---|---|---|---|---|
+| OFF (baseline) | — | 175.41 | 185.15 | 4.20% |
+| 8 | 1 | 178.33 | 187.67 | 3.78% |
+| 8 | 2 | 180.36 | 190.24 | 4.70% |
+| 16 | 2 | 178.35 | 190.74 | 3.26% |
+| 8 | 4 | 183.11 | 202.38 | 5.26% |
+| 8 | 8 | 181.24 | 196.60 | 5.78% |
+| **4** | **4** | **184.89** | 194.56 | 4.63% |
+| 4 | 8 | 177.40 | 191.97 | 5.79% |
+
+Winner: **BLOCK_KV=4, num_warps=4** (baked as kernel default).
+
+**Final 35B PROD A/B (apples to apples, 100t output)**:
+
+| Config              | tool-call | mean   | min   | max    | CV    |
+|---------------------|-----------|--------|-------|--------|-------|
+| Baseline (OFF)      | 7/7       | 175.41 | 158.71| 185.15 | 4.20% |
+| **PN26b v5**        | **7/7**   | **182.30** | 153.53 | **212.24** | 7.02% |
+| Δ                   | match     | **+3.9%** | -3.3% | **+14.7%** ⭐ | +2.82pp |
+
+The `212 max` exceeds the historical 35B PROD ceiling reference (171-204
+TPS quoted from earlier sessions). Tool-call quality preserved (7/7).
+Sustained 50-request load: 0 errors, p50=181, p90=197, p99=211. VRAM
+delta +142 MiB (acceptable, no leak).
+
+**Caveat**: skip rate at threshold=0.005 is empirically very low on our
+short-output workload (most TPS gain comes from kernel restructuring,
+not the skip itself). Skip-rate counter scaffold ships so future
+operators can data-drive their threshold tuning. Long-context (>16K
+input) deeper sweep deferred to next session — needs sustained-context
+workload to characterize properly.
+
 ### Bench results — `v7.65` PROD eligibility
 
 35B FP8 DFlash 160K (TP=2 + DFlash spec K=3 + PN22+PN23+PN24):
 
@@ -47,7 +47,7 @@ docker run -d \
   -e GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 -e GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=50000 \
   -e GENESIS_ENABLE_P37=1 -e GENESIS_TQ_MAX_MODEL_LEN=320000 \
   -e GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 -e GENESIS_PROFILE_RUN_CAP_M=4096 \
-  -e GENESIS_ENABLE_P74_CHUNK_CLAMP=1 -e GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=0 -e GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=0 -e GENESIS_ENABLE_P79D_PREEMPT_ASYNC_DISCARD=0 -e GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 -e GENESIS_ENABLE_P82=1 -e GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1 -e GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 -e GENESIS_ENABLE_P99=1 -e GENESIS_ENABLE_PN17_FA2_LSE_CLAMP=1 -e GENESIS_ENABLE_PN19_SCOPED_MAX_SPLIT=1 -e GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 -e GENESIS_ENABLE_P103=1 -e GENESIS_ENABLE_P101=1 -e GENESIS_P82_THRESHOLD_SINGLE=0.3 -e GENESIS_PREALLOC_TOKEN_BUDGET=4096 -e GENESIS_BUFFER_MODE=shared \
+  -e GENESIS_ENABLE_P74_CHUNK_CLAMP=1 -e GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=0 -e GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=0 -e GENESIS_ENABLE_P79D_PREEMPT_ASYNC_DISCARD=0 -e GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 -e GENESIS_ENABLE_P82=1 -e GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1 -e GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 -e GENESIS_ENABLE_P99=1 -e GENESIS_ENABLE_PN17_FA2_LSE_CLAMP=1 -e GENESIS_ENABLE_PN19_SCOPED_MAX_SPLIT=1 -e GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 -e GENESIS_ENABLE_PN26_SPARSE_V=1 -e GENESIS_PN26_SPARSE_V_THRESHOLD=0.005 -e GENESIS_PN26_SPARSE_V_BLOCK_KV=4 -e GENESIS_PN26_SPARSE_V_NUM_WARPS=4 -e GENESIS_PN26_SPARSE_V_DEBUG=1 -e GENESIS_ENABLE_P103=1 -e GENESIS_ENABLE_P101=1 -e GENESIS_P82_THRESHOLD_SINGLE=0.3 -e GENESIS_PREALLOC_TOKEN_BUDGET=4096 -e GENESIS_BUFFER_MODE=shared \
   vllm/vllm-openai:nightly -c \
   "set -e; echo \"=== v775 35B baseline upstream P67 (matches v759 PROD) ===\"; \
 pip install --quiet --disable-pip-version-check pandas scipy xxhash; \
 
@@ -613,6 +613,33 @@ class ValidationIssue:
         "conflicts_with": [],
         "requires_patches": [],
     },
+    "PN26b": {
+        "title": "Sparse-V tile-skip Genesis kernel (BLASST λ=a/L for SM86)",
+        "env_flag": "GENESIS_ENABLE_PN26_SPARSE_V",
+        "default_on": False,
+        "category": "perf_hotfix",
+        "credit": (
+            "Genesis-original Triton kernel fork — first sparse-V tile-skip "
+            "deployed for SM86 (Ampere consumer). Synthesized from 4-agent "
+            "research 2026-05-01: vllm#41422 (TheTom, AMD-only validated) "
+            "design template + BLASST arXiv 2512.12087 (Yuan et al. Dec 2025) "
+            "λ=a/L threshold formula + tq-kv reference (CUDA, SM86-compatible) "
+            "acc*re_scale skip semantics + StreamingLLM (arXiv 2309.17453) "
+            "sink token protection (first 4 KV positions never skipped). "
+            "Mechanism: when tl.max(p) < threshold for a KV tile, skip V load + "
+            "dequant + weighted sum, just decay accumulator. Online softmax "
+            "denominator/max still update so totals stay numerically exact "
+            "for non-skipped tiles. Composes with PN26 main (centroids "
+            "prebake) + P98 (workspace revert) + P67 (multi-query — separate "
+            "code path, not affected). Default OFF; opt-in via "
+            "GENESIS_ENABLE_PN26_SPARSE_V=1 + GENESIS_PN26_SPARSE_V_THRESHOLD "
+            "(fixed) OR GENESIS_PN26_SPARSE_V_SCALE_FACTOR (BLASST adaptive)."
+        ),
+        "upstream_pr": 41422,
+        "applies_to": {},
+        "conflicts_with": [],
+        "requires_patches": [],
+    },
     "PN27": {
         "title": "Revert MoERunnerInterface PluggableLayer (vllm#41440)",
         "env_flag": "GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE",