v7.65: PN26 unified TQ perf pack + A2 P68/P69 threshold default

Sandermage · Sandermage · commit d73fa9d1857d · 2026-05-01T11:36:04.000+03:00
PN26 — unified TurboQuant perf backport ======================================== Combines three OPEN upstream PRs from jasonkim8652 (#41418 / #41422 / #41414) into a single Genesis-original opt-in patch that takes the strengths and drops the weaknesses: **Taken from #41418** — pre-baked Lloyd-Max centroid tables for the 3 (d, bits) shapes our PROD actually uses: (128, 4) (k=4 turboquant_4bit_nc), (128, 8) (k=8 turboquant_k8v4 — most expensive solver, 4.6s on cold boot), (128, 3) (k=3 turboquant_3bit_nc). Empirical on live container after warmup: - (128, 8): 0.018ms vs 4583.9ms solver = 259,812x speedup - (128, 4): 3.7us vs 287.9ms solver = 77,600x speedup - (128, 3): 0.005ms vs solver = drop-in win **Genesis defensive addition vs upstream**: at first use, runs a self-check that asserts prebaked == solver for (128, 4). On drift > 1e-3 (real algorithm change in upstream Lloyd-Max), auto-disables prebake and falls through to runtime solver with a WARNING. On 1e-6 drift (round-noise from int/1e10 encoding), logs INFO and keeps prebake. Threshold gates against silent staleness without false-positives on encoding rounding. **Taken from #41422 (scaffold-only)** — sparse V tile-skip kernel modification. Author validated on AMD MI300X only — NVIDIA Ampere correctness needs empirical confirmation before promoting. Ships as OFF-by-default scaffold gated by GENESIS_ENABLE_PN26_SPARSE_V=1 sub-flag; actual kernel wiring deferred to next iteration after correctness baseline. **Dropped from #41414** — head_dim power-of-2 padding. Qwen3.6 head_dim=128 is already pow-2; the patch would add a runtime branch (`needs_padding`) that is dead code on our model. Revisit if we ever migrate to head_dim=80 (Phi-2) or similar non-pow-2 model. Status: opt-in via GENESIS_ENABLE_PN26_TQ_UNIFIED=1. Default OFF. Composes with P67/P98/PN8 — orthogonal code paths. Sub-flag GENESIS_ENABLE_PN26_SPARSE_V=1 reserved for future kernel wiring. A2 — P68/P69 long-context threshold default 8000 → 50000 chars ================================================================ Issue #9 (Sander 2026-04-XX) flagged that the 8000-char default (~2K tokens) was too aggressive — triggered P68 force-tool-choice and P69 explicit-format-reminder on routine IDE-agent flows that aren't genuinely long-context. New default 50000 chars (~12.5K tokens) keeps the behavior for genuine long histories while leaving casual flows alone. Code default already at 50000; this commit: - updates apply_all.py docstring (was stale "8000 chars ~= 2K tok") - updates 6 active launch scripts to override 8000→50000 explicitly (remaining _archive scripts left at 8000 for historical bench reproducibility) PN9 self-retire confirmed correct ================================== 27B PROD boot showed 1 partial-apply warning (Cliff 8 hardening working correctly): PN9 detected as obsolete via drift marker 'spec_cfg.attention_backend' present in llm_base_proposer.py. Manually verified upstream PR #39930 is fully in our pin (7a1eb8ac2) — both the always-reset attention_backend logic AND the DFlashProposer subclass override (`use_non_causal=True`). Upstream fix is a strict superset of our partial backport. Self-retire is the correct action. Bench validation ================ 35B DFlash 160K boot: 44 applied, 43 skipped, 0 failed, 0 partial- apply warnings. tool-call 5/7 (variance band). 27B TQ k8v4 + MTP K=3 boot: 54 applied, 33 skipped, 0 failed, 1 partial-apply warning (PN9 self-retire — verified correct above). - tool-call 7/7 - prose 256t mean 88.39 TPS, CV 2.59% - code 512t mean 104.25 TPS, CV 0.20% #41190 stress test (TP=2 + spec-decode + cudaErrorIllegalAddress) ================================================================== Tested on 35B DFlash 160K (TP=2 + DFlash spec K=3) per noonghunna's report: 5 concurrent + 30 sequential rapid-fire chat completions. Zero `cudaError`, zero `illegal memory access`, zero `watchdog` events. Our stack NOT vulnerable. Differences: - They used QuantTrio AWQ (online-quant), we use FP8 (offline) - Their pin built off PR #40898 head (WIP), our pin on main - Possibly P58 (async scheduler placeholder) or P60 (GDN+ngram) defends against the codepath
diff --git a/scripts/launch/bare_metal_35b_fp8_PROD.sh b/scripts/launch/bare_metal_35b_fp8_PROD.sh
@@ -86,7 +86,7 @@ export GENESIS_P67_NUM_KV_SPLITS=32
 export GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1
 export GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER=1
 export GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1
-export GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=8000
+export GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=50000
 export GENESIS_ENABLE_P37=1
 export GENESIS_TQ_MAX_MODEL_LEN=320000
 export GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1
diff --git a/scripts/launch/bare_metal_35b_fp8_PROD_single_card.sh b/scripts/launch/bare_metal_35b_fp8_PROD_single_card.sh
@@ -117,7 +117,7 @@ export GENESIS_P67_NUM_KV_SPLITS=32
 export GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1
 export GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER=1
 export GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1
-export GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=8000
+export GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=50000
 export GENESIS_ENABLE_P37=1
 export GENESIS_TQ_MAX_MODEL_LEN=320000
 export GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1
diff --git a/scripts/launch/start_35b_fp8_PROD.sh b/scripts/launch/start_35b_fp8_PROD.sh
@@ -44,7 +44,7 @@ docker run -d \
   -e GENESIS_ENABLE_P66_CUDAGRAPH_SIZE_FILTER=1 -e GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL=1 \
   -e GENESIS_P67_USE_UPSTREAM=1 -e GENESIS_P67_NUM_KV_SPLITS=32 \
   -e GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 -e GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER=1 \
-  -e GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 -e GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=8000 \
+  -e GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 -e GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=50000 \
   -e GENESIS_ENABLE_P37=1 -e GENESIS_TQ_MAX_MODEL_LEN=320000 \
   -e GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 -e GENESIS_PROFILE_RUN_CAP_M=4096 \
   -e GENESIS_ENABLE_P74_CHUNK_CLAMP=1 -e GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=0 -e GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=0 -e GENESIS_ENABLE_P79D_PREEMPT_ASYNC_DISCARD=0 -e GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 -e GENESIS_ENABLE_P82=1 -e GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1 -e GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 -e GENESIS_ENABLE_P99=1 -e GENESIS_ENABLE_P101=1 -e GENESIS_P82_THRESHOLD_SINGLE=0.3 -e GENESIS_PREALLOC_TOKEN_BUDGET=4096 -e GENESIS_BUFFER_MODE=shared \
diff --git a/scripts/launch/start_35b_fp8_PROD_single_card.sh b/scripts/launch/start_35b_fp8_PROD_single_card.sh
@@ -75,7 +75,7 @@ docker run -d \
   -e GENESIS_ENABLE_P66_CUDAGRAPH_SIZE_FILTER=1 -e GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL=1 \
   -e GENESIS_P67_USE_UPSTREAM=1 -e GENESIS_P67_NUM_KV_SPLITS=32 \
   -e GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 -e GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER=1 \
-  -e GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 -e GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=8000 \
+  -e GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 -e GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=50000 \
   -e GENESIS_ENABLE_P37=1 -e GENESIS_TQ_MAX_MODEL_LEN=320000 \
   -e GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 -e GENESIS_PROFILE_RUN_CAP_M=4096 \
   -e GENESIS_ENABLE_P74_CHUNK_CLAMP=1 -e GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=0 -e GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=0 -e GENESIS_ENABLE_P79D_PREEMPT_ASYNC_DISCARD=0 -e GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 -e GENESIS_ENABLE_P82=1 -e GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1 -e GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 -e GENESIS_ENABLE_P99=1 -e GENESIS_ENABLE_P101=1 -e GENESIS_P82_THRESHOLD_SINGLE=0.3 -e GENESIS_PREALLOC_TOKEN_BUDGET=4096 -e GENESIS_BUFFER_MODE=shared \
diff --git a/scripts/start_35b_fp8_DFLASH.sh b/scripts/start_35b_fp8_DFLASH.sh
@@ -44,7 +44,7 @@ docker run -d \
   -e GENESIS_ENABLE_P66_CUDAGRAPH_SIZE_FILTER=1 -e GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL=1 \
   -e GENESIS_P67_USE_UPSTREAM=1 -e GENESIS_P67_NUM_KV_SPLITS=32 \
   -e GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 -e GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER=1 \
-  -e GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 -e GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=8000 \
+  -e GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 -e GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=50000 \
   -e GENESIS_ENABLE_P37=1 -e GENESIS_TQ_MAX_MODEL_LEN=320000 \
   -e GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 -e GENESIS_PROFILE_RUN_CAP_M=4096 \
   -e GENESIS_ENABLE_P74_CHUNK_CLAMP=1 -e GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=0 -e GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=0 -e GENESIS_ENABLE_P79D_PREEMPT_ASYNC_DISCARD=0 -e GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 -e GENESIS_ENABLE_P82=1 -e GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1 -e GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 -e GENESIS_ENABLE_P99=1 -e GENESIS_ENABLE_PN17_FA2_LSE_CLAMP=1 -e GENESIS_ENABLE_PN19_SCOPED_MAX_SPLIT=1 -e GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 -e GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 -e GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 -e GENESIS_ENABLE_P103=1 -e GENESIS_ENABLE_P101=1 -e GENESIS_P82_THRESHOLD_SINGLE=0.3 -e GENESIS_PREALLOC_TOKEN_BUDGET=4096 -e GENESIS_BUFFER_MODE=shared \
diff --git a/scripts/start_35b_fp8_PROD.sh b/scripts/start_35b_fp8_PROD.sh
@@ -44,7 +44,7 @@ docker run -d \
   -e GENESIS_ENABLE_P66_CUDAGRAPH_SIZE_FILTER=1 -e GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL=1 \
   -e GENESIS_P67_USE_UPSTREAM=1 -e GENESIS_P67_NUM_KV_SPLITS=32 \
   -e GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 -e GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER=1 \
-  -e GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 -e GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=8000 \
+  -e GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 -e GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=50000 \
   -e GENESIS_ENABLE_P37=1 -e GENESIS_TQ_MAX_MODEL_LEN=320000 \
   -e GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 -e GENESIS_PROFILE_RUN_CAP_M=4096 \
   -e GENESIS_ENABLE_P74_CHUNK_CLAMP=1 -e GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=0 -e GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=0 -e GENESIS_ENABLE_P79D_PREEMPT_ASYNC_DISCARD=0 -e GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 -e GENESIS_ENABLE_P82=1 -e GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1 -e GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 -e GENESIS_ENABLE_P99=1 -e GENESIS_ENABLE_PN17_FA2_LSE_CLAMP=1 -e GENESIS_ENABLE_PN19_SCOPED_MAX_SPLIT=1 -e GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 -e GENESIS_ENABLE_P103=1 -e GENESIS_ENABLE_P101=1 -e GENESIS_P82_THRESHOLD_SINGLE=0.3 -e GENESIS_PREALLOC_TOKEN_BUDGET=4096 -e GENESIS_BUFFER_MODE=shared \
diff --git a/vllm/_genesis/dispatcher.py b/vllm/_genesis/dispatcher.py
@@ -613,6 +613,31 @@ class ValidationIssue:
         "conflicts_with": [],
         "requires_patches": [],
     },
+    "PN26": {
+        "title": "TQ unified perf pack (centroids prebake + sparse V scaffold)",
+        "env_flag": "GENESIS_ENABLE_PN26_TQ_UNIFIED",
+        "default_on": False,
+        "category": "perf_hotfix",
+        "credit": (
+            "Genesis-original 2026-05-01 unification of three OPEN upstream "
+            "PRs (jasonkim8652): #41418 pre-baked Lloyd-Max centroids (drop-in "
+            "safe, eliminates 50ms-2.5s JIT solver per shape on cold boot); "
+            "#41422 sparse V tile-skip in decode kernel (scaffolded, OFF by "
+            "default until NVIDIA Ampere correctness validation — author "
+            "validated AMD MI300X only); #41414 head_dim pow-2 padding "
+            "DROPPED — Qwen3.6 head_dim=128 already pow-2, would add dead "
+            "code overhead. Genesis defensive addition: self-check at "
+            "module-init asserts prebaked centroids equal solver output; on "
+            "drift (e.g. upstream changes Lloyd-Max algo) auto-disables "
+            "prebake and falls through to runtime solver with WARNING. No "
+            "silent staleness. Composes with P67/P98/PN8 — orthogonal code "
+            "paths."
+        ),
+        "upstream_pr": 41418,
+        "applies_to": {},
+        "conflicts_with": [],
+        "requires_patches": [],
+    },
     "PN25": {
         "title": "SiluAndMul.forward_native opaque-op pool (Cliff 1 mech B compile path)",
         "env_flag": "GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE",
diff --git a/vllm/_genesis/patches/apply_all.py b/vllm/_genesis/patches/apply_all.py
@@ -1100,7 +1100,9 @@ def apply_patch_68_69_long_ctx_tool_adherence() -> PatchResult:
       P69: append explicit format reminder to last user message
 
     Both env-flag opt-in. No-op when disabled. Threshold configurable via
-    GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS (default 8000 chars ~= 2K tok).
+    GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS (default 50000 chars ~= 12.5K
+    tok; raised from 8000 in v7.65 per Issue #9 — old default was too
+    aggressive and triggered on routine tool-call flows).
 
     Status:
       - GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 to engage P68
@@ -2015,6 +2017,50 @@ def apply_patch_N12_ffn_intermediate_pool() -> PatchResult:
     return _failed(name, reason)
 
 
+@register_patch(
+    "PN26 TQ unified perf pack (centroids prebake + sparse V scaffold)"
+)
+def apply_patch_N26_tq_unified_perf() -> PatchResult:
+    """Patch N26: unified backport of three OPEN upstream PRs touching the
+    TurboQuant code path (#41418 + #41422 + #41414).
+
+    Combines the strengths and drops the weaknesses:
+
+    - **From #41418** (centroids prebake): drop-in safe, eliminates
+      50ms-2.5s JIT solver run on the first request per (d, bits) shape.
+      Genesis defensive addition: at first use, asserts prebaked == solver
+      to catch drift if upstream Lloyd-Max algorithm changes; auto-falls
+      back to runtime solver on mismatch.
+
+    - **From #41422** (sparse V tile-skip): kernel modification to skip V
+      load + dequant on tiles where softmax probability max is below a
+      threshold. Author validated on AMD MI300X only — we ship as
+      OFF-by-default scaffold; sub-flag GENESIS_ENABLE_PN26_SPARSE_V=1
+      acknowledges operator opt-in but actual kernel wiring is deferred
+      to next iteration after NVIDIA Ampere correctness baseline.
+
+    - **DROPPED from #41414** (head_dim power-of-2 padding): Qwen3.6
+      head_dim=128 is already a power of 2; the patch would add a
+      runtime branch (`needs_padding`) that is dead code on our model.
+
+    Status: opt-in via GENESIS_ENABLE_PN26_TQ_UNIFIED=1. Default OFF.
+    Composes with P67/P98/PN8 — orthogonal code paths.
+    """
+    name = "PN26 TQ unified perf pack (centroids prebake + sparse V scaffold)"
+    if not _APPLY_MODE:
+        return _applied(name, "dry-run: text-patch ready")
+    try:
+        from vllm._genesis.wiring.perf_hotfix import patch_N26_tq_unified_perf
+    except Exception as e:
+        return _failed(name, f"wiring import failed: {e}")
+    status, reason = patch_N26_tq_unified_perf.apply()
+    if status == "applied":
+        return _applied(name, reason)
+    if status == "skipped":
+        return _skipped(name, reason)
+    return _failed(name, reason)
+
+
 @register_patch(
     "PN25 SiluAndMul.forward_native opaque-op pool "
     "(Cliff 1 mech B compile-path companion to PN12)"
diff --git a/vllm/_genesis/wiring/perf_hotfix/patch_N26_tq_unified_perf.py b/vllm/_genesis/wiring/perf_hotfix/patch_N26_tq_unified_perf.py