Skip to content

Commit d73fa9d

Browse files
author
Sandermage
committed
v7.65: PN26 unified TQ perf pack + A2 P68/P69 threshold default
PN26 — unified TurboQuant perf backport ======================================== Combines three OPEN upstream PRs from jasonkim8652 (#41418 / #41422 / #41414) into a single Genesis-original opt-in patch that takes the strengths and drops the weaknesses: **Taken from #41418** — pre-baked Lloyd-Max centroid tables for the 3 (d, bits) shapes our PROD actually uses: (128, 4) (k=4 turboquant_4bit_nc), (128, 8) (k=8 turboquant_k8v4 — most expensive solver, 4.6s on cold boot), (128, 3) (k=3 turboquant_3bit_nc). Empirical on live container after warmup: - (128, 8): 0.018ms vs 4583.9ms solver = 259,812x speedup - (128, 4): 3.7us vs 287.9ms solver = 77,600x speedup - (128, 3): 0.005ms vs solver = drop-in win **Genesis defensive addition vs upstream**: at first use, runs a self-check that asserts prebaked == solver for (128, 4). On drift > 1e-3 (real algorithm change in upstream Lloyd-Max), auto-disables prebake and falls through to runtime solver with a WARNING. On 1e-6 drift (round-noise from int/1e10 encoding), logs INFO and keeps prebake. Threshold gates against silent staleness without false-positives on encoding rounding. **Taken from #41422 (scaffold-only)** — sparse V tile-skip kernel modification. Author validated on AMD MI300X only — NVIDIA Ampere correctness needs empirical confirmation before promoting. Ships as OFF-by-default scaffold gated by GENESIS_ENABLE_PN26_SPARSE_V=1 sub-flag; actual kernel wiring deferred to next iteration after correctness baseline. **Dropped from #41414** — head_dim power-of-2 padding. Qwen3.6 head_dim=128 is already pow-2; the patch would add a runtime branch (`needs_padding`) that is dead code on our model. Revisit if we ever migrate to head_dim=80 (Phi-2) or similar non-pow-2 model. Status: opt-in via GENESIS_ENABLE_PN26_TQ_UNIFIED=1. Default OFF. Composes with P67/P98/PN8 — orthogonal code paths. Sub-flag GENESIS_ENABLE_PN26_SPARSE_V=1 reserved for future kernel wiring. A2 — P68/P69 long-context threshold default 8000 → 50000 chars ================================================================ Issue #9 (Sander 2026-04-XX) flagged that the 8000-char default (~2K tokens) was too aggressive — triggered P68 force-tool-choice and P69 explicit-format-reminder on routine IDE-agent flows that aren't genuinely long-context. New default 50000 chars (~12.5K tokens) keeps the behavior for genuine long histories while leaving casual flows alone. Code default already at 50000; this commit: - updates apply_all.py docstring (was stale "8000 chars ~= 2K tok") - updates 6 active launch scripts to override 8000→50000 explicitly (remaining _archive scripts left at 8000 for historical bench reproducibility) PN9 self-retire confirmed correct ================================== 27B PROD boot showed 1 partial-apply warning (Cliff 8 hardening working correctly): PN9 detected as obsolete via drift marker 'spec_cfg.attention_backend' present in llm_base_proposer.py. Manually verified upstream PR #39930 is fully in our pin (7a1eb8ac2) — both the always-reset attention_backend logic AND the DFlashProposer subclass override (`use_non_causal=True`). Upstream fix is a strict superset of our partial backport. Self-retire is the correct action. Bench validation ================ 35B DFlash 160K boot: 44 applied, 43 skipped, 0 failed, 0 partial- apply warnings. tool-call 5/7 (variance band). 27B TQ k8v4 + MTP K=3 boot: 54 applied, 33 skipped, 0 failed, 1 partial-apply warning (PN9 self-retire — verified correct above). - tool-call 7/7 - prose 256t mean 88.39 TPS, CV 2.59% - code 512t mean 104.25 TPS, CV 0.20% #41190 stress test (TP=2 + spec-decode + cudaErrorIllegalAddress) ================================================================== Tested on 35B DFlash 160K (TP=2 + DFlash spec K=3) per noonghunna's report: 5 concurrent + 30 sequential rapid-fire chat completions. Zero `cudaError`, zero `illegal memory access`, zero `watchdog` events. Our stack NOT vulnerable. Differences: - They used QuantTrio AWQ (online-quant), we use FP8 (offline) - Their pin built off PR #40898 head (WIP), our pin on main - Possibly P58 (async scheduler placeholder) or P60 (GDN+ngram) defends against the codepath
1 parent 434c8ce commit d73fa9d

9 files changed

Lines changed: 453 additions & 7 deletions

scripts/launch/bare_metal_35b_fp8_PROD.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ export GENESIS_P67_NUM_KV_SPLITS=32
8686
export GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1
8787
export GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER=1
8888
export GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1
89-
export GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=8000
89+
export GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=50000
9090
export GENESIS_ENABLE_P37=1
9191
export GENESIS_TQ_MAX_MODEL_LEN=320000
9292
export GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1

scripts/launch/bare_metal_35b_fp8_PROD_single_card.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,7 @@ export GENESIS_P67_NUM_KV_SPLITS=32
117117
export GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1
118118
export GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER=1
119119
export GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1
120-
export GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=8000
120+
export GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=50000
121121
export GENESIS_ENABLE_P37=1
122122
export GENESIS_TQ_MAX_MODEL_LEN=320000
123123
export GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1

scripts/launch/start_35b_fp8_PROD.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ docker run -d \
4444
-e GENESIS_ENABLE_P66_CUDAGRAPH_SIZE_FILTER=1 -e GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL=1 \
4545
-e GENESIS_P67_USE_UPSTREAM=1 -e GENESIS_P67_NUM_KV_SPLITS=32 \
4646
-e GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 -e GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER=1 \
47-
-e GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 -e GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=8000 \
47+
-e GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 -e GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=50000 \
4848
-e GENESIS_ENABLE_P37=1 -e GENESIS_TQ_MAX_MODEL_LEN=320000 \
4949
-e GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 -e GENESIS_PROFILE_RUN_CAP_M=4096 \
5050
-e GENESIS_ENABLE_P74_CHUNK_CLAMP=1 -e GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=0 -e GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=0 -e GENESIS_ENABLE_P79D_PREEMPT_ASYNC_DISCARD=0 -e GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 -e GENESIS_ENABLE_P82=1 -e GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1 -e GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 -e GENESIS_ENABLE_P99=1 -e GENESIS_ENABLE_P101=1 -e GENESIS_P82_THRESHOLD_SINGLE=0.3 -e GENESIS_PREALLOC_TOKEN_BUDGET=4096 -e GENESIS_BUFFER_MODE=shared \

scripts/launch/start_35b_fp8_PROD_single_card.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ docker run -d \
7575
-e GENESIS_ENABLE_P66_CUDAGRAPH_SIZE_FILTER=1 -e GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL=1 \
7676
-e GENESIS_P67_USE_UPSTREAM=1 -e GENESIS_P67_NUM_KV_SPLITS=32 \
7777
-e GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 -e GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER=1 \
78-
-e GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 -e GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=8000 \
78+
-e GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 -e GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=50000 \
7979
-e GENESIS_ENABLE_P37=1 -e GENESIS_TQ_MAX_MODEL_LEN=320000 \
8080
-e GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 -e GENESIS_PROFILE_RUN_CAP_M=4096 \
8181
-e GENESIS_ENABLE_P74_CHUNK_CLAMP=1 -e GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=0 -e GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=0 -e GENESIS_ENABLE_P79D_PREEMPT_ASYNC_DISCARD=0 -e GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 -e GENESIS_ENABLE_P82=1 -e GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1 -e GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 -e GENESIS_ENABLE_P99=1 -e GENESIS_ENABLE_P101=1 -e GENESIS_P82_THRESHOLD_SINGLE=0.3 -e GENESIS_PREALLOC_TOKEN_BUDGET=4096 -e GENESIS_BUFFER_MODE=shared \

scripts/start_35b_fp8_DFLASH.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ docker run -d \
4444
-e GENESIS_ENABLE_P66_CUDAGRAPH_SIZE_FILTER=1 -e GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL=1 \
4545
-e GENESIS_P67_USE_UPSTREAM=1 -e GENESIS_P67_NUM_KV_SPLITS=32 \
4646
-e GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 -e GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER=1 \
47-
-e GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 -e GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=8000 \
47+
-e GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 -e GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=50000 \
4848
-e GENESIS_ENABLE_P37=1 -e GENESIS_TQ_MAX_MODEL_LEN=320000 \
4949
-e GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 -e GENESIS_PROFILE_RUN_CAP_M=4096 \
5050
-e GENESIS_ENABLE_P74_CHUNK_CLAMP=1 -e GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=0 -e GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=0 -e GENESIS_ENABLE_P79D_PREEMPT_ASYNC_DISCARD=0 -e GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 -e GENESIS_ENABLE_P82=1 -e GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1 -e GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 -e GENESIS_ENABLE_P99=1 -e GENESIS_ENABLE_PN17_FA2_LSE_CLAMP=1 -e GENESIS_ENABLE_PN19_SCOPED_MAX_SPLIT=1 -e GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 -e GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 -e GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 -e GENESIS_ENABLE_P103=1 -e GENESIS_ENABLE_P101=1 -e GENESIS_P82_THRESHOLD_SINGLE=0.3 -e GENESIS_PREALLOC_TOKEN_BUDGET=4096 -e GENESIS_BUFFER_MODE=shared \

scripts/start_35b_fp8_PROD.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ docker run -d \
4444
-e GENESIS_ENABLE_P66_CUDAGRAPH_SIZE_FILTER=1 -e GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL=1 \
4545
-e GENESIS_P67_USE_UPSTREAM=1 -e GENESIS_P67_NUM_KV_SPLITS=32 \
4646
-e GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 -e GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER=1 \
47-
-e GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 -e GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=8000 \
47+
-e GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 -e GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS=50000 \
4848
-e GENESIS_ENABLE_P37=1 -e GENESIS_TQ_MAX_MODEL_LEN=320000 \
4949
-e GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 -e GENESIS_PROFILE_RUN_CAP_M=4096 \
5050
-e GENESIS_ENABLE_P74_CHUNK_CLAMP=1 -e GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=0 -e GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=0 -e GENESIS_ENABLE_P79D_PREEMPT_ASYNC_DISCARD=0 -e GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 -e GENESIS_ENABLE_P82=1 -e GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1 -e GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 -e GENESIS_ENABLE_P99=1 -e GENESIS_ENABLE_PN17_FA2_LSE_CLAMP=1 -e GENESIS_ENABLE_PN19_SCOPED_MAX_SPLIT=1 -e GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 -e GENESIS_ENABLE_P103=1 -e GENESIS_ENABLE_P101=1 -e GENESIS_P82_THRESHOLD_SINGLE=0.3 -e GENESIS_PREALLOC_TOKEN_BUDGET=4096 -e GENESIS_BUFFER_MODE=shared \

vllm/_genesis/dispatcher.py

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -613,6 +613,31 @@ class ValidationIssue:
613613
"conflicts_with": [],
614614
"requires_patches": [],
615615
},
616+
"PN26": {
617+
"title": "TQ unified perf pack (centroids prebake + sparse V scaffold)",
618+
"env_flag": "GENESIS_ENABLE_PN26_TQ_UNIFIED",
619+
"default_on": False,
620+
"category": "perf_hotfix",
621+
"credit": (
622+
"Genesis-original 2026-05-01 unification of three OPEN upstream "
623+
"PRs (jasonkim8652): #41418 pre-baked Lloyd-Max centroids (drop-in "
624+
"safe, eliminates 50ms-2.5s JIT solver per shape on cold boot); "
625+
"#41422 sparse V tile-skip in decode kernel (scaffolded, OFF by "
626+
"default until NVIDIA Ampere correctness validation — author "
627+
"validated AMD MI300X only); #41414 head_dim pow-2 padding "
628+
"DROPPED — Qwen3.6 head_dim=128 already pow-2, would add dead "
629+
"code overhead. Genesis defensive addition: self-check at "
630+
"module-init asserts prebaked centroids equal solver output; on "
631+
"drift (e.g. upstream changes Lloyd-Max algo) auto-disables "
632+
"prebake and falls through to runtime solver with WARNING. No "
633+
"silent staleness. Composes with P67/P98/PN8 — orthogonal code "
634+
"paths."
635+
),
636+
"upstream_pr": 41418,
637+
"applies_to": {},
638+
"conflicts_with": [],
639+
"requires_patches": [],
640+
},
616641
"PN25": {
617642
"title": "SiluAndMul.forward_native opaque-op pool (Cliff 1 mech B compile path)",
618643
"env_flag": "GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE",

vllm/_genesis/patches/apply_all.py

Lines changed: 47 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1100,7 +1100,9 @@ def apply_patch_68_69_long_ctx_tool_adherence() -> PatchResult:
11001100
P69: append explicit format reminder to last user message
11011101
11021102
Both env-flag opt-in. No-op when disabled. Threshold configurable via
1103-
GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS (default 8000 chars ~= 2K tok).
1103+
GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS (default 50000 chars ~= 12.5K
1104+
tok; raised from 8000 in v7.65 per Issue #9 — old default was too
1105+
aggressive and triggered on routine tool-call flows).
11041106
11051107
Status:
11061108
- GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 to engage P68
@@ -2015,6 +2017,50 @@ def apply_patch_N12_ffn_intermediate_pool() -> PatchResult:
20152017
return _failed(name, reason)
20162018

20172019

2020+
@register_patch(
2021+
"PN26 TQ unified perf pack (centroids prebake + sparse V scaffold)"
2022+
)
2023+
def apply_patch_N26_tq_unified_perf() -> PatchResult:
2024+
"""Patch N26: unified backport of three OPEN upstream PRs touching the
2025+
TurboQuant code path (#41418 + #41422 + #41414).
2026+
2027+
Combines the strengths and drops the weaknesses:
2028+
2029+
- **From #41418** (centroids prebake): drop-in safe, eliminates
2030+
50ms-2.5s JIT solver run on the first request per (d, bits) shape.
2031+
Genesis defensive addition: at first use, asserts prebaked == solver
2032+
to catch drift if upstream Lloyd-Max algorithm changes; auto-falls
2033+
back to runtime solver on mismatch.
2034+
2035+
- **From #41422** (sparse V tile-skip): kernel modification to skip V
2036+
load + dequant on tiles where softmax probability max is below a
2037+
threshold. Author validated on AMD MI300X only — we ship as
2038+
OFF-by-default scaffold; sub-flag GENESIS_ENABLE_PN26_SPARSE_V=1
2039+
acknowledges operator opt-in but actual kernel wiring is deferred
2040+
to next iteration after NVIDIA Ampere correctness baseline.
2041+
2042+
- **DROPPED from #41414** (head_dim power-of-2 padding): Qwen3.6
2043+
head_dim=128 is already a power of 2; the patch would add a
2044+
runtime branch (`needs_padding`) that is dead code on our model.
2045+
2046+
Status: opt-in via GENESIS_ENABLE_PN26_TQ_UNIFIED=1. Default OFF.
2047+
Composes with P67/P98/PN8 — orthogonal code paths.
2048+
"""
2049+
name = "PN26 TQ unified perf pack (centroids prebake + sparse V scaffold)"
2050+
if not _APPLY_MODE:
2051+
return _applied(name, "dry-run: text-patch ready")
2052+
try:
2053+
from vllm._genesis.wiring.perf_hotfix import patch_N26_tq_unified_perf
2054+
except Exception as e:
2055+
return _failed(name, f"wiring import failed: {e}")
2056+
status, reason = patch_N26_tq_unified_perf.apply()
2057+
if status == "applied":
2058+
return _applied(name, reason)
2059+
if status == "skipped":
2060+
return _skipped(name, reason)
2061+
return _failed(name, reason)
2062+
2063+
20182064
@register_patch(
20192065
"PN25 SiluAndMul.forward_native opaque-op pool "
20202066
"(Cliff 1 mech B compile-path companion to PN12)"

0 commit comments

Comments
 (0)