Commit d73fa9d
Sandermage
v7.65: PN26 unified TQ perf pack + A2 P68/P69 threshold default
PN26 — unified TurboQuant perf backport
========================================
Combines three OPEN upstream PRs from jasonkim8652 (#41418 / #41422 /
#41414) into a single Genesis-original opt-in patch that takes the
strengths and drops the weaknesses:
**Taken from #41418** — pre-baked Lloyd-Max centroid tables for the
3 (d, bits) shapes our PROD actually uses: (128, 4) (k=4 turboquant_4bit_nc),
(128, 8) (k=8 turboquant_k8v4 — most expensive solver, 4.6s on cold boot),
(128, 3) (k=3 turboquant_3bit_nc).
Empirical on live container after warmup:
- (128, 8): 0.018ms vs 4583.9ms solver = 259,812x speedup
- (128, 4): 3.7us vs 287.9ms solver = 77,600x speedup
- (128, 3): 0.005ms vs solver = drop-in win
**Genesis defensive addition vs upstream**: at first use, runs a
self-check that asserts prebaked == solver for (128, 4). On drift
> 1e-3 (real algorithm change in upstream Lloyd-Max), auto-disables
prebake and falls through to runtime solver with a WARNING. On
1e-6 drift (round-noise from int/1e10 encoding), logs INFO and
keeps prebake. Threshold gates against silent staleness without
false-positives on encoding rounding.
**Taken from #41422 (scaffold-only)** — sparse V tile-skip kernel
modification. Author validated on AMD MI300X only — NVIDIA Ampere
correctness needs empirical confirmation before promoting. Ships as
OFF-by-default scaffold gated by GENESIS_ENABLE_PN26_SPARSE_V=1
sub-flag; actual kernel wiring deferred to next iteration after
correctness baseline.
**Dropped from #41414** — head_dim power-of-2 padding. Qwen3.6
head_dim=128 is already pow-2; the patch would add a runtime branch
(`needs_padding`) that is dead code on our model. Revisit if we ever
migrate to head_dim=80 (Phi-2) or similar non-pow-2 model.
Status: opt-in via GENESIS_ENABLE_PN26_TQ_UNIFIED=1. Default OFF.
Composes with P67/P98/PN8 — orthogonal code paths. Sub-flag
GENESIS_ENABLE_PN26_SPARSE_V=1 reserved for future kernel wiring.
A2 — P68/P69 long-context threshold default 8000 → 50000 chars
================================================================
Issue #9 (Sander 2026-04-XX) flagged that the 8000-char default
(~2K tokens) was too aggressive — triggered P68 force-tool-choice and
P69 explicit-format-reminder on routine IDE-agent flows that aren't
genuinely long-context. New default 50000 chars (~12.5K tokens) keeps
the behavior for genuine long histories while leaving casual flows
alone. Code default already at 50000; this commit:
- updates apply_all.py docstring (was stale "8000 chars ~= 2K tok")
- updates 6 active launch scripts to override 8000→50000 explicitly
(remaining _archive scripts left at 8000 for historical bench
reproducibility)
PN9 self-retire confirmed correct
==================================
27B PROD boot showed 1 partial-apply warning (Cliff 8 hardening
working correctly): PN9 detected as obsolete via drift marker
'spec_cfg.attention_backend' present in llm_base_proposer.py.
Manually verified upstream PR #39930 is fully in our pin
(7a1eb8ac2) — both the always-reset attention_backend logic AND the
DFlashProposer subclass override (`use_non_causal=True`). Upstream
fix is a strict superset of our partial backport. Self-retire is the
correct action.
Bench validation
================
35B DFlash 160K boot: 44 applied, 43 skipped, 0 failed, 0 partial-
apply warnings. tool-call 5/7 (variance band).
27B TQ k8v4 + MTP K=3 boot: 54 applied, 33 skipped, 0 failed, 1
partial-apply warning (PN9 self-retire — verified correct above).
- tool-call 7/7
- prose 256t mean 88.39 TPS, CV 2.59%
- code 512t mean 104.25 TPS, CV 0.20%
#41190 stress test (TP=2 + spec-decode + cudaErrorIllegalAddress)
==================================================================
Tested on 35B DFlash 160K (TP=2 + DFlash spec K=3) per noonghunna's
report: 5 concurrent + 30 sequential rapid-fire chat completions.
Zero `cudaError`, zero `illegal memory access`, zero `watchdog`
events. Our stack NOT vulnerable. Differences:
- They used QuantTrio AWQ (online-quant), we use FP8 (offline)
- Their pin built off PR #40898 head (WIP), our pin on main
- Possibly P58 (async scheduler placeholder) or P60 (GDN+ngram)
defends against the codepath1 parent 434c8ce commit d73fa9d
9 files changed
Lines changed: 453 additions & 7 deletions
File tree
- scripts
- launch
- vllm/_genesis
- patches
- wiring/perf_hotfix
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
86 | 86 | | |
87 | 87 | | |
88 | 88 | | |
89 | | - | |
| 89 | + | |
90 | 90 | | |
91 | 91 | | |
92 | 92 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
117 | 117 | | |
118 | 118 | | |
119 | 119 | | |
120 | | - | |
| 120 | + | |
121 | 121 | | |
122 | 122 | | |
123 | 123 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
44 | 44 | | |
45 | 45 | | |
46 | 46 | | |
47 | | - | |
| 47 | + | |
48 | 48 | | |
49 | 49 | | |
50 | 50 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
75 | 75 | | |
76 | 76 | | |
77 | 77 | | |
78 | | - | |
| 78 | + | |
79 | 79 | | |
80 | 80 | | |
81 | 81 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
44 | 44 | | |
45 | 45 | | |
46 | 46 | | |
47 | | - | |
| 47 | + | |
48 | 48 | | |
49 | 49 | | |
50 | 50 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
44 | 44 | | |
45 | 45 | | |
46 | 46 | | |
47 | | - | |
| 47 | + | |
48 | 48 | | |
49 | 49 | | |
50 | 50 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
613 | 613 | | |
614 | 614 | | |
615 | 615 | | |
| 616 | + | |
| 617 | + | |
| 618 | + | |
| 619 | + | |
| 620 | + | |
| 621 | + | |
| 622 | + | |
| 623 | + | |
| 624 | + | |
| 625 | + | |
| 626 | + | |
| 627 | + | |
| 628 | + | |
| 629 | + | |
| 630 | + | |
| 631 | + | |
| 632 | + | |
| 633 | + | |
| 634 | + | |
| 635 | + | |
| 636 | + | |
| 637 | + | |
| 638 | + | |
| 639 | + | |
| 640 | + | |
616 | 641 | | |
617 | 642 | | |
618 | 643 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1100 | 1100 | | |
1101 | 1101 | | |
1102 | 1102 | | |
1103 | | - | |
| 1103 | + | |
| 1104 | + | |
| 1105 | + | |
1104 | 1106 | | |
1105 | 1107 | | |
1106 | 1108 | | |
| |||
2015 | 2017 | | |
2016 | 2018 | | |
2017 | 2019 | | |
| 2020 | + | |
| 2021 | + | |
| 2022 | + | |
| 2023 | + | |
| 2024 | + | |
| 2025 | + | |
| 2026 | + | |
| 2027 | + | |
| 2028 | + | |
| 2029 | + | |
| 2030 | + | |
| 2031 | + | |
| 2032 | + | |
| 2033 | + | |
| 2034 | + | |
| 2035 | + | |
| 2036 | + | |
| 2037 | + | |
| 2038 | + | |
| 2039 | + | |
| 2040 | + | |
| 2041 | + | |
| 2042 | + | |
| 2043 | + | |
| 2044 | + | |
| 2045 | + | |
| 2046 | + | |
| 2047 | + | |
| 2048 | + | |
| 2049 | + | |
| 2050 | + | |
| 2051 | + | |
| 2052 | + | |
| 2053 | + | |
| 2054 | + | |
| 2055 | + | |
| 2056 | + | |
| 2057 | + | |
| 2058 | + | |
| 2059 | + | |
| 2060 | + | |
| 2061 | + | |
| 2062 | + | |
| 2063 | + | |
2018 | 2064 | | |
2019 | 2065 | | |
2020 | 2066 | | |
| |||
0 commit comments