Commit f289e07
Sandermage
Two related fixes for the Cliff 1 mech B cascade reported by noonghunna
on Genesis-vllm-patches issues #14 and #15.
ISSUE #14 (P38 silent no-op on TQ KV path) — P38B fix
======================================================
**Problem (root cause)**: P38's class-attribute rebind of
`TurboQuantAttentionImpl._continuation_prefill` doesn't survive
`aot_compile_fullgraph` capture. The compiled forward graph references
the ORIGINAL method body at runtime; rebind updates only the live class
dict. noonghunna's instrumentation confirmed: log line in Genesis
replacement never fires despite rebind reporting "applied".
**Fix**: text-patch `vllm/v1/attention/backends/turboquant_attn.py` to
inject a delegate hook at the start of `_continuation_prefill` body.
The hook calls `type(self)._genesis_p38_dispatch` (a class attribute
set by Genesis after import) which returns Genesis result OR None
to fall through. Source-level edit means aot_compile captures the hook
as part of the compiled artifact.
**Affected**: ALL TurboQuant KV users with V0/V1 compile pipeline.
fp8 KV configs unaffected (different code path).
ISSUE #15 (FA varlen workspace cliff) — P15B fix
=================================================
**Problem (root cause)**: PN17 clamps `max_seqlen_k` on the FA2 backend
path (`flash_attn.py`), but TurboQuant code path bypasses PN17's
coverage by calling vllm_flash_attn's vendored wrapper via
`turboquant_attn.py:_flash_attn_varlen`. On long-context continuation
prefill the wrapper over-allocates ~max_seqlen_k-sized workspace,
causing 50 MiB OOM at tight VRAM (24 GB consumer cards, long-vision
140K + 0.95 mem-util).
**Fix**: text-patch `_flash_attn_varlen` body to compute actual max
from `cu_seqlens_k` and clamp `max_seqlen_k` before invoking the FA
wrapper. batch=1 fast path: single tensor element access. batch>1:
diff().max() reduction. Adds one GPU→CPU sync per call on infrequent
continuation-prefill path.
NEW PATCHES
===========
- `vllm/_genesis/wiring/perf_hotfix/patch_38b_compile_safe_hook.py`
- `vllm/_genesis/wiring/perf_hotfix/patch_15B_fa_varlen_clamp.py`
- Dispatcher entries: P38B + P15B (opt-in OFF default)
- apply_all.py register entries
- 27B PROD launch script enables both: GENESIS_ENABLE_P38B_COMPILE_SAFE=1
+ GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1
VALIDATION (27B PROD, TQ k8v4 + MTP K=3, 2× A5000)
====================================================
Boot: P38B + P15B both APPLY cleanly (text-patch + dispatcher install
on TurboQuantAttentionImpl). No exceptions, no boot regressions.
Boot summary: PN26b + P38B + P15B + 50+ other patches all applied.
Sustained 50-req bench:
| Config | mean | min | max | p99 | tool-call | errors |
|-------------------|--------|-------|--------|--------|-----------|--------|
| Baseline (no PN26b) | 97.76 | 85.31 | 108.73 | 108.45 | 7/7 | 0/50 |
| PN26b only | 98.91 | 85.06 | 110.61 | 110.18 | 7/7 | 0/50 |
| **PN26b + P38B + P15B** | **98.57** | 84.04 | 109.56 | 109.51 | **7/7** | **0/50** |
Net: P38B + P15B add zero observable runtime overhead vs PN26b alone.
Tool-call quality preserved (7/7). Zero errors. Variance band ±1.5 TPS.
Cliff repro pending: noonghunna's failure repros require ~50K-token
single-shot prefill on long-vision 140K + 0.95 (24 GB 3090). Our
35B PROD bench at 100t output doesn't exercise the failure path.
P15B trade-off (sync per call) is statistically invisible at this
output length.
INDEPENDENT CONVERGENCE WITH NOONGHUNNA
========================================
noonghunna's `patch_pn12_compile_safe_custom_op.py` uses
`torch.library.custom_op` for the same problem class on PN12. Genesis
P38B uses in-source text-patch on `_continuation_prefill`. Both
mechanisms are valid for routing around aot_compile capture; we chose
text-patch for P38 specifically because `_continuation_prefill` has
many self-attribute deps and module-level imports that complicate the
functional-input contract that custom_op needs.
For PN25 (SiluAndMul.forward_native) we used custom_op. For P38B
(TurboQuant._continuation_prefill) we used text-patch. Same problem
class, mechanism choice depends on signature complexity.
Sources:
- Issue #14: #14
- Issue #15: #15
- noonghunna's PN12 reference impl:
https://github.com/noonghunna/club-3090/blob/master/models/qwen3.6-27b/vllm/patches/patch_pn12_compile_safe_custom_op.py
1 parent 5fe62b4 commit f289e07
5 files changed
Lines changed: 639 additions & 0 deletions
File tree
- scripts
- vllm/_genesis
- patches
- wiring/perf_hotfix
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
| 49 | + | |
49 | 50 | | |
50 | 51 | | |
51 | 52 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
613 | 613 | | |
614 | 614 | | |
615 | 615 | | |
| 616 | + | |
| 617 | + | |
| 618 | + | |
| 619 | + | |
| 620 | + | |
| 621 | + | |
| 622 | + | |
| 623 | + | |
| 624 | + | |
| 625 | + | |
| 626 | + | |
| 627 | + | |
| 628 | + | |
| 629 | + | |
| 630 | + | |
| 631 | + | |
| 632 | + | |
| 633 | + | |
| 634 | + | |
| 635 | + | |
| 636 | + | |
| 637 | + | |
| 638 | + | |
| 639 | + | |
| 640 | + | |
| 641 | + | |
| 642 | + | |
| 643 | + | |
| 644 | + | |
| 645 | + | |
| 646 | + | |
| 647 | + | |
| 648 | + | |
| 649 | + | |
| 650 | + | |
| 651 | + | |
| 652 | + | |
| 653 | + | |
| 654 | + | |
| 655 | + | |
| 656 | + | |
| 657 | + | |
| 658 | + | |
| 659 | + | |
| 660 | + | |
616 | 661 | | |
617 | 662 | | |
618 | 663 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2017 | 2017 | | |
2018 | 2018 | | |
2019 | 2019 | | |
| 2020 | + | |
| 2021 | + | |
| 2022 | + | |
| 2023 | + | |
| 2024 | + | |
| 2025 | + | |
| 2026 | + | |
| 2027 | + | |
| 2028 | + | |
| 2029 | + | |
| 2030 | + | |
| 2031 | + | |
| 2032 | + | |
| 2033 | + | |
| 2034 | + | |
| 2035 | + | |
| 2036 | + | |
| 2037 | + | |
| 2038 | + | |
| 2039 | + | |
| 2040 | + | |
| 2041 | + | |
| 2042 | + | |
| 2043 | + | |
| 2044 | + | |
| 2045 | + | |
| 2046 | + | |
| 2047 | + | |
| 2048 | + | |
| 2049 | + | |
| 2050 | + | |
| 2051 | + | |
| 2052 | + | |
| 2053 | + | |
| 2054 | + | |
| 2055 | + | |
| 2056 | + | |
| 2057 | + | |
| 2058 | + | |
| 2059 | + | |
| 2060 | + | |
| 2061 | + | |
| 2062 | + | |
| 2063 | + | |
| 2064 | + | |
| 2065 | + | |
| 2066 | + | |
| 2067 | + | |
| 2068 | + | |
| 2069 | + | |
| 2070 | + | |
| 2071 | + | |
| 2072 | + | |
| 2073 | + | |
| 2074 | + | |
| 2075 | + | |
| 2076 | + | |
| 2077 | + | |
| 2078 | + | |
| 2079 | + | |
| 2080 | + | |
| 2081 | + | |
| 2082 | + | |
| 2083 | + | |
| 2084 | + | |
| 2085 | + | |
| 2086 | + | |
| 2087 | + | |
| 2088 | + | |
| 2089 | + | |
| 2090 | + | |
| 2091 | + | |
| 2092 | + | |
| 2093 | + | |
| 2094 | + | |
2020 | 2095 | | |
2021 | 2096 | | |
2022 | 2097 | | |
| |||
Lines changed: 204 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
0 commit comments