Skip to content

Commit f289e07

Browse files
author
Sandermage
committed
v7.65: bug fixes #14 (P38B) + #15 (P15B) — close noonghunna's cliff cascade
Two related fixes for the Cliff 1 mech B cascade reported by noonghunna on Genesis-vllm-patches issues #14 and #15. ISSUE #14 (P38 silent no-op on TQ KV path) — P38B fix ====================================================== **Problem (root cause)**: P38's class-attribute rebind of `TurboQuantAttentionImpl._continuation_prefill` doesn't survive `aot_compile_fullgraph` capture. The compiled forward graph references the ORIGINAL method body at runtime; rebind updates only the live class dict. noonghunna's instrumentation confirmed: log line in Genesis replacement never fires despite rebind reporting "applied". **Fix**: text-patch `vllm/v1/attention/backends/turboquant_attn.py` to inject a delegate hook at the start of `_continuation_prefill` body. The hook calls `type(self)._genesis_p38_dispatch` (a class attribute set by Genesis after import) which returns Genesis result OR None to fall through. Source-level edit means aot_compile captures the hook as part of the compiled artifact. **Affected**: ALL TurboQuant KV users with V0/V1 compile pipeline. fp8 KV configs unaffected (different code path). ISSUE #15 (FA varlen workspace cliff) — P15B fix ================================================= **Problem (root cause)**: PN17 clamps `max_seqlen_k` on the FA2 backend path (`flash_attn.py`), but TurboQuant code path bypasses PN17's coverage by calling vllm_flash_attn's vendored wrapper via `turboquant_attn.py:_flash_attn_varlen`. On long-context continuation prefill the wrapper over-allocates ~max_seqlen_k-sized workspace, causing 50 MiB OOM at tight VRAM (24 GB consumer cards, long-vision 140K + 0.95 mem-util). **Fix**: text-patch `_flash_attn_varlen` body to compute actual max from `cu_seqlens_k` and clamp `max_seqlen_k` before invoking the FA wrapper. batch=1 fast path: single tensor element access. batch>1: diff().max() reduction. Adds one GPU→CPU sync per call on infrequent continuation-prefill path. NEW PATCHES =========== - `vllm/_genesis/wiring/perf_hotfix/patch_38b_compile_safe_hook.py` - `vllm/_genesis/wiring/perf_hotfix/patch_15B_fa_varlen_clamp.py` - Dispatcher entries: P38B + P15B (opt-in OFF default) - apply_all.py register entries - 27B PROD launch script enables both: GENESIS_ENABLE_P38B_COMPILE_SAFE=1 + GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1 VALIDATION (27B PROD, TQ k8v4 + MTP K=3, 2× A5000) ==================================================== Boot: P38B + P15B both APPLY cleanly (text-patch + dispatcher install on TurboQuantAttentionImpl). No exceptions, no boot regressions. Boot summary: PN26b + P38B + P15B + 50+ other patches all applied. Sustained 50-req bench: | Config | mean | min | max | p99 | tool-call | errors | |-------------------|--------|-------|--------|--------|-----------|--------| | Baseline (no PN26b) | 97.76 | 85.31 | 108.73 | 108.45 | 7/7 | 0/50 | | PN26b only | 98.91 | 85.06 | 110.61 | 110.18 | 7/7 | 0/50 | | **PN26b + P38B + P15B** | **98.57** | 84.04 | 109.56 | 109.51 | **7/7** | **0/50** | Net: P38B + P15B add zero observable runtime overhead vs PN26b alone. Tool-call quality preserved (7/7). Zero errors. Variance band ±1.5 TPS. Cliff repro pending: noonghunna's failure repros require ~50K-token single-shot prefill on long-vision 140K + 0.95 (24 GB 3090). Our 35B PROD bench at 100t output doesn't exercise the failure path. P15B trade-off (sync per call) is statistically invisible at this output length. INDEPENDENT CONVERGENCE WITH NOONGHUNNA ======================================== noonghunna's `patch_pn12_compile_safe_custom_op.py` uses `torch.library.custom_op` for the same problem class on PN12. Genesis P38B uses in-source text-patch on `_continuation_prefill`. Both mechanisms are valid for routing around aot_compile capture; we chose text-patch for P38 specifically because `_continuation_prefill` has many self-attribute deps and module-level imports that complicate the functional-input contract that custom_op needs. For PN25 (SiluAndMul.forward_native) we used custom_op. For P38B (TurboQuant._continuation_prefill) we used text-patch. Same problem class, mechanism choice depends on signature complexity. Sources: - Issue #14: #14 - Issue #15: #15 - noonghunna's PN12 reference impl: https://github.com/noonghunna/club-3090/blob/master/models/qwen3.6-27b/vllm/patches/patch_pn12_compile_safe_custom_op.py
1 parent 5fe62b4 commit f289e07

5 files changed

Lines changed: 639 additions & 0 deletions

File tree

scripts/start_27b_int4_TQ_k8v4.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ docker run -d \
4646
-e GENESIS_ENABLE_P74_CHUNK_CLAMP=1 \
4747
-e GENESIS_ENABLE_P82=0 -e GENESIS_ENABLE_P98=1 -e GENESIS_ENABLE_P99=1 -e GENESIS_P82_THRESHOLD_SINGLE=0.3 \
4848
-e GENESIS_ENABLE_PN26_SPARSE_V=1 -e GENESIS_PN26_SPARSE_V_THRESHOLD=0.01 -e GENESIS_PN26_SPARSE_V_BLOCK_KV=8 -e GENESIS_PN26_SPARSE_V_NUM_WARPS=4 \
49+
-e GENESIS_ENABLE_P38B_COMPILE_SAFE=1 -e GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1 \
4950
-e GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL=1 -e GENESIS_ENABLE_P91=1 -e GENESIS_ENABLE_P87=1 -e GENESIS_ENABLE_P85=1 -e GENESIS_ENABLE_P83=1 -e GENESIS_ENABLE_P101=1 -e GENESIS_ENABLE_P100=1 \
5051
-e GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=0 \
5152
-e GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=0 \

vllm/_genesis/dispatcher.py

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -613,6 +613,51 @@ class ValidationIssue:
613613
"conflicts_with": [],
614614
"requires_patches": [],
615615
},
616+
"P15B": {
617+
"title": "FA varlen max_seqlen_k clamp on TQ path (Issue #15 fix)",
618+
"env_flag": "GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP",
619+
"default_on": False,
620+
"category": "perf_hotfix",
621+
"credit": (
622+
"Genesis-original 2026-05-01 fix for noonghunna's Issue #15. "
623+
"PN17 clamps max_seqlen_k on the FA2 backend path, but TurboQuant "
624+
"code path bypasses PN17's coverage by calling vllm_flash_attn's "
625+
"vendored wrapper via turboquant_attn.py:_flash_attn_varlen. P15B "
626+
"extends the same clamp logic to that callsite via text-patch — "
627+
"computes actual span from cu_seqlens_k and clamps max_seqlen_k "
628+
"before invocation. Prevents 50 MiB workspace OOM on long-context "
629+
"continuation-prefill on tight VRAM (24 GB consumer cards). "
630+
"Trade-off: adds one GPU->CPU sync per call on the infrequent "
631+
"continuation-prefill path."
632+
),
633+
"upstream_pr": None,
634+
"applies_to": {},
635+
"conflicts_with": [],
636+
"requires_patches": [],
637+
},
638+
"P38B": {
639+
"title": "P38 compile-safe in-source hook (Issue #14 fix)",
640+
"env_flag": "GENESIS_ENABLE_P38B_COMPILE_SAFE",
641+
"default_on": False,
642+
"category": "perf_hotfix",
643+
"credit": (
644+
"Genesis-original 2026-05-01 fix for noonghunna's Issue #14. "
645+
"Root cause: aot_compile_fullgraph captures _continuation_prefill "
646+
"original body at engine init; Python class-attribute rebind "
647+
"(P38's mechanism) doesn't propagate to compiled artifact. "
648+
"P38B injects an in-source delegate hook at the start of "
649+
"_continuation_prefill body via text-patch. Hook calls a "
650+
"dispatcher that returns Genesis result OR None (fall-through). "
651+
"Source-level edit means aot_compile captures the hook itself. "
652+
"Affects ALL TQ KV users with V0/V1 compile pipeline; fp8 KV "
653+
"configs unaffected (different code path). Composes with P38 "
654+
"(both share _genesis_continuation_prefill impl)."
655+
),
656+
"upstream_pr": None,
657+
"applies_to": {},
658+
"conflicts_with": [],
659+
"requires_patches": [], # P38 install order: P38 first (provides impl), P38B second (installs hook)
660+
},
616661
"PN26b": {
617662
"title": "Sparse-V tile-skip Genesis kernel (BLASST λ=a/L for SM86)",
618663
"env_flag": "GENESIS_ENABLE_PN26_SPARSE_V",

vllm/_genesis/patches/apply_all.py

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2017,6 +2017,81 @@ def apply_patch_N12_ffn_intermediate_pool() -> PatchResult:
20172017
return _failed(name, reason)
20182018

20192019

2020+
@register_patch(
2021+
"P15B FA varlen max_seqlen_k clamp on TQ path (Issue #15 fix)"
2022+
)
2023+
def apply_patch_15B_fa_varlen_clamp() -> PatchResult:
2024+
"""Patch 15B: extend PN17-style clamp to TurboQuant FA varlen path.
2025+
2026+
Fixes Genesis Issue #15 (noonghunna 2026-05-01): PN17 doesn't reach
2027+
`turboquant_attn.py:_flash_attn_varlen` which calls vllm_flash_attn's
2028+
vendored wrapper. On long-context continuation prefill the wrapper
2029+
over-allocates ~max_seqlen_k-sized workspace, causing 50 MiB OOM at
2030+
tight VRAM (long-vision 140K + 0.95 mem-util on 24 GB 3090).
2031+
2032+
P15B inserts a clamp at the start of `_flash_attn_varlen` body that
2033+
computes actual max from cu_seqlens_k and reduces max_seqlen_k before
2034+
invocation. Adds one GPU->CPU sync per call on infrequent path.
2035+
2036+
Status: opt-in via GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1. Default OFF.
2037+
"""
2038+
name = "P15B FA varlen max_seqlen_k clamp on TQ path (Issue #15 fix)"
2039+
if not _APPLY_MODE:
2040+
return _applied(name, "dry-run: text-patch ready")
2041+
try:
2042+
from vllm._genesis.wiring.perf_hotfix import patch_15B_fa_varlen_clamp
2043+
except Exception as e:
2044+
return _failed(name, f"wiring import failed: {e}")
2045+
status, reason = patch_15B_fa_varlen_clamp.apply()
2046+
if status == "applied":
2047+
return _applied(name, reason)
2048+
if status == "skipped":
2049+
return _skipped(name, reason)
2050+
return _failed(name, reason)
2051+
2052+
2053+
@register_patch(
2054+
"P38B P38 compile-safe in-source hook (Issue #14 fix — aot_compile-safe)"
2055+
)
2056+
def apply_patch_38B_compile_safe_hook() -> PatchResult:
2057+
"""Patch 38B: P38 compile-safe in-source hook.
2058+
2059+
Fixes Genesis Issue #14 (noonghunna 2026-05-01): P38's class-attribute
2060+
rebind of `_continuation_prefill` doesn't survive aot_compile_fullgraph
2061+
capture. Compiled forward graph references the ORIGINAL method body at
2062+
runtime. Affects ALL TQ KV users with V0/V1 compile pipeline.
2063+
2064+
P38B fix: text-patch the upstream `turboquant_attn.py` source to
2065+
insert an in-source delegate hook at the START of
2066+
`_continuation_prefill` body. The hook calls a dispatcher that returns
2067+
Genesis impl result OR None (fall-through to original body).
2068+
2069+
Source-level edit means aot_compile captures the hook itself, not just
2070+
the original body. Class attribute `_genesis_p38_dispatch` is set
2071+
after import, BEFORE the worker compiles forward — dispatcher is
2072+
available at compile time.
2073+
2074+
Composes with P38: both share `_genesis_continuation_prefill` impl.
2075+
P38 still rebinds for eager-mode callers; P38B handles compile-mode.
2076+
2077+
Status: opt-in via GENESIS_ENABLE_P38B_COMPILE_SAFE=1. Default OFF.
2078+
Recommended pairing: enable P38 + P38B + P37 together when on TQ KV.
2079+
"""
2080+
name = "P38B P38 compile-safe in-source hook (Issue #14 fix)"
2081+
if not _APPLY_MODE:
2082+
return _applied(name, "dry-run: text-patch + dispatcher ready")
2083+
try:
2084+
from vllm._genesis.wiring.perf_hotfix import patch_38b_compile_safe_hook
2085+
except Exception as e:
2086+
return _failed(name, f"wiring import failed: {e}")
2087+
status, reason = patch_38b_compile_safe_hook.apply()
2088+
if status == "applied":
2089+
return _applied(name, reason)
2090+
if status == "skipped":
2091+
return _skipped(name, reason)
2092+
return _failed(name, reason)
2093+
2094+
20202095
@register_patch(
20212096
"PN26b sparse-V tile-skip Genesis kernel "
20222097
"(BLASST λ=a/L for SM86, first NVIDIA Ampere implementation)"
Lines changed: 204 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,204 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
"""Wiring for Patch 15B — extend PN17-style clamp to TQ FA varlen call.
3+
4+
Fixes Genesis Issue #15 (noonghunna 2026-05-01):
5+
https://github.com/Sandermage/genesis-vllm-patches/issues/15
6+
7+
================================================================
8+
PROBLEM (root cause)
9+
================================================================
10+
11+
PN17 patches `vllm/v1/attention/backends/flash_attn.py` to clamp
12+
`max_seqlen_k` from cudagraph-capture-bloated `max_model_len` to actual
13+
runtime span. This prevents `softmax_lse` over-allocation in the FA2
14+
backend.
15+
16+
But PN17's coverage doesn't reach the **TurboQuant code path**. When
17+
`_continuation_prefill` (TQ k8v4 with chunked prefill at long context)
18+
calls `self._flash_attn_varlen(...)` (turboquant_attn.py:394), the
19+
`max_seqlen_k` passed in is `seq_len` from the metadata — which on
20+
cudagraph-captured runtime can also be bloated to `max_model_len`.
21+
22+
The trace from noonghunna's repro:
23+
```
24+
turboquant_attn.py:909 _continuation_prefill
25+
turboquant_attn.py:394 _flash_attn_varlen
26+
flash_attn_interface.py:300 flash_attn_varlen_func
27+
torch._ops:1269 → C extension allocates ~50 MiB workspace based on max_seqlen_k
28+
torch.OutOfMemoryError
29+
```
30+
31+
================================================================
32+
FIX DESIGN
33+
================================================================
34+
35+
Text-patch `turboquant_attn.py:_flash_attn_varlen` to clamp
36+
`max_seqlen_k` to the ACTUAL maximum sequence length, computed from
37+
`cu_seqlens_k`:
38+
39+
- For batch=1 (continuation prefill case): `cu_seqlens_k[-1] == seq_len`
40+
is the actual max, single tensor access.
41+
- For batch>1: `(cu_seqlens_k[1:] - cu_seqlens_k[:-1]).max()` gives the
42+
actual max across batch elements. One reduction + sync.
43+
44+
The clamp adds ONE GPU→CPU sync per `_flash_attn_varlen` call. On the
45+
continuation prefill path this is tolerable: each call already triggers
46+
synchronous FA kernel invocation, and the path itself is infrequent
47+
(once per chunked prefill rollover, not per decode token).
48+
49+
PN17's design avoided sync via CPU-resident metadata, but on this
50+
path we don't have CPU-resident max_seq_len. The fallback sync is the
51+
pragmatic choice given the alternative is silent OOM.
52+
53+
================================================================
54+
SAFETY MODEL
55+
================================================================
56+
57+
- Default OFF (opt-in via `GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1`).
58+
- Idempotent (marker-checked).
59+
- Drift-aware: if upstream rewrites `_flash_attn_varlen` signature or
60+
body, anchor misses → SKIPPED, source stays vanilla.
61+
- Try/except guard: if clamp computation raises (degenerate input),
62+
falls through to original `max_seqlen_k`. No crash.
63+
- Sync added only ONCE per call (not per layer) — already-sync codepath.
64+
65+
Composition:
66+
67+
- P38 + P38B (Issue #14): orthogonal — they fix `_continuation_prefill`
68+
upstream alloc; P15B fixes the FA varlen wrapper alloc downstream.
69+
- PN17: orthogonal — different file, different code path. Together they
70+
cover both FA backends.
71+
72+
Author: Sandermage(Sander) Barzov Aleksandr, Ukraine, Odessa
73+
Origin: noonghunna Issue #15 — direct fix per their suggestion path 1
74+
"""
75+
from __future__ import annotations
76+
77+
import logging
78+
79+
from vllm._genesis.guards import resolve_vllm_file, vllm_install_root
80+
from vllm._genesis.wiring.text_patch import (
81+
TextPatch,
82+
TextPatcher,
83+
TextPatchResult,
84+
result_to_wiring_status,
85+
)
86+
87+
log = logging.getLogger("genesis.wiring.p15B_fa_varlen_clamp")
88+
89+
GENESIS_P15B_MARKER = "Genesis P15B FA varlen max_seqlen_k clamp (Issue #15) v7.65"
90+
91+
92+
# Anchor: the function signature. We insert clamp logic right at the
93+
# top of the body, before any other logic.
94+
P15B_ANCHOR = (
95+
" def _flash_attn_varlen(\n"
96+
" self,\n"
97+
" q: torch.Tensor,\n"
98+
" k: torch.Tensor,\n"
99+
" v: torch.Tensor,\n"
100+
" cu_seqlens_q: torch.Tensor,\n"
101+
" cu_seqlens_k: torch.Tensor,\n"
102+
" max_seqlen_q: int,\n"
103+
" max_seqlen_k: int,\n"
104+
" ) -> torch.Tensor:\n"
105+
" # fa_utils.get_flash_attn_version() returns None on backends that\n"
106+
)
107+
108+
P15B_REPLACEMENT = (
109+
" def _flash_attn_varlen(\n"
110+
" self,\n"
111+
" q: torch.Tensor,\n"
112+
" k: torch.Tensor,\n"
113+
" v: torch.Tensor,\n"
114+
" cu_seqlens_q: torch.Tensor,\n"
115+
" cu_seqlens_k: torch.Tensor,\n"
116+
" max_seqlen_q: int,\n"
117+
" max_seqlen_k: int,\n"
118+
" ) -> torch.Tensor:\n"
119+
" # [Genesis P15B Issue #15 fix] Clamp max_seqlen_k to actual span.\n"
120+
" # On cudagraph-captured runtime, max_seqlen_k may equal\n"
121+
" # max_model_len (320K+) even though actual span is smaller —\n"
122+
" # FA wrapper's C extension over-allocates ~max_seqlen_k-sized\n"
123+
" # workspace. Clamp adds one GPU->CPU sync per call but the call\n"
124+
" # is on infrequent continuation-prefill path; sync cost amortizes.\n"
125+
" if cu_seqlens_k is not None and cu_seqlens_k.numel() >= 2:\n"
126+
" try:\n"
127+
" if cu_seqlens_k.shape[0] == 2:\n"
128+
" _genesis_p15b_actual = int(cu_seqlens_k[-1].item())\n"
129+
" else:\n"
130+
" _genesis_p15b_actual = int(\n"
131+
" (cu_seqlens_k[1:] - cu_seqlens_k[:-1]).max().item()\n"
132+
" )\n"
133+
" if _genesis_p15b_actual > 0:\n"
134+
" max_seqlen_k = min(max_seqlen_k, _genesis_p15b_actual)\n"
135+
" except Exception:\n"
136+
" pass # fall through with original value\n"
137+
" # fa_utils.get_flash_attn_version() returns None on backends that\n"
138+
)
139+
140+
141+
def _make_patcher() -> TextPatcher | None:
142+
target = resolve_vllm_file("v1/attention/backends/turboquant_attn.py")
143+
if target is None:
144+
return None
145+
return TextPatcher(
146+
patch_name=(
147+
"P15B turboquant_attn.py — _flash_attn_varlen max_seqlen_k clamp "
148+
"(Issue #15 fix)"
149+
),
150+
target_file=str(target),
151+
marker=GENESIS_P15B_MARKER,
152+
sub_patches=[
153+
TextPatch(
154+
name="p15b_fa_varlen_clamp",
155+
anchor=P15B_ANCHOR,
156+
replacement=P15B_REPLACEMENT,
157+
required=True,
158+
),
159+
],
160+
upstream_drift_markers=[
161+
"[Genesis P15B",
162+
"_genesis_p15b_actual",
163+
],
164+
)
165+
166+
167+
def apply() -> tuple[str, str]:
168+
"""Apply P15B — FA varlen max_seqlen_k clamp."""
169+
from vllm._genesis.dispatcher import log_decision, should_apply
170+
171+
decision, reason = should_apply("P15B")
172+
log_decision("P15B", decision, reason)
173+
if not decision:
174+
return "skipped", reason
175+
176+
if vllm_install_root() is None:
177+
return "skipped", "vllm install root not discoverable"
178+
179+
patcher = _make_patcher()
180+
if patcher is None:
181+
return "skipped", "turboquant_attn.py not resolvable"
182+
183+
result, failure = patcher.apply()
184+
return result_to_wiring_status(
185+
result, failure,
186+
applied_message=(
187+
"P15B applied: _flash_attn_varlen now clamps max_seqlen_k to "
188+
"actual cu_seqlens_k span. Prevents 50 MiB FA wrapper workspace "
189+
"OOM on long-context continuation-prefill (Issue #15). Adds "
190+
"one GPU->CPU sync per call on infrequent path."
191+
),
192+
patch_name=patcher.patch_name,
193+
)
194+
195+
196+
def is_applied() -> bool:
197+
target = resolve_vllm_file("v1/attention/backends/turboquant_attn.py")
198+
if target is None:
199+
return False
200+
try:
201+
with open(str(target)) as f:
202+
return GENESIS_P15B_MARKER in f.read()
203+
except OSError:
204+
return False

0 commit comments

Comments
 (0)