v7.65: bug fixes #14 (P38B) + #15 (P15B) — close noonghunna's cliff cascade

Sandermage · Sandermage · commit f289e07c5b90 · 2026-05-01T16:50:24.000+03:00
Two related fixes for the Cliff 1 mech B cascade reported by noonghunna on Genesis-vllm-patches issues #14 and #15. ISSUE #14 (P38 silent no-op on TQ KV path) — P38B fix ====================================================== **Problem (root cause)**: P38's class-attribute rebind of `TurboQuantAttentionImpl._continuation_prefill` doesn't survive `aot_compile_fullgraph` capture. The compiled forward graph references the ORIGINAL method body at runtime; rebind updates only the live class dict. noonghunna's instrumentation confirmed: log line in Genesis replacement never fires despite rebind reporting "applied". **Fix**: text-patch `vllm/v1/attention/backends/turboquant_attn.py` to inject a delegate hook at the start of `_continuation_prefill` body. The hook calls `type(self)._genesis_p38_dispatch` (a class attribute set by Genesis after import) which returns Genesis result OR None to fall through. Source-level edit means aot_compile captures the hook as part of the compiled artifact. **Affected**: ALL TurboQuant KV users with V0/V1 compile pipeline. fp8 KV configs unaffected (different code path). ISSUE #15 (FA varlen workspace cliff) — P15B fix ================================================= **Problem (root cause)**: PN17 clamps `max_seqlen_k` on the FA2 backend path (`flash_attn.py`), but TurboQuant code path bypasses PN17's coverage by calling vllm_flash_attn's vendored wrapper via `turboquant_attn.py:_flash_attn_varlen`. On long-context continuation prefill the wrapper over-allocates ~max_seqlen_k-sized workspace, causing 50 MiB OOM at tight VRAM (24 GB consumer cards, long-vision 140K + 0.95 mem-util). **Fix**: text-patch `_flash_attn_varlen` body to compute actual max from `cu_seqlens_k` and clamp `max_seqlen_k` before invoking the FA wrapper. batch=1 fast path: single tensor element access. batch>1: diff().max() reduction. Adds one GPU→CPU sync per call on infrequent continuation-prefill path. NEW PATCHES =========== - `vllm/_genesis/wiring/perf_hotfix/patch_38b_compile_safe_hook.py` - `vllm/_genesis/wiring/perf_hotfix/patch_15B_fa_varlen_clamp.py` - Dispatcher entries: P38B + P15B (opt-in OFF default) - apply_all.py register entries - 27B PROD launch script enables both: GENESIS_ENABLE_P38B_COMPILE_SAFE=1 + GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1 VALIDATION (27B PROD, TQ k8v4 + MTP K=3, 2× A5000) ==================================================== Boot: P38B + P15B both APPLY cleanly (text-patch + dispatcher install on TurboQuantAttentionImpl). No exceptions, no boot regressions. Boot summary: PN26b + P38B + P15B + 50+ other patches all applied. Sustained 50-req bench: | Config | mean | min | max | p99 | tool-call | errors | |-------------------|--------|-------|--------|--------|-----------|--------| | Baseline (no PN26b) | 97.76 | 85.31 | 108.73 | 108.45 | 7/7 | 0/50 | | PN26b only | 98.91 | 85.06 | 110.61 | 110.18 | 7/7 | 0/50 | | **PN26b + P38B + P15B** | **98.57** | 84.04 | 109.56 | 109.51 | **7/7** | **0/50** | Net: P38B + P15B add zero observable runtime overhead vs PN26b alone. Tool-call quality preserved (7/7). Zero errors. Variance band ±1.5 TPS. Cliff repro pending: noonghunna's failure repros require ~50K-token single-shot prefill on long-vision 140K + 0.95 (24 GB 3090). Our 35B PROD bench at 100t output doesn't exercise the failure path. P15B trade-off (sync per call) is statistically invisible at this output length. INDEPENDENT CONVERGENCE WITH NOONGHUNNA ======================================== noonghunna's `patch_pn12_compile_safe_custom_op.py` uses `torch.library.custom_op` for the same problem class on PN12. Genesis P38B uses in-source text-patch on `_continuation_prefill`. Both mechanisms are valid for routing around aot_compile capture; we chose text-patch for P38 specifically because `_continuation_prefill` has many self-attribute deps and module-level imports that complicate the functional-input contract that custom_op needs. For PN25 (SiluAndMul.forward_native) we used custom_op. For P38B (TurboQuant._continuation_prefill) we used text-patch. Same problem class, mechanism choice depends on signature complexity. Sources: - Issue #14: #14 - Issue #15: #15 - noonghunna's PN12 reference impl: https://github.com/noonghunna/club-3090/blob/master/models/qwen3.6-27b/vllm/patches/patch_pn12_compile_safe_custom_op.py
diff --git a/scripts/start_27b_int4_TQ_k8v4.sh b/scripts/start_27b_int4_TQ_k8v4.sh
@@ -46,6 +46,7 @@ docker run -d \
   -e GENESIS_ENABLE_P74_CHUNK_CLAMP=1 \
   -e GENESIS_ENABLE_P82=0 -e GENESIS_ENABLE_P98=1 -e GENESIS_ENABLE_P99=1 -e GENESIS_P82_THRESHOLD_SINGLE=0.3 \
   -e GENESIS_ENABLE_PN26_SPARSE_V=1 -e GENESIS_PN26_SPARSE_V_THRESHOLD=0.01 -e GENESIS_PN26_SPARSE_V_BLOCK_KV=8 -e GENESIS_PN26_SPARSE_V_NUM_WARPS=4 \
+  -e GENESIS_ENABLE_P38B_COMPILE_SAFE=1 -e GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1 \
   -e GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL=1 -e GENESIS_ENABLE_P91=1 -e GENESIS_ENABLE_P87=1 -e GENESIS_ENABLE_P85=1 -e GENESIS_ENABLE_P83=1 -e GENESIS_ENABLE_P101=1 -e GENESIS_ENABLE_P100=1 \
   -e GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=0 \
   -e GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=0 \
diff --git a/vllm/_genesis/dispatcher.py b/vllm/_genesis/dispatcher.py
@@ -613,6 +613,51 @@ class ValidationIssue:
         "conflicts_with": [],
         "requires_patches": [],
     },
+    "P15B": {
+        "title": "FA varlen max_seqlen_k clamp on TQ path (Issue #15 fix)",
+        "env_flag": "GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP",
+        "default_on": False,
+        "category": "perf_hotfix",
+        "credit": (
+            "Genesis-original 2026-05-01 fix for noonghunna's Issue #15. "
+            "PN17 clamps max_seqlen_k on the FA2 backend path, but TurboQuant "
+            "code path bypasses PN17's coverage by calling vllm_flash_attn's "
+            "vendored wrapper via turboquant_attn.py:_flash_attn_varlen. P15B "
+            "extends the same clamp logic to that callsite via text-patch — "
+            "computes actual span from cu_seqlens_k and clamps max_seqlen_k "
+            "before invocation. Prevents 50 MiB workspace OOM on long-context "
+            "continuation-prefill on tight VRAM (24 GB consumer cards). "
+            "Trade-off: adds one GPU->CPU sync per call on the infrequent "
+            "continuation-prefill path."
+        ),
+        "upstream_pr": None,
+        "applies_to": {},
+        "conflicts_with": [],
+        "requires_patches": [],
+    },
+    "P38B": {
+        "title": "P38 compile-safe in-source hook (Issue #14 fix)",
+        "env_flag": "GENESIS_ENABLE_P38B_COMPILE_SAFE",
+        "default_on": False,
+        "category": "perf_hotfix",
+        "credit": (
+            "Genesis-original 2026-05-01 fix for noonghunna's Issue #14. "
+            "Root cause: aot_compile_fullgraph captures _continuation_prefill "
+            "original body at engine init; Python class-attribute rebind "
+            "(P38's mechanism) doesn't propagate to compiled artifact. "
+            "P38B injects an in-source delegate hook at the start of "
+            "_continuation_prefill body via text-patch. Hook calls a "
+            "dispatcher that returns Genesis result OR None (fall-through). "
+            "Source-level edit means aot_compile captures the hook itself. "
+            "Affects ALL TQ KV users with V0/V1 compile pipeline; fp8 KV "
+            "configs unaffected (different code path). Composes with P38 "
+            "(both share _genesis_continuation_prefill impl)."
+        ),
+        "upstream_pr": None,
+        "applies_to": {},
+        "conflicts_with": [],
+        "requires_patches": [],  # P38 install order: P38 first (provides impl), P38B second (installs hook)
+    },
     "PN26b": {
         "title": "Sparse-V tile-skip Genesis kernel (BLASST λ=a/L for SM86)",
         "env_flag": "GENESIS_ENABLE_PN26_SPARSE_V",
diff --git a/vllm/_genesis/patches/apply_all.py b/vllm/_genesis/patches/apply_all.py
@@ -2017,6 +2017,81 @@ def apply_patch_N12_ffn_intermediate_pool() -> PatchResult:
     return _failed(name, reason)
 
 
+@register_patch(
+    "P15B FA varlen max_seqlen_k clamp on TQ path (Issue #15 fix)"
+)
+def apply_patch_15B_fa_varlen_clamp() -> PatchResult:
+    """Patch 15B: extend PN17-style clamp to TurboQuant FA varlen path.
+
+    Fixes Genesis Issue #15 (noonghunna 2026-05-01): PN17 doesn't reach
+    `turboquant_attn.py:_flash_attn_varlen` which calls vllm_flash_attn's
+    vendored wrapper. On long-context continuation prefill the wrapper
+    over-allocates ~max_seqlen_k-sized workspace, causing 50 MiB OOM at
+    tight VRAM (long-vision 140K + 0.95 mem-util on 24 GB 3090).
+
+    P15B inserts a clamp at the start of `_flash_attn_varlen` body that
+    computes actual max from cu_seqlens_k and reduces max_seqlen_k before
+    invocation. Adds one GPU->CPU sync per call on infrequent path.
+
+    Status: opt-in via GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1. Default OFF.
+    """
+    name = "P15B FA varlen max_seqlen_k clamp on TQ path (Issue #15 fix)"
+    if not _APPLY_MODE:
+        return _applied(name, "dry-run: text-patch ready")
+    try:
+        from vllm._genesis.wiring.perf_hotfix import patch_15B_fa_varlen_clamp
+    except Exception as e:
+        return _failed(name, f"wiring import failed: {e}")
+    status, reason = patch_15B_fa_varlen_clamp.apply()
+    if status == "applied":
+        return _applied(name, reason)
+    if status == "skipped":
+        return _skipped(name, reason)
+    return _failed(name, reason)
+
+
+@register_patch(
+    "P38B P38 compile-safe in-source hook (Issue #14 fix — aot_compile-safe)"
+)
+def apply_patch_38B_compile_safe_hook() -> PatchResult:
+    """Patch 38B: P38 compile-safe in-source hook.
+
+    Fixes Genesis Issue #14 (noonghunna 2026-05-01): P38's class-attribute
+    rebind of `_continuation_prefill` doesn't survive aot_compile_fullgraph
+    capture. Compiled forward graph references the ORIGINAL method body at
+    runtime. Affects ALL TQ KV users with V0/V1 compile pipeline.
+
+    P38B fix: text-patch the upstream `turboquant_attn.py` source to
+    insert an in-source delegate hook at the START of
+    `_continuation_prefill` body. The hook calls a dispatcher that returns
+    Genesis impl result OR None (fall-through to original body).
+
+    Source-level edit means aot_compile captures the hook itself, not just
+    the original body. Class attribute `_genesis_p38_dispatch` is set
+    after import, BEFORE the worker compiles forward — dispatcher is
+    available at compile time.
+
+    Composes with P38: both share `_genesis_continuation_prefill` impl.
+    P38 still rebinds for eager-mode callers; P38B handles compile-mode.
+
+    Status: opt-in via GENESIS_ENABLE_P38B_COMPILE_SAFE=1. Default OFF.
+    Recommended pairing: enable P38 + P38B + P37 together when on TQ KV.
+    """
+    name = "P38B P38 compile-safe in-source hook (Issue #14 fix)"
+    if not _APPLY_MODE:
+        return _applied(name, "dry-run: text-patch + dispatcher ready")
+    try:
+        from vllm._genesis.wiring.perf_hotfix import patch_38b_compile_safe_hook
+    except Exception as e:
+        return _failed(name, f"wiring import failed: {e}")
+    status, reason = patch_38b_compile_safe_hook.apply()
+    if status == "applied":
+        return _applied(name, reason)
+    if status == "skipped":
+        return _skipped(name, reason)
+    return _failed(name, reason)
+
+
 @register_patch(
     "PN26b sparse-V tile-skip Genesis kernel "
     "(BLASST λ=a/L for SM86, first NVIDIA Ampere implementation)"
diff --git a/vllm/_genesis/wiring/perf_hotfix/patch_15B_fa_varlen_clamp.py b/vllm/_genesis/wiring/perf_hotfix/patch_15B_fa_varlen_clamp.py
@@ -0,0 +1,204 @@
+# SPDX-License-Identifier: Apache-2.0
+"""Wiring for Patch 15B — extend PN17-style clamp to TQ FA varlen call.
+
+Fixes Genesis Issue #15 (noonghunna 2026-05-01):
+https://github.com/Sandermage/genesis-vllm-patches/issues/15
+
+================================================================
+PROBLEM (root cause)
+================================================================
+
+PN17 patches `vllm/v1/attention/backends/flash_attn.py` to clamp
+`max_seqlen_k` from cudagraph-capture-bloated `max_model_len` to actual
+runtime span. This prevents `softmax_lse` over-allocation in the FA2
+backend.
+
+But PN17's coverage doesn't reach the **TurboQuant code path**. When
+`_continuation_prefill` (TQ k8v4 with chunked prefill at long context)
+calls `self._flash_attn_varlen(...)` (turboquant_attn.py:394), the
+`max_seqlen_k` passed in is `seq_len` from the metadata — which on
+cudagraph-captured runtime can also be bloated to `max_model_len`.
+
+The trace from noonghunna's repro:
+```
+turboquant_attn.py:909 _continuation_prefill
+turboquant_attn.py:394 _flash_attn_varlen
+flash_attn_interface.py:300 flash_attn_varlen_func
+torch._ops:1269 → C extension allocates ~50 MiB workspace based on max_seqlen_k
+torch.OutOfMemoryError
+```
+
+================================================================
+FIX DESIGN
+================================================================
+
+Text-patch `turboquant_attn.py:_flash_attn_varlen` to clamp
+`max_seqlen_k` to the ACTUAL maximum sequence length, computed from
+`cu_seqlens_k`:
+
+- For batch=1 (continuation prefill case): `cu_seqlens_k[-1] == seq_len`
+  is the actual max, single tensor access.
+- For batch>1: `(cu_seqlens_k[1:] - cu_seqlens_k[:-1]).max()` gives the
+  actual max across batch elements. One reduction + sync.
+
+The clamp adds ONE GPU→CPU sync per `_flash_attn_varlen` call. On the
+continuation prefill path this is tolerable: each call already triggers
+synchronous FA kernel invocation, and the path itself is infrequent
+(once per chunked prefill rollover, not per decode token).
+
+PN17's design avoided sync via CPU-resident metadata, but on this
+path we don't have CPU-resident max_seq_len. The fallback sync is the
+pragmatic choice given the alternative is silent OOM.
+
+================================================================
+SAFETY MODEL
+================================================================
+
+- Default OFF (opt-in via `GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1`).
+- Idempotent (marker-checked).
+- Drift-aware: if upstream rewrites `_flash_attn_varlen` signature or
+  body, anchor misses → SKIPPED, source stays vanilla.
+- Try/except guard: if clamp computation raises (degenerate input),
+  falls through to original `max_seqlen_k`. No crash.
+- Sync added only ONCE per call (not per layer) — already-sync codepath.
+
+Composition:
+
+- P38 + P38B (Issue #14): orthogonal — they fix `_continuation_prefill`
+  upstream alloc; P15B fixes the FA varlen wrapper alloc downstream.
+- PN17: orthogonal — different file, different code path. Together they
+  cover both FA backends.
+
+Author: Sandermage(Sander) Barzov Aleksandr, Ukraine, Odessa
+Origin: noonghunna Issue #15 — direct fix per their suggestion path 1
+"""
+from __future__ import annotations
+
+import logging
+
+from vllm._genesis.guards import resolve_vllm_file, vllm_install_root
+from vllm._genesis.wiring.text_patch import (
+    TextPatch,
+    TextPatcher,
+    TextPatchResult,
+    result_to_wiring_status,
+)
+
+log = logging.getLogger("genesis.wiring.p15B_fa_varlen_clamp")
+
+GENESIS_P15B_MARKER = "Genesis P15B FA varlen max_seqlen_k clamp (Issue #15) v7.65"
+
+
+# Anchor: the function signature. We insert clamp logic right at the
+# top of the body, before any other logic.
+P15B_ANCHOR = (
+    "    def _flash_attn_varlen(\n"
+    "        self,\n"
+    "        q: torch.Tensor,\n"
+    "        k: torch.Tensor,\n"
+    "        v: torch.Tensor,\n"
+    "        cu_seqlens_q: torch.Tensor,\n"
+    "        cu_seqlens_k: torch.Tensor,\n"
+    "        max_seqlen_q: int,\n"
+    "        max_seqlen_k: int,\n"
+    "    ) -> torch.Tensor:\n"
+    "        # fa_utils.get_flash_attn_version() returns None on backends that\n"
+)
+
+P15B_REPLACEMENT = (
+    "    def _flash_attn_varlen(\n"
+    "        self,\n"
+    "        q: torch.Tensor,\n"
+    "        k: torch.Tensor,\n"
+    "        v: torch.Tensor,\n"
+    "        cu_seqlens_q: torch.Tensor,\n"
+    "        cu_seqlens_k: torch.Tensor,\n"
+    "        max_seqlen_q: int,\n"
+    "        max_seqlen_k: int,\n"
+    "    ) -> torch.Tensor:\n"
+    "        # [Genesis P15B Issue #15 fix] Clamp max_seqlen_k to actual span.\n"
+    "        # On cudagraph-captured runtime, max_seqlen_k may equal\n"
+    "        # max_model_len (320K+) even though actual span is smaller —\n"
+    "        # FA wrapper's C extension over-allocates ~max_seqlen_k-sized\n"
+    "        # workspace. Clamp adds one GPU->CPU sync per call but the call\n"
+    "        # is on infrequent continuation-prefill path; sync cost amortizes.\n"
+    "        if cu_seqlens_k is not None and cu_seqlens_k.numel() >= 2:\n"
+    "            try:\n"
+    "                if cu_seqlens_k.shape[0] == 2:\n"
+    "                    _genesis_p15b_actual = int(cu_seqlens_k[-1].item())\n"
+    "                else:\n"
+    "                    _genesis_p15b_actual = int(\n"
+    "                        (cu_seqlens_k[1:] - cu_seqlens_k[:-1]).max().item()\n"
+    "                    )\n"
+    "                if _genesis_p15b_actual > 0:\n"
+    "                    max_seqlen_k = min(max_seqlen_k, _genesis_p15b_actual)\n"
+    "            except Exception:\n"
+    "                pass  # fall through with original value\n"
+    "        # fa_utils.get_flash_attn_version() returns None on backends that\n"
+)
+
+
+def _make_patcher() -> TextPatcher | None:
+    target = resolve_vllm_file("v1/attention/backends/turboquant_attn.py")
+    if target is None:
+        return None
+    return TextPatcher(
+        patch_name=(
+            "P15B turboquant_attn.py — _flash_attn_varlen max_seqlen_k clamp "
+            "(Issue #15 fix)"
+        ),
+        target_file=str(target),
+        marker=GENESIS_P15B_MARKER,
+        sub_patches=[
+            TextPatch(
+                name="p15b_fa_varlen_clamp",
+                anchor=P15B_ANCHOR,
+                replacement=P15B_REPLACEMENT,
+                required=True,
+            ),
+        ],
+        upstream_drift_markers=[
+            "[Genesis P15B",
+            "_genesis_p15b_actual",
+        ],
+    )
+
+
+def apply() -> tuple[str, str]:
+    """Apply P15B — FA varlen max_seqlen_k clamp."""
+    from vllm._genesis.dispatcher import log_decision, should_apply
+
+    decision, reason = should_apply("P15B")
+    log_decision("P15B", decision, reason)
+    if not decision:
+        return "skipped", reason
+
+    if vllm_install_root() is None:
+        return "skipped", "vllm install root not discoverable"
+
+    patcher = _make_patcher()
+    if patcher is None:
+        return "skipped", "turboquant_attn.py not resolvable"
+
+    result, failure = patcher.apply()
+    return result_to_wiring_status(
+        result, failure,
+        applied_message=(
+            "P15B applied: _flash_attn_varlen now clamps max_seqlen_k to "
+            "actual cu_seqlens_k span. Prevents 50 MiB FA wrapper workspace "
+            "OOM on long-context continuation-prefill (Issue #15). Adds "
+            "one GPU->CPU sync per call on infrequent path."
+        ),
+        patch_name=patcher.patch_name,
+    )
+
+
+def is_applied() -> bool:
+    target = resolve_vllm_file("v1/attention/backends/turboquant_attn.py")
+    if target is None:
+        return False
+    try:
+        with open(str(target)) as f:
+            return GENESIS_P15B_MARKER in f.read()
+    except OSError:
+        return False
diff --git a/vllm/_genesis/wiring/perf_hotfix/patch_38b_compile_safe_hook.py b/vllm/_genesis/wiring/perf_hotfix/patch_38b_compile_safe_hook.py