v7.65: Cliff 8 hardening — partial_apply_warnings counter + PN19 H100-only flag

Sandermage · Sandermage · commit 434c8ced43aa · 2026-05-01T04:14:43.000+03:00
Cliff 8 hardening (apply_all.py): - New PatchStats.partial_apply_warnings property surfaces skipped patches whose reason indicates real anchor drift / ambiguous-anchor / required- anchor-missing — distinct from benign skips (opt-in OFF, upstream-merged, platform mismatch, deferred, redundant). - Boot summary line now appends "N ⚠️ partial-apply warning(s)" when the count is non-zero, plus per-warning WARNING-level lines that name each patch + reason. Silent anchor-drift skip class noonghunna flagged in club-3090 discussion #19 is now impossible to miss in the boot output. - Validated on live 35B DFlash 160K boot: 0 false positives after BENIGN list refinement (catches "opt-in:", "redundant:", "deferred", "upstream may have absorbed", "config: neutral" etc). CLIFFS.md PN19 H100-only flag (Cliff 1 mech A section): - noonghunna 2026-05-01 confirmed PN19 costs ~120 MiB KV pool on a 24 GB single-3090 (vs the documented 200-500 MiB win on H100). At 218K + 0.985 mem-util, engine init fails with KV cache available 3.4 GiB / required 3.52 GiB. - Documented as: disable PN19 on 24 GB consumer cards (3090, 4090, A5000) running long context. Same lesson as P104 L2 persistence (regressed -16.2% on 32+ layer KV >> L2 setups). Generic allocator hints don't survive GPU class boundaries. No regressions on live 35B DFlash 160K bench: - 44 applied / 42 skipped / 0 failed / 0 partial-apply warnings - prose 256t mean TPS 125.07, CV 3.07% - tool-call 7/7 then 5/7 then 6/7 (variance noise band, no real regression)
diff --git a/docs/CLIFFS.md b/docs/CLIFFS.md
@@ -22,10 +22,16 @@ You hit OOM earlier than you should on long-context workloads. On a 24 GB card r
 
 **PN17 — FA2 lse runtime clamp.** Genesis-original, 2026-04-30, in response to noonghunna Issue #11. Patches FA2 to use the actual `seq_lens.max()` at runtime instead of `max_model_len` during capture.
 
+**PN19 — scoped max-split cudagraph init (datacenter Ampere / Hopper / Blackwell only).** Genesis-original, 2026-04-30. Frees 200-500 MiB during model load on H100/B100. **Does NOT transfer cleanly to Ampere consumer:** noonghunna 2026-05-01 confirmed PN19 costs ~120 MiB KV pool on a 24 GB single-3090 (vs the documented 200-500 MiB win). At 218K context + 0.985 mem-util, engine init fails with `KV cache memory available 3.4 GiB, estimated maximum model length is 206400`. Different allocator behavior under PyTorch 2.10+ load-time fragmentation on consumer SKUs.
+
+> **Recommendation:** disable PN19 on 24 GB consumer cards (3090, 4090, A5000) running long context. Same lesson as P104 L2 persistence (regressed -16.2% on 32+ layer KV >> L2 setups). Generic allocator hints don't survive GPU class boundaries.
+
 **Refs**
 
 - `vllm/_genesis/wiring/perf_hotfix/patch_n17_fa2_softmax_lse_clamp.py`
+- `vllm/_genesis/wiring/perf_hotfix/patch_N19_scoped_max_split.py`
 - noonghunna Issue #11 (cross-engine derivative)
+- club-3090 Discussion #19 (PN19 ≠ H100 ergonomics report, 2026-05-01)
 
 ---
 
diff --git a/vllm/_genesis/patches/apply_all.py b/vllm/_genesis/patches/apply_all.py
@@ -87,23 +87,85 @@ def skipped_count(self) -> int:
     def failed_count(self) -> int:
         return len(self.failed)
 
+    @property
+    def partial_apply_warnings(self) -> list[PatchResult]:
+        """Skipped patches whose reason signals a real problem (drift,
+        ambiguous anchor, anchor-missing — NOT opt-in-OFF, upstream-merged,
+        or platform-mismatch which are all expected).
+
+        Surfaced separately from `skipped_count` so noonghunna's "silent
+        skip class" diagnosis (club-3090 discussion #19) is impossible to
+        miss in the boot summary. Cliff 8 hardening, v7.65.
+        """
+        # Reasons that indicate a benign/expected skip
+        BENIGN = (
+            "opt-in",   # matches "opt-in only", "opt-in:", "opt-in env"
+            "default off",
+            "upstream_merged",
+            "upstream_already",
+            "upstream_already_contains",
+            "upstream may have absorbed",
+            "upstream pr",  # "redundant: upstream PR ..."
+            "platform mismatch",
+            "platform_skip",
+            "config: opt-in",
+            "config: opt-out",
+            "config: skipped",
+            "config: neutral",
+            "already applied",
+            "marker present",
+            "soft_skip",
+            "no-op",
+            "dry-run",
+            "vllm install root not discoverable",
+            "target file not resolvable",
+            "is_pn",
+            "unsupported",
+            "not applicable",
+            "auto-disabled",
+            "auto-skip",
+            "deprecated",
+            "obsolete",
+            "redundant",
+            "deferred",
+            "incompatible with",  # P7 deferred reason
+        )
+        warnings = []
+        for r in self.skipped:
+            reason_lower = (r.reason or "").lower()
+            if not any(b.lower() in reason_lower for b in BENIGN):
+                warnings.append(r)
+        return warnings
+
+    @property
+    def partial_apply_warnings_count(self) -> int:
+        return len(self.partial_apply_warnings)
+
     def summary(self) -> dict[str, Any]:
         return {
             "applied": self.applied_count,
             "skipped": self.skipped_count,
             "failed": self.failed_count,
+            "partial_apply_warnings": self.partial_apply_warnings_count,
             "details": {
                 "applied": [(r.name, r.reason) for r in self.applied],
                 "skipped": [(r.name, r.reason) for r in self.skipped],
                 "failed": [(r.name, r.reason) for r in self.failed],
+                "partial_apply_warnings": [
+                    (r.name, r.reason) for r in self.partial_apply_warnings
+                ],
             },
         }
 
     def __str__(self) -> str:
-        return (
+        base = (
             f"Results: {self.applied_count} applied, "
             f"{self.skipped_count} skipped, {self.failed_count} failed"
         )
+        warns = self.partial_apply_warnings_count
+        if warns:
+            base += f", {warns} ⚠️ partial-apply warning(s)"
+        return base
 
 
 # ═══════════════════════════════════════════════════════════════════════════
@@ -3552,6 +3614,22 @@ def run(verbose: bool = True, apply: bool = False) -> PatchStats:
 
     log.info("Genesis %s", stats)
 
+    # [Genesis v7.65 / Cliff 8 hardening] Surface partial-apply warnings.
+    # Silent anchor-drift / ambiguous-anchor / anchor-missing skips were
+    # the class noonghunna flagged in club-3090 discussion #19. Drift
+    # detection works correctly, but the user-visible summary previously
+    # buried the signal in the same `skipped` count as opt-in OFF. Now
+    # warnings are pulled out and logged individually at WARNING level.
+    if stats.partial_apply_warnings:
+        log.warning(
+            "[Genesis] %d partial-apply warning(s) — patch(es) failed to "
+            "match expected source pattern. Review below to confirm anchor "
+            "drift vs upstream change vs config issue:",
+            stats.partial_apply_warnings_count,
+        )
+        for r in stats.partial_apply_warnings:
+            log.warning("[Genesis] ⚠️  %s — %s", r.name, r.reason)
+
     # [Genesis v7.13] Emit Dispatcher v2 apply matrix as a single readable
     # block. Only matters for patches that route through dispatcher.should_apply
     # (P56-P62 currently); other patches get only the per-line INFO above.