hybrid: add PN25 — Inductor-safe silu_and_mul pool (sister-patch to PN12)

Sandermage · Sandermage · commit 2b239a88f33d · 2026-05-01T03:50:47.000+03:00
PN12 patches the eager-mode SiluAndMul.forward_cuda; under V1 default
custom_ops=["none"], dispatch routes through forward_native which
torch.compile/Inductor inlines and lowers to empty_strided_cuda(...) —
completely bypassing PN12's pool. Reported by noonghunna in club-3090#16
(VolandBerlioz Reddit + ampersandru cross-rig confirmation): RTX 3090
24 GB + Lorbus 27B + OpenCode 29K-token prefill OOMs at
inductor_cache/...py:1208 allocating (s18, 17408) fp16 = 137.6 MiB.

Genesis stack inherits the same flaw — our PN12 only patches forward_cuda.
We don't see it in PROD only because our 27B Lorbus configs short-circuit
the compile path on this kernel. Future Inductor-default configs would
expose the leak.

PN25 design (complement, not replacement of PN12):
- New kernel module vllm/_genesis/kernels/silu_and_mul_customop.py
  registers `genesis::silu_and_mul_pooled` via torch.library.custom_op
  with mutates_args=() and device_types=("cuda",). Inductor treats
  custom ops as opaque — emits a call, never inlines. Inside the body
  we acquire from FFNIntermediateCache.acquire_silu_out (same pool used
  by PN12) and dispatch to torch.ops._C.silu_and_mul.
- New wiring patch_N25_silu_inductor_safe_pool.py text-patches
  SiluAndMul.forward_native to dispatch through the opaque op. Falls
  back to vanilla F.silu * mul math when registration unavailable
  (torch &lt; 2.4 or CPU-only build) — soft degradation.
- Dispatcher entry declares conflicts_with=[], requires_patches=[].
  PN12 + PN25 patch DIFFERENT methods (forward_cuda vs forward_native)
  so they can coexist on the same file without anchor collision.

Bug fix in PN12 marker logic:
- Removed "FFNIntermediateCache" from upstream_drift_markers — this is
  our own internal pool class name and may legitimately appear in
  sister-patches (PN25 docstring references it). Drift markers should
  signal upstream variants, not Genesis-internal symbols. Without this
  fix, applying PN25 to a vanilla file would correctly patch
  forward_native, but a subsequent PN12 attempt on the same file would
  see "FFNIntermediateCache" via PN25's docstring and skip with
  upstream_merged — silently bypassing forward_cuda pool wiring.

Validation (clean container test):
- register: True
- op_callable: genesis.silu_and_mul_pooled
- eager call: shape (2, 8) bfloat16, finite=True
- torch.compile call: shape (2, 8), finite=True (fake impl validates
  for shape inference)
- live container PN12+PN25 sequential apply: both "applied" status,
  both markers in activation.py.

PN25 ships opt-in OFF (GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1) and
is NOT enabled in any launch script. Users with inductor-heavy
configs (e.g. 35B FP8 long-context, future MoE) can pair it with PN12
for full FFN intermediate pooling coverage.

Cross-reference: noonghunna's club-3090#16 work-in-progress on the
same flaw. Independent convergence on the torch.library.custom_op
approach.
diff --git a/vllm/_genesis/dispatcher.py b/vllm/_genesis/dispatcher.py
@@ -613,6 +613,30 @@ class ValidationIssue:
         "conflicts_with": [],
         "requires_patches": [],
     },
+    "PN25": {
+        "title": "SiluAndMul.forward_native opaque-op pool (Cliff 1 mech B compile path)",
+        "env_flag": "GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE",
+        "default_on": False,
+        "category": "memory_savings",
+        "credit": (
+            "Genesis-original 2026-05-01 in response to noonghunna's "
+            "club-3090#16 (VolandBerlioz/ampersandru cross-rig OOM trace, "
+            "RTX 3090 24 GB + Lorbus 27B + OpenCode 29K prefill). PN12 "
+            "patches eager `forward_cuda` but `custom_ops=['none']` (default "
+            "under V1 aot_compile_fullgraph) routes dispatch through "
+            "`forward_native` which Inductor inlines and lowers to "
+            "`empty_strided_cuda(...)`, bypassing PN12's pool. "
+            "Sister-patch PN25 patches `forward_native` to dispatch through "
+            "an opaque `genesis::silu_and_mul_pooled` torch.library.custom_op "
+            "(Inductor cannot inline opaque ops). Both patches share the "
+            "same FFNIntermediateCache pool. Recommended pairing for any "
+            "inductor-heavy config."
+        ),
+        "upstream_pr": None,
+        "applies_to": {},
+        "conflicts_with": [],
+        "requires_patches": [],  # complements PN12 but does not require it
+    },
     "PN17": {
         "title": "FA2 softmax_lse runtime clamp (Cliff 1 mechanism A, Issue #11)",
         "env_flag": "GENESIS_ENABLE_PN17_FA2_LSE_CLAMP",
diff --git a/vllm/_genesis/kernels/silu_and_mul_customop.py b/vllm/_genesis/kernels/silu_and_mul_customop.py
@@ -0,0 +1,228 @@
+# SPDX-License-Identifier: Apache-2.0
+"""PN25 — `silu_and_mul` as torch.library.custom_op (Inductor-safe pool).
+
+Problem (continuation of PN12)
+------------------------------
+PN12 text-patches `SiluAndMul.forward_cuda` to acquire its
+`[M, intermediate_size]` BF16/FP16 transient from `FFNIntermediateCache`
+instead of `torch.empty()`. That works in eager mode.
+
+Reported by noonghunna 2026-04-30 in club-3090#16, with confirmation
+from VolandBerlioz on a real OpenCode workload (29K sys+tools prefill,
+24 GB single 3090): when `custom_ops=["none"]` is the default (which is
+typical under V1 `aot_compile_fullgraph`), `SiluAndMul.__call__`
+dispatches to `forward_native`, NOT `forward_cuda`. `forward_native`
+is
+
+    @staticmethod
+    def forward_native(x):
+        d = x.shape[-1] // 2
+        return F.silu(x[..., :d]) * x[..., d:]
+
+torch.compile's Inductor traces this body into a fused kernel and
+issues its own `empty_strided_cuda((s18, intermediate_size), fp16)`
+for the multiplication output. PN12's pool never gets a chance —
+the patched `forward_cuda` method is never reached.
+
+Symptom on a 24 GB 3090 + Lorbus 27B (intermediate=17408) at
+`max_num_batched_tokens=4128`: 137.6 MiB allocation, 131.75 MiB free,
+OOM. Cliff 1 mech B fires on real workloads while our verify-stress
+25K synthetic happens to hit shapes that DO reach eager forward_cuda
+and pass.
+
+Genesis stack vulnerability
+---------------------------
+Same architectural flaw exists in our PN12 — we only patch
+`forward_cuda`. Our 27B PROD configs avoid the inductor path because
+`--cudagraph-mode=PIECEWISE` + offline-quant INT4 short-circuits the
+compile pipeline on this kernel. But long-context + chunked prefill
+under a future Inductor-default config could hit it.
+
+Fix design (this module)
+------------------------
+Register `genesis::silu_and_mul_pooled` as `torch.library.custom_op`
+with `device_types=("cuda",)`. Inductor treats custom ops as opaque
+nodes — emits a call to the op and does NOT trace through the body.
+Inside the op body we run the same eager logic as PN12's patched
+forward_cuda: acquire output from `FFNIntermediateCache.acquire_silu_out`
+when the [M, d] 2-D shape matches, fall back to `torch.empty` otherwise,
+then dispatch to the underlying CUDA `silu_and_mul` kernel.
+
+Companion patch PN25 (`patch_N25_silu_inductor_safe_pool.py`) edits
+`SiluAndMul.forward_native` to route through this op when available.
+PN12 stays as the eager-path patch on `forward_cuda`. Both can run
+simultaneously without conflict — `forward_cuda` is called when
+`custom_ops=["+silu_and_mul"]`, `forward_native` is called otherwise.
+
+Composition with PN12
+---------------------
+PN12 patches `forward_cuda` (eager dispatch).
+PN25 patches `forward_native` via opaque op (compile dispatch).
+
+Together: both paths acquire from the same `FFNIntermediateCache`
+pool. No state collision — pool is keyed by (intermediate_size, dtype,
+device), and forward is strictly sequential in vLLM's schedule.
+
+If only PN12 enabled: eager workloads work, compile path leaks.
+If only PN25 enabled: compile workloads work, eager path leaks.
+If both enabled: full coverage. Recommended for any inductor-heavy
+config (35B FP8 + future MoE; club-3090 long-text/long-vision).
+
+Compat
+------
+- Requires `torch.library.custom_op` (PyTorch ≥ 2.4, available on
+  current vLLM nightly).
+- Enabled via `GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1`. OFF by
+  default.
+- Falls back gracefully if torch < 2.4 OR `torch.ops._C.silu_and_mul`
+  is missing (CPU-only build).
+
+Author: Sandermage(Sander) Barzov Aleksandr, Ukraine, Odessa
+Cross-engine inspiration:
+  - club-3090#16 (noonghunna independent work-in-progress)
+  - Genesis P7b `gdn_dual_stream_customop.py` (custom_op template)
+"""
+from __future__ import annotations
+
+import logging
+import os
+from typing import Optional
+
+import torch
+
+log = logging.getLogger("genesis.silu_and_mul_customop")
+
+_ENV_ENABLE_PN25 = "GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE"
+
+
+def is_pn25_enabled() -> bool:
+    return os.environ.get(_ENV_ENABLE_PN25, "").strip().lower() in (
+        "1", "true", "yes", "on",
+    )
+
+
+def should_apply() -> bool:
+    """Platform gate: NVIDIA CUDA + env opt-in."""
+    if not is_pn25_enabled():
+        return False
+    from vllm._genesis.guards import is_nvidia_cuda
+    if not is_nvidia_cuda():
+        return False
+    if not torch.cuda.is_available():
+        return False
+    return True
+
+
+_OP_QUALNAME = "genesis::silu_and_mul_pooled"
+_op_registered = False
+
+
+def _silu_and_mul_native_fallback(x: torch.Tensor) -> torch.Tensor:
+    """Pure-PyTorch fallback when CUDA op `_C.silu_and_mul` is not present.
+
+    Equivalent to `forward_native`. Allocates fresh — no pool benefit,
+    but preserves correctness. Hit only on CPU-only builds or in tests.
+    """
+    import torch.nn.functional as F
+    d = x.shape[-1] // 2
+    return F.silu(x[..., :d]) * x[..., d:]
+
+
+def _register_op_once() -> bool:
+    """Register `genesis::silu_and_mul_pooled` with torch.library.
+
+    Idempotent. Returns True on success. On any failure (torch too old,
+    op already registered by sister module, fake-impl rejection by
+    dynamo) returns False and PN25 wiring will fall back to upstream
+    `forward_native` body.
+    """
+    global _op_registered
+    if _op_registered:
+        return True
+
+    try:
+        custom_op = getattr(torch.library, "custom_op", None)
+        if custom_op is None:
+            log.info(
+                "[PN25] torch.library.custom_op not available "
+                "(torch<2.4) — falling back to vanilla forward_native"
+            )
+            return False
+    except Exception as e:
+        log.info("[PN25] torch.library import failed: %s", e)
+        return False
+
+    # Probe the underlying CUDA op once. If absent (CPU-only build,
+    # rare), we register a pure-pytorch impl that still routes through
+    # the opaque op so Inductor won't inline.
+    has_cuda_op = (
+        hasattr(torch.ops, "_C") and
+        hasattr(torch.ops._C, "silu_and_mul")
+    )
+
+    @custom_op(_OP_QUALNAME, mutates_args=(), device_types=("cuda",))
+    def _silu_and_mul_pooled(x: torch.Tensor) -> torch.Tensor:
+        """Real impl — runs outside dynamo trace (opaque op).
+
+        For 2-D `(M, 2*d)` tensors, acquires output from the shared
+        `FFNIntermediateCache` pool. For 3-D `(B, S, 2*d)` we fall
+        back to `torch.empty` because the pool is keyed on
+        `(num_tokens, intermediate_size)` and 3-D shapes only appear
+        in non-prefill paths where the alloc is small enough to not
+        matter.
+        """
+        d = x.shape[-1] // 2
+
+        if has_cuda_op and x.dim() == 2:
+            try:
+                from vllm._genesis.kernels.ffn_intermediate_cache import (
+                    FFNIntermediateCache as _Cache,
+                )
+                if _Cache.is_production_eligible():
+                    out = _Cache.acquire_silu_out(
+                        num_tokens=x.shape[0],
+                        intermediate_size=d,
+                        dtype=x.dtype, device=x.device,
+                    )
+                    torch.ops._C.silu_and_mul(out, x)
+                    return out
+            except Exception as e:
+                log.debug("[PN25] pool acquire failed, fallback: %s", e)
+
+        if has_cuda_op:
+            output_shape = x.shape[:-1] + (d,)
+            out = torch.empty(output_shape, dtype=x.dtype, device=x.device)
+            torch.ops._C.silu_and_mul(out, x)
+            return out
+
+        return _silu_and_mul_native_fallback(x)
+
+    @_silu_and_mul_pooled.register_fake
+    def _silu_and_mul_pooled_fake(x: torch.Tensor) -> torch.Tensor:
+        """Shape-inference impl for dynamo tracing.
+
+        Returns an empty tensor of the correct shape; dynamo never
+        executes the body so this is never observed at runtime — only
+        used for output shape propagation through the compiled graph.
+        """
+        d = x.shape[-1] // 2
+        output_shape = x.shape[:-1] + (d,)
+        return torch.empty(output_shape, dtype=x.dtype, device=x.device)
+
+    _op_registered = True
+    log.info("[PN25] registered torch op %s (Inductor-opaque)", _OP_QUALNAME)
+    return True
+
+
+def get_op_callable():
+    """Return the registered op callable, or None if registration failed.
+
+    Used by the PN25 wiring patch to populate the replacement body.
+    Caller is responsible for graceful degradation on None.
+    """
+    if not _register_op_once():
+        return None
+    try:
+        return torch.ops.genesis.silu_and_mul_pooled
+    except (AttributeError, RuntimeError):
+        return None
diff --git a/vllm/_genesis/patches/apply_all.py b/vllm/_genesis/patches/apply_all.py
@@ -1953,6 +1953,53 @@ def apply_patch_N12_ffn_intermediate_pool() -> PatchResult:
     return _failed(name, reason)
 
 
+@register_patch(
+    "PN25 SiluAndMul.forward_native opaque-op pool "
+    "(Cliff 1 mech B compile-path companion to PN12)"
+)
+def apply_patch_N25_silu_inductor_safe_pool() -> PatchResult:
+    """Patch N25: sister-patch to PN12 covering the compile dispatch path.
+
+    PN12 patches `SiluAndMul.forward_cuda` (eager mode); PN25 patches
+    `SiluAndMul.forward_native` via a `torch.library.custom_op` so
+    torch.compile/Inductor cannot inline the FFN intermediate alloc and
+    bypass PN12's pool.
+
+    Reported by noonghunna in club-3090#16 (VolandBerlioz Reddit + ampersandru
+    confirmation): on `custom_ops=["none"]` configs (default V1
+    aot_compile_fullgraph) `__call__` dispatches to `forward_native`,
+    Inductor traces and lowers to `empty_strided_cuda(...)` at line
+    `inductor_cache/...py:1208` — completely outside PN12's hot path.
+
+    PN25 registers `genesis::silu_and_mul_pooled` (opaque to Inductor)
+    and rewrites `forward_native` to dispatch through it. Inside the
+    opaque body, the same `FFNIntermediateCache` pool used by PN12
+    serves the [M, intermediate_size] transient. Pool is shared — both
+    paths converge on one buffer.
+
+    Status: opt-in via GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1.
+    Default OFF. Composes with PN12 (recommended pairing for any
+    inductor-heavy config). Standalone use covers compile-only paths;
+    PN12-only covers eager-only paths.
+    """
+    name = (
+        "PN25 SiluAndMul.forward_native opaque-op pool "
+        "(Cliff 1 mech B compile-path)"
+    )
+    if not _APPLY_MODE:
+        return _applied(name, "dry-run: text-patch ready")
+    try:
+        from vllm._genesis.wiring.hybrid import patch_N25_silu_inductor_safe_pool
+    except Exception as e:
+        return _failed(name, f"wiring import failed: {e}")
+    status, reason = patch_N25_silu_inductor_safe_pool.apply()
+    if status == "applied":
+        return _applied(name, reason)
+    if status == "skipped":
+        return _skipped(name, reason)
+    return _failed(name, reason)
+
+
 @register_patch("PN13 CUDAGraphWrapper lambda arity (vllm#41235 backport)")
 def apply_patch_N13_cuda_graph_lambda_arity() -> PatchResult:
     """Patch N13: backport of vllm#41235 (roikoren755, OPEN as of 2026-04-29) —
diff --git a/vllm/_genesis/wiring/hybrid/patch_N12_ffn_intermediate_pool.py b/vllm/_genesis/wiring/hybrid/patch_N12_ffn_intermediate_pool.py
@@ -188,7 +188,11 @@ def _make_patcher() -> TextPatcher | None:
             # If vllm#34207 lands the body becomes silu_and_mul.out()
             # variant — different anchor, ours auto-skips.
             "torch.ops._C.silu_and_mul.out",
-            "FFNIntermediateCache",
+            # Note: deliberately do NOT use "FFNIntermediateCache" here
+            # as a drift marker — it's our own pool class name and may
+            # appear in sister-patches (PN25) that legitimately compose
+            # with PN12 on the same file. Idempotency is handled by the
+            # Genesis-PN12 wiring marker line above.
         ],
     )
 
diff --git a/vllm/_genesis/wiring/hybrid/patch_N25_silu_inductor_safe_pool.py b/vllm/_genesis/wiring/hybrid/patch_N25_silu_inductor_safe_pool.py