Skip to content

Commit 2b239a8

Browse files
author
Sandermage
committed
hybrid: add PN25 — Inductor-safe silu_and_mul pool (sister-patch to PN12)
PN12 patches the eager-mode SiluAndMul.forward_cuda; under V1 default custom_ops=["none"], dispatch routes through forward_native which torch.compile/Inductor inlines and lowers to empty_strided_cuda(...) — completely bypassing PN12's pool. Reported by noonghunna in club-3090#16 (VolandBerlioz Reddit + ampersandru cross-rig confirmation): RTX 3090 24 GB + Lorbus 27B + OpenCode 29K-token prefill OOMs at inductor_cache/...py:1208 allocating (s18, 17408) fp16 = 137.6 MiB. Genesis stack inherits the same flaw — our PN12 only patches forward_cuda. We don't see it in PROD only because our 27B Lorbus configs short-circuit the compile path on this kernel. Future Inductor-default configs would expose the leak. PN25 design (complement, not replacement of PN12): - New kernel module vllm/_genesis/kernels/silu_and_mul_customop.py registers `genesis::silu_and_mul_pooled` via torch.library.custom_op with mutates_args=() and device_types=("cuda",). Inductor treats custom ops as opaque — emits a call, never inlines. Inside the body we acquire from FFNIntermediateCache.acquire_silu_out (same pool used by PN12) and dispatch to torch.ops._C.silu_and_mul. - New wiring patch_N25_silu_inductor_safe_pool.py text-patches SiluAndMul.forward_native to dispatch through the opaque op. Falls back to vanilla F.silu * mul math when registration unavailable (torch < 2.4 or CPU-only build) — soft degradation. - Dispatcher entry declares conflicts_with=[], requires_patches=[]. PN12 + PN25 patch DIFFERENT methods (forward_cuda vs forward_native) so they can coexist on the same file without anchor collision. Bug fix in PN12 marker logic: - Removed "FFNIntermediateCache" from upstream_drift_markers — this is our own internal pool class name and may legitimately appear in sister-patches (PN25 docstring references it). Drift markers should signal upstream variants, not Genesis-internal symbols. Without this fix, applying PN25 to a vanilla file would correctly patch forward_native, but a subsequent PN12 attempt on the same file would see "FFNIntermediateCache" via PN25's docstring and skip with upstream_merged — silently bypassing forward_cuda pool wiring. Validation (clean container test): - register: True - op_callable: genesis.silu_and_mul_pooled - eager call: shape (2, 8) bfloat16, finite=True - torch.compile call: shape (2, 8), finite=True (fake impl validates for shape inference) - live container PN12+PN25 sequential apply: both "applied" status, both markers in activation.py. PN25 ships opt-in OFF (GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1) and is NOT enabled in any launch script. Users with inductor-heavy configs (e.g. 35B FP8 long-context, future MoE) can pair it with PN12 for full FFN intermediate pooling coverage. Cross-reference: noonghunna's club-3090#16 work-in-progress on the same flaw. Independent convergence on the torch.library.custom_op approach.
1 parent 1ac34a8 commit 2b239a8

5 files changed

Lines changed: 526 additions & 1 deletion

File tree

vllm/_genesis/dispatcher.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -613,6 +613,30 @@ class ValidationIssue:
613613
"conflicts_with": [],
614614
"requires_patches": [],
615615
},
616+
"PN25": {
617+
"title": "SiluAndMul.forward_native opaque-op pool (Cliff 1 mech B compile path)",
618+
"env_flag": "GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE",
619+
"default_on": False,
620+
"category": "memory_savings",
621+
"credit": (
622+
"Genesis-original 2026-05-01 in response to noonghunna's "
623+
"club-3090#16 (VolandBerlioz/ampersandru cross-rig OOM trace, "
624+
"RTX 3090 24 GB + Lorbus 27B + OpenCode 29K prefill). PN12 "
625+
"patches eager `forward_cuda` but `custom_ops=['none']` (default "
626+
"under V1 aot_compile_fullgraph) routes dispatch through "
627+
"`forward_native` which Inductor inlines and lowers to "
628+
"`empty_strided_cuda(...)`, bypassing PN12's pool. "
629+
"Sister-patch PN25 patches `forward_native` to dispatch through "
630+
"an opaque `genesis::silu_and_mul_pooled` torch.library.custom_op "
631+
"(Inductor cannot inline opaque ops). Both patches share the "
632+
"same FFNIntermediateCache pool. Recommended pairing for any "
633+
"inductor-heavy config."
634+
),
635+
"upstream_pr": None,
636+
"applies_to": {},
637+
"conflicts_with": [],
638+
"requires_patches": [], # complements PN12 but does not require it
639+
},
616640
"PN17": {
617641
"title": "FA2 softmax_lse runtime clamp (Cliff 1 mechanism A, Issue #11)",
618642
"env_flag": "GENESIS_ENABLE_PN17_FA2_LSE_CLAMP",
Lines changed: 228 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,228 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
"""PN25 — `silu_and_mul` as torch.library.custom_op (Inductor-safe pool).
3+
4+
Problem (continuation of PN12)
5+
------------------------------
6+
PN12 text-patches `SiluAndMul.forward_cuda` to acquire its
7+
`[M, intermediate_size]` BF16/FP16 transient from `FFNIntermediateCache`
8+
instead of `torch.empty()`. That works in eager mode.
9+
10+
Reported by noonghunna 2026-04-30 in club-3090#16, with confirmation
11+
from VolandBerlioz on a real OpenCode workload (29K sys+tools prefill,
12+
24 GB single 3090): when `custom_ops=["none"]` is the default (which is
13+
typical under V1 `aot_compile_fullgraph`), `SiluAndMul.__call__`
14+
dispatches to `forward_native`, NOT `forward_cuda`. `forward_native`
15+
is
16+
17+
@staticmethod
18+
def forward_native(x):
19+
d = x.shape[-1] // 2
20+
return F.silu(x[..., :d]) * x[..., d:]
21+
22+
torch.compile's Inductor traces this body into a fused kernel and
23+
issues its own `empty_strided_cuda((s18, intermediate_size), fp16)`
24+
for the multiplication output. PN12's pool never gets a chance —
25+
the patched `forward_cuda` method is never reached.
26+
27+
Symptom on a 24 GB 3090 + Lorbus 27B (intermediate=17408) at
28+
`max_num_batched_tokens=4128`: 137.6 MiB allocation, 131.75 MiB free,
29+
OOM. Cliff 1 mech B fires on real workloads while our verify-stress
30+
25K synthetic happens to hit shapes that DO reach eager forward_cuda
31+
and pass.
32+
33+
Genesis stack vulnerability
34+
---------------------------
35+
Same architectural flaw exists in our PN12 — we only patch
36+
`forward_cuda`. Our 27B PROD configs avoid the inductor path because
37+
`--cudagraph-mode=PIECEWISE` + offline-quant INT4 short-circuits the
38+
compile pipeline on this kernel. But long-context + chunked prefill
39+
under a future Inductor-default config could hit it.
40+
41+
Fix design (this module)
42+
------------------------
43+
Register `genesis::silu_and_mul_pooled` as `torch.library.custom_op`
44+
with `device_types=("cuda",)`. Inductor treats custom ops as opaque
45+
nodes — emits a call to the op and does NOT trace through the body.
46+
Inside the op body we run the same eager logic as PN12's patched
47+
forward_cuda: acquire output from `FFNIntermediateCache.acquire_silu_out`
48+
when the [M, d] 2-D shape matches, fall back to `torch.empty` otherwise,
49+
then dispatch to the underlying CUDA `silu_and_mul` kernel.
50+
51+
Companion patch PN25 (`patch_N25_silu_inductor_safe_pool.py`) edits
52+
`SiluAndMul.forward_native` to route through this op when available.
53+
PN12 stays as the eager-path patch on `forward_cuda`. Both can run
54+
simultaneously without conflict — `forward_cuda` is called when
55+
`custom_ops=["+silu_and_mul"]`, `forward_native` is called otherwise.
56+
57+
Composition with PN12
58+
---------------------
59+
PN12 patches `forward_cuda` (eager dispatch).
60+
PN25 patches `forward_native` via opaque op (compile dispatch).
61+
62+
Together: both paths acquire from the same `FFNIntermediateCache`
63+
pool. No state collision — pool is keyed by (intermediate_size, dtype,
64+
device), and forward is strictly sequential in vLLM's schedule.
65+
66+
If only PN12 enabled: eager workloads work, compile path leaks.
67+
If only PN25 enabled: compile workloads work, eager path leaks.
68+
If both enabled: full coverage. Recommended for any inductor-heavy
69+
config (35B FP8 + future MoE; club-3090 long-text/long-vision).
70+
71+
Compat
72+
------
73+
- Requires `torch.library.custom_op` (PyTorch ≥ 2.4, available on
74+
current vLLM nightly).
75+
- Enabled via `GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1`. OFF by
76+
default.
77+
- Falls back gracefully if torch < 2.4 OR `torch.ops._C.silu_and_mul`
78+
is missing (CPU-only build).
79+
80+
Author: Sandermage(Sander) Barzov Aleksandr, Ukraine, Odessa
81+
Cross-engine inspiration:
82+
- club-3090#16 (noonghunna independent work-in-progress)
83+
- Genesis P7b `gdn_dual_stream_customop.py` (custom_op template)
84+
"""
85+
from __future__ import annotations
86+
87+
import logging
88+
import os
89+
from typing import Optional
90+
91+
import torch
92+
93+
log = logging.getLogger("genesis.silu_and_mul_customop")
94+
95+
_ENV_ENABLE_PN25 = "GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE"
96+
97+
98+
def is_pn25_enabled() -> bool:
99+
return os.environ.get(_ENV_ENABLE_PN25, "").strip().lower() in (
100+
"1", "true", "yes", "on",
101+
)
102+
103+
104+
def should_apply() -> bool:
105+
"""Platform gate: NVIDIA CUDA + env opt-in."""
106+
if not is_pn25_enabled():
107+
return False
108+
from vllm._genesis.guards import is_nvidia_cuda
109+
if not is_nvidia_cuda():
110+
return False
111+
if not torch.cuda.is_available():
112+
return False
113+
return True
114+
115+
116+
_OP_QUALNAME = "genesis::silu_and_mul_pooled"
117+
_op_registered = False
118+
119+
120+
def _silu_and_mul_native_fallback(x: torch.Tensor) -> torch.Tensor:
121+
"""Pure-PyTorch fallback when CUDA op `_C.silu_and_mul` is not present.
122+
123+
Equivalent to `forward_native`. Allocates fresh — no pool benefit,
124+
but preserves correctness. Hit only on CPU-only builds or in tests.
125+
"""
126+
import torch.nn.functional as F
127+
d = x.shape[-1] // 2
128+
return F.silu(x[..., :d]) * x[..., d:]
129+
130+
131+
def _register_op_once() -> bool:
132+
"""Register `genesis::silu_and_mul_pooled` with torch.library.
133+
134+
Idempotent. Returns True on success. On any failure (torch too old,
135+
op already registered by sister module, fake-impl rejection by
136+
dynamo) returns False and PN25 wiring will fall back to upstream
137+
`forward_native` body.
138+
"""
139+
global _op_registered
140+
if _op_registered:
141+
return True
142+
143+
try:
144+
custom_op = getattr(torch.library, "custom_op", None)
145+
if custom_op is None:
146+
log.info(
147+
"[PN25] torch.library.custom_op not available "
148+
"(torch<2.4) — falling back to vanilla forward_native"
149+
)
150+
return False
151+
except Exception as e:
152+
log.info("[PN25] torch.library import failed: %s", e)
153+
return False
154+
155+
# Probe the underlying CUDA op once. If absent (CPU-only build,
156+
# rare), we register a pure-pytorch impl that still routes through
157+
# the opaque op so Inductor won't inline.
158+
has_cuda_op = (
159+
hasattr(torch.ops, "_C") and
160+
hasattr(torch.ops._C, "silu_and_mul")
161+
)
162+
163+
@custom_op(_OP_QUALNAME, mutates_args=(), device_types=("cuda",))
164+
def _silu_and_mul_pooled(x: torch.Tensor) -> torch.Tensor:
165+
"""Real impl — runs outside dynamo trace (opaque op).
166+
167+
For 2-D `(M, 2*d)` tensors, acquires output from the shared
168+
`FFNIntermediateCache` pool. For 3-D `(B, S, 2*d)` we fall
169+
back to `torch.empty` because the pool is keyed on
170+
`(num_tokens, intermediate_size)` and 3-D shapes only appear
171+
in non-prefill paths where the alloc is small enough to not
172+
matter.
173+
"""
174+
d = x.shape[-1] // 2
175+
176+
if has_cuda_op and x.dim() == 2:
177+
try:
178+
from vllm._genesis.kernels.ffn_intermediate_cache import (
179+
FFNIntermediateCache as _Cache,
180+
)
181+
if _Cache.is_production_eligible():
182+
out = _Cache.acquire_silu_out(
183+
num_tokens=x.shape[0],
184+
intermediate_size=d,
185+
dtype=x.dtype, device=x.device,
186+
)
187+
torch.ops._C.silu_and_mul(out, x)
188+
return out
189+
except Exception as e:
190+
log.debug("[PN25] pool acquire failed, fallback: %s", e)
191+
192+
if has_cuda_op:
193+
output_shape = x.shape[:-1] + (d,)
194+
out = torch.empty(output_shape, dtype=x.dtype, device=x.device)
195+
torch.ops._C.silu_and_mul(out, x)
196+
return out
197+
198+
return _silu_and_mul_native_fallback(x)
199+
200+
@_silu_and_mul_pooled.register_fake
201+
def _silu_and_mul_pooled_fake(x: torch.Tensor) -> torch.Tensor:
202+
"""Shape-inference impl for dynamo tracing.
203+
204+
Returns an empty tensor of the correct shape; dynamo never
205+
executes the body so this is never observed at runtime — only
206+
used for output shape propagation through the compiled graph.
207+
"""
208+
d = x.shape[-1] // 2
209+
output_shape = x.shape[:-1] + (d,)
210+
return torch.empty(output_shape, dtype=x.dtype, device=x.device)
211+
212+
_op_registered = True
213+
log.info("[PN25] registered torch op %s (Inductor-opaque)", _OP_QUALNAME)
214+
return True
215+
216+
217+
def get_op_callable():
218+
"""Return the registered op callable, or None if registration failed.
219+
220+
Used by the PN25 wiring patch to populate the replacement body.
221+
Caller is responsible for graceful degradation on None.
222+
"""
223+
if not _register_op_once():
224+
return None
225+
try:
226+
return torch.ops.genesis.silu_and_mul_pooled
227+
except (AttributeError, RuntimeError):
228+
return None

vllm/_genesis/patches/apply_all.py

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1953,6 +1953,53 @@ def apply_patch_N12_ffn_intermediate_pool() -> PatchResult:
19531953
return _failed(name, reason)
19541954

19551955

1956+
@register_patch(
1957+
"PN25 SiluAndMul.forward_native opaque-op pool "
1958+
"(Cliff 1 mech B compile-path companion to PN12)"
1959+
)
1960+
def apply_patch_N25_silu_inductor_safe_pool() -> PatchResult:
1961+
"""Patch N25: sister-patch to PN12 covering the compile dispatch path.
1962+
1963+
PN12 patches `SiluAndMul.forward_cuda` (eager mode); PN25 patches
1964+
`SiluAndMul.forward_native` via a `torch.library.custom_op` so
1965+
torch.compile/Inductor cannot inline the FFN intermediate alloc and
1966+
bypass PN12's pool.
1967+
1968+
Reported by noonghunna in club-3090#16 (VolandBerlioz Reddit + ampersandru
1969+
confirmation): on `custom_ops=["none"]` configs (default V1
1970+
aot_compile_fullgraph) `__call__` dispatches to `forward_native`,
1971+
Inductor traces and lowers to `empty_strided_cuda(...)` at line
1972+
`inductor_cache/...py:1208` — completely outside PN12's hot path.
1973+
1974+
PN25 registers `genesis::silu_and_mul_pooled` (opaque to Inductor)
1975+
and rewrites `forward_native` to dispatch through it. Inside the
1976+
opaque body, the same `FFNIntermediateCache` pool used by PN12
1977+
serves the [M, intermediate_size] transient. Pool is shared — both
1978+
paths converge on one buffer.
1979+
1980+
Status: opt-in via GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1.
1981+
Default OFF. Composes with PN12 (recommended pairing for any
1982+
inductor-heavy config). Standalone use covers compile-only paths;
1983+
PN12-only covers eager-only paths.
1984+
"""
1985+
name = (
1986+
"PN25 SiluAndMul.forward_native opaque-op pool "
1987+
"(Cliff 1 mech B compile-path)"
1988+
)
1989+
if not _APPLY_MODE:
1990+
return _applied(name, "dry-run: text-patch ready")
1991+
try:
1992+
from vllm._genesis.wiring.hybrid import patch_N25_silu_inductor_safe_pool
1993+
except Exception as e:
1994+
return _failed(name, f"wiring import failed: {e}")
1995+
status, reason = patch_N25_silu_inductor_safe_pool.apply()
1996+
if status == "applied":
1997+
return _applied(name, reason)
1998+
if status == "skipped":
1999+
return _skipped(name, reason)
2000+
return _failed(name, reason)
2001+
2002+
19562003
@register_patch("PN13 CUDAGraphWrapper lambda arity (vllm#41235 backport)")
19572004
def apply_patch_N13_cuda_graph_lambda_arity() -> PatchResult:
19582005
"""Patch N13: backport of vllm#41235 (roikoren755, OPEN as of 2026-04-29) —

vllm/_genesis/wiring/hybrid/patch_N12_ffn_intermediate_pool.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -188,7 +188,11 @@ def _make_patcher() -> TextPatcher | None:
188188
# If vllm#34207 lands the body becomes silu_and_mul.out()
189189
# variant — different anchor, ours auto-skips.
190190
"torch.ops._C.silu_and_mul.out",
191-
"FFNIntermediateCache",
191+
# Note: deliberately do NOT use "FFNIntermediateCache" here
192+
# as a drift marker — it's our own pool class name and may
193+
# appear in sister-patches (PN25) that legitimately compose
194+
# with PN12 on the same file. Idempotency is handled by the
195+
# Genesis-PN12 wiring marker line above.
192196
],
193197
)
194198

0 commit comments

Comments
 (0)