You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Cliff 8 hardening (apply_all.py):
- New PatchStats.partial_apply_warnings property surfaces skipped patches
whose reason indicates real anchor drift / ambiguous-anchor / required-
anchor-missing — distinct from benign skips (opt-in OFF, upstream-merged,
platform mismatch, deferred, redundant).
- Boot summary line now appends "N ⚠️ partial-apply warning(s)" when the
count is non-zero, plus per-warning WARNING-level lines that name each
patch + reason. Silent anchor-drift skip class noonghunna flagged in
club-3090 discussion #19 is now impossible to miss in the boot output.
- Validated on live 35B DFlash 160K boot: 0 false positives after BENIGN
list refinement (catches "opt-in:", "redundant:", "deferred", "upstream
may have absorbed", "config: neutral" etc).
CLIFFS.md PN19 H100-only flag (Cliff 1 mech A section):
- noonghunna 2026-05-01 confirmed PN19 costs ~120 MiB KV pool on a 24 GB
single-3090 (vs the documented 200-500 MiB win on H100). At 218K + 0.985
mem-util, engine init fails with KV cache available 3.4 GiB / required
3.52 GiB.
- Documented as: disable PN19 on 24 GB consumer cards (3090, 4090, A5000)
running long context. Same lesson as P104 L2 persistence (regressed
-16.2% on 32+ layer KV >> L2 setups). Generic allocator hints don't
survive GPU class boundaries.
No regressions on live 35B DFlash 160K bench:
- 44 applied / 42 skipped / 0 failed / 0 partial-apply warnings
- prose 256t mean TPS 125.07, CV 3.07%
- tool-call 7/7 then 5/7 then 6/7 (variance noise band, no real regression)
Copy file name to clipboardExpand all lines: docs/CLIFFS.md
+6Lines changed: 6 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,10 +22,16 @@ You hit OOM earlier than you should on long-context workloads. On a 24 GB card r
22
22
23
23
**PN17 — FA2 lse runtime clamp.** Genesis-original, 2026-04-30, in response to noonghunna Issue #11. Patches FA2 to use the actual `seq_lens.max()` at runtime instead of `max_model_len` during capture.
24
24
25
+
**PN19 — scoped max-split cudagraph init (datacenter Ampere / Hopper / Blackwell only).** Genesis-original, 2026-04-30. Frees 200-500 MiB during model load on H100/B100. **Does NOT transfer cleanly to Ampere consumer:** noonghunna 2026-05-01 confirmed PN19 costs ~120 MiB KV pool on a 24 GB single-3090 (vs the documented 200-500 MiB win). At 218K context + 0.985 mem-util, engine init fails with `KV cache memory available 3.4 GiB, estimated maximum model length is 206400`. Different allocator behavior under PyTorch 2.10+ load-time fragmentation on consumer SKUs.
26
+
27
+
> **Recommendation:** disable PN19 on 24 GB consumer cards (3090, 4090, A5000) running long context. Same lesson as P104 L2 persistence (regressed -16.2% on 32+ layer KV >> L2 setups). Generic allocator hints don't survive GPU class boundaries.
0 commit comments