Record: SP8192 + yahya010 NN base + byte-PPM mixer — val_bpb 0.99145 …#1933
Record: SP8192 + yahya010 NN base + byte-PPM mixer — val_bpb 0.99145 …#1933deborahnelson8788726 wants to merge 1 commit intoopenai:mainfrom
Conversation
…(3-seed mean) 3-seed mean: 0.99145 (std 0.00078, full FineWeb val 152.6 MB) Beats current main SOTA 1.0810 by -0.08955; OE-GOD's pending PR openai#1795 1.01252 by -0.02107 Composition of two unmerged contributions: - @yahya010 PR openai#1727 NN base (1.07217, MP-SGD TTT + QK-Gain 5.25) - @OE-GOD PR openai#1795 byte-level PPM-D mixer (strict-legal outcome-independent gate) Source diff vs PR openai#1727: ~120 lines added in eval_val_sliding for PPM mixer. Adds only records/track_10min_16mb/2026-04-29_PPM_SP8192_yahya_base/. Compliance: train 596s (under 600s), artifact 15.9 MB (under 16 MB), mean eval 562s (seeds 0/1234 at 533/527s under 600s; seed 42 cold-cache 626s). Inherits OE-GOD openai#1795 organizer-ruling-pending status on byte-PPM as TTT.
|
Closing in light of the C2 discussion in Issue #1872 (raised by @sharpobject and acknowledged by @andrewbaggio1):
The byte-mix distribution does not normalize over the official token alphabet Σ, which makes the metric not a valid Withdrawing rather than asking maintainers to spend review cycles on a submission that the C2 ruling already addresses. Thanks to @yahya010, @OE-GOD, @bigbag, @clarkkev for the upstream components, and to @andrewbaggio1 / @sharpobject for the clean C2 framing. |
Summary
val_bpb = 0.99145 (3-seed mean, std=0.00078, full FineWeb val 152,574,319 bytes)
Beats current main SOTA 1.0810 by −0.08955 and the strongest pending PR #1795 (1.01252) by −0.02107.
This is the composition of two complementary, already-published unmerged contributions, both inherited unchanged:
NN base = @yahya010 PR Record: SP8192 MP-SGD TTT (4 phases) + QK-Gain 5.25 — val_bpb 1.07217 (3-seed mean) #1727 (val_bpb 1.07217, 3-seed) — Multi-Phase Global SGD TTT (4 phases) + QK-Gain 5.25 + Phased LoRA TTT on the @bigbag PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493 / @clarkkev PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 lineage. Stack and env vars unchanged.
Eval-time mixer = @OE-GOD PR Record: SP4096 + byte-level PPM adaptive-λ mixture (strict-legal gate) — val_bpb 1.01252 (3-seed) #1795 byte-level PPM-D order-4 with strict-legal outcome-independent adaptive-λ gate. Function copied verbatim (`_ppm_mixture_bpb`, ~60 lines) and called from `eval_val_sliding` after distributed all-reduce.
3-Seed Results
Our NN-only token-BPB (1.07646) matches @yahya010's 1.07217 within seed noise (σ_seed ≈ 0.0007). The PPM mixer Δ (−0.0744) matches @OE-GOD's reported Δ (−0.0744) on @clarkkev's base.
Why this composition
What changed vs base
Source diff vs `records/track_10min_16mb/2026-04-18_SP8192_MPSGD_QKGain525/train_gpt.py`:
Total diff: ~120 lines added, 0 lines removed from yahya010's NN logic.
Compliance
Scope
Adds only `records/track_10min_16mb/2026-04-29_PPM_SP8192_yahya_base/`. No changes outside.
Credits
Test plan
If PPM-as-TTT is ruled invalid, this submission falls back to the inherited NN-only score (1.076 byte-BPB / 1.076 NN-token-BPB matching yahya010), which is still a valid record vs current main SOTA 1.0810.