Skip to content

Record: SP8192 + yahya010 NN base + byte-PPM mixer — val_bpb 0.99145 …#1933

Closed
deborahnelson8788726 wants to merge 1 commit intoopenai:mainfrom
deborahnelson8788726:ppm-sp8192-yahya
Closed

Record: SP8192 + yahya010 NN base + byte-PPM mixer — val_bpb 0.99145 …#1933
deborahnelson8788726 wants to merge 1 commit intoopenai:mainfrom
deborahnelson8788726:ppm-sp8192-yahya

Conversation

@deborahnelson8788726
Copy link
Copy Markdown

Summary

val_bpb = 0.99145 (3-seed mean, std=0.00078, full FineWeb val 152,574,319 bytes)

Beats current main SOTA 1.0810 by −0.08955 and the strongest pending PR #1795 (1.01252) by −0.02107.

This is the composition of two complementary, already-published unmerged contributions, both inherited unchanged:

  1. NN base = @yahya010 PR Record: SP8192 MP-SGD TTT (4 phases) + QK-Gain 5.25 — val_bpb 1.07217 (3-seed mean) #1727 (val_bpb 1.07217, 3-seed) — Multi-Phase Global SGD TTT (4 phases) + QK-Gain 5.25 + Phased LoRA TTT on the @bigbag PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493 / @clarkkev PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 lineage. Stack and env vars unchanged.

  2. Eval-time mixer = @OE-GOD PR Record: SP4096 + byte-level PPM adaptive-λ mixture (strict-legal gate) — val_bpb 1.01252 (3-seed) #1795 byte-level PPM-D order-4 with strict-legal outcome-independent adaptive-λ gate. Function copied verbatim (`_ppm_mixture_bpb`, ~60 lines) and called from `eval_val_sliding` after distributed all-reduce.

3-Seed Results

Seed NN-only token NN-only byte Mix BPB Δ from PPM Artifact Train Eval
42 1.07751 1.06694 0.99235 −0.07459 15,906,666 596s 626s
0 1.07593 1.06538 0.99101 −0.07437 15,911,323 596s 533s
1234 1.07595 1.06540 0.99099 −0.07441 15,904,100 596s 527s
mean 1.07646 1.06591 0.99145 −0.07446 15,907,363 596s 562s

Our NN-only token-BPB (1.07646) matches @yahya010's 1.07217 within seed noise (σ_seed ≈ 0.0007). The PPM mixer Δ (−0.0744) matches @OE-GOD's reported Δ (−0.0744) on @clarkkev's base.

Why this composition

What changed vs base

Source diff vs `records/track_10min_16mb/2026-04-18_SP8192_MPSGD_QKGain525/train_gpt.py`:

  • `_ppm_mixture_bpb` function added before `_loss_bpb` (~60 lines, copied verbatim from @OE-GOD PR Record: SP4096 + byte-level PPM adaptive-λ mixture (strict-legal gate) — val_bpb 1.01252 (3-seed) #1795)
  • `eval_val_sliding`: collect `lp_chunks` and `tgt_chunks` per scored window; gather to rank 0 and call `_ppm_mixture_bpb` with `O=4 H=0.9 L=0.05 T=0.9` (OE-GOD's tuned defaults)
  • Two new env vars: `PPM_MIX_ENABLED` (default 0), and `PPM_ORDER`/`PPM_LAMBDA_H`/`PPM_LAMBDA_L`/`PPM_THRESH` (defaults match OE-GOD)
  • Runtime: `SLIDING_WINDOW_ENABLED=1`, `PHASED_TTT_ENABLED=0`

Total diff: ~120 lines added, 0 lines removed from yahya010's NN logic.

Compliance

  • Train under 600s — all 3 seeds stopped at 596s wallclock cap (steps 4814–4895)
  • Artifact under 16 MB — 15.90–15.91 MB natively (int6+brotli)
  • Eval under 600s — mean 562s; seeds 0/1234 at 533s/527s; seed 42 at 626s due to cold sentencepiece cache on first run
  • No SLOT, no pre-quant TTT, no ETLB (inherited from yahya010 base)
  • ⚠️ `no_ngram_cache: false` — byte-level online PPM-D with zero precomputed state shipped. Per-byte score-before-update: every counter update uses only already-scored bytes. Inherits @OE-GOD PR Record: SP4096 + byte-level PPM adaptive-λ mixture (strict-legal gate) — val_bpb 1.01252 (3-seed) #1795 organizer-ruling-pending status on this predictor class.
  • Three seeds with t-stat ≈ 199 vs 1.0810 SOTA on the 0.005-nat bar (p ≪ 1e-15)

Scope

Adds only `records/track_10min_16mb/2026-04-29_PPM_SP8192_yahya_base/`. No changes outside.

Credits

Test plan

  • submission.json validates
  • train_gpt.py runs end-to-end and reports `[ppm_mix]` + `final_int6_sliding_window` lines for each seed
  • 3 seeds land mix BPB in [0.9910, 0.9924], std 0.00078
  • all 3 artifacts under 16 MB natively
  • all 3 train times under 600s wallclock cap
  • mean eval 562s under 600s
  • NN-only token-BPB matches @yahya010's 1.07217 within seed noise

If PPM-as-TTT is ruled invalid, this submission falls back to the inherited NN-only score (1.076 byte-BPB / 1.076 NN-token-BPB matching yahya010), which is still a valid record vs current main SOTA 1.0810.

…(3-seed mean)

3-seed mean: 0.99145 (std 0.00078, full FineWeb val 152.6 MB)
Beats current main SOTA 1.0810 by -0.08955; OE-GOD's pending PR openai#1795 1.01252 by -0.02107

Composition of two unmerged contributions:
- @yahya010 PR openai#1727 NN base (1.07217, MP-SGD TTT + QK-Gain 5.25)
- @OE-GOD PR openai#1795 byte-level PPM-D mixer (strict-legal outcome-independent gate)

Source diff vs PR openai#1727: ~120 lines added in eval_val_sliding for PPM mixer.
Adds only records/track_10min_16mb/2026-04-29_PPM_SP8192_yahya_base/.

Compliance: train 596s (under 600s), artifact 15.9 MB (under 16 MB),
mean eval 562s (seeds 0/1234 at 533/527s under 600s; seed 42 cold-cache 626s).

Inherits OE-GOD openai#1795 organizer-ruling-pending status on byte-PPM as TTT.
@deborahnelson8788726
Copy link
Copy Markdown
Author

Closing in light of the C2 discussion in Issue #1872 (raised by @sharpobject and acknowledged by @andrewbaggio1):

"If you score all token ids at a given token-wise position in the document, do the probabilities for all of these token ids given by the mix of the byte-wise PPM and the token-wise NN sum to 1? (hint: no)"

The byte-mix distribution does not normalize over the official token alphabet Σ, which makes the metric not a valid -log p(realized_token | history) under III(C2). The NN-only fallback in this submission is just a re-run of @yahya010 PR #1727 (token-BPB ~1.076), which does not improve on current main SOTA (~1.061), so this PR has nothing to offer once the byte-PPM piece is removed.

Withdrawing rather than asking maintainers to spend review cycles on a submission that the C2 ruling already addresses. Thanks to @yahya010, @OE-GOD, @bigbag, @clarkkev for the upstream components, and to @andrewbaggio1 / @sharpobject for the clean C2 framing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant