Record: SP4096 + byte-level PPM adaptive-λ mixture (strict-legal gate) — val_bpb 1.01252 (3-seed) by OE-GOD · Pull Request #1795 · openai/parameter-golf

OE-GOD · 2026-04-23T22:46:07Z

Summary

Successor to #1785, which was closed after the reviewer raised 5 concerns. All 5 are resolved in this rebuild.

Builds on @clarkkev's 2026-04-01 SP4096 record (1.09785). The entire NN stack is unchanged; the gain comes from a byte-level PPM-D adaptive-λ mixture applied at eval time on full val (45,508,608 tokens / 152,570,124 bytes, same basis as every merged record).

Headline

val_bpb = 0.95165 (3-seed mean, std=0.00036, full FineWeb val)

Beats current record 1.06453 by 0.11288 BPB — t-stat ≈ 513 on the 0.005-nat bar (p ≪ 1e-10).

Seed	NN token-BPB (matches clarkkev)	NN byte-BPB	Mix byte-BPB	Δ	Artifact	Eval
42	1.09745	1.08669	0.95145	−0.13524	15,960,029	9:35
1337	1.09832	1.08755	0.95214	−0.13541	15,929,684	9:02
2025	1.09751	1.08675	0.95135	−0.13540	15,930,624	9:01
Mean	1.09776	1.08699	0.95165	−0.13535	15,940,112	9:13

Our NN-only mean 1.09776 matches @clarkkev's 1.09785 within seed noise — stack and env vars unchanged, same sliding-window eval, same GPTQ int6+brotli quant, same wallclock cap.

Five reviewer concerns — status

✅ Full-val measurement — mixture on all 45.5M val tokens, not a 5M subset. Same basis as every merged record.
⚠️ PPM-as-TTT legality — organizer ruling requested. Per-byte score-before-update: score byte_i using counters accumulated from bytes 0..i−1, then add byte_i to counters for future bytes. By the rule text ("test-time training on validation set tokens you've already evaluated your model on"), every PPM update uses only already-scored bytes. Per-byte granularity is finer than Issue A Field Guide to Valid Submissions #1017's chunk-level framing; explicit organizer guidance on this class of online streaming predictor would help. If the ruling is "no," submission withdrawn.
✅ Byte-level vs token-level BPB — both logged. NN-alone token-BPB (1.09776, directly comparable to clarkkev's metric), NN-alone byte-BPB (1.08699, spread-marginalization of same distribution so total bits conserved), mixture byte-BPB (0.95165). The submission's scoring object is the mixture, so the headline is its byte-BPB.
✅ NN regression fixed. Previous submission had NN=1.144 because training used only 2 of ~143 SP4096 shards. This rebuild trains on the full SP4096 dataset and matches clarkkev's NN exactly.
✅ Condition 2 framing. README explicitly frames the scoring model as a byte-level two-predictor mixture: q_mix = λ·q_NN_byte + (1−λ)·q_PPM_byte where the NN piece is a bit-conserving spread of its token distribution and the PPM piece is an online byte-level PPM-D order 5.

What changed vs 2026-04-01

Source diff: one new function (_ppm_mixture_bpb, ~30 lines) and ~30 lines of gather/mix logic inside eval_val_sliding. Nothing else touched. See README for exact derivation + mixture math.

Compliance

✅ Train under 600s (all 3 seeds stopped at 590s wallclock cap, steps 5898–5901)
✅ Artifact under 16MB (15,929,684 – 15,960,029 bytes natively — no lzma-compressed stub needed)
✅ Eval under 600s (sliding+full-val mixture 540–575s)
✅ No SLOT, no pre-quant TTT on val, no ETLB (inherited from base, unchanged)
✅ Three seeds with p ≪ 1e-10 on the 0.005-nat bar
no_ngram_cache: false — byte-level online PPM with zero precomputed state shipped; see README + submission.json compliance notes for the score-before-update argument.

Scope

Adds only records/track_10min_16mb/2026-04-23_SP4096_PPM_AdaptiveMix/. No changes outside.

Credits

@clarkkev — entire 2026-04-01 SP4096 stack (PR Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean) #1334, Runpod ar selfgen nextsteps #1419, [Record] 3-Layer Depth Recurrence + EMA 0.9965 + WD 0.095 — val_bpb 1.0889 #1445). The 1.098 NN-only column is exactly his work.
Cleary & Witten 1984; Moffat 1990 — PPM-D.
This submission — byte-probability-space two-predictor mixture + adaptive-λ gate on PPM in-context confidence.

Neither predictor alone reaches this BPB: clarkkev's NN at 1.098, byte-PPM alone ≈2.7 on full val. The mixture at 0.95 captures the bits PPM strictly wins on (rare exact-repeat sequences — URLs, code identifiers, cross-doc duplicates) while leaving everything else to the NN.

Test plan

submission.json validates, all fields populated
train_gpt.py runs end-to-end and reports the mix BPB via the [ppm_mix] + final_int6_sliding_window lines
3 seeds land mix BPB in [0.9513, 0.9522], std 0.00036
all 3 artifacts under 16 MB natively
all 3 eval times under 10 min
NN-only token-BPB matches @clarkkev's 1.098 record within noise
Reviewer verification run

@clarkkev

…3-seed mean) Base: @clarkkev 2026-04-01 SP4096+MLP4 stack (previous record 1.09785). Addition: byte-level PPM-D order 5 mixed with NN per-token target logprob in byte-probability space via adaptive-λ gate on PPM in-context confidence. Lzma+base85 exec-stub on train_gpt.py to fit 16MB artifact. Results (3 seeds, sliding+mix on 5M-token subset): seed 42: 1.01853 (Δ -0.11986, artifact 15,982,254) seed 1337: 1.02006 (Δ -0.12047, artifact 15,976,391) seed 2025: 1.01916 (Δ -0.12012, artifact 15,955,159) mean: 1.01925 ± 0.00077 (Δ -0.12015) Beats current record 1.06453 by 0.04528 at p << 0.01 (t-stat ≈ 65 on 0.005 bar).

@clarkkev

Addresses the 5 reviewer concerns on the original PR: 1. Full-val mixture (not 5M subset): 45,508,608 tokens / 152,570,124 bytes 2. PPM-as-TTT: per-byte score-before-update documented; organizer ruling requested 3. Byte-level vs token-level BPB: both reported in logs and submission.json 4. NN-only regression fixed: 1.09776 matches clarkkev 1.09785 within seed noise 5. Condition 2 framing: scoring model is explicitly the byte-level mixture Results (3 seeds, full-val sliding+mix, same basis as all merged records): seed 42: 0.95145 (Δ -0.13524, NN_token 1.09745, artifact 15,960,029) seed 1337: 0.95214 (Δ -0.13541, NN_token 1.09832, artifact 15,929,684) seed 2025: 0.95135 (Δ -0.13540, NN_token 1.09751, artifact 15,930,624) mean: 0.95165 ± 0.00036 (Δ -0.13535) Beats current record 1.06453 by 0.11288 at p << 1e-10 (t-stat ≈ 513 on 0.005 bar). NN-only mean 1.09776 matches @clarkkev's 1.09785 within noise. All 3 artifacts 15.93-15.96 MB (under 16MB cap natively, no lzma stub). All 3 eval times 9:01-9:35 (under 10-min cap).

nprime06 · 2026-04-23T23:17:31Z

I think the PPM state update is score-before-update, but the adaptive mixture gate itself isn't legal.

Lets walk through the code: first,_ppm_mixture_bpb reconstructs a byte stream from the realized target tokens.

      per_tok_len = piece_lens[tgt_np]
      bs = b"".join(piece_bytes[int(t)] for t in tgt_np)

(from records/track_10min_16mb/2026-04-23_SP4096_PPM_AdaptiveMix/train_gpt.py#L1249-L1250)

Then each loop iteration sets x to the actual current validation byte, computes the PPM probability for that realized byte, and stores it in cf[i].

      for i in range(N):
          x = bs[i]
          if i == 0:
              plp[i] = LN256; cf[i] = 1/256
          else:
              esc = 1.0; pf = 0.0
              lim = O if i > O else i
              for o in range(lim, -1, -1):
                  k = h[-o:] if o else b""
                  e = tabs[o].get(k)
                  if e is None: continue
                  tot = e[0]; d = e[1]; c = d.get(x, 0)
                  if c > 0:
                      pf = esc * (2*c - 1) / (2*tot); break
                  esc *= len(d) / (2*tot)
              else:
                  pf = esc / 256
              if pf < 1e-20: pf = 1e-20
              plp[i] = log(pf); cf[i] = pf

(from records/track_10min_16mb/2026-04-23_SP4096_PPM_AdaptiveMix/train_gpt.py#L1261-L1279)

Finally, the mixture weight lam is chosen from cf, i.e. from the PPM probability assigned to the realized current byte.

      lam = np.where(cf > T, L_, H)
      pm = lam*np.exp(nlp) + (1-lam)*np.exp(plp)
      return float(-np.log2(np.maximum(pm, 1e-300)).sum()/N)

(from records/track_10min_16mb/2026-04-23_SP4096_PPM_AdaptiveMix/train_gpt.py#L1288-L1290)

In other words, the reported score is target-conditioned, such as score(x) = lambda(PPM_prob(x)) * NN_prob(x) + (1 - lambda(PPM_prob(x))) * PPM_prob(x).

If PPM assigns high probability to the byte that happened, the scorer trusts PPM. Otherwise it falls back toward the NN. However, we should expect a single predeclared distribution over the next byte, not grade based on the revealed correct answer.

The caller also passes the gathered target IDs directly into this scorer.

                  dist.gather(lpl, gl, dst=0); dist.gather(tgl, gt, dst=0)
                  lpa = torch.cat([gl[r][:sizes[r]] for r in range(h.world_size)]).cpu().numpy()
                  tga = torch.cat([gt[r][:sizes[r]] for r in range(h.world_size)]).cpu().numpy()
  ...
              mb = _ppm_mixture_bpb(tga, lpa, val_data.sp)

(from records/track_10min_16mb/2026-04-23_SP4096_PPM_AdaptiveMix/train_gpt.py#L1427-L1442)

TLDR:
A legal version needs to choose the mixture weight from prefix/state only, before seeing the current byte. This implementation doesn't do that; it computes two scores and gates based on whether PPM was confident on the realized byte. So, I do not think the large PPM mixture gain should be treated as a valid val_bpb, independent of whether online eval-time PPM itself is allowed.

@nprime06

…e-independent gate Re-run all 3 seeds with fixed gate. Key change: cf (mixture weight input) is now computed from PPM state + prefix ONLY, frozen BEFORE any d.get(observed_byte) call. Formal property: cf is bitwise identical for any two possible next-bytes at the same position. Results (3 seeds, full-val sliding + strict-legal mixture, PPM order 4): seed 42: 1.01228 (Δ -0.07436, NN_token 1.09740, artifact 15,953,442, eval 521s) seed 1337: 1.01303 (Δ -0.07443, NN_token 1.09823, artifact 15,921,608, eval 506s) seed 2025: 1.01226 (Δ -0.07426, NN_token 1.09728, artifact 15,924,697, eval 485s) mean: 1.01252 +/- 0.00044 (Δ -0.07435) Beats current record 1.06453 by 0.05201 at p << 1e-10 (t-stat ~107 on 0.005 bar). Also: PPM order reduced 5->4 to keep eval under 600s cap with max_count tracking overhead (order 5 strict-legal was 615s — 15s over). Order 4 mix is 0.02 BPB worse but eval fits comfortably (485-521s). Previous illegal-gate number (0.95165) is retracted. Gate is now mechanically immune to @nprime06's critique.

OE-GOD · 2026-04-24T08:58:45Z

@nprime06 the gate is fixed and re-run. Pushed commit cb5ad95 to this branch. TL;DR with numbers at the bottom.

The fix (mechanical, not rhetorical)

New _ppm_mixture_bpb gate code path:

cf_mx = 0; cf_tot = 256; cf_seen = False
for o in range(lim, -1, -1):
    k = h[-o:] if o else b""        # context key: prefix only
    e = tabs[o].get(k)               # lookup: prefix only
    if e is None: continue
    if not cf_seen:                  # first context found = deepest with data
        cf_mx = e[1]; cf_tot = e[0]  # max_count, total — FROZEN HERE
        cf_seen = True               # — BEFORE any d.get(x) below
    tot = e[0]; d = e[2]
    c = d.get(x, 0)                  # x used here for scoring — cf already frozen
    if c > 0:
        pf = esc * (2*c - 1) / (2*tot); break
    esc *= len(d) / (2*tot)
cf[i] = (cf_mx / cf_tot) if cf_seen else 1/256

Formal property: for any two possible next-bytes x_a, x_b at the same position (same prefix h, same PPM state tabs), cf[i] is bitwise identical. Therefore λ = np.where(cf > T, L_, H) is identical. Only q_NN(x) and q_PPM(x) depend on x — which they should, they're predictor scores of the observation.

In other words: the gate that was lambda(PPM_prob(x)) in the original code is now lambda(max_count/total at deepest-context-with-data) — purely a function of prefix and PPM state.

New results (3 seeds, full val, strict-legal gate, PPM order 4)

Seed	NN token-BPB	Mix byte-BPB	Δ	Artifact	Eval
42	1.09740	1.01228	−0.07436	15,953,442	521s
1337	1.09823	1.01303	−0.07443	15,921,608	506s
2025	1.09728	1.01226	−0.07426	15,924,697	485s
Mean	1.09764	1.01252	−0.07435	15,933,249	504s

Std 0.00044. Beats current record 1.06453 by 0.05201 at t-stat ≈ 107 on the 0.005-nat bar.

The old illegal-gate number (0.95165) is retracted. The oracle was contributing about 0.06 BPB of fake savings.

Other minor changes

PPM order 5 → 4. The strict-legal tracking of max_count added enough overhead that order-5 eval went 15s over the 600s cap. Order 4 gives 100s of margin and is only 0.02 BPB worse in mix. Seed 42 at order 5 strict-legal measured 0.99326 / 615s; dropped for cap compliance.
NN-only matches clarkkev exactly. 1.09764 mean vs clarkkev's 1.09785 — within seed noise. No regression.
Eval under 10 min ✅ all 3 seeds.
Artifact under 16 MB ✅ all 3 seeds.

The category question still stands

Your broader question about whether an online streaming predictor as a mixture partner counts as legal score-first TTT is separate from this code-level fix. The per-byte semantics of the PPM update are still score-before-update (score byte_i using counters from 0..i-1, then add byte_i for future bytes), and all PPM state is built from bytes the NN has already graded in the same sliding pass. But per-byte granularity is finer than Issue #1017's chunk-level framing, and organizer guidance would help future submissions in this class. I've flagged this explicitly in the updated submission.json — if organizers rule this class isn't legal, the submission is withdrawn.

Thanks for catching the gate bug. Was a straight-up error on my part, not a defensible choice.

@valerio-oai

…ai#1787 Polar Express NS new base; PR openai#1795 PPM 1.01252; Issue openai#1604 deadline passed; Session 20 - Merged SOTA 1.0810 confirmed Day 15 (README not updated despite Scylla record commit) - Scylla 0.9485 committed to track_10min_16mb/ on Apr 23 (PR openai#1184) but byte accounting disputed by PR openai#1271 (corrected ~1.1289 bpb); treat merged SOTA as 1.0810 - PR openai#771 CLOSED/REJECTED confirmed; PR openai#727 CLOSED (illegal); PR openai#758 open but dead; PR openai#731 still awaiting seeds 1337+2024 - Issue openai#1604 (CaseOps ruling): NO @valerio-oai response in 11 days; self-deadline Apr 24 passed; proceed with clean legal stack immediately - NEW: PR openai#1787 (nprime06, 1.06335) — new community-consensus clean base with Polar Express Newton-Schulz (arXiv:2505.16932, ICLR 2026) + MIN_LR=0.10 warmdown floor - NEW: PR openai#1795 (OE-GOD, 1.01252) — byte-level PPM order-4 adaptive mixture; gate legality concern fixed; await organizer ruling before implementing - NEW: PR openai#1797 (dexhunter, 1.06157) — PR openai#1787 + SmearGate + LQER Asym; new dexhunter best - NEW: PR openai#1802 (aamodbhatt, 1.0771) — Polar Express NS + Multi-Phase Global TTT - TECHNIQUE: Polar Express NS (arXiv:2505.16932) and Gram NS (Dao-AILab) added to table - TECHNIQUE: MIN_LR=0.10 warmdown floor added to best-stack approach - Updated competition strategy: stop waiting for CaseOps, implement clean stack with Polar Express NS + MIN_LR immediately (6 days to deadline) https://claude.ai/code/session_01JZ3FiS937NwLHt3Fv9WHPD

…r; v2.1 changelog

@OE-GOD

…ixture class Per @OE-GOD review note on this PR — the byte-level PPM-D mixture technique class was first introduced in PR openai#1795 (2026-04-23), with anmarhindi's PR openai#1835 (2026-04-25, our port source) following two days later. Updates: - Opening summary cites PR openai#1795 as class introduction + PR openai#1835 as our port source - Stack table 'Byte-level PPM-D mixture (this addition)' row updated with both refs - Acknowledgements section reordered to lead with PR openai#1795 chronologically - PPM-D cluster list in compliance section now includes openai#1795 No code or score changes.

OE-GOD added 2 commits April 22, 2026 22:50

OE-GOD mentioned this pull request Apr 23, 2026

Record: SP4096 + byte-level PPM adaptive-λ mixture — val_bpb 1.01925 (3-seed) #1785

Closed

5 tasks

OE-GOD changed the title ~~Record: SP4096 + byte-level PPM adaptive-λ mixture — val_bpb 0.95165 (full-val, 3-seed)~~ Record: SP4096 + byte-level PPM adaptive-λ mixture (strict-legal gate) — val_bpb 1.01252 (3-seed) Apr 24, 2026

OE-GOD mentioned this pull request Apr 24, 2026

Audit 1698 lineage bpb bytecount #1804

Open

abi2024 added a commit to abi2024/parameter-golf that referenced this pull request Apr 27, 2026

Submission: add PR openai#1795 verified-CORRECT entry; update frontie…

a97f5c4

…r; v2.1 changelog

dexhunter mentioned this pull request Apr 27, 2026

SP8192 + Score-First TTT + QK-Gain 5.25 — Neural-Only val_bpb 1.0810 (3-seed mean) #1858

Open

This was referenced Apr 28, 2026

Record: PR #1797 base + PPM-D byte mixture (v2, full-val coverage) — mix_bpb 0.9019 / quantized_ttt 1.0621 #1881

Open

Record: PR #1850 + Anti-Hijack Gate — val_bpb 0.99445 (full val) #1885

Open

deborahnelson8788726 mentioned this pull request Apr 29, 2026

Record: SP8192 + yahya010 NN base + byte-PPM mixer — val_bpb 0.99145 … #1933

Closed

7 tasks

remg1997 mentioned this pull request Apr 30, 2026

Record: SP8192 + 3-layer recurrence + byte-PPM mixer — val_bpb 0.99621 (3-seed mean) #1959

Open

joshuaswanson mentioned this pull request Apr 30, 2026

Record: SP8192 + Byte-PPM Mixer with Tuned Order/Gate (O=5) — val_bpb 0.94290 (3-seed mean) #1991

Open

6 tasks

This was referenced May 1, 2026

Record: SP8192 CaseOps v13 PPM tuned gate — fresh 3-seed mean 0.94175270 #2083

Open

Record: BIJEPAX-lite JEPA + SP8192 CaseOps PPM — val_bpb 0.97271 #2080

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP4096 + byte-level PPM adaptive-λ mixture (strict-legal gate) — val_bpb 1.01252 (3-seed)#1795

Record: SP4096 + byte-level PPM adaptive-λ mixture (strict-legal gate) — val_bpb 1.01252 (3-seed)#1795
OE-GOD wants to merge 3 commits intoopenai:mainfrom
OE-GOD:record-sp4096-ppm-fullval-rebuild

OE-GOD commented Apr 23, 2026

Uh oh!

nprime06 commented Apr 23, 2026

Uh oh!

OE-GOD commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

OE-GOD commented Apr 23, 2026

Summary

Headline

Five reviewer concerns — status

What changed vs 2026-04-01

Compliance

Scope

Credits

Test plan

Uh oh!

nprime06 commented Apr 23, 2026

Uh oh!

OE-GOD commented Apr 24, 2026

The fix (mechanical, not rhetorical)

New results (3 seeds, full val, strict-legal gate, PPM order 4)

Other minor changes

The category question still stands

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants