Skip to content

Record: SP4096 + byte-level PPM adaptive-λ mixture (strict-legal gate) — val_bpb 1.01252 (3-seed)#1795

Open
OE-GOD wants to merge 3 commits intoopenai:mainfrom
OE-GOD:record-sp4096-ppm-fullval-rebuild
Open

Record: SP4096 + byte-level PPM adaptive-λ mixture (strict-legal gate) — val_bpb 1.01252 (3-seed)#1795
OE-GOD wants to merge 3 commits intoopenai:mainfrom
OE-GOD:record-sp4096-ppm-fullval-rebuild

Conversation

@OE-GOD
Copy link
Copy Markdown

@OE-GOD OE-GOD commented Apr 23, 2026

Summary

Successor to #1785, which was closed after the reviewer raised 5 concerns. All 5 are resolved in this rebuild.

Builds on @clarkkev's 2026-04-01 SP4096 record (1.09785). The entire NN stack is unchanged; the gain comes from a byte-level PPM-D adaptive-λ mixture applied at eval time on full val (45,508,608 tokens / 152,570,124 bytes, same basis as every merged record).

Headline

val_bpb = 0.95165 (3-seed mean, std=0.00036, full FineWeb val)

Beats current record 1.06453 by 0.11288 BPB — t-stat ≈ 513 on the 0.005-nat bar (p ≪ 1e-10).

Seed NN token-BPB (matches clarkkev) NN byte-BPB Mix byte-BPB Δ Artifact Eval
42 1.09745 1.08669 0.95145 −0.13524 15,960,029 9:35
1337 1.09832 1.08755 0.95214 −0.13541 15,929,684 9:02
2025 1.09751 1.08675 0.95135 −0.13540 15,930,624 9:01
Mean 1.09776 1.08699 0.95165 −0.13535 15,940,112 9:13

Our NN-only mean 1.09776 matches @clarkkev's 1.09785 within seed noise — stack and env vars unchanged, same sliding-window eval, same GPTQ int6+brotli quant, same wallclock cap.

Five reviewer concerns — status

  1. Full-val measurement — mixture on all 45.5M val tokens, not a 5M subset. Same basis as every merged record.
  2. ⚠️ PPM-as-TTT legality — organizer ruling requested. Per-byte score-before-update: score byte_i using counters accumulated from bytes 0..i−1, then add byte_i to counters for future bytes. By the rule text ("test-time training on validation set tokens you've already evaluated your model on"), every PPM update uses only already-scored bytes. Per-byte granularity is finer than Issue A Field Guide to Valid Submissions #1017's chunk-level framing; explicit organizer guidance on this class of online streaming predictor would help. If the ruling is "no," submission withdrawn.
  3. Byte-level vs token-level BPB — both logged. NN-alone token-BPB (1.09776, directly comparable to clarkkev's metric), NN-alone byte-BPB (1.08699, spread-marginalization of same distribution so total bits conserved), mixture byte-BPB (0.95165). The submission's scoring object is the mixture, so the headline is its byte-BPB.
  4. NN regression fixed. Previous submission had NN=1.144 because training used only 2 of ~143 SP4096 shards. This rebuild trains on the full SP4096 dataset and matches clarkkev's NN exactly.
  5. Condition 2 framing. README explicitly frames the scoring model as a byte-level two-predictor mixture: q_mix = λ·q_NN_byte + (1−λ)·q_PPM_byte where the NN piece is a bit-conserving spread of its token distribution and the PPM piece is an online byte-level PPM-D order 5.

What changed vs 2026-04-01

Source diff: one new function (_ppm_mixture_bpb, ~30 lines) and ~30 lines of gather/mix logic inside eval_val_sliding. Nothing else touched. See README for exact derivation + mixture math.

Compliance

  • ✅ Train under 600s (all 3 seeds stopped at 590s wallclock cap, steps 5898–5901)
  • ✅ Artifact under 16MB (15,929,684 – 15,960,029 bytes natively — no lzma-compressed stub needed)
  • ✅ Eval under 600s (sliding+full-val mixture 540–575s)
  • ✅ No SLOT, no pre-quant TTT on val, no ETLB (inherited from base, unchanged)
  • ✅ Three seeds with p ≪ 1e-10 on the 0.005-nat bar
  • no_ngram_cache: false — byte-level online PPM with zero precomputed state shipped; see README + submission.json compliance notes for the score-before-update argument.

Scope

Adds only records/track_10min_16mb/2026-04-23_SP4096_PPM_AdaptiveMix/. No changes outside.

Credits

Neither predictor alone reaches this BPB: clarkkev's NN at 1.098, byte-PPM alone ≈2.7 on full val. The mixture at 0.95 captures the bits PPM strictly wins on (rare exact-repeat sequences — URLs, code identifiers, cross-doc duplicates) while leaving everything else to the NN.

Test plan

  • submission.json validates, all fields populated
  • train_gpt.py runs end-to-end and reports the mix BPB via the [ppm_mix] + final_int6_sliding_window lines
  • 3 seeds land mix BPB in [0.9513, 0.9522], std 0.00036
  • all 3 artifacts under 16 MB natively
  • all 3 eval times under 10 min
  • NN-only token-BPB matches @clarkkev's 1.098 record within noise
  • Reviewer verification run

OE-GOD added 2 commits April 22, 2026 22:50
…3-seed mean)

Base: @clarkkev 2026-04-01 SP4096+MLP4 stack (previous record 1.09785).
Addition: byte-level PPM-D order 5 mixed with NN per-token target logprob
in byte-probability space via adaptive-λ gate on PPM in-context
confidence. Lzma+base85 exec-stub on train_gpt.py to fit 16MB artifact.

Results (3 seeds, sliding+mix on 5M-token subset):
  seed 42:   1.01853  (Δ -0.11986,  artifact 15,982,254)
  seed 1337: 1.02006  (Δ -0.12047,  artifact 15,976,391)
  seed 2025: 1.01916  (Δ -0.12012,  artifact 15,955,159)
  mean:      1.01925 ± 0.00077  (Δ -0.12015)

Beats current record 1.06453 by 0.04528 at p << 0.01 (t-stat ≈ 65 on 0.005 bar).
Addresses the 5 reviewer concerns on the original PR:

1. Full-val mixture (not 5M subset): 45,508,608 tokens / 152,570,124 bytes
2. PPM-as-TTT: per-byte score-before-update documented; organizer ruling requested
3. Byte-level vs token-level BPB: both reported in logs and submission.json
4. NN-only regression fixed: 1.09776 matches clarkkev 1.09785 within seed noise
5. Condition 2 framing: scoring model is explicitly the byte-level mixture

Results (3 seeds, full-val sliding+mix, same basis as all merged records):
  seed 42:   0.95145  (Δ -0.13524,  NN_token 1.09745,  artifact 15,960,029)
  seed 1337: 0.95214  (Δ -0.13541,  NN_token 1.09832,  artifact 15,929,684)
  seed 2025: 0.95135  (Δ -0.13540,  NN_token 1.09751,  artifact 15,930,624)
  mean:      0.95165 ± 0.00036  (Δ -0.13535)

Beats current record 1.06453 by 0.11288 at p << 1e-10 (t-stat ≈ 513 on 0.005 bar).
NN-only mean 1.09776 matches @clarkkev's 1.09785 within noise.
All 3 artifacts 15.93-15.96 MB (under 16MB cap natively, no lzma stub).
All 3 eval times 9:01-9:35 (under 10-min cap).
@nprime06
Copy link
Copy Markdown
Contributor

I think the PPM state update is score-before-update, but the adaptive mixture gate itself isn't legal.

Lets walk through the code: first,_ppm_mixture_bpb reconstructs a byte stream from the realized target tokens.

      per_tok_len = piece_lens[tgt_np]
      bs = b"".join(piece_bytes[int(t)] for t in tgt_np)

(from records/track_10min_16mb/2026-04-23_SP4096_PPM_AdaptiveMix/train_gpt.py#L1249-L1250)

Then each loop iteration sets x to the actual current validation byte, computes the PPM probability for that realized byte, and stores it in cf[i].

      for i in range(N):
          x = bs[i]
          if i == 0:
              plp[i] = LN256; cf[i] = 1/256
          else:
              esc = 1.0; pf = 0.0
              lim = O if i > O else i
              for o in range(lim, -1, -1):
                  k = h[-o:] if o else b""
                  e = tabs[o].get(k)
                  if e is None: continue
                  tot = e[0]; d = e[1]; c = d.get(x, 0)
                  if c > 0:
                      pf = esc * (2*c - 1) / (2*tot); break
                  esc *= len(d) / (2*tot)
              else:
                  pf = esc / 256
              if pf < 1e-20: pf = 1e-20
              plp[i] = log(pf); cf[i] = pf

(from records/track_10min_16mb/2026-04-23_SP4096_PPM_AdaptiveMix/train_gpt.py#L1261-L1279)

Finally, the mixture weight lam is chosen from cf, i.e. from the PPM probability assigned to the realized current byte.

      lam = np.where(cf > T, L_, H)
      pm = lam*np.exp(nlp) + (1-lam)*np.exp(plp)
      return float(-np.log2(np.maximum(pm, 1e-300)).sum()/N)

(from records/track_10min_16mb/2026-04-23_SP4096_PPM_AdaptiveMix/train_gpt.py#L1288-L1290)

In other words, the reported score is target-conditioned, such as score(x) = lambda(PPM_prob(x)) * NN_prob(x) + (1 - lambda(PPM_prob(x))) * PPM_prob(x).

If PPM assigns high probability to the byte that happened, the scorer trusts PPM. Otherwise it falls back toward the NN. However, we should expect a single predeclared distribution over the next byte, not grade based on the revealed correct answer.

The caller also passes the gathered target IDs directly into this scorer.

                  dist.gather(lpl, gl, dst=0); dist.gather(tgl, gt, dst=0)
                  lpa = torch.cat([gl[r][:sizes[r]] for r in range(h.world_size)]).cpu().numpy()
                  tga = torch.cat([gt[r][:sizes[r]] for r in range(h.world_size)]).cpu().numpy()
  ...
              mb = _ppm_mixture_bpb(tga, lpa, val_data.sp)

(from records/track_10min_16mb/2026-04-23_SP4096_PPM_AdaptiveMix/train_gpt.py#L1427-L1442)

TLDR:
A legal version needs to choose the mixture weight from prefix/state only, before seeing the current byte. This implementation doesn't do that; it computes two scores and gates based on whether PPM was confident on the realized byte. So, I do not think the large PPM mixture gain should be treated as a valid val_bpb, independent of whether online eval-time PPM itself is allowed.

…e-independent gate

Re-run all 3 seeds with fixed gate.

Key change: cf (mixture weight input) is now computed from PPM state + prefix ONLY,
frozen BEFORE any d.get(observed_byte) call. Formal property: cf is bitwise identical
for any two possible next-bytes at the same position.

Results (3 seeds, full-val sliding + strict-legal mixture, PPM order 4):
  seed 42:   1.01228  (Δ -0.07436,  NN_token 1.09740,  artifact 15,953,442, eval 521s)
  seed 1337: 1.01303  (Δ -0.07443,  NN_token 1.09823,  artifact 15,921,608, eval 506s)
  seed 2025: 1.01226  (Δ -0.07426,  NN_token 1.09728,  artifact 15,924,697, eval 485s)
  mean:      1.01252 +/- 0.00044  (Δ -0.07435)

Beats current record 1.06453 by 0.05201 at p << 1e-10 (t-stat ~107 on 0.005 bar).

Also: PPM order reduced 5->4 to keep eval under 600s cap with max_count tracking
overhead (order 5 strict-legal was 615s — 15s over). Order 4 mix is 0.02 BPB worse
but eval fits comfortably (485-521s).

Previous illegal-gate number (0.95165) is retracted. Gate is now mechanically
immune to @nprime06's critique.
@OE-GOD OE-GOD changed the title Record: SP4096 + byte-level PPM adaptive-λ mixture — val_bpb 0.95165 (full-val, 3-seed) Record: SP4096 + byte-level PPM adaptive-λ mixture (strict-legal gate) — val_bpb 1.01252 (3-seed) Apr 24, 2026
@OE-GOD
Copy link
Copy Markdown
Author

OE-GOD commented Apr 24, 2026

@nprime06 the gate is fixed and re-run. Pushed commit cb5ad95 to this branch. TL;DR with numbers at the bottom.

The fix (mechanical, not rhetorical)

New _ppm_mixture_bpb gate code path:

cf_mx = 0; cf_tot = 256; cf_seen = False
for o in range(lim, -1, -1):
    k = h[-o:] if o else b""        # context key: prefix only
    e = tabs[o].get(k)               # lookup: prefix only
    if e is None: continue
    if not cf_seen:                  # first context found = deepest with data
        cf_mx = e[1]; cf_tot = e[0]  # max_count, total — FROZEN HERE
        cf_seen = True               # — BEFORE any d.get(x) below
    tot = e[0]; d = e[2]
    c = d.get(x, 0)                  # x used here for scoring — cf already frozen
    if c > 0:
        pf = esc * (2*c - 1) / (2*tot); break
    esc *= len(d) / (2*tot)
cf[i] = (cf_mx / cf_tot) if cf_seen else 1/256

Formal property: for any two possible next-bytes x_a, x_b at the same position (same prefix h, same PPM state tabs), cf[i] is bitwise identical. Therefore λ = np.where(cf > T, L_, H) is identical. Only q_NN(x) and q_PPM(x) depend on x — which they should, they're predictor scores of the observation.

In other words: the gate that was lambda(PPM_prob(x)) in the original code is now lambda(max_count/total at deepest-context-with-data) — purely a function of prefix and PPM state.

New results (3 seeds, full val, strict-legal gate, PPM order 4)

Seed NN token-BPB Mix byte-BPB Δ Artifact Eval
42 1.09740 1.01228 −0.07436 15,953,442 521s
1337 1.09823 1.01303 −0.07443 15,921,608 506s
2025 1.09728 1.01226 −0.07426 15,924,697 485s
Mean 1.09764 1.01252 −0.07435 15,933,249 504s

Std 0.00044. Beats current record 1.06453 by 0.05201 at t-stat ≈ 107 on the 0.005-nat bar.

The old illegal-gate number (0.95165) is retracted. The oracle was contributing about 0.06 BPB of fake savings.

Other minor changes

  • PPM order 5 → 4. The strict-legal tracking of max_count added enough overhead that order-5 eval went 15s over the 600s cap. Order 4 gives 100s of margin and is only 0.02 BPB worse in mix. Seed 42 at order 5 strict-legal measured 0.99326 / 615s; dropped for cap compliance.
  • NN-only matches clarkkev exactly. 1.09764 mean vs clarkkev's 1.09785 — within seed noise. No regression.
  • Eval under 10 min ✅ all 3 seeds.
  • Artifact under 16 MB ✅ all 3 seeds.

The category question still stands

Your broader question about whether an online streaming predictor as a mixture partner counts as legal score-first TTT is separate from this code-level fix. The per-byte semantics of the PPM update are still score-before-update (score byte_i using counters from 0..i-1, then add byte_i for future bytes), and all PPM state is built from bytes the NN has already graded in the same sliding pass. But per-byte granularity is finer than Issue #1017's chunk-level framing, and organizer guidance would help future submissions in this class. I've flagged this explicitly in the updated submission.json — if organizers rule this class isn't legal, the submission is withdrawn.

Thanks for catching the gate bug. Was a straight-up error on my part, not a defensible choice.

sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 24, 2026
…ai#1787 Polar Express NS new base; PR openai#1795 PPM 1.01252; Issue openai#1604 deadline passed; Session 20

- Merged SOTA 1.0810 confirmed Day 15 (README not updated despite Scylla record commit)
- Scylla 0.9485 committed to track_10min_16mb/ on Apr 23 (PR openai#1184) but byte accounting
  disputed by PR openai#1271 (corrected ~1.1289 bpb); treat merged SOTA as 1.0810
- PR openai#771 CLOSED/REJECTED confirmed; PR openai#727 CLOSED (illegal); PR openai#758 open but dead;
  PR openai#731 still awaiting seeds 1337+2024
- Issue openai#1604 (CaseOps ruling): NO @valerio-oai response in 11 days; self-deadline Apr 24
  passed; proceed with clean legal stack immediately
- NEW: PR openai#1787 (nprime06, 1.06335) — new community-consensus clean base with Polar Express
  Newton-Schulz (arXiv:2505.16932, ICLR 2026) + MIN_LR=0.10 warmdown floor
- NEW: PR openai#1795 (OE-GOD, 1.01252) — byte-level PPM order-4 adaptive mixture; gate legality
  concern fixed; await organizer ruling before implementing
- NEW: PR openai#1797 (dexhunter, 1.06157) — PR openai#1787 + SmearGate + LQER Asym; new dexhunter best
- NEW: PR openai#1802 (aamodbhatt, 1.0771) — Polar Express NS + Multi-Phase Global TTT
- TECHNIQUE: Polar Express NS (arXiv:2505.16932) and Gram NS (Dao-AILab) added to table
- TECHNIQUE: MIN_LR=0.10 warmdown floor added to best-stack approach
- Updated competition strategy: stop waiting for CaseOps, implement clean stack with
  Polar Express NS + MIN_LR immediately (6 days to deadline)

https://claude.ai/code/session_01JZ3FiS937NwLHt3Fv9WHPD
abi2024 added a commit to abi2024/parameter-golf that referenced this pull request Apr 27, 2026
ndokutovich added a commit to ndokutovich/parameter-golf that referenced this pull request Apr 28, 2026
…ixture class

Per @OE-GOD review note on this PR — the byte-level PPM-D mixture technique
class was first introduced in PR openai#1795 (2026-04-23), with anmarhindi's
PR openai#1835 (2026-04-25, our port source) following two days later.

Updates:
- Opening summary cites PR openai#1795 as class introduction + PR openai#1835 as our port source
- Stack table 'Byte-level PPM-D mixture (this addition)' row updated with both refs
- Acknowledgements section reordered to lead with PR openai#1795 chronologically
- PPM-D cluster list in compliance section now includes openai#1795

No code or score changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants