Record: SP4096 + byte-level PPM adaptive-λ mixture (strict-legal gate) — val_bpb 1.01252 (3-seed)#1795
Record: SP4096 + byte-level PPM adaptive-λ mixture (strict-legal gate) — val_bpb 1.01252 (3-seed)#1795OE-GOD wants to merge 3 commits intoopenai:mainfrom
Conversation
…3-seed mean) Base: @clarkkev 2026-04-01 SP4096+MLP4 stack (previous record 1.09785). Addition: byte-level PPM-D order 5 mixed with NN per-token target logprob in byte-probability space via adaptive-λ gate on PPM in-context confidence. Lzma+base85 exec-stub on train_gpt.py to fit 16MB artifact. Results (3 seeds, sliding+mix on 5M-token subset): seed 42: 1.01853 (Δ -0.11986, artifact 15,982,254) seed 1337: 1.02006 (Δ -0.12047, artifact 15,976,391) seed 2025: 1.01916 (Δ -0.12012, artifact 15,955,159) mean: 1.01925 ± 0.00077 (Δ -0.12015) Beats current record 1.06453 by 0.04528 at p << 0.01 (t-stat ≈ 65 on 0.005 bar).
Addresses the 5 reviewer concerns on the original PR: 1. Full-val mixture (not 5M subset): 45,508,608 tokens / 152,570,124 bytes 2. PPM-as-TTT: per-byte score-before-update documented; organizer ruling requested 3. Byte-level vs token-level BPB: both reported in logs and submission.json 4. NN-only regression fixed: 1.09776 matches clarkkev 1.09785 within seed noise 5. Condition 2 framing: scoring model is explicitly the byte-level mixture Results (3 seeds, full-val sliding+mix, same basis as all merged records): seed 42: 0.95145 (Δ -0.13524, NN_token 1.09745, artifact 15,960,029) seed 1337: 0.95214 (Δ -0.13541, NN_token 1.09832, artifact 15,929,684) seed 2025: 0.95135 (Δ -0.13540, NN_token 1.09751, artifact 15,930,624) mean: 0.95165 ± 0.00036 (Δ -0.13535) Beats current record 1.06453 by 0.11288 at p << 1e-10 (t-stat ≈ 513 on 0.005 bar). NN-only mean 1.09776 matches @clarkkev's 1.09785 within noise. All 3 artifacts 15.93-15.96 MB (under 16MB cap natively, no lzma stub). All 3 eval times 9:01-9:35 (under 10-min cap).
|
I think the PPM state update is score-before-update, but the adaptive mixture gate itself isn't legal. Lets walk through the code: first, per_tok_len = piece_lens[tgt_np]
bs = b"".join(piece_bytes[int(t)] for t in tgt_np)(from Then each loop iteration sets x to the actual current validation byte, computes the PPM probability for that realized byte, and stores it in cf[i]. for i in range(N):
x = bs[i]
if i == 0:
plp[i] = LN256; cf[i] = 1/256
else:
esc = 1.0; pf = 0.0
lim = O if i > O else i
for o in range(lim, -1, -1):
k = h[-o:] if o else b""
e = tabs[o].get(k)
if e is None: continue
tot = e[0]; d = e[1]; c = d.get(x, 0)
if c > 0:
pf = esc * (2*c - 1) / (2*tot); break
esc *= len(d) / (2*tot)
else:
pf = esc / 256
if pf < 1e-20: pf = 1e-20
plp[i] = log(pf); cf[i] = pf(from Finally, the mixture weight lam is chosen from cf, i.e. from the PPM probability assigned to the realized current byte. lam = np.where(cf > T, L_, H)
pm = lam*np.exp(nlp) + (1-lam)*np.exp(plp)
return float(-np.log2(np.maximum(pm, 1e-300)).sum()/N)(from In other words, the reported score is target-conditioned, such as If PPM assigns high probability to the byte that happened, the scorer trusts PPM. Otherwise it falls back toward the NN. However, we should expect a single predeclared distribution over the next byte, not grade based on the revealed correct answer. The caller also passes the gathered target IDs directly into this scorer. dist.gather(lpl, gl, dst=0); dist.gather(tgl, gt, dst=0)
lpa = torch.cat([gl[r][:sizes[r]] for r in range(h.world_size)]).cpu().numpy()
tga = torch.cat([gt[r][:sizes[r]] for r in range(h.world_size)]).cpu().numpy()
...
mb = _ppm_mixture_bpb(tga, lpa, val_data.sp)(from TLDR: |
…e-independent gate Re-run all 3 seeds with fixed gate. Key change: cf (mixture weight input) is now computed from PPM state + prefix ONLY, frozen BEFORE any d.get(observed_byte) call. Formal property: cf is bitwise identical for any two possible next-bytes at the same position. Results (3 seeds, full-val sliding + strict-legal mixture, PPM order 4): seed 42: 1.01228 (Δ -0.07436, NN_token 1.09740, artifact 15,953,442, eval 521s) seed 1337: 1.01303 (Δ -0.07443, NN_token 1.09823, artifact 15,921,608, eval 506s) seed 2025: 1.01226 (Δ -0.07426, NN_token 1.09728, artifact 15,924,697, eval 485s) mean: 1.01252 +/- 0.00044 (Δ -0.07435) Beats current record 1.06453 by 0.05201 at p << 1e-10 (t-stat ~107 on 0.005 bar). Also: PPM order reduced 5->4 to keep eval under 600s cap with max_count tracking overhead (order 5 strict-legal was 615s — 15s over). Order 4 mix is 0.02 BPB worse but eval fits comfortably (485-521s). Previous illegal-gate number (0.95165) is retracted. Gate is now mechanically immune to @nprime06's critique.
|
@nprime06 the gate is fixed and re-run. Pushed commit The fix (mechanical, not rhetorical)New cf_mx = 0; cf_tot = 256; cf_seen = False
for o in range(lim, -1, -1):
k = h[-o:] if o else b"" # context key: prefix only
e = tabs[o].get(k) # lookup: prefix only
if e is None: continue
if not cf_seen: # first context found = deepest with data
cf_mx = e[1]; cf_tot = e[0] # max_count, total — FROZEN HERE
cf_seen = True # — BEFORE any d.get(x) below
tot = e[0]; d = e[2]
c = d.get(x, 0) # x used here for scoring — cf already frozen
if c > 0:
pf = esc * (2*c - 1) / (2*tot); break
esc *= len(d) / (2*tot)
cf[i] = (cf_mx / cf_tot) if cf_seen else 1/256Formal property: for any two possible next-bytes In other words: the gate that was New results (3 seeds, full val, strict-legal gate, PPM order 4)
Std 0.00044. Beats current record 1.06453 by 0.05201 at t-stat ≈ 107 on the 0.005-nat bar. The old illegal-gate number (0.95165) is retracted. The oracle was contributing about 0.06 BPB of fake savings. Other minor changes
The category question still standsYour broader question about whether an online streaming predictor as a mixture partner counts as legal score-first TTT is separate from this code-level fix. The per-byte semantics of the PPM update are still score-before-update (score Thanks for catching the gate bug. Was a straight-up error on my part, not a defensible choice. |
…ai#1787 Polar Express NS new base; PR openai#1795 PPM 1.01252; Issue openai#1604 deadline passed; Session 20 - Merged SOTA 1.0810 confirmed Day 15 (README not updated despite Scylla record commit) - Scylla 0.9485 committed to track_10min_16mb/ on Apr 23 (PR openai#1184) but byte accounting disputed by PR openai#1271 (corrected ~1.1289 bpb); treat merged SOTA as 1.0810 - PR openai#771 CLOSED/REJECTED confirmed; PR openai#727 CLOSED (illegal); PR openai#758 open but dead; PR openai#731 still awaiting seeds 1337+2024 - Issue openai#1604 (CaseOps ruling): NO @valerio-oai response in 11 days; self-deadline Apr 24 passed; proceed with clean legal stack immediately - NEW: PR openai#1787 (nprime06, 1.06335) — new community-consensus clean base with Polar Express Newton-Schulz (arXiv:2505.16932, ICLR 2026) + MIN_LR=0.10 warmdown floor - NEW: PR openai#1795 (OE-GOD, 1.01252) — byte-level PPM order-4 adaptive mixture; gate legality concern fixed; await organizer ruling before implementing - NEW: PR openai#1797 (dexhunter, 1.06157) — PR openai#1787 + SmearGate + LQER Asym; new dexhunter best - NEW: PR openai#1802 (aamodbhatt, 1.0771) — Polar Express NS + Multi-Phase Global TTT - TECHNIQUE: Polar Express NS (arXiv:2505.16932) and Gram NS (Dao-AILab) added to table - TECHNIQUE: MIN_LR=0.10 warmdown floor added to best-stack approach - Updated competition strategy: stop waiting for CaseOps, implement clean stack with Polar Express NS + MIN_LR immediately (6 days to deadline) https://claude.ai/code/session_01JZ3FiS937NwLHt3Fv9WHPD
…r; v2.1 changelog
…ixture class Per @OE-GOD review note on this PR — the byte-level PPM-D mixture technique class was first introduced in PR openai#1795 (2026-04-23), with anmarhindi's PR openai#1835 (2026-04-25, our port source) following two days later. Updates: - Opening summary cites PR openai#1795 as class introduction + PR openai#1835 as our port source - Stack table 'Byte-level PPM-D mixture (this addition)' row updated with both refs - Acknowledgements section reordered to lead with PR openai#1795 chronologically - PPM-D cluster list in compliance section now includes openai#1795 No code or score changes.
Summary
Successor to #1785, which was closed after the reviewer raised 5 concerns. All 5 are resolved in this rebuild.
Builds on @clarkkev's 2026-04-01 SP4096 record (1.09785). The entire NN stack is unchanged; the gain comes from a byte-level PPM-D adaptive-λ mixture applied at eval time on full val (45,508,608 tokens / 152,570,124 bytes, same basis as every merged record).
Headline
val_bpb = 0.95165 (3-seed mean, std=0.00036, full FineWeb val)
Beats current record 1.06453 by 0.11288 BPB — t-stat ≈ 513 on the 0.005-nat bar (p ≪ 1e-10).
Our NN-only mean 1.09776 matches @clarkkev's 1.09785 within seed noise — stack and env vars unchanged, same sliding-window eval, same GPTQ int6+brotli quant, same wallclock cap.
Five reviewer concerns — status
byte_iusing counters accumulated from bytes0..i−1, then addbyte_ito counters for future bytes. By the rule text ("test-time training on validation set tokens you've already evaluated your model on"), every PPM update uses only already-scored bytes. Per-byte granularity is finer than Issue A Field Guide to Valid Submissions #1017's chunk-level framing; explicit organizer guidance on this class of online streaming predictor would help. If the ruling is "no," submission withdrawn.q_mix = λ·q_NN_byte + (1−λ)·q_PPM_bytewhere the NN piece is a bit-conserving spread of its token distribution and the PPM piece is an online byte-level PPM-D order 5.What changed vs 2026-04-01
Source diff: one new function (
_ppm_mixture_bpb, ~30 lines) and ~30 lines of gather/mix logic insideeval_val_sliding. Nothing else touched. See README for exact derivation + mixture math.Compliance
no_ngram_cache: false— byte-level online PPM with zero precomputed state shipped; see README + submission.json compliance notes for the score-before-update argument.Scope
Adds only
records/track_10min_16mb/2026-04-23_SP4096_PPM_AdaptiveMix/. No changes outside.Credits
Neither predictor alone reaches this BPB: clarkkev's NN at 1.098, byte-PPM alone ≈2.7 on full val. The mixture at 0.95 captures the bits PPM strictly wins on (rare exact-repeat sequences — URLs, code identifiers, cross-doc duplicates) while leaving everything else to the NN.
Test plan
[ppm_mix]+final_int6_sliding_windowlines