openai · OE-GOD · Apr 23, 2026 · Apr 23, 2026 · Apr 24, 2026
diff --git a/records/track_10min_16mb/2026-04-23_SP4096_PPM_AdaptiveMix/README.md b/records/track_10min_16mb/2026-04-23_SP4096_PPM_AdaptiveMix/README.md
@@ -0,0 +1,104 @@
+# Record: SP4096 + Byte-Level PPM Adaptive-λ Mixture (strict-legal gate) — val_bpb 1.01252
+
+**val_bpb: 1.01252** (3-seed mean, std=0.00044, full FineWeb val, **strict-legal outcome-independent gate**)
+
+| Seed | NN_full (sliding, token-BPB, full val) | Mix BPB (byte-level, full val) | Δ | Artifact | Eval |
+|-|-|-|-|-|-|
+| 42   | 1.09740 | **1.01228** | −0.07436 | 15,953,442 | 521s |
+| 1337 | 1.09823 | **1.01303** | −0.07443 | 15,921,608 | 506s |
+| 2025 | 1.09728 | **1.01226** | −0.07426 | 15,924,697 | 485s |
+| **Mean** | **1.09764** | **1.01252** | **−0.07435** | 15,933,249 | 504s |
+
+Beats current record **1.06453** (PR #1769) by **0.05201** BPB — t-stat ≈ 107 on the 0.005-nat bar.
+
+Our NN-only mean **1.09764 matches @clarkkev's 2026-04-01 record of 1.09785** within seed noise. The entire NN stack is unchanged from PR #1334 / the 2026-04-01 record; the gain comes from a byte-level PPM adaptive-λ mixture applied at eval time.
+
+## This PR supersedes an earlier (now-invalidated) attempt
+
+The earlier version of this submission (on branch `record-sp4096-ppm-adaptive-mix`, PR #1795 at commit `07d20c3`, claiming val_bpb 0.95165) used a **target-conditioned gate** — `cf[i] = P_PPM(observed_byte)` — which made the reported score depend on the realized byte value. This was correctly flagged by @nprime06 in the PR comments as not a valid scoring rule, and that number is retracted.
+
+The revised gate in this version is **strictly a function of the prefix and PPM state, frozen before the observed byte is looked up**. See the next section.
+
+## The mixture, and why the gate is now outcome-independent
+
+The scoring model is a byte-level two-predictor mixture:
+
+`q_mix(b) = λ·q_NN_byte(b) + (1−λ)·q_PPM_byte(b)`
+
+where:
+
+- **`q_NN_byte`** — NN's SentencePiece-token distribution, spread uniformly across UTF-8 bytes of each token. Conserves total NN bits (byte-BPB of NN alone equals token-BPB scaled by bytes/token).
+- **`q_PPM_byte`** — byte-level PPM-D order 4 predictor. Builds its suffix-count table online from val bytes the NN has already graded in the same sliding pass. Zero precomputed state ships in the 16MB artifact.
+- **`λ`** (the gate) — adaptive: `λ = 0.05 if cf > 0.9 else 0.9`, where `cf = max_count / total` at the **deepest context with any data**, computed from the PPM state and the prefix **before any lookup of the observed byte**.
+
+The key code:
+
+```python
+cf_mx = 0; cf_tot = 256; cf_seen = False
+for o in range(lim, -1, -1):
+    k = h[-o:] if o else b""        # context key: prefix only
+    e = tabs[o].get(k)               # lookup: prefix only
+    if e is None: continue
+    if not cf_seen:                  # first context found = deepest with data
+        cf_mx = e[1]                 # max_count, frozen HERE
+        cf_tot = e[0]                # total, frozen HERE
+        cf_seen = True               # — BEFORE any d.get(x) below
+    tot = e[0]; d = e[2]
+    c = d.get(x, 0)                  # now uses x — but cf already frozen
+    if c > 0:
+        pf = esc * (2*c - 1) / (2*tot); break
+    esc *= len(d) / (2*tot)
+cf[i] = (cf_mx / cf_tot) if cf_seen else 1/256
+```
+
+**Formal property:** for any two possible next-bytes `x_a`, `x_b` at the same position (same prefix `h`, same PPM state `tabs`), `cf[i]` is bitwise identical between the two cases. Therefore `λ[i] = np.where(cf > T, L_, H)` is identical. Only `q_NN(x)` and `q_PPM(x)` depend on `x` — which is correct for predictor scores.
+
+This answers @nprime06's specific concern on PR #1795 mechanically, not rhetorically.
+
+## What changed vs @clarkkev 2026-04-01
+
+Source-level diff: one new function (`_ppm_mixture_bpb`, ~55 lines including the strict-legal gate tracking) plus ~30 lines of gather/mix logic inside `eval_val_sliding`. Everything else is unchanged from the 2026-04-01 record:
+
+- 11 layers, SP4096, MLP mult 4, depth recurrence, sliding-window eval, EMA, GPTQ int6 + brotli, LeakyReLU², parallel residuals, legal TTT framework
+- Same env vars (`RUN_ID`, `SEED`), plus one gating the mixture (`PPM_MIX_ENABLED=1`)
+- Same wallclock cap, same train pipeline, same GPTQ calibration
+
+## Compliance
+
+- **Train under 600s** ✅ all 3 seeds stopped at 590s wallclock cap (steps 5898–5901)
+- **Artifact under 16 MB** ✅ 15.92–15.95 MB natively (no lzma-compressed stub needed)
+- **Eval under 600s** ✅ all 3 seeds 485–521s (using PPM order 4 — order 5 was 15s over cap due to max_count tracking overhead; benchmarking showed order 4 only 0.02 BPB worse in mix)
+- **No SLOT, no pre-quant TTT on val, no ETLB** ✅ inherited from base, unchanged
+- **3 seeds, p ≪ 1e-10 on the 0.005-nat bar** ✅ (t-stat ≈ 107)
+- **`no_ngram_cache: false`** — byte-level online PPM predictor built from empty counters during sliding eval. **Per-byte semantics: score byte_i using counters from bytes 0..i-1 (score-before-update), then add byte_i to counters for future bytes.** All PPM state is constructed from val tokens the NN has already graded, consistent with the rule text "test-time training on validation set tokens you've already evaluated your model on". **Organizer ruling explicitly requested** (see @nprime06 and @dexhunter review comments on PR #1795) on whether this class of online streaming predictor qualifies as legal score-first TTT — if ruled no, submission withdrawn.
+
+## Reviewer concerns from PR #1795 (status)
+
+| # | Concern | Status |
+|---|---|---|
+| 1 | Full-val measurement (not 5M subset) | ✅ RESOLVED — 45.5M tokens / 152.6 MB bytes |
+| 2 | PPM-as-TTT class legality | ⚠️ Organizer ruling requested (category question) |
+| 3 | Byte-level vs token-level BPB | ✅ BOTH logged (NN_token=1.098, NN_byte=1.087, mix=1.013) |
+| 4 | NN regression vs clarkkev | ✅ RESOLVED — 1.0976 mean matches 1.0978 |
+| 5 | Condition 2 framing (scoring model is a mixture) | ✅ Explicit in README above |
+| **@nprime06**: target-conditioned gate | ✅ RESOLVED — strict-legal outcome-independent gate, see code |
+
+## Reproduction
+
+```bash
+# Data prep (Kevin Clark's SP4096 dataset):
+MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp4096
+
+# Training + mixture eval (per seed):
+RUN_ID=<seed> SEED=<seed> PPM_MIX_ENABLED=1 torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+The reported val_bpb is the `final_int6_sliding_window val_bpb:` line, which equals the `[ppm_mix] ... mix=` value by construction.
+
+## Credits
+
+- **@clarkkev** — entire 2026-04-01 SP4096 + 11L + MLP4 + depth-recurrence + EMA + GPTQ + sliding + brotli stack (PR #1334, #1419, #1445). All of the NN contribution (1.098 BPB) is his work.
+- **Cleary & Witten 1984; Moffat 1990** — PPM-D.
+- **This submission** — the byte-probability-space two-predictor mixture construction with an **outcome-independent** adaptive-λ gate keyed on PPM's state-only max-count ratio.
+
+Neither predictor alone reaches this BPB: clarkkev's NN is at 1.098, byte-PPM alone ≈2.5 on full val. The mixture at 1.013 captures the bits PPM strictly wins on (rare exact-repeat bytes — URLs, code identifiers, cross-doc duplicates) while leaving the rest to the NN. The −0.074 Δ is smaller than the retracted illegal-gate claim (−0.135) but is **mechanically defensible**: no function of the observed byte enters the gate.
diff --git a/records/track_10min_16mb/2026-04-23_SP4096_PPM_AdaptiveMix/submission.json b/records/track_10min_16mb/2026-04-23_SP4096_PPM_AdaptiveMix/submission.json
@@ -0,0 +1,57 @@
+{
+  "author": "OE-GOD",
+  "github_id": "OE-GOD",
+  "name": "SP4096 + Byte-Level PPM Adaptive-λ Mixture (strict-legal gate, full-val)",
+  "date": "2026-04-24",
+  "track": "10min_16mb",
+  "val_bpb": 1.01252,
+  "val_bpb_std": 0.00044,
+  "val_bpb_nn_token_mean": 1.09764,
+  "val_bpb_nn_byte_mean": 1.08687,
+  "val_bpb_delta_mean": -0.07435,
+  "measurement": "Full FineWeb validation set (45,508,608 tokens, 152,570,124 bytes). Mixture BPB computed per-byte after spreading NN per-token logprob uniformly across UTF-8 bytes; outcome-independent adaptive-λ gate on byte-level PPM-D order-4 state (max_count/total at deepest seen context).",
+  "seeds": [42, 1337, 2025],
+  "seed_results": {
+    "42":   {"val_bpb": 1.01228, "val_bpb_nn_token": 1.09740, "val_bpb_nn_byte": 1.08664, "val_bpb_delta": -0.07436, "artifact_bytes": 15953442, "eval_time_ms": 521086},
+    "1337": {"val_bpb": 1.01303, "val_bpb_nn_token": 1.09823, "val_bpb_nn_byte": 1.08746, "val_bpb_delta": -0.07443, "artifact_bytes": 15921608, "eval_time_ms": 506272},
+    "2025": {"val_bpb": 1.01226, "val_bpb_nn_token": 1.09728, "val_bpb_nn_byte": 1.08652, "val_bpb_delta": -0.07426, "artifact_bytes": 15924697, "eval_time_ms": 485453}
+  },
+  "hardware": "8xH100 80GB SXM",
+  "pytorch_version": "2.9.1+cu128",
+  "technique_summary": "Base: @clarkkev 2026-04-01 SP4096 + 11L + MLP4x submission (record 1.09785). Addition: byte-level PPM-D order-4 mixed with the NN's per-token target logprob in byte-probability space during final sliding-window eval on FULL val. Mixture weight λ is a function of prefix + PPM state only (outcome-independent gate): cf = max_count / total at the deepest context with data, frozen before the observed byte is scored.",
+  "mixture_technique": {
+    "predictor": "byte-level PPM-D order 4 (pure Python, online, legal score-before-update on already-scored val bytes)",
+    "mixing": "adaptive λ gate: cf = max_count / total at deepest seen context; λ=0.05 when cf > 0.9, else λ=0.9",
+    "gate_is_outcome_independent": true,
+    "gate_legality_note": "cf is computed from PPM state + prefix only, before any d.get(observed_byte) call. For any two possible next-bytes x_a, x_b at the same position, cf and λ are identical. Addresses PR #1795 comment by @nprime06 on target-conditioned gates.",
+    "byte_marginalization": "spread NN token logprob uniformly across UTF-8 bytes (conserves total NN bits)",
+    "measurement_basis": "full val (45.5M tokens, 152.6MB bytes) — same as all merged records"
+  },
+  "compliance": {
+    "train_under_600s": true,
+    "artifact_under_16mb": true,
+    "artifact_under_16mb_note": "All 3 seeds 15.92-15.95 MB natively.",
+    "eval_under_600s": true,
+    "eval_under_600s_note": "All 3 seeds 485-521s. Order-4 PPM chosen over order-5 to ensure eval fits within 600s cap; order 4 was 0.02 BPB worse than order 5 but gave 100+s margin.",
+    "no_slot": true,
+    "no_pre_quant_ttt": true,
+    "no_etlb": true,
+    "no_ngram_cache": false,
+    "no_ngram_cache_note": "Byte-level online PPM predictor trained from empty counters during sliding eval. Per-byte score-before-update: score byte_i using counters from bytes 0..i-1, then add byte_i for future bytes. Zero precomputed statistics shipped in the artifact. Organizer ruling requested on this predictor class per PR #1795 discussion.",
+    "three_seeds": true,
+    "three_seeds_significance": "t-stat for the 0.005-nat improvement bar: (1.0595 − 1.01252)/0.00044/sqrt(3) ≈ 185; p ≪ 1e-10"
+  },
+  "attribution": {
+    "base_submission": "@clarkkev 2026-04-01 SP4096 submission (record 1.09785) — NN stack unchanged",
+    "byte_ppm": "Cleary & Witten 1984; Moffat 1990 (PPM-D escape method)",
+    "adaptive_lambda_gate": "designed for this submission; strict-legal form responds to @nprime06 review on PR #1795"
+  },
+  "history": {
+    "supersedes": "PR #1795 (earlier version with illegal target-conditioned gate, mix BPB 0.95145)",
+    "changelog": [
+      "Fixed target-conditioned gate: cf now outcome-independent (max_count/total at deepest seen context, frozen before observing next byte)",
+      "Reduced PPM order 5 -> 4 to keep eval under 600s cap with the max_count tracking overhead",
+      "Mix BPB reported is the legal strict version (1.01252); previous illegal-gate number (0.95145) is retracted"
+    ]
+  }
+}