Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT (indirect 3 seed mean)#1851
Conversation
…symmetric + Phased TTT val_bpb = 1.06128 | ~15.95 MB | 8xH100 SXM Key Change: SmearGate BOS Document Boundary Fix Builds on PR openai#1797 stack (PR openai#1787 base + SmearGate + LQER Asymmetric) but fixes the SmearGate cross-document leakage bug identified by @cocohearts in PR openai#1797 audit. The bug: SmearGate 1-token causal lookback does not mask BOS positions, so the final token of document N smears into BOS of document N+1. Credits @nprime06 -- PR openai#1787 base stack @romeerp -- CaseOps transform (PR openai#1729) @dexhunter -- SmearGate + LQER (PR openai#1797) @cocohearts -- Identifying SmearGate BOS bug @abaybektursun -- Score-first TTT (PR openai#549) @clarkkev -- GPTQ SDClip + SP8192 (PR openai#1394)
|
need results on 3 different seeds |
… required; PR openai#1848 BPB risk; Day 18 plateau; Session 23 - Merged SOTA still 1.0810 (Day 18, no change since Apr 9) - PPM-D byte mixture confirmed by dexhunter at 1.0322 (PR openai#1857, self-closed) - SmearGate BOS bug documented: prev-token leaks at document boundaries; fix required - PR openai#1848 (newjordan, 0.87980) flagged BPB risk: sibling PR openai#1846 closed same day - PR openai#1858 (0.9946) only covers 8M/40.5M tokens — not leaderboard-comparable - PR openai#1855 (codemath3000, 1.06108) and openai#1851 (aquariouseworkman, 1.06128) both clean - PPM-D wave: PRs openai#1850, openai#1854, openai#1835 await organizer ruling - Added Session 23 lessons to CLAUDE.md - 3 days to deadline (Apr 30) — final GPU run window https://claude.ai/code/session_01RmJtLYUmKNzDgDVTnWoKzU
I ran out of credits on RunPod. This amazing person validated the other 2 for me! see: #1868 |
Forward-1-token residual mixer at embedding lane:
x_t <- x_t + lambda * sigmoid(W * x_t[:12]) * x_{t-1}
The model gets a learnable bias toward bigram features without needing
attention to discover it. Tiny (13 params total: 12-wide linear + scalar lambda).
Zero-init lambda = transparent at start.
BOS-fix prevents cross-document leakage during packed training: gate is
masked to 0 at positions where input_ids == BOS_TOKEN_ID (default 1).
Both smear_gate.weight and smear_lambda match 'smear' pattern -> route to
scalar AdamW, not Muon. Both at GPT-level (not blocks), so explicitly
appended to scalar_params in Optimizers.
- Adds 2-line BOS mask in both forward_logits and forward_ttt SmearGate paths. Before fix, the last token of doc N smeared into the BOS of doc N+1 — model-quality bug, not a C1 issue. Identical fix to PR openai#1851 @aquariouseworkman, audit by @cocohearts. - runpod/phase_g_3seed.sh: full 3-seed driver. Sets PR openai#1797 stack env vars + the PR openai#1855 9-hparam greedy stack delta: MLP_CLIP_SIGMAS=11.5 EMBED_CLIP_SIGMAS=14.0 WARMDOWN_FRAC=0.85 BETA2=0.99 TTT_BETA2=0.99 TTT_WEIGHT_DECAY=0.5 TTT_LORA_RANK=80 SPARSE_ATTN_GATE_SCALE=0.5 PHASED_TTT_PREFIX_DOCS=2500 Mixers (NGRAM/TEMP) stay OFF — pure neural baseline + bug fix + hparam stack. Auto-runs Welch t-test vs PR openai#1797 (1.06157±0.00066). - TTT 4-epoch (PR openai#1812) explicitly NOT adopted: that scheme targets the PR openai#1493 SGD-on-whole-model TTT path, not the PR openai#1797 LoRA-phased per-doc-reset path we're on. No clean mapping. Legality: all 16/16 unit tests still pass. BOS fix preserves causality (it only zeroes a gate at positions where current token is BOS, never references future tokens).
|
Correct to last comment (edited) |
S9 stack extended with caseops dataset support and LQER (Low-rank Quantization Error Rescue). 4487 lines vs train_gpt_s9.py's 4363. This is the script used in PR openai#1851 stage 1/2 ablations (cells A0–F4 in stage 1, Z0/P*/Q*/R* in stage 2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR openai#1902 (cocohearts) accepted openai#1851/openai#1868 over openai#1736 and excluded openai#1855 only on significance grounds (p=0.325). Our prior 050 line built on openai#1797 which is under validity-cloud per cocohearts. Re-anchor research baseline on openai#1855's accepted chain. Pure port — zero modifications. Files copied verbatim from codemath3000/parameter-golf:submission/sp8192-lqer-bos-smear-fix-9hp-stack @ 1e43966 into records/track_10min_16mb/2026-04-29_PR1855_Port_Baseline/. Spec 060B+ will fork exp/060B-* etc. to stack quant-repair / deploy-time levers (046B-tight SDClip, 046L deploy-time repair, 046G-tighter, etc.) on this baseline.
5252-line training script reproducing PR openai#1851's stack with extensive inline annotations (CN comments). Mandatory FA3 import (no SDPA fallback) and direct Triton kernel use. Sibling to train_gpt_s9*.py variants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR openai#1898 (X-Abhishek-X) ran Partial SpinQuant + EMBED_BITS=6 reinvest on the same chain and reported val_bpb 1.06614 vs their base openai#1851's 1.06128 = +0.00486 REGRESSION. Their PR framed it as -0.01486 vs the 2-week-old merged SOTA openai#1493 (1.0810) instead of vs their actual parent. Implications: - 060G (Partial SpinQuant): empirically null/negative on this chain. - 060H (EMBED_BITS=6 alone or with LQER reinvest): even riskier without SpinQuant's rotation protection. Both specs marked as DEPRECATED at the top. Not deleted (kept as documentation for if conditions change later, e.g., deploy-time repair specifically targeting tok_emb precision).
S9 stack extended with caseops dataset support and LQER (Low-rank Quantization Error Rescue). 4487 lines vs train_gpt_s9.py's 4363. This is the script used in PR openai#1851 stage 1/2 ablations (cells A0–F4 in stage 1, Z0/P*/Q*/R* in stage 2).
5252-line training script reproducing PR openai#1851's stack with extensive inline annotations (CN comments). Mandatory FA3 import (no SDPA fallback) and direct Triton kernel use. Sibling to train_gpt_s9*.py variants.
…d mean) Applies activation-aware mixed-precision GPTQ (from PR openai#1908 / romeerp) on top of codemath3000 PR openai#1855 stack. ## Results | Seed | val_bpb (post-TTT) | artifact bytes | steps | eval time | |------|--------------------|----------------|-------|-----------| | 42 | 1.06118 | 15,978,503 | 4989 | 392.8s | | 314 | 1.06005 | 15,976,469 | 4986 | 395.8s | | 1234 | 1.06135 | 15,976,673 | 4977 | 395.5s | | **mean** | **1.06086** | — | — | — | 3-seed std: 0.00069. Beats codemath3000 PR openai#1855 (1.06108) by 0.00022 BPB. ## Technique Training is identical to PR openai#1855. The only change is post-training quantization: **AWQ-lite (activation-aware GPTQ):** 1. Collect per-input-channel activation RMS during GPTQ calibration 2. Score column groups: `saliency = act_rms * mean(abs(weight))` 3. Select top-1 most salient 64-column group per matrix 4. Quantize that group at int8 inside the same full-tensor GPTQ solve (rest stays int6) Env vars: `AWQ_LITE_ENABLED=1 AWQ_LITE_BITS=8 AWQ_LITE_GROUP_TOP_K=1 AWQ_LITE_GROUP_SIZE=64` ## Setup 1. `pip install -r requirements.txt` 2. `apt-get install -y lrzip` 3. Install FA3: `pip install --no-deps flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/` 4. Run `prepare_caseops_data.py` to build the dataset 5. `AWQ_LITE_ENABLED=1 AWQ_LITE_BITS=8 AWQ_LITE_GROUP_TOP_K=1 AWQ_LITE_GROUP_SIZE=64 torchrun --standalone --nproc_per_node=8 train_gpt.py` ## Environment - 8xH100 80GB SXM (RunPod) - PyTorch 2.9.1+cu128 - FlashAttention 3.0.0 - Triton 3.5.1
…t w/ GPTQ v2 3143-line condensed version of train_gpt_s0_pr1851_mod.py (no inline annotations, GPTQ v2 path). Same mandatory FA3 + Triton dependency as the annotated sibling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ams) After 4 parallel research agents reviewed 30+ open PRs and compliance issues, two new findings: 1. PR openai#1923 (AsymLogit) flagged "empirical negative" by sunnypatneedi 4-29 frontier-scan, BUT only on PR openai#1855 base with default WD=1.0. Never tested on PR openai#1908 + WD=2.0 combo. V19's specific stack is NOT directly invalidated. 2. PR openai#1925 simon-marcus 1.06049 (3-seed verified, vs PR openai#1855 base 1.06108 = -0.00059 BPB). Just 2 hparam env vars: MATRIX_LR 0.026 -> 0.028 PHASED_TTT_PREFIX_DOCS 2500 -> 3500 Orthogonal axis to AsymLogit (LR/TTT prefix vs logit head). Adds two new scout scripts: - run_v19c_stacked_scout.sh: PR openai#1908 + AsymLogit + simon-marcus + WD=2.0 (full stack, recommended first scout) - run_v19b_simonmarcus_scout.sh: PR openai#1908 + simon-marcus + WD=2.0 (ablation if V19c wins partially) Decision rule (CaseOps val baseline 0.97651, community floor 0.0006): V19c < 0.97591 -> CLEAR WIN, run 3-seed V19c 0.97591-0.9755 -> borderline, ablate via V19a/V19b V19c > 0.9755 -> abandon stack, try Lead B (PR openai#1884) Other research findings: - PR openai#1898 SpinQuant flagged regression vs parent openai#1851 (skip) - PR openai#1929 SLOT banned per openai#1722 precedent - PR openai#1911 pre-quant TTT chain banned per openai#1735 precedent - cocohearts 4-28 PR openai#1902 confirmed PR openai#1855 as official openai#1 - regina-openai + Alex Zhao 48h zero activity - CaseOps de-facto legal (PR openai#1855 merged into chain)
… 1.5221 - Add full-val Path B result (151,078,222 bytes, claim_ready=true) - Add formal mathematical description of byte-level vs token-level BPB - Add comparison with PR openai#1905 (independent normalization invalidity discovery) - Add Score-First Legal TTT evidence section (PRs openai#461, openai#549, openai#1735, openai#1851) - Archive Path A as computationally intractable - Bundle fast_score.py and full-val legality proof - Fix trie marginalization formula to reflect continuable mass implementation - Update submission.json with full-val fields Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Same 3143-line code as v2; only Hyperparameters defaults changed to match the PR openai#1851 stack tuning observed in stage-1/2 ablations: SEED=42 MIN_LR=0.1 TTT_BATCH_SIZE=16 PHASED_TTT_NUM_PHASES=3 GPTQ_RESERVE_SECONDS=16 EMBED_BITS=7 EMBED_CLIP_SIGMAS=15 MLP_CLIP_SIGMAS=12 SMEAR_GATE_ENABLED=1 GATED_ATTN_QUANT_GATE=1 SPARSE_ATTN_GATE_ENABLED=1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cocohearts
left a comment
There was a problem hiding this comment.
Accepted on substance for the original SmearGate BOS-fix submission, but this PR is not merge-ready in its current scope. It now contains a second, later AWQ-lite submission directory, records/track_10min_16mb/2020-04-29_AWQ_lite_mixedprecision_GPTQ, with an invalid 2020 date and a separate ML change that was not part of the accepted #1851 leaderboard row. Please split that newer AWQ-lite work into a separate PR or remove it from this one. This PR should merge only the accepted 2026-04-27 BOS-fix record package, with its 3-seed support either included directly or clearly tied to #1868.
cocohearts
left a comment
There was a problem hiding this comment.
Accepted on substance for the original SmearGate BOS-fix submission, but this PR is not merge-ready in its current scope. It now contains a second, later AWQ-lite submission directory, records/track_10min_16mb/2020-04-29_AWQ_lite_mixedprecision_GPTQ, with an invalid 2020 date and a separate ML change that was not part of the accepted #1851 leaderboard row. Please split that newer AWQ-lite work into a separate PR or remove it from this one. This PR should merge only the accepted 2026-04-27 BOS-fix record package, with its 3-seed support either included directly or clearly tied to #1868.
3-seed reproduction of PR #1851 (SmearGate BOS document boundary fix). Code is byte-identical to #1851 by @aquariouseworkman. Results (post-TTT BPB): Seed 42: 1.06128 (original #1851 author) Seed 314: 1.06087 (this submission) Seed 1234: 1.06220 (this submission) Mean: 1.06145 ± 0.00068 All artifacts < 16,000,000 bytes. All runs < 600s. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…lone openai#1851 Part 1: BOS-fixed SmearGate + per-head attn output gate ported onto PR1493 wd_strong_paired baseline (15+/-6 lines in train_pr1493.py). 5 new env vars: SMEARGATE_{ENABLED,BOS_ID,INIT}, ATTN_GATE_{ENABLED,INIT}. SmearGate is causal previous-token mixing with the BOS document-boundary mask from PR openai#1851: at positions where input_ids == bos_id, the smear contribution is forced to zero so the final token of doc N cannot leak into BOS of doc N+1. Verified by a focused unit test. Per-head attn_gate added inside CausalSelfAttention applied to flash_attn output before XSA. smeargate.smear_gate is a top-level GPT parameter so it gets explicitly appended to Optimizers.scalar_params (not picked up by the blocks-only loop). CONTROL_TENSOR_NAME_PATTERNS extended; 100% optimizer coverage verified. Real-run results (single seed s42, 8xH100): variant pre q q_sw q_ttt d_qttt baseline (wd_strong_paired) 1.08573 1.09874 1.08194 1.07971 -- smear+attn_gate1d (sigmoid) 1.08663 1.09887 1.08220 1.08052 +0.00081 smearonly (gate off) 1.08601 1.09834 1.08170 1.07998 +0.00027 smear_gate2d (additive) killed mid-train (~step 4000, val 1.1051) The 1D per-head sigmoid gate (8 params/layer) is undercapacity vs upstream PR openai#1667's 96 params/layer, and is +0.00090 worse pre-quant -- a real regression in the trained model. SmearGate alone improves q (-0.00040) and q_sw (-0.00024) but disrupts our SGD TTT lift (0.0017 vs 0.0022 baseline); net q_ttt within seed noise. The artifact stays >16 MB (added code costs ~7 KB; still bust like baseline). Conclusion: port is mechanically correct, just doesn't help on PR1493 base without the rest of the top stack (LQER, phased TTT, CaseOps). Part 2: Critical leaderboard analysis. PR openai#1855 and PR openai#1851 are both verified-merged by maintainer cocohearts and listed on README. PR openai#1855 has an OPEN val_docs=10_000 vs canonical 50_000 dispute (jfc43, 2026-04-30, unresolved) that affects the entire CaseOps chain (PRs 1736/1769/1787/ 1851/1855/1868). If ruling lands against, all six fall and PR1493 family returns to the top -- so building on PR1493 is a hedged investment. Real pre/q/q_ttt comparison vs openai#1855 seed 42 log: their pre=1.06396 vs ours 1.08573 (+0.022 BPB gap at the trained-model level), bigger than the total 0.020 gap. The leaderboard wedge is dominated by training-level wins (CaseOps + SparseAttnGate + 9-knob hparam stack), not LQER/phased-TTT. Part 3: Pivot decision. Clone openai#1851's train_gpt.py (152 KB, 3,574 lines) as the new base rather than porting their 2,500+ lines into our 553-line file. openai#1851 picked over openai#1855 because: same q_ttt within noise (1.06128 vs 1.06108), no lrzip system dep, fewer disputes. Layer only our small PR1493 differentiators (paired-head Muon NS, wd_schedule, gptq_all_reduce). CaseOps shards already published at romeerp/parameter-golf-caseops-v1 (80 train + val + val_bytes sidecar + tokenizer); saves 1-2 hr CPU retokenization. Background download in progress at session-end. Plan for next session: reproduce openai#1851 unmodified at s42 (target q_ttt 1.06128 +/- 0.0005); if reproduced, layer paired-head Muon then wd_schedule one-at-a-time; if not reproduced, stop and debug. Files added: pr1493_smeargate_to_top_stack_session.md full session writeup _top_ref/ cached openai#1851 reference files (train_gpt.py, lossless_caps.py, prepare_caseops_data.py, README.md) run_smear_*.sh smear experiment runners run_chain_smear_experiments.sh chain runner run_mom97.sh drafted but superseded logs/smear_*.txt + .stdout full run logs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Layers WD_SCHEDULE_ENABLED + low/high factors onto _top_ref/train_gpt.py (PR openai#1851 SmearGate BOS Fix base). Off by default; strict no-op when WD_SCHEDULE_ENABLED=0. Skips paired-head Muon NS port: PR openai#1851 uses parameter banks (qo_bank/kv_bank/mlp_*_bank stacked along dim 0) instead of per-layer c_q/c_k weights, so the _head_pair_ns tagging approach from train_pr1493.py does not apply without redesigning the per-bank NS path. Surgical diff (5 hunks): - 5 env-driven hyperparameters (WD_SCHEDULE_ENABLED, hold/ramp fracs, low/high factors) - snapshot base_wd per group in Optimizers.__init__ after self.optimizers - wd_mul(frac) helper next to lr_mul(frac), same hold/ramp shape as train_pr1493 - step_fn signature gains wd_scale=1.0; applies group["weight_decay"] = base_wd * wd_scale - caller passes wd_mul(frac) Run with WD_SCHEDULE_ENABLED=1 WD_SCHED_LOW_FACTOR=0.5 WD_SCHED_HIGH_FACTOR=1.75 plus the standard PR1851 env vars. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single-seed s42 result for top_wd_strong (WD_SCHEDULE_ENABLED=1, low=0.5, high=1.75 layered onto PR openai#1851 base): q_ttt = 1.06111. Compared to PR openai#1851's published s42 numbers (1.06128 original / 1.06083 re-run gptq8s) the delta is within 1/2 of the published 3-seed std (0.00068) — a no-op at single-seed resolution. Stage decomposition shows the WD schedule slightly worsened pre (+0.00033 vs PR1855's pre 1.06396) and widened the LQER quant gap (+0.00116 vs PR1855), with phased-LoRA TTT recovering most of the q-stage damage. Sign-flipped from PR1493 where the same WD config gave -0.00037 pre. Includes a critical inventory of every PR1493-stack technique cross-referenced against PR openai#1851's stack, ranking portability by pragmatic value: 1. GPTQ Hessian all-reduce: HIGH confidence, ~10-line port, expected -0.0005 to -0.0009 BPB. PR openai#1851's collect_hessians (line 2037-2141) does NOT all-reduce across ranks — same bug PR1493 had. With PR openai#1851's default gptq_calibration_batches=16, AR is in the regime where it helps (saturates at 128). 2. wd_schedule with default factors (low=0.65, high=1.5): env-var only, defensive test of whether WD-schedule mechanism carries at all. 3. Paired-head Muon NS port to bank architecture: ~80-120 lines of careful porting around qo_bank/kv_bank reshape semantics. Bank-NS already does per-layer NS for free, so marginal gain expected smaller than PR1493's -0.00055. Honest ceiling: even with all three layered, expected q_ttt ~1.05970 — clears PR openai#1855 by ~0.00140 BPB but does NOT clear the 0.0024-BPB acceptance bar (0.00140 < 0.0024). Best-case submission is a non-record entry at this stack without something architecture-level we don't have ready. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR openai#1851's collect_hessians (line 2037-2150 of _top_ref/train_gpt.py) computes each rank's Hessian on its own data shard subset (ShuffledSequenceLoader splits files by rank) and divides only by n_calibration_batches — without all-reduce, only rank 0's Hessian is effectively used since only rank 0 writes the quantized blob. 7/8 of calibration compute is wasted. Fix: dist.all_reduce(SUM) each Hessian (sorted iteration to avoid deadlock if key order ever drifts), divide by n_calibration_batches * world_size. Smoking- gun log line "gptq:all-rank Hessian averaging across N ranks (denom=...)" when on, "gptq:per-rank Hessian (no all-reduce, denom=...)" when off. Gated by GPTQ_ALL_REDUCE env var (default 1, the bugfix behavior). Off path preserves the original upstream semantics for clean A/B if needed. PR1493 evidence at gptq_calibration_batches=16 (PR openai#1851's default): 16-shard no-AR: q_ttt = 1.08060 16-shard AR : q_ttt = 1.07977 (delta -0.00083) At 128 calibration batches the AR delta saturates to noise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…1855.py Previous session's choice of PR openai#1851 over PR openai#1855 was a mistake we inherited. PR openai#1855 is currently openai#1 on the upstream leaderboard at 1.06108 (3-seed mean), 0.00037 BPB ahead of PR openai#1851's 1.06145. PR openai#1855 also ships the per-group lrzip+brotli compressor (COMPRESSOR=pergroup, ~280 KB smaller artifact than brotli) that PR openai#1851 lacks. Without that compressor, even the 9-hparam stack on PR openai#1851 base busts the 16 MB cap (Run 4 artifact = 16,140,607 B, +140 KB over). train_top_1855.py = PR openai#1855's train_gpt.py + same surgical patches we applied to train_top.py: wd_schedule (5 hparams + base_wd snapshot + wd_mul + step_fn injection + caller) and GPTQ_ALL_REDUCE=1 in collect_hessians. 41 line additions, 3 line modifications, syntax OK. Run 4 evidence (PR openai#1851 + 9 hparams + wd_strong + AR, single seed s42): pre = 1.06331 (vs Run 0's 1.06429 — best pre of session) q = 1.07239 (q_gap 0.00908 — tightest gap of session) artifact = 16,140,607 B (busts cap with brotli; pergroup needed) lrzip 0.651 installed via add-apt-repository universe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run 4 results, single seed s42: pre = 1.06331 (best pre of session, beats Run 0's 1.06429 by 0.00098) q = 1.07239 (q_gap 0.00908 — tightest gap of session) q_ttt = 1.05950 (best q_ttt of session, beats PR openai#1855 published s42 1.05989 by 0.00039) artifact = 16,140,607 B (BUSTS 16 MB cap by 140,607 B with brotli; PR openai#1855's pergroup compressor saves ~280 KB, which is needed for this hparam stack to fit) Three findings: 1. The 9 hparams transfer cleanly through to final EMA model quality. Contrast with paired-head Muon NS (Run 3): also gave a striking mid-train signal (-0.0046 at step 4000) but that gain converged out by pre-quant time (+0.00038 vs Run 0). Run 4's mid-train gain (-0.0059) carried through to pre-quant (-0.00098). Mechanism: the 9 hparams change *what's actually being trained* (tighter clipping preserves outliers, longer warmdown reshapes convergence, tuned TTT-LoRA reshapes recovery), not just the optimizer's update direction. 2. Tightest quant gap of the session (0.00908). Tighter MLP/EMBED clipping (11.5/14.0) preserves outliers that LQER asymmetric int4 rank-4 correction can exploit, on top of AR's narrowing. 3. Artifact busts cap with brotli alone — confirms PR openai#1855's claim that their pergroup compressor saves ~280 KB on this stack. With brotli, even PR openai#1855 itself would land ~16,180,000 B. They needed pergroup; we need pergroup. This run made the case to pivot to PR openai#1855 base for Run 5. Earlier session's choice of PR openai#1851 (yesterday's "no lrzip dispute" reasoning) overturned by Run 4's evidence: PR openai#1855 is 0.00037 BPB ahead at 3-seed mean, ships the pergroup compressor we need to fit cap, and the 9 hparams we manually applied transfer cleanly. Run 5 (queued, auto-launch when Run 4 GPUs free) = PR openai#1855's full env stack + our wd_strong + AR + COMPRESSOR=pergroup. Expected q_ttt ~1.0590-1.0595 single-seed; 3-seed mean ~1.0593 ± 0.001. Honest acceptance-bar math: SOTA = 1.06108 (PR openai#1855 3-seed mean) Bar = SOTA - 0.005 nats ≈ 1.0588 Run 4 single = 1.05950, +0.00070 short of bar Run 5 predicted = 1.0590-1.0595, still 0.0002-0.0007 short Even best-case Run 5 likely just misses the record bar by ~half a sigma. Best plausible outcome is non-record submission with documented findings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_ttt Run 6 is the pergroup-recovery run from top_run4_pergroup_recovery_runbook.md: keep Run 4's training graph (train_top.py, PR openai#1851 base) and Run 4's hparam stack (9 PR openai#1855 overrides + wd_strong + GPTQ AR), and replace the cap-busting brotli serialization with PR openai#1855's pergroup compressor that we ported in commit 0209a50. Result, single seed s42: pre = 1.06335 (Run 4 was 1.06331; +0.00005) q = 1.07246 (Run 4 was 1.07239; +0.00008) q_ttt = 1.05957 (Run 4 was 1.05950; +0.00006) total = 15,901,624 B (Run 4 was 16,140,607 B brotli, INVALID +140,607 B) (UNDER 16,000,000 B cap by 98,376 B — VALID) Pergroup saves 240,863 B on the model blob and 238,983 B on total vs brotli on this exact stack. That matches PR openai#1855 README's published "~280 KB savings" claim within tolerance — different runs have different quantized weight distributions so brotli/pergroup deltas aren't exactly transportable, but the order of magnitude lines up. Quality drift between Run 4 and Run 6 is <=0.00008 BPB across pre/q/q_ttt, which is below typical pod-to-pod nondeterminism (Run 4 vs PR openai#1855 published s42 differed by 0.00039 even on the "same" stack). Compressor swap is functionally a no-op on quality. Comparison summary: Run 6 vs Run 4 (best, but invalid): +0.00006 BPB worse, but VALID Run 6 vs Run 5 (PR openai#1855 base recovery): -0.00053 BPB BETTER and same compressor Run 6 vs PR openai#1855 published s42: -0.00033 BPB better, +4365 B Run 6 vs PR openai#1855 3-seed mean SOTA: -0.00152 BPB better (~1.7sigma) Run 6 vs acceptance bar (~1.0588): +0.00077 BPB SHORT So Run 6 is the strongest single-seed valid-size submission of the session. Not yet a record (single-seed, ~half-sigma short of acceptance bar) but a strong non-record submission with a documented win: - Validates the ported pergroup compressor end-to-end (synthetic 138-tensor roundtrip preflight + live deserialize during phased TTT eval). - Confirms the runbook's hypothesis that "preserve Run 4 graph + only swap compressor" beats "preserve compressor + retrain on PR openai#1855 base + apply our patches" (Run 5 path). - Reproduces Run 4's quality bit-equivalent within pod noise. Pod prep this session: - apt-get install -y lrzip (lrzip 0.651, required by pergroup) - pip install brotli python-minifier - snapshot_download romeerp/parameter-golf-caseops-v1 (16 GB) for the canonical sp8192-caseops shards + canonical fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model SP model. Layout matches train_top.py's _default_caseops_data path exactly. Files in this commit: top_run6_pergroup_recovery_session.md (full Run 6 report) upload_run6_to_hf.py (pushes artifacts to HF) logs/top_pr1855_hparams_s42_pergroup.stdout (torchrun stdout/stderr) logs/top_pr1855_hparams_s42_pergroup.txt (per-rank training log) Artifacts pushed to HuggingFace (shikhar007/parameter-golf-gram-ns): models/top_pr1855_hparams_s42_pergroup.pt (135.4 MB FP ckpt) models/top_pr1855_hparams_s42_pergroup.int6.ptz (15.9 MB pergroup blob) logs/top_pr1855_hparams_s42_pergroup.txt logs/top_pr1855_hparams_s42_pergroup.stdout Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ean 1.05831 BPB Clears record bar (1.05914) by 0.83 milli-BPB. Welch t = -6.49 vs PR openai#1855 (1.06108), p < 0.0001. All 3 seeds produce 15.99 MB artifacts under the 16 MB cap, all under the 600s wallclock budget. Per-seed: - 42: ttt=1.05793 art=15,986,149 eval=572.6s - 314: ttt=1.05852 art=15,987,257 eval=553.7s - 1234: ttt=1.05849 art=15,989,895 eval=574.1s Submission directory at records/track_10min_16mb/2026-04-30_PR2014_Reproduction_1.0583/ contains PR openai#2014's verbatim train_gpt.py + tokenizer + our seed_results.csv + a detailed README documenting the lineage (openai#1797 -> openai#1851 -> openai#1855 -> openai#1908 -> openai#1923 -> openai#1953 -> openai#2014), the new levers vs each parent, and the full 4-condition C1-C4 legality check. submission.json author/github_id are placeholders pending the user's choice of submitting account. Reproduction script: runpod/phase_x_pr2014.sh — runs end-to-end on a single 8xH100 SXM pod (~2.5h wall, ~$66 cost). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Added a README for the non-record submission detailing the inhibitory layers on the PR openai#1851 stack, including architecture, mechanism, results, and reproduction steps.
Added README.md for non-record submission detailing inhibitory layers on PR openai#1851 stack, including mechanism, configuration, results, and limitations.
Audits every CaseOps-lineage record-track PR (merged + unmerged) since 2026-04-18 for whether val docs are also in the training set. Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors: openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps). Verdicts: - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068 - LEAK (25): openai#1736 (our research baseline) → openai#1769 → openai#1787 → openai#1797 → openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018 → openai#2118 (current claimed frontier 1.04350), plus siblings. - INHERIT (1): openai#2050 (eval-only on frozen openai#1915) Code-level evidence (not README claims): - Every shipped prepare_caseops_data.py is byte-identical: SHARD_TOKENS=10_000_000, default=10_000 for --val-docs - NO PR overrides --val-docs (searched all .sh files in all 34 PRs) - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1 HF dataset whose manifest pins docs_val=50000, docs_train=8181945, sums match → CLEAN by construction - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval" Three signposts: - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py default invocation - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN. The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is inflated by val memorization; spec 301 was designed to measure how much remains under clean data. Files: caseops-memory-leakage/README.md — overview, methodology, takeaways caseops-memory-leakage/verdicts.md — 34-row master table with evidence caseops-memory-leakage/family-tree.md — ASCII trees with [C]/[L] annotations
…ence After user feedback that LEAK calls relied too heavily on lineage-inheritance and path heuristics, applied stricter criterion: a LEAK verdict requires at least one of (a) explicit shell-script invocation of prepare_caseops_data.py without --val-docs=50000, (b) README "Data setup" matching actual train log path, (c) audit/submission.json admission text, (d) train log path with `_caseops/datasets/datasets/<name>` triple-nesting OR single `<root>/datasets/<name>` (which only local prep produces; HF always gives double-nesting). Records that previously got LEAK by lineage-inheritance alone are now AMBIGUOUS unless they meet at least one of those tests. Changes: - openai#1945 LEAK → CLEAN (finalize_v18.sh has snapshot_download from HF; actual run path matches HF target; README's prepare_caseops_data.py section is stale documentation) - openai#1953 LEAK → AMBIGUOUS (PR ships only train_gpt.py + logs; no prep evidence; path matches HF target; parent openai#1945 confirmed CLEAN — leans CLEAN but no direct PR evidence) - openai#2041 LEAK → AMBIGUOUS (no prep invocation; double-nested path consistent with EITHER HF or local prep) - openai#2075 LEAK → AMBIGUOUS (ships prep file but no explicit invocation; path matches HF target) Updated tally: CLEAN 9, LEAK 21, AMBIGUOUS 3, INHERIT 1 (was 8/25/0/1). Headline impact: realistic clean SOTA is at most ~0.012 bpb below the claimed frontier openai#2118 (1.04350). Best clean BPB candidates in order: openai#2019 1.05847 (HF, confirmed) openai#1953 1.05855 (AMBIGUOUS, leans CLEAN) openai#1945 1.05943 (HF, confirmed via re-audit) openai#2031 1.05985 (HF, confirmed) openai#1908 1.06081 (HF, confirmed) openai#1851 1.06128 (HF, MERGED SOTA)
Record: SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT
val_bpb = 1.06128 | ~15.95 MB | 8xH100 SXM
Result
Merged SOTA (PR #1493): 1.0810 BPP. Delta: -0.0197 BPP. Clears the 0.005-nat threshold.
3-seed validation (seeds 42 / 314 / 1234) is provided in PR #1868, which evaluates this exact record package. The mean across those three seeds is reported there.
Key Change: SmearGate BOS Document Boundary Fix
Builds on PR #1797 stack (PR #1787 base + SmearGate + LQER Asymmetric) but fixes the SmearGate cross-document leakage bug identified by @cocohearts in PR #1797 audit.
The bug: SmearGate 1-token causal lookback does not mask BOS positions, so the final token of document N smears into BOS of document N+1.
The fix (applied in both forward_logits and forward_ttt):
Technique Stack
Architecture
11L x 512d x 8H/4KV, MLP 4x, LeakyReLU(0.5)^2, Partial RoPE (16/64 dims), layerwise LN scale, tied embeddings, logit softcap=30.0. Depth recurrence: layers 3-5 looped x2 (activated at frac=0.35). Parallel residuals from layer 8. XSA on all 11 layers. SmearGate window=12.
Compliance
Credits