Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT + CaseOps Tokenizer — val_bpb 1.07462 (3-seed mean) by OE-GOD · Pull Request #1755 · openai/parameter-golf

OE-GOD · 2026-04-20T22:08:24Z

Record Summary

val_bpb = 1.07462 (3-seed mean, std 0.00043) | 8×H100 SXM | max artifact 15,991,629 bytes (< 16,000,000)

3-Seed Results (packed submission)

Seed	Pre-quant EMA	Quantized	Sliding (Track A)	TTT (Track B)	Artifact bytes
42	1.08393	1.09482	1.07605	1.07447	15,991,629
314	1.08467	1.09589	1.07711	1.07521	15,990,248
999	1.08384	1.09437	1.07552	1.07418	15,989,091
Mean	1.08415	1.09503	1.07623	1.07462	—
Std	0.00037	0.00064	0.00066	0.00043	—

Delta vs merged SOTA (PR #1493)

Merged SOTA: 1.08100 BPB (@bigbag)
This PR: 1.07462 BPB
Delta: −0.00638 BPB = −0.01402 nats/token (2.8× over the 0.005-nat threshold)
z ≈ 22.8, p ≪ 0.0001 ✓

Contribution

Combines two previously-separate directions without inheriting the pre-quant TTT component that is a Condition-3 violation:

Legal score-first TTT — merged stack from PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493 (@bigbag): SP8192 + 3-layer depth recurrence + parallel residuals + QK-Gain 5.25 + SGD TTT with score-before-update ordering.
Lossless CaseOps tokenizer with byte sidecar — pending PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729 (@romeerp): bijective case-folding (TITLE / ALLCAPS / CAPNEXT reserved tokens) + fineweb_val_bytes_*.bin sidecar reporting original UTF-8 byte counts per token.

Specific novel work in this PR (~25 lines):

Integrated Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729's byte-sidecar accounting into Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493's three eval functions (eval_val, eval_val_sliding, eval_val_ttt).
Fixed load_validation_tokens glob to exclude _bytes_*.bin files (prevents double-counting the token stream).
Deliberately excluded PR Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735/Record: PR #1735 + CaseOps Tokenizer V15 (val_bpb 1.03540, mean of 3 seeds) #1738's pre-quant TTT, which runs multi-epoch training on val_tokens without score-first discipline — community-flagged as a Condition-3 violation in the @MatoTeziTanka / @dexhunter review on PR Record: SP8192 + Pre-Quant TTT — val_bpb 1.07948 (3-seed mean) #1416 (the original author @erichroepke agreed and withdrew the TTT there).

Compliance (Issue #1017 — all 4 conditions)

✅ C1 (Strict causal): flash_attn_3_func(..., causal=True); sliding-window eval uses strict prefix; byte sidecar is pre-computed data (shipped as fineweb_val_bytes_*.bin), not runtime state from val tokens.
✅ C2 (Full normalized distribution): standard softmax over full 8192-vocab Σ; logit softcap 30·tanh(x/30) uniform across all logits.
✅ C3 (Score-before-update): Each TTT chunk scored under torch.no_grad() before any parameter update. No pre-quant TTT, no SLOT, no score-after-adapt.
✅ C4 (Single pass): Each val token scored exactly once; no rescoring, no second pass.

Additional:

No SLOT (any variant), no ETLB, no n-gram cache, no pre-quant TTT.
Tokenizer transform is fully reversible (see lossless_caps.py in PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729).
BPB computed against original UTF-8 bytes via sidecar, not transformed token length.
Artifact < 16,000,000 bytes (decimal) on all 3 seeds.
Train ≤ 588 s, eval ≤ 497 s per seed.

Attribution

@bigbag — PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493 (merged): base stack
@romeerp — PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729 (pending): CaseOps tokenizer + byte sidecar (adopted tokenizer/sidecar only)
@clarkkev — PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394: SP8192 + GPTQ SDClip + MuonEq-R
@dexhunter — PR Record: MuonEq-R + 3-Layer Recurrence + WD=0.095 + MLR=0.022 + All-Int6 — val_bpb 1.0900 (3-seed mean) #1331, Record: SP8192 + Parallel Residuals + 3-Layer Recurrence + Token-Only N-gram Tilt — val_bpb 1.08091 (5-seed mean, causal-corrected) #1437, Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean) #1413: depth recurrence + QK-Gain
@Robby955 — PR Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean) #1412: parallel residuals
@abaybektursun — PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549, Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019: legal score-first TTT, GPTQ-XSA
@Christopher-Lee-McClendon — PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461: LoRA TTT framework
@MatoTeziTanka, @dexhunter — PR Record: SP8192 + Pre-Quant TTT — val_bpb 1.07948 (3-seed mean) #1416 review: compliance analysis that guided our decision to exclude pre-quant TTT

Full details and reproduction instructions in records/track_10min_16mb/2026-04-20_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_CaseOps/README.md.

@bigbag

…25 + Legal TTT + CaseOps Tokenizer — val_bpb 1.07462 (3-seed mean) 3-seed mean val_bpb: 1.07462 (std 0.00043) on 8×H100 SXM. Delta vs merged SOTA PR openai#1493 (1.08100): -0.00638 BPB = -0.01402 nats/token. z≈22.8, p ≪ 0.0001. Clears p<0.01 significance threshold comfortably. Combines: - PR openai#1493 @bigbag: merged base stack (legal score-first TTT) - PR openai#1729 @romeerp: lossless CaseOps tokenizer + byte sidecar (pending) Deliberately excludes PR openai#1735/openai#1738's pre-quant TTT, which is a Condition-3 violation per the @MatoTeziTanka / @dexhunter community review on PR openai#1416. All 4 conditions from Issue openai#1017 satisfied. Max artifact: 15,991,629 bytes (under 16,000,000 decimal limit on all 3 seeds).

…uant TTT; Recurrence Depth Curriculum; Parcae stable loops - SOTA 1.0810 still holds (Day 12 plateau, longest in competition history) - PR openai#1758 (1.02840): pre-quant TTT — 6th attempt at illegal pattern, ignore - PR openai#1756 (1.06505): CaseOps + Recurrence Depth Curriculum (depth 1→3→4); has BOS bug; awaits Issue openai#1604 - PR openai#1755 (1.07462): CaseOps + Legal TTT; awaits Issue openai#1604 - New paper: Parcae (arXiv:2604.12946) — stable looped LMs via spectral norm constraint on injection params, relevant to Triple Loop stability - New paper: Gated Attention (arXiv:2505.06708, NeurIPS 2025) — backs PR openai#1667 Attention Output Gate - Added Session 18 lessons learned; Issue openai#1604 self-deadline Apr 24 - Primary action: implement PR openai#1586+openai#1667 immediately (9 days to deadline) https://claude.ai/code/session_0151v7YeWWUSnhmcC8U8NGUV

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT + CaseOps Tokenizer — val_bpb 1.07462 (3-seed mean)#1755

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT + CaseOps Tokenizer — val_bpb 1.07462 (3-seed mean)#1755
OE-GOD wants to merge 1 commit intoopenai:mainfrom
OE-GOD:record/sp8192_caseops_legalttt

OE-GOD commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

OE-GOD commented Apr 20, 2026

Record Summary

3-Seed Results (packed submission)

Delta vs merged SOTA (PR #1493)

Contribution

Compliance (Issue #1017 — all 4 conditions)

Attribution

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant