Skip to content

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT + CaseOps Tokenizer — val_bpb 1.07462 (3-seed mean)#1755

Open
OE-GOD wants to merge 1 commit intoopenai:mainfrom
OE-GOD:record/sp8192_caseops_legalttt
Open

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT + CaseOps Tokenizer — val_bpb 1.07462 (3-seed mean)#1755
OE-GOD wants to merge 1 commit intoopenai:mainfrom
OE-GOD:record/sp8192_caseops_legalttt

Conversation

@OE-GOD
Copy link
Copy Markdown

@OE-GOD OE-GOD commented Apr 20, 2026

Record Summary

val_bpb = 1.07462 (3-seed mean, std 0.00043) | 8×H100 SXM | max artifact 15,991,629 bytes (< 16,000,000)

3-Seed Results (packed submission)

Seed Pre-quant EMA Quantized Sliding (Track A) TTT (Track B) Artifact bytes
42 1.08393 1.09482 1.07605 1.07447 15,991,629
314 1.08467 1.09589 1.07711 1.07521 15,990,248
999 1.08384 1.09437 1.07552 1.07418 15,989,091
Mean 1.08415 1.09503 1.07623 1.07462
Std 0.00037 0.00064 0.00066 0.00043

Delta vs merged SOTA (PR #1493)

  • Merged SOTA: 1.08100 BPB (@bigbag)
  • This PR: 1.07462 BPB
  • Delta: −0.00638 BPB = −0.01402 nats/token (2.8× over the 0.005-nat threshold)
  • z ≈ 22.8, p ≪ 0.0001

Contribution

Combines two previously-separate directions without inheriting the pre-quant TTT component that is a Condition-3 violation:

  1. Legal score-first TTT — merged stack from PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493 (@bigbag): SP8192 + 3-layer depth recurrence + parallel residuals + QK-Gain 5.25 + SGD TTT with score-before-update ordering.

  2. Lossless CaseOps tokenizer with byte sidecar — pending PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729 (@romeerp): bijective case-folding (TITLE / ALLCAPS / CAPNEXT reserved tokens) + fineweb_val_bytes_*.bin sidecar reporting original UTF-8 byte counts per token.

Specific novel work in this PR (~25 lines):

Compliance (Issue #1017 — all 4 conditions)

  • C1 (Strict causal): flash_attn_3_func(..., causal=True); sliding-window eval uses strict prefix; byte sidecar is pre-computed data (shipped as fineweb_val_bytes_*.bin), not runtime state from val tokens.
  • C2 (Full normalized distribution): standard softmax over full 8192-vocab Σ; logit softcap 30·tanh(x/30) uniform across all logits.
  • C3 (Score-before-update): Each TTT chunk scored under torch.no_grad() before any parameter update. No pre-quant TTT, no SLOT, no score-after-adapt.
  • C4 (Single pass): Each val token scored exactly once; no rescoring, no second pass.

Additional:

  • No SLOT (any variant), no ETLB, no n-gram cache, no pre-quant TTT.
  • Tokenizer transform is fully reversible (see lossless_caps.py in PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729).
  • BPB computed against original UTF-8 bytes via sidecar, not transformed token length.
  • Artifact < 16,000,000 bytes (decimal) on all 3 seeds.
  • Train ≤ 588 s, eval ≤ 497 s per seed.

Attribution

Full details and reproduction instructions in records/track_10min_16mb/2026-04-20_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_CaseOps/README.md.

…25 + Legal TTT + CaseOps Tokenizer — val_bpb 1.07462 (3-seed mean)

3-seed mean val_bpb: 1.07462 (std 0.00043) on 8×H100 SXM.
Delta vs merged SOTA PR openai#1493 (1.08100): -0.00638 BPB = -0.01402 nats/token.
z≈22.8, p ≪ 0.0001. Clears p<0.01 significance threshold comfortably.

Combines:
- PR openai#1493 @bigbag: merged base stack (legal score-first TTT)
- PR openai#1729 @romeerp: lossless CaseOps tokenizer + byte sidecar (pending)

Deliberately excludes PR openai#1735/openai#1738's pre-quant TTT, which is a
Condition-3 violation per the @MatoTeziTanka / @dexhunter community
review on PR openai#1416.

All 4 conditions from Issue openai#1017 satisfied. Max artifact: 15,991,629
bytes (under 16,000,000 decimal limit on all 3 seeds).
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 21, 2026
…uant TTT; Recurrence Depth Curriculum; Parcae stable loops

- SOTA 1.0810 still holds (Day 12 plateau, longest in competition history)
- PR openai#1758 (1.02840): pre-quant TTT — 6th attempt at illegal pattern, ignore
- PR openai#1756 (1.06505): CaseOps + Recurrence Depth Curriculum (depth 1→3→4); has BOS bug; awaits Issue openai#1604
- PR openai#1755 (1.07462): CaseOps + Legal TTT; awaits Issue openai#1604
- New paper: Parcae (arXiv:2604.12946) — stable looped LMs via spectral norm constraint on injection params, relevant to Triple Loop stability
- New paper: Gated Attention (arXiv:2505.06708, NeurIPS 2025) — backs PR openai#1667 Attention Output Gate
- Added Session 18 lessons learned; Issue openai#1604 self-deadline Apr 24
- Primary action: implement PR openai#1586+openai#1667 immediately (9 days to deadline)

https://claude.ai/code/session_0151v7YeWWUSnhmcC8U8NGUV
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant