Record: Casefold V4 + AttnOutGate + Multi-Phase Global SGD TTT — val_bpb 1.05733 (3-seed mean) by dexhunter · Pull Request #1693 · openai/parameter-golf

dexhunter · 2026-04-17T10:17:51Z

Summary

val_bpb: 1.05733 (3-seed mean, std 0.00035) | 3.04721 nats | ~15.21 MB | 8xH100 SXM, 600s | Phased TTT
Stacks per-head Attention Output Gate (PR RECORD: SmearGate + Attention Output Gate + Legal TTT | val_bpb=1.07139 #1667 @MarioPaerle) and SmearGate on top of our Casefold V4 + Multi-Phase Global SGD TTT base (PR Record: Casefold V4 Tokenizer + Multi-Phase Global SGD TTT — val_bpb 1.05970 (3-seed mean) #1670)
Both gates are zero-initialized (identity at init) and add 1,056 + 13 parameters total
Clears the 0.005-nat bar vs the casefold leader (PR Record: Casefold Tokenizer + Parallel Residuals + Systems Optimization — val_bpb 1.0639 (3-seed mean) #1585) by 3.4x; improves on PR Record: Casefold V4 Tokenizer + Multi-Phase Global SGD TTT — val_bpb 1.05970 (3-seed mean) #1670 by -0.00237 BPB / -0.00680 nats

Key Innovation

Attention Output Gate (from PR #1667 @MarioPaerle)

Per-head multiplicative gate on the attention output, weight-initialized to zero so all heads pass through at scale 1.0 at init. Implemented as an inline-safe op with .contiguous() barriers so it composes with fullgraph=True torch.compile:

def _apply_attn_out_gate_inline(y, x_orig, gate_w):
    gate_in = x_orig[:, :, :12].contiguous()
    gate = (2.0 * torch.sigmoid(F.linear(gate_in, gate_w.to(gate_in.dtype)))).contiguous()
    return y * gate.unsqueeze(-1)

12 x 8 heads x 11 layers = 1,056 new parameters
Applied in all three attention paths (standard, parallel-residual, depth-recurrent)
Negligible throughput cost (<2%)

SmearGate

Input-dependent per-channel residual mixer (13 parameters). Mixes the current token with the previous token (strictly backward-looking, so causal). Zero-initialized lambda means it starts as the identity on the residual stream.

Casefold V4 + Multi-Phase Global SGD TTT (from PR #1670)

Same tokenizer and TTT protocol as PR #1670. Casefold legality is pending organizer review at Issue #1604.

Results

Seed	Pre-TTT BPB	Post-TTT BPB	val_loss (nats)	Artifact
42	1.06633	1.05693	3.04604	15,936,269
0	1.06674	1.05730	3.04712	15,937,514
1234	1.06714	1.05777	3.04846	15,938,772
Mean	1.06674	1.05733	3.04721	15,937,518
Std		0.00035

Record Comparison

Submission	val_bpb	Delta BPB vs prior	Delta nats vs prior
Merged SOTA (PR #1493)	1.08100	-	-
PR #1530 @samacqua	1.07336	-	-
PR #1585 @codemath3000 (casefold leader)	1.06390	-	-
PR #1670 @dexhunter (casefold v4 + phased TTT)	1.05970	-0.00420 vs #1585	-
This	1.05733	-0.00657 vs #1585 / -0.00237 vs #1670	-0.01697 vs #1585 / -0.00680 vs #1670

Compliance (Issue #1017 Track B)

Condition 1 (Causality): Sliding-window eval is strictly causal. AttnOutGate is a position-local sigmoid of the current token's channels. SmearGate mixes with the previous token only (backward-looking).
Condition 2 (Normalized distribution): Standard softmax over full vocab. Gates modulate hidden states, not logits.
Condition 3 (Score before update): Each phase fully scored under torch.no_grad() before any SGD update.
Condition 4 (Single pass): Each token scored exactly once per phase.

AttnOutGate and SmearGate are pure training-time architectural additions with trained weights; no eval-time effect beyond the trained weights. Analogous gating constructs have precedent in skip gates (PR #549 family), parallel-lane gating (PR #1204 family), and SmearGate (modded-nanogpt).

Casefold tokenizer: pending organizer review at Issue #1604. The tokenizer is retrained from scratch on casefolded FineWeb — it is not a modification of the standard SP8192 tokenizer. Byte-level BPB is computed correctly via the sentencepiece piece table on the original (non-casefolded) validation bytes.

Lineage

PR #1530 (@samacqua) -> PR #1626 (@dexhunter, multi-phase SGD TTT) -> PR #1670 (@dexhunter, casefold v4 + phased TTT) -> this PR (+ AttnOutGate from PR #1667 @MarioPaerle + SmearGate)

Credits

@samacqua — PR Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) #1530 base architecture (11L/512d/4x MLP, depth recurrence, parallel residuals, MuonEq-R, GPTQ SDClip, VarLen attention, fused MLP)
@MarioPaerle — Attention Output Gate (PR RECORD: SmearGate + Attention Output Gate + Legal TTT | val_bpb=1.07139 #1667), SmearGate reintroduction to parameter-golf
@KellerJordan — SmearGate concept (modded-nanogpt)
@mikeapedia — Casefold tokenizer concept (PR Record: Custom Casefold Tokenizer — 1.0668 BPB #1578)
@romeerp — Phased TTT concept (PR Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean) #1610)
@abaybektursun — Score-first TTT framework (PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549, merged precedent)
@dexhunter — Casefold V4 retokenization + BOS fix, multi-phase global SGD TTT, trimmed GPTQ tuning, inline-safe gate implementation

Test plan

3-seed training on 8xH100 SXM (seeds 42, 0, 1234) — all under 600s train, all under 600s eval
All artifacts under 16 MB (max 15,938,772 B / 16,000,000 B)
Code-size and hyperparameter consistency verified across all 3 seed logs
Score-before-update ordering held across all 3 phases (phase boundaries [666, 1333, 2000])
Clears the 0.005-nat bar vs merged SOTA and vs casefold leader
Judges verify reproducibility
Judges confirm casefold legality (Issue Clarify which text normalizations are allowed for custom tokenizers #1604)

@MarioPaerle

…TTT — val_bpb 1.05733 (3-seed mean) Stacks per-head Attention Output Gate (PR openai#1667 @MarioPaerle) and SmearGate on top of PR openai#1670's Casefold V4 + Multi-Phase Global SGD TTT base. Zero-init gates (identity at init) add 1,056 + 13 parameters total. - Seed 42: val_bpb=1.05693, val_loss=3.04604, artifact=15,936,269 B - Seed 0: val_bpb=1.05730, val_loss=3.04712, artifact=15,937,514 B - Seed 1234: val_bpb=1.05777, val_loss=3.04846, artifact=15,938,772 B - 3-seed mean val_bpb=1.05733 (std 0.00035), val_loss=3.04721 nats - Delta vs casefold leader (PR openai#1585): -0.00657 BPB / -0.01697 nats (>3x the 0.005-nat bar) - Delta vs PR openai#1670 casefold base: -0.00237 BPB / -0.00680 nats Casefold legality pending organizer review at Issue openai#1604. AttnOutGate and SmearGate are pure architectural additions and comply with all Issue openai#1017 conditions (causality, normalized distribution, score-before- update, single pass).

…ct concern; PR openai#1687 CLOSED BPB bug; PR openai#1693 casefold 1.05733; SOTA Day 8; Session 16 https://claude.ai/code/session_01LVvBLAM46dRKg53renpkq4

…verlay Supersedes the 2026-04-18 post-Ultrareview pin (876bb36). Rev 5 adds provenance auto-capture, repo-type=model fix, exact-SHA env-var pin, and run_all.sh/README alignment; new pin reflects the pipeline-patch commit. Also records the live-guidance absolute-BPB overlay and 04b deprecation driven by open-PR competitive intel (openai#1700 / openai#1716 / openai#1707 / openai#1693). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@mikeapedia

Sliding-window eval only (TTT_ENABLED=0 by default) on standard SP8192. Beats merged non-casefold SOTA PR openai#1493 (1.0810) by 0.00394 BPB without any test-time adaptation. Single seed 1337; compute-constrained non-record submission — VM went down before the run log could be pushed so it is not attached. Metrics were observed during the session. Architecture: PR openai#1674 (@mikeapedia) base (Parcae constrained loop injection, Gemma-style global/local attention, Gram Newton-Schulz) + PR openai#1530 (@samacqua) varlen attention + fused MLP triton kernel + AttnOutGate (PR openai#1667 @MarioPaerle) + SmearGate (@KellerJordan concept, @MarioPaerle reintroduction) + new layered local sliding windows (512 on early/loop layers, 1024 on post-loop layers, split at index 6). KV-tying on globals dropped vs PR openai#1674. TTT scaffolding (phased global-SGD + per-doc LoRA, from PR openai#1693 lineage) remains in the file for experiments but is disabled by default for this submission.

Per PR openai#1667/openai#1693: a tiny linear (gate_width x num_heads, default 12x8 = 96 weights per layer) projects the first 12 dims of the input into per-head gate values. Scaled to (0, 2) via 2*sigmoid for symmetric pass-through at zero-init. Total: 1056 extra params (8 heads x 12 width x 11 layers) — ~1KB at fp16. Zero-init = identity at start (transparent). Lets each head dynamically suppress noise per-token. Compatible with depth recurrence, parallel residuals, XSA, and GPTQ (gate weights pass through as fp16, numel < 65536).

mikeapedia mentioned this pull request Apr 19, 2026

Non-record: Neural Base Model, No TTT — Parcae + Gates + Layered Windows (val_bpb 1.07706) #1728

Open

7 tasks

Meirzhan05 mentioned this pull request Apr 28, 2026

Record: AttnOutGate + SmearGate + Softcap 15 — val_bpb 1.07750 (3-seed mean) #1880

Open

Idan3011 mentioned this pull request Apr 28, 2026

val_bpb 1.0902 - 12L sp9000 + AttnOutGate + SmearGate #1565

Open

bsisduck mentioned this pull request Apr 30, 2026

Ablation: WiderGate32, RoPE dims, activation slopes, hparam stack (8xH100) #1970

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Casefold V4 + AttnOutGate + Multi-Phase Global SGD TTT — val_bpb 1.05733 (3-seed mean)#1693

Record: Casefold V4 + AttnOutGate + Multi-Phase Global SGD TTT — val_bpb 1.05733 (3-seed mean)#1693
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:dexhunter/casefold-v4-attn-outgate-phased-ttt

dexhunter commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dexhunter commented Apr 17, 2026

Summary

Key Innovation

Attention Output Gate (from PR #1667 @MarioPaerle)

SmearGate

Casefold V4 + Multi-Phase Global SGD TTT (from PR #1670)

Results

Record Comparison

Compliance (Issue #1017 Track B)

Lineage

Credits

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant