Record: #1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.06287 by leon2k2k2k · Pull Request #1801 · openai/parameter-golf

leon2k2k2k · 2026-04-24T02:09:34Z

Summary

3-seed mean val_bpb = 1.06287 (seeds 42, 0, 1234), val_loss = 2.32695 nats/token
−0.00134 vs Record: SP8192 + CaseOps + Gated Attention + Quant Gate + Loop4-5 + Phased TTT + Frozen Recurrent Alpha — val_bpb 1.06421 #1779 (1.06421, our last submission), −0.00048 vs Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787 (1.06335), −0.00262 vs Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549 #1736 (1.06549)
Inherits from Record: SP8192 + CaseOps + Gated Attention + Quant Gate + Loop4-5 + Phased TTT + Frozen Recurrent Alpha — val_bpb 1.06421 #1779; adds a sparse attention-output gate and updated frozen recurrent carry
Stackable with the smear gate and LQER from Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157 #1797

Results (8×H100 80GB SXM, phased LoRA-TTT, 10-min train / 10-min eval)

Seed	Steps	Post-EMA (pre-quant)	Quantized	Post-TTT	Artifact (bytes)
42	4989	1.06749	1.07678	1.06366	15,909,254
0	4974	1.06685	1.07608	1.06311	15,904,209
1234	4973	1.06578	1.07509	1.06183	15,909,401
Mean	4979	1.06671	1.07598	1.06287	15,907,621

Frozen Recurrent Carry

The recurrent α/β carry coefficients (first introduced in #1779) were learned end-to-end on a full training run with no validation set involvement, then quantized to 2 decimal places before this promotion run:

β = [1.56, 1.85, 2.13]
α = [[0.23, 0.04, 0.03], [0.13, −0.34, 0.01], [0.06, 0.19, −0.02]]

Full-precision learned values: β = [1.5610, 1.8531, 2.1320], α = [[0.2314, 0.0388, 0.0347], [0.1260, −0.3438, 0.0145], [0.0557, 0.1934, −0.0172]].

The legality of offline-learned frozen scalars was discussed in #1779 — the data-size budget provides a natural bound on this class of technique.

What this adds over #1779

From #1787 (nprime06):

Polar Express Newton-Schulz coefficients
MIN_LR=0.10 warmdown floor
Fused softcapped CE
GPTQ_RESERVE_SECONDS=0.5, VAL_LOSS_EVERY=0

New in this PR:

Sparse attention-output gate — replaces the dense GatedAttn with a narrow-input sparse gate
Updated frozen recurrent carry — α/β re-learned on the sparse-gate stack and frozen to 2 decimal places (values above)

Rule Compliance

Score-first phased TTT (Condition 3), no pre-quant TTT, no n-gram cache
All artifacts ≤ 16 MB (max 15,909,401 bytes), train ≤ 600s, eval ≤ 600s
CaseOps tokenizer (pending issue Clarify which text normalizations are allowed for custom tokenizers #1604, same as Record: SP8192 + CaseOps + Gated Attention + Quant Gate + Loop4-5 + Phased TTT + Frozen Recurrent Alpha — val_bpb 1.06421 #1779/Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787)

Test Plan

Reviewer reproduces any single seed with the provided train_gpt.py and env vars
Verify artifact size < 16,000,000 bytes in each seed log
Verify score-first TTT ordering in code

…06287

Add a small per-pass LoRA delta on mlp_up and mlp_down for every (pass, layer) inside the loop band. Tests whether per-pass FFN freedom matters at K=3 on our 50M-param stack. Lit basis: ALBERT (1909.11942) Table 4 — FFN-tying causes ~1.4-2.8 Avg drop, attention-tying is approximately free. Relaxed Recursive Transformers (2410.20672) and MoLoRA (2512.12880) recover the FFN-tying gap by adding a per-iteration LoRA. Our existing frozen α/β work (openai#1779/openai#1801) is the rank-0 scalar version of this; r≥1 is the natural matrix-valued lift. Code: exp/047C-per-pass-lora-ffn @ 5cf60f9 (forks from exp/045-loop-layer-improvements @ ece7b76). Screen rung: 4×H100 mini, 1 seed, training-endpoint val_bpb post-EMA, no quant. Baseline = 045 AC-fix-rerun @ 1.06479. Wait-and-see on kill threshold per user direction.

Record: openai#1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.…

372c5f1

…06287

leon2k2k2k marked this pull request as ready for review April 24, 2026 02:10

This was referenced Apr 28, 2026

Update Parameter Golf leaderboard #1899

Open

Update Parameter Golf leaderboard #1900

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: #1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.06287#1801

Record: #1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.06287#1801
leon2k2k2k wants to merge 1 commit intoopenai:mainfrom
leon2k2k2k:submission/036-sparse-updated-carry-clean2

leon2k2k2k commented Apr 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leon2k2k2k commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results (8×H100 80GB SXM, phased LoRA-TTT, 10-min train / 10-min eval)

Frozen Recurrent Carry

What this adds over #1779

Rule Compliance

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

leon2k2k2k commented Apr 24, 2026 •

edited

Loading