Skip to content

Record: #1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.06287#1801

Open
leon2k2k2k wants to merge 1 commit intoopenai:mainfrom
leon2k2k2k:submission/036-sparse-updated-carry-clean2
Open

Record: #1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.06287#1801
leon2k2k2k wants to merge 1 commit intoopenai:mainfrom
leon2k2k2k:submission/036-sparse-updated-carry-clean2

Conversation

@leon2k2k2k
Copy link
Copy Markdown

@leon2k2k2k leon2k2k2k commented Apr 24, 2026

Summary

Results (8×H100 80GB SXM, phased LoRA-TTT, 10-min train / 10-min eval)

Seed Steps Post-EMA (pre-quant) Quantized Post-TTT Artifact (bytes)
42 4989 1.06749 1.07678 1.06366 15,909,254
0 4974 1.06685 1.07608 1.06311 15,904,209
1234 4973 1.06578 1.07509 1.06183 15,909,401
Mean 4979 1.06671 1.07598 1.06287 15,907,621

Frozen Recurrent Carry

The recurrent α/β carry coefficients (first introduced in #1779) were learned end-to-end on a full training run with no validation set involvement, then quantized to 2 decimal places before this promotion run:

  • β = [1.56, 1.85, 2.13]
  • α = [[0.23, 0.04, 0.03], [0.13, −0.34, 0.01], [0.06, 0.19, −0.02]]

Full-precision learned values: β = [1.5610, 1.8531, 2.1320], α = [[0.2314, 0.0388, 0.0347], [0.1260, −0.3438, 0.0145], [0.0557, 0.1934, −0.0172]].

The legality of offline-learned frozen scalars was discussed in #1779 — the data-size budget provides a natural bound on this class of technique.

What this adds over #1779

From #1787 (nprime06):

  • Polar Express Newton-Schulz coefficients
  • MIN_LR=0.10 warmdown floor
  • Fused softcapped CE
  • GPTQ_RESERVE_SECONDS=0.5, VAL_LOSS_EVERY=0

New in this PR:

  • Sparse attention-output gate — replaces the dense GatedAttn with a narrow-input sparse gate
  • Updated frozen recurrent carry — α/β re-learned on the sparse-gate stack and frozen to 2 decimal places (values above)

Rule Compliance

Test Plan

  • Reviewer reproduces any single seed with the provided train_gpt.py and env vars
  • Verify artifact size < 16,000,000 bytes in each seed log
  • Verify score-first TTT ordering in code

@leon2k2k2k leon2k2k2k marked this pull request as ready for review April 24, 2026 02:10
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 27, 2026
Add a small per-pass LoRA delta on mlp_up and mlp_down for every (pass, layer)
inside the loop band. Tests whether per-pass FFN freedom matters at K=3 on our
50M-param stack.

Lit basis: ALBERT (1909.11942) Table 4 — FFN-tying causes ~1.4-2.8 Avg drop,
attention-tying is approximately free. Relaxed Recursive Transformers
(2410.20672) and MoLoRA (2512.12880) recover the FFN-tying gap by adding a
per-iteration LoRA. Our existing frozen α/β work (openai#1779/openai#1801) is the rank-0
scalar version of this; r≥1 is the natural matrix-valued lift.

Code: exp/047C-per-pass-lora-ffn @ 5cf60f9 (forks from
exp/045-loop-layer-improvements @ ece7b76).

Screen rung: 4×H100 mini, 1 seed, training-endpoint val_bpb post-EMA, no
quant. Baseline = 045 AC-fix-rerun @ 1.06479. Wait-and-see on kill threshold
per user direction.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant