Record: Alpha=144 LoRA + Warm-start A + WD 1.0 — val_bpb 1.07209 (3-seed mean) by renqianluo · Pull Request #1767 · openai/parameter-golf

renqianluo · 2026-04-22T01:40:27Z

Summary

Four composable small-LOC changes to BatchedLinearLoRA on top of @dexhunter's 1.07193 phased-TTT code. Everything outside the LoRA module (VarLen attention, Fused MLP, multi-phase global SGD, trimmed GPTQ, triple depth recurrence) is unchanged.

Alpha/rank output scaling — forward(x) * (alpha/rank). Without this, raising rank directly diverges on some seeds.
Warm-start A across batches — only B resets between batches, A accumulates feature directions over the ~780 phased-TTT batches.
Raised TTT weight decay 0.5 → 1.0 — counteracts the across-batch A overfit enabled by (2).
Alpha lifted 96 → 144 — scale=1.125 on rank 128 gives LoRA more adaptation strength; (3) keeps it stable.

Results

Seed	rank-96 baseline	+ alpha 96	+ warm+WD=1	+ alpha 144
1337	1.07423	1.07379	1.07298	1.07189
42	1.07341	1.07320	1.07298	1.07248
314	1.07214	1.07200	1.07203	1.07189
Mean	1.07326	1.07300	1.07266	1.07209

Every seed improves monotonically across every change.

Compliance

All train ≤596s, eval 455.7–456.7s, artifacts ≤15.94MB. Issue #1017 conditions 1–4 verified.

Attribution

@bigbag (Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493), @EthanYangTW (Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean) #1523), @samacqua (Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) #1530), @romeerp (Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean) #1610), @dexhunter (1.07193), @abaybektursun (Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549)

…eed mean) Four composable novel changes on top of dexhunter's phased-TTT code: 1. Alpha/rank LoRA scaling enables stable higher rank (128 vs 96) 2. Warm-start LoRA A across batches lets feature directions accumulate 3. Raised TTT weight decay (0.5 -> 1.0) prevents warm-A overfit 4. Alpha lifted 96 -> 144 gives LoRA more adaptation strength; WD keeps it stable 3-seed mean 1.07209 BPB (seeds 1337, 42, 314). All seeds improve monotonically across each of the four changes. Matches/approaches dexhunter's 1.07193 closely despite different seed set.

TTT_LORA_ALPHA env var (default 96, spec uses 144). Only zero B on reset; A accumulates feature directions across batches. Output scaled by alpha/rank. Validated by renqianluo (openai#1767) and bigbag (openai#1771). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Ports PR openai#1767's TTT-only improvements on top of our training-time wins: - TTT_LORA_ALPHA=144 (rank-scaled LoRA output, was implicit=96) - TTT_WARM_START_A=1 (keep A warm across doc resets, was re-init) - TTT_WEIGHT_DECAY=1.0 (was 0.5) These are eval-time-only changes: zero training or artifact impact. Validated via TTT_EVAL_ONLY mode on the same 3 quantized artifacts from the original training runs (no retraining, no re-quantization). 3-seed post-TTT results (PR openai#1767 TTT on draft-7 artifacts): seed 42: 1.06400 (was 1.06444, -0.44 mBPP) seed 0: 1.06308 (was 1.06353, -0.45 mBPP) seed 1234: 1.06297 (was 1.06336, -0.39 mBPP) mean: 1.06335 (was 1.06378, -0.43 mBPP) train_gpt.py defaults updated to PR openai#1767 values so a fresh end-to-end torchrun produces the reported 1.06335 directly. TTT-only logs included in ttt_pr1767/ subdirectory. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…Gate + Fused CE — val_bpb 1.06378 3-seed mean val_bpb = 1.06378 (std 0.00058), val_loss = 2.32794 nats/token. -0.00171 BPB vs PR #1736 (1.06549), -0.00043 vs PR #1779 (1.06421). Stacks 4 orthogonal wins on top of PR #1736, all ablation-validated on seed 0 against stock #1736 before stacking: - Polar Express per-iteration minimax Newton-Schulz coefficients (from PR #1344), replacing the fixed (3.44, -4.78, 2.03) tuple applied 5x with 5 distinct tuples baked into zeropower_via_newtonschulz5 - MIN_LR=0.10 warmdown floor (was 0) - Sparse attention head-output gate (modded-nanogpt pattern, 96 params/layer vs dense GatedAttn 4096), preserving the attn_gate_w name so the int8-per-row quant path still routes it (size-range check widened to 32..8192) - Triton fused softcapped cross-entropy kernel on the training forward; eval path keeps eager numerics unchanged Polish: GPTQ_RESERVE_SECONDS=0.5 (was 4) and VAL_LOSS_EVERY=0 (was 4000) together reclaim ~15s of training budget for additional depth-3 steps. All 3 seeds (42, 0, 1234) clear the 16M decimal cap (max 15,940,380 B, ~60 KB headroom), the 600s train budget (599.46- 599.57s), and the 600s TTT-eval budget (412.8-511.3s). Every individual seed beats its PR #1736 counterpart (deltas -1.20 to -2.27 mBPP). Changes are fully orthogonal to PR #1779's frozen recurrent α/β and PR #1767's LoRA-TTT tweaks — stackable. Also ships the BOS-fix patch for prepare_caseops_data.py (matches PR #1736 d7263a3 and PR #1769 fe7c309): sp.encode can't emit BOS_ID=1 since IDs 0-7 are reserved, and phased TTT's _loss_bpb_from_sums divides by zero on BOS-less shards. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 Merge accepted Parameter Golf record submission #1787.

… Attn Gate + Fused CE — val_bpb 1.06378 3-seed mean val_bpb = 1.06378 (std 0.00058), val_loss = 2.32794 nats/token. -0.00171 BPB vs PR openai#1736 (1.06549), -0.00043 vs PR openai#1779 (1.06421). Stacks 4 orthogonal wins on top of PR openai#1736, all ablation-validated on seed 0 against stock openai#1736 before stacking: - Polar Express per-iteration minimax Newton-Schulz coefficients (from PR openai#1344), replacing the fixed (3.44, -4.78, 2.03) tuple applied 5x with 5 distinct tuples baked into zeropower_via_newtonschulz5 - MIN_LR=0.10 warmdown floor (was 0) - Sparse attention head-output gate (modded-nanogpt pattern, 96 params/layer vs dense GatedAttn 4096), preserving the attn_gate_w name so the int8-per-row quant path still routes it (size-range check widened to 32..8192) - Triton fused softcapped cross-entropy kernel on the training forward; eval path keeps eager numerics unchanged Polish: GPTQ_RESERVE_SECONDS=0.5 (was 4) and VAL_LOSS_EVERY=0 (was 4000) together reclaim ~15s of training budget for additional depth-3 steps. All 3 seeds (42, 0, 1234) clear the 16M decimal cap (max 15,940,380 B, ~60 KB headroom), the 600s train budget (599.46- 599.57s), and the 600s TTT-eval budget (412.8-511.3s). Every individual seed beats its PR openai#1736 counterpart (deltas -1.20 to -2.27 mBPP). Changes are fully orthogonal to PR openai#1779's frozen recurrent α/β and PR openai#1767's LoRA-TTT tweaks — stackable. Also ships the BOS-fix patch for prepare_caseops_data.py (matches PR openai#1736 d7263a3 and PR openai#1769 fe7c309): sp.encode can't emit BOS_ID=1 since IDs 0-7 are reserved, and phased TTT's _loss_bpb_from_sums divides by zero on BOS-less shards. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ports PR openai#1767's TTT-only improvements on top of our training-time wins: - TTT_LORA_ALPHA=144 (rank-scaled LoRA output, was implicit=96) - TTT_WARM_START_A=1 (keep A warm across doc resets, was re-init) - TTT_WEIGHT_DECAY=1.0 (was 0.5) These are eval-time-only changes: zero training or artifact impact. Validated via TTT_EVAL_ONLY mode on the same 3 quantized artifacts from the original training runs (no retraining, no re-quantization). 3-seed post-TTT results (PR openai#1767 TTT on draft-7 artifacts): seed 42: 1.06400 (was 1.06444, -0.44 mBPP) seed 0: 1.06308 (was 1.06353, -0.45 mBPP) seed 1234: 1.06297 (was 1.06336, -0.39 mBPP) mean: 1.06335 (was 1.06378, -0.43 mBPP) train_gpt.py defaults updated to PR openai#1767 values so a fresh end-to-end torchrun produces the reported 1.06335 directly. TTT-only logs included in ttt_pr1767/ subdirectory. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

bigbag mentioned this pull request Apr 22, 2026

Record: SP8192 CaseOps + V13 Curriculum + SmearGate + LoRA-TTT — val_bpb 1.06513 (3-seed mean) #1771

Open

3 tasks

renqianluo changed the title ~~Alpha=144 LoRA + Warm-start A + WD 1.0 — val_bpb 1.07209 (3-seed mean)~~ Record: Alpha=144 LoRA + Warm-start A + WD 1.0 — val_bpb 1.07209 (3-seed mean) Apr 22, 2026

leon2k2k2k mentioned this pull request Apr 22, 2026

Record: SP8192 + CaseOps + Gated Attention + Quant Gate + Loop4-5 + Phased TTT + Frozen Recurrent Alpha — val_bpb 1.06421 #1779

Open

3 tasks

renqianluo mentioned this pull request Apr 23, 2026

Record: GatedAttn + Alpha-Scaled LoRA + Warm-start A + WD 1.0 — val_bpb 1.07081 (3-seed mean) #1784

Open

nprime06 mentioned this pull request Apr 23, 2026

Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787

Merged

6 tasks

miaoyuxun mentioned this pull request Apr 23, 2026

Record: SP8192 + SmearGate + AttnOutGate(w24) + LoRA-TTT Improvements + Phased TTT — val_bpb 1.06991 (3-seed mean) #1790

Open

renqianluo mentioned this pull request Apr 23, 2026

Record: Polar Express NS + MIN_LR + GatedAttn + Alpha LoRA — val_bpb 1.07006 (3-seed mean) #1792

Open

dexhunter mentioned this pull request Apr 24, 2026

Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157 #1797

Open

8 tasks

AjAnubolu mentioned this pull request Apr 27, 2026

Record: SP8192 + Polar Express NS + MIN_LR + LQER Asym Rank-4 — val_bpb 1.06766 (3-seed mean) #1874

Open

4 tasks

renqianluo mentioned this pull request Apr 28, 2026

Record: Fused softcap CE + WD=2.0 (warm-start stability fix) — val_bpb 1.06957 (3-seed mean) #1886

Open

AayushBaniya2006 mentioned this pull request Apr 28, 2026

Record: PR #1797 reproduction — val_bpb 1.06136 (3-seed mean) #1906

Open

6 tasks

jorge-asenjo mentioned this pull request Apr 29, 2026

Record: SP8192 #1855 Base + Asymmetric Logit Rescale + AWQ-lite — val_bpb 1.05971 (3-seed mean, full val) #1923

Open

3 tasks

MarioPaerle mentioned this pull request Apr 29, 2026

Record: PR #1886 base + per-block MLP output gate (Linear, weight-learnable) — val_bpb 1.06872 (3-seed mean) #1941

Closed

cocohearts added a commit that referenced this pull request Apr 29, 2026

Merge PR #1787: Record: PR #1736 + Polar Express NS + MIN_LR + Sparse…

f9544c7

… Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 Merge accepted Parameter Golf record submission #1787.

sahiee-dev mentioned this pull request Apr 30, 2026

SP8192 + PolarExpressNS + MIN_LR + LQER Asym Rank-4 | val_bpb=1.07302 (3-seed mean) #1977

Open

9 tasks

Christopher-Lee-McClendon mentioned this pull request Apr 30, 2026

[Non-Record] 6h Long-Train Scaling + TTT Sweep: Post-TTT BPB 1.03387 #2008

Open

TanishGudise mentioned this pull request May 1, 2026

Record candidate: 1.05670 BPB — token-only n-gram tilt + AsymLogit + #2060 levers + NUM_PHASES=1 #2130

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Alpha=144 LoRA + Warm-start A + WD 1.0 — val_bpb 1.07209 (3-seed mean)#1767

Record: Alpha=144 LoRA + Warm-start A + WD 1.0 — val_bpb 1.07209 (3-seed mean)#1767
renqianluo wants to merge 1 commit intoopenai:mainfrom
renqianluo:record/alpha-144-wd1-1.07209

renqianluo commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

renqianluo commented Apr 22, 2026

Summary

Results

Compliance

Attribution

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant