Skip to content

Record: Alpha=144 LoRA + Warm-start A + WD 1.0 — val_bpb 1.07209 (3-seed mean)#1767

Open
renqianluo wants to merge 1 commit intoopenai:mainfrom
renqianluo:record/alpha-144-wd1-1.07209
Open

Record: Alpha=144 LoRA + Warm-start A + WD 1.0 — val_bpb 1.07209 (3-seed mean)#1767
renqianluo wants to merge 1 commit intoopenai:mainfrom
renqianluo:record/alpha-144-wd1-1.07209

Conversation

@renqianluo
Copy link
Copy Markdown

Summary

Four composable small-LOC changes to BatchedLinearLoRA on top of @dexhunter's 1.07193 phased-TTT code. Everything outside the LoRA module (VarLen attention, Fused MLP, multi-phase global SGD, trimmed GPTQ, triple depth recurrence) is unchanged.

  1. Alpha/rank output scalingforward(x) * (alpha/rank). Without this, raising rank directly diverges on some seeds.
  2. Warm-start A across batches — only B resets between batches, A accumulates feature directions over the ~780 phased-TTT batches.
  3. Raised TTT weight decay 0.5 → 1.0 — counteracts the across-batch A overfit enabled by (2).
  4. Alpha lifted 96 → 144 — scale=1.125 on rank 128 gives LoRA more adaptation strength; (3) keeps it stable.

Results

Seed rank-96 baseline + alpha 96 + warm+WD=1 + alpha 144
1337 1.07423 1.07379 1.07298 1.07189
42 1.07341 1.07320 1.07298 1.07248
314 1.07214 1.07200 1.07203 1.07189
Mean 1.07326 1.07300 1.07266 1.07209

Every seed improves monotonically across every change.

Compliance

All train ≤596s, eval 455.7–456.7s, artifacts ≤15.94MB. Issue #1017 conditions 1–4 verified.

Attribution

…eed mean)

Four composable novel changes on top of dexhunter's phased-TTT code:
1. Alpha/rank LoRA scaling enables stable higher rank (128 vs 96)
2. Warm-start LoRA A across batches lets feature directions accumulate
3. Raised TTT weight decay (0.5 -> 1.0) prevents warm-A overfit
4. Alpha lifted 96 -> 144 gives LoRA more adaptation strength; WD keeps it stable

3-seed mean 1.07209 BPB (seeds 1337, 42, 314). All seeds improve monotonically
across each of the four changes. Matches/approaches dexhunter's 1.07193 closely
despite different seed set.
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 22, 2026
TTT_LORA_ALPHA env var (default 96, spec uses 144). Only zero B on reset;
A accumulates feature directions across batches. Output scaled by alpha/rank.
Validated by renqianluo (openai#1767) and bigbag (openai#1771).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@renqianluo renqianluo changed the title Alpha=144 LoRA + Warm-start A + WD 1.0 — val_bpb 1.07209 (3-seed mean) Record: Alpha=144 LoRA + Warm-start A + WD 1.0 — val_bpb 1.07209 (3-seed mean) Apr 22, 2026
nprime06 added a commit to nprime06/parameter-golf that referenced this pull request Apr 23, 2026
Ports PR openai#1767's TTT-only improvements on top of our training-time wins:
- TTT_LORA_ALPHA=144 (rank-scaled LoRA output, was implicit=96)
- TTT_WARM_START_A=1 (keep A warm across doc resets, was re-init)
- TTT_WEIGHT_DECAY=1.0 (was 0.5)

These are eval-time-only changes: zero training or artifact impact.
Validated via TTT_EVAL_ONLY mode on the same 3 quantized artifacts
from the original training runs (no retraining, no re-quantization).

3-seed post-TTT results (PR openai#1767 TTT on draft-7 artifacts):
  seed 42:   1.06400 (was 1.06444, -0.44 mBPP)
  seed 0:    1.06308 (was 1.06353, -0.45 mBPP)
  seed 1234: 1.06297 (was 1.06336, -0.39 mBPP)
  mean:      1.06335 (was 1.06378, -0.43 mBPP)

train_gpt.py defaults updated to PR openai#1767 values so a fresh
end-to-end torchrun produces the reported 1.06335 directly.
TTT-only logs included in ttt_pr1767/ subdirectory.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cocohearts pushed a commit that referenced this pull request Apr 29, 2026
…Gate + Fused CE — val_bpb 1.06378

3-seed mean val_bpb = 1.06378 (std 0.00058), val_loss = 2.32794 nats/token.
-0.00171 BPB vs PR #1736 (1.06549), -0.00043 vs PR #1779 (1.06421).

Stacks 4 orthogonal wins on top of PR #1736, all ablation-validated on
seed 0 against stock #1736 before stacking:
- Polar Express per-iteration minimax Newton-Schulz coefficients (from
  PR #1344), replacing the fixed (3.44, -4.78, 2.03) tuple applied 5x
  with 5 distinct tuples baked into zeropower_via_newtonschulz5
- MIN_LR=0.10 warmdown floor (was 0)
- Sparse attention head-output gate (modded-nanogpt pattern, 96
  params/layer vs dense GatedAttn 4096), preserving the attn_gate_w
  name so the int8-per-row quant path still routes it (size-range
  check widened to 32..8192)
- Triton fused softcapped cross-entropy kernel on the training
  forward; eval path keeps eager numerics unchanged

Polish: GPTQ_RESERVE_SECONDS=0.5 (was 4) and VAL_LOSS_EVERY=0
(was 4000) together reclaim ~15s of training budget for additional
depth-3 steps.

All 3 seeds (42, 0, 1234) clear the 16M decimal cap (max
15,940,380 B, ~60 KB headroom), the 600s train budget (599.46-
599.57s), and the 600s TTT-eval budget (412.8-511.3s). Every
individual seed beats its PR #1736 counterpart (deltas -1.20 to
-2.27 mBPP).

Changes are fully orthogonal to PR #1779's frozen recurrent α/β and
PR #1767's LoRA-TTT tweaks — stackable.

Also ships the BOS-fix patch for prepare_caseops_data.py (matches
PR #1736 d7263a3 and PR #1769 fe7c309): sp.encode can't emit
BOS_ID=1 since IDs 0-7 are reserved, and phased TTT's
_loss_bpb_from_sums divides by zero on BOS-less shards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cocohearts added a commit that referenced this pull request Apr 29, 2026
… Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335

Merge accepted Parameter Golf record submission #1787.
hilbertmeng pushed a commit to hilbertmeng/parameter-golf that referenced this pull request Apr 30, 2026
… Attn Gate + Fused CE — val_bpb 1.06378

3-seed mean val_bpb = 1.06378 (std 0.00058), val_loss = 2.32794 nats/token.
-0.00171 BPB vs PR openai#1736 (1.06549), -0.00043 vs PR openai#1779 (1.06421).

Stacks 4 orthogonal wins on top of PR openai#1736, all ablation-validated on
seed 0 against stock openai#1736 before stacking:
- Polar Express per-iteration minimax Newton-Schulz coefficients (from
  PR openai#1344), replacing the fixed (3.44, -4.78, 2.03) tuple applied 5x
  with 5 distinct tuples baked into zeropower_via_newtonschulz5
- MIN_LR=0.10 warmdown floor (was 0)
- Sparse attention head-output gate (modded-nanogpt pattern, 96
  params/layer vs dense GatedAttn 4096), preserving the attn_gate_w
  name so the int8-per-row quant path still routes it (size-range
  check widened to 32..8192)
- Triton fused softcapped cross-entropy kernel on the training
  forward; eval path keeps eager numerics unchanged

Polish: GPTQ_RESERVE_SECONDS=0.5 (was 4) and VAL_LOSS_EVERY=0
(was 4000) together reclaim ~15s of training budget for additional
depth-3 steps.

All 3 seeds (42, 0, 1234) clear the 16M decimal cap (max
15,940,380 B, ~60 KB headroom), the 600s train budget (599.46-
599.57s), and the 600s TTT-eval budget (412.8-511.3s). Every
individual seed beats its PR openai#1736 counterpart (deltas -1.20 to
-2.27 mBPP).

Changes are fully orthogonal to PR openai#1779's frozen recurrent α/β and
PR openai#1767's LoRA-TTT tweaks — stackable.

Also ships the BOS-fix patch for prepare_caseops_data.py (matches
PR openai#1736 d7263a3 and PR openai#1769 fe7c309): sp.encode can't emit
BOS_ID=1 since IDs 0-7 are reserved, and phased TTT's
_loss_bpb_from_sums divides by zero on BOS-less shards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hilbertmeng pushed a commit to hilbertmeng/parameter-golf that referenced this pull request Apr 30, 2026
Ports PR openai#1767's TTT-only improvements on top of our training-time wins:
- TTT_LORA_ALPHA=144 (rank-scaled LoRA output, was implicit=96)
- TTT_WARM_START_A=1 (keep A warm across doc resets, was re-init)
- TTT_WEIGHT_DECAY=1.0 (was 0.5)

These are eval-time-only changes: zero training or artifact impact.
Validated via TTT_EVAL_ONLY mode on the same 3 quantized artifacts
from the original training runs (no retraining, no re-quantization).

3-seed post-TTT results (PR openai#1767 TTT on draft-7 artifacts):
  seed 42:   1.06400 (was 1.06444, -0.44 mBPP)
  seed 0:    1.06308 (was 1.06353, -0.45 mBPP)
  seed 1234: 1.06297 (was 1.06336, -0.39 mBPP)
  mean:      1.06335 (was 1.06378, -0.43 mBPP)

train_gpt.py defaults updated to PR openai#1767 values so a fresh
end-to-end torchrun produces the reported 1.06335 directly.
TTT-only logs included in ttt_pr1767/ subdirectory.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant