Non-record: XSA-All + QK Gain 4.0 + LN Scale — 45 Experiments on 1×RTX 5090 by jainpranjal97 · Pull Request #1125 · openai/parameter-golf

jainpranjal97 · 2026-03-30T07:24:55Z

Summary

Non-record submission: 1.1946 BPB on 1×RTX 5090 (60-min, 3699 steps). 45 systematic experiments exploring hyperparameter space and novel architectures.

Key findings for the community:

XSA on ALL layers > XSA on last 4 (-0.0018 BPB). Every top entry uses XSA only on deepest 3-4 layers — all-layer XSA is consistently better.
qk_gain_init = 4.0 (-0.0039 BPB vs default 1.5). Sharper attention patterns help small models significantly. Swept 1.5 → 2.0 → 3.0 → 4.0 with monotonic gains.
Warmdown calibration for wallclock-capped training (-0.0078 BPB). Default warmdown_iters=1200 means the LR never reaches full strength when wallclock-capped at 10 min.
Pre-quant vs post-quant divergence: XSA Gating (learned per-head gate) achieved 1.1932 pre-quant (better than best) but 1.1961 post int8+zlib (worse). Architectural choices that improve FP loss can degrade quantized loss.

Novel approaches tested (all documented with negative results):

Approach	Result	Why it failed
Progressive Layer Growing (5→11L at 60%)	+0.0057	5L capacity ceiling
Depth Recurrence 4×3 + LoRA16	+0.0753	torch.compile bypass + optimization conflicts
XSA Gating (learned per-head gate)	+0.0015	Quantizes worse despite better FP loss
Cosine warmdown	+0.0039	Linear warmdown already optimal

Stack

11L, MLP 3×, Partial RoPE 16/64, LN Scale 1/√(layer+1), XSA all layers, LeakyReLU(0.5)², Muon WD 0.06, seq 2048, grad_clip 0.3, qk_gain 4.0, logit_softcap 20.

Full experiment log with 45 runs in the README.

Test plan

Verified val_bpb 1.1946 on 1×RTX 5090 (60-min run)
All 45 experiments logged with reproducible configurations
train_gpt.py included and runnable

🤖 Generated with Claude Code

45 systematic experiments on consumer GPU. Key findings: - XSA on ALL layers beats XSA on last 4 (-0.0018 BPB) - qk_gain_init=4.0 significantly better than default 1.5 (-0.0039) - Warmdown calibration critical for wallclock-capped training (-0.0078) - 4 novel approaches tested and documented (PLG, depth recurrence, XSA gating, cosine warmdown) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Architectural innovations from PR openai#1204 (1.1063 BPB record): - QK_GAIN_INIT=4.0 (from PR openai#1125 sweep, -0.006 BPB) - Parallel Residuals: dual-lane from physical layer 7+ - Attn reads lane0, MLP reads lane1, learned cross-lane writes - parallel_post_lambdas [N,2,2], parallel_resid_lambdas [N,2] - Mini Depth Recurrence: repeat layers 4,5 between encoder/decoder - Delayed activation at step 3000 (avoids disrupting early training) - Tied MLP weights (no extra params, keeps model within 16MB) - Bigram dim reduced 128->112 for budget headroom - Refactored forward into _run_backbone() for DRY encoder/decoder/parallel

3-seed mean 0.9300 BPP (std 0.0006), beats merged SOTA 1.1194 by 0.189. Novel mechanisms: scored-position SLOT mask, per-sample delta [bsz,1,dim], logit bias [bsz,1,vocab], training-data GPTQ calibration, cosine LR schedule. Base: PR openai#1019. SLOT based on arXiv:2505.12392v2. Adapted sigmoid-gated skips and Brotli from PR openai#1172, QK-Gain from PR openai#1125. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Integrates four proven post-March-25 techniques: - QK-Gain 4.0 (PR openai#1125 sweep) - XSA all 11 layers (PR openai#1176) - SLOT per-sample delta + logit bias with scored-position masking (PR openai#1229) - forward_hidden/compute_logits refactor for SLOT compatibility

…ents Key changes: - Turbo-Muon optimizer (AOL preconditioning, Polar Express coefficients, 4 NS steps) - Soft-round QAT with sigmoid alpha ramp (1→16), starting at 40% wallclock - SWA bug fix (was gated by EMA), start_frac=0.7, every=5 steps - Higher LRs matching baseline: matrix_lr=0.04, scalar_lr=0.04, tied_embed_lr=0.05 - QK_GAIN_INIT=4.0 (PR openai#1125), embed_beta1=0.7, head_beta1=0.7 - Sqrt cooldown schedule, lr_floor=0.05, warmdown_iters=600 for 4xA100 - Int6 quantization (QUANT_BITS=6) with Full Hessian GPTQ - Best result: exp132 val_bpb=1.2296 (GATED_ATTENTION=0, 1222 steps)

MatoTeziTanka · 2026-04-12T06:19:04Z

Community Review — Non-record: XSA-All + QK Gain 4.0 + LN Scale — 45 Experiments on 1×RTX 5090

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

Summary PR #1125 (`records/track_non_record_16mb/2026-03-30_XSA-All_QKGain4_LNScale_1x5090/train_gpt.py`) is a pure neural submission. Head SHA: `3eff27ce3986cb83e4b50ffc5b474fe45307254a`. ## Checks ### N-gram family bug (ILLEGAL: target XOR'd into hash key, NOT BigramHash) NOT PRESENT. No `BigramHash`, no n-gram table, no hash key, no XOR operation anywhere in the file. The only `tgt_ids` usage is at lines 270-272 inside `eval_val()` for byte-counting with `base_bytes_lut` — this is the standard BPB metric bookkeeping, not a training signal. ### Pre-Quant TTT — multi-epoch on val_tokens without score-first (ILLEGAL) NOT PRESENT. The `eval_val()` function runs under `torch.inference_mode()` (line 256) and never calls `.backward()`. `val_tokens` is used only for read-only evaluation. The only `.backward()` calls are at lines 992 and 1061, both operating on `train_loader` batches (`x, y` from `train_files`). `model.train()` at line 283 is just restoring training mode after evaluation — no gradient flows through val data. ### Score-first TTT / is_last_chunk guard (LEGAL, PR #1413 pattern) NOT PRESENT. No TTT of any kind exists in this submission. No `is_last_chunk` guard, no score-first logic. ### Scored-region SLOT (HOLD) NOT PRESENT. No scored-region slot indicators found. ## Novel technique The PR introduces XSA (Cross-Self Attention) via the `_xsa_efficient` method (lines 594-602) enabled on all attention layers (line 725-727). This subtracts the self-value projection from the attention output: `y = y - (y·v̂)v̂` in a GQA-aware manner. This is a zero-parameter architectural modification applied during training, not a post-hoc or inference-only trick. It is pure-neural and architecturally clean. Also includes: `LNScale` (per-layer `ln_scale_factor = 1/sqrt(layer_idx+1)` at line 671), `QKGain4` (q_gain_init=4.0), and...

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Embedding-space delta optimized with 8 AdamW steps per chunk. Worse than both sliding window (1.1246) and naive eval (1.1479). Lesson: SLOT needs L-BFGS in logit space (see exp_075), not AdamW in embedding space. 8 steps underfits, and the embedding-space loss surface is non-convex. Also bumped QK-Gain 1.5 -> 4.0 (free -0.006 BPB from PR openai#1125).

…ents Key changes: - Turbo-Muon optimizer (AOL preconditioning, Polar Express coefficients, 4 NS steps) - Soft-round QAT with sigmoid alpha ramp (1→16), starting at 40% wallclock - SWA bug fix (was gated by EMA), start_frac=0.7, every=5 steps - Higher LRs matching baseline: matrix_lr=0.04, scalar_lr=0.04, tied_embed_lr=0.05 - QK_GAIN_INIT=4.0 (PR openai#1125), embed_beta1=0.7, head_beta1=0.7 - Sqrt cooldown schedule, lr_floor=0.05, warmdown_iters=600 for 4xA100 - Int6 quantization (QUANT_BITS=6) with Full Hessian GPTQ - Best result: exp132 val_bpb=1.2296 (GATED_ATTENTION=0, 1222 steps)

bigbag mentioned this pull request Mar 31, 2026

Record: QK-Gain 4.0 + XSA-11 + Muon-TTT + SLOT — val_bpb 1.0914 (3-seed mean) #1176

Open

notapplica mentioned this pull request Mar 31, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

clarkkev mentioned this pull request Apr 1, 2026

Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218

Merged

This was referenced Apr 1, 2026

Record: Scored-Position SLOT + Per-Sample Delta + GPTQ (val_bpb: 0.9300) resouer/parameter-golf#2

Closed

Record: Scored-Position SLOT + Per-Sample Delta + GPTQ (val_bpb: 0.9300) #1229

Closed

xexyz mentioned this pull request Apr 2, 2026

Record: 11L LeakyReLU² + XSA-all + QK-Gain 4.0 + Full GPTQ + SLOT — val_bpb 0.9354 (3-seed mean) #1263

Open

dentity007 mentioned this pull request Apr 3, 2026

Record: Vocab4096 + MLP4.0x + WD0.085 - val_bpb 1.1048 (3-seed mean) #1287

Open

5 tasks

This was referenced Apr 3, 2026

Record: SLOT + QK-Gain 4.0 + XSA-11 — val_bpb 0.9462 (3-seed mean) #1303

Open

Record: SLOT-24 Aggressive — val_bpb 0.8637 (3-seed mean) #1313

Open

Record: SLOT-48 — val_bpb 0.7406 (3-seed mean) #1321

Open

stukenov mentioned this pull request Apr 4, 2026

Record: Pre-quant AdamW TTT + QK-Gain 4.0 — val_bpb 1.1025 (3-seed mean) #1364

Open

stukenov mentioned this pull request Apr 5, 2026

Record: SLOT-24 + Pre-quant TTT — val_bpb 0.7094 (3-seed mean) #1376

Open

abaybektursun mentioned this pull request Apr 6, 2026

Record: Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt; val_bpb 1.08309 (5-seed mean) #1420

Open

dttdrv mentioned this pull request Apr 28, 2026

{RECORD} CaseOps pre-quant TTT record (1.0354 BPB) #1911

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: XSA-All + QK Gain 4.0 + LN Scale — 45 Experiments on 1×RTX 5090#1125

Non-record: XSA-All + QK Gain 4.0 + LN Scale — 45 Experiments on 1×RTX 5090#1125
jainpranjal97 wants to merge 1 commit intoopenai:mainfrom
jainpranjal97:submission/xsa-all-qkgain4-lnscale

jainpranjal97 commented Mar 30, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jainpranjal97 commented Mar 30, 2026

Summary

Key findings for the community:

Novel approaches tested (all documented with negative results):

Stack

Test plan

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Non-record: XSA-All + QK Gain 4.0 + LN Scale — 45 Experiments on 1×RTX 5090

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants