Skip to content

Non-record: XSA-All + QK Gain 4.0 + LN Scale — 45 Experiments on 1×RTX 5090#1125

Open
jainpranjal97 wants to merge 1 commit intoopenai:mainfrom
jainpranjal97:submission/xsa-all-qkgain4-lnscale
Open

Non-record: XSA-All + QK Gain 4.0 + LN Scale — 45 Experiments on 1×RTX 5090#1125
jainpranjal97 wants to merge 1 commit intoopenai:mainfrom
jainpranjal97:submission/xsa-all-qkgain4-lnscale

Conversation

@jainpranjal97
Copy link
Copy Markdown

Summary

Non-record submission: 1.1946 BPB on 1×RTX 5090 (60-min, 3699 steps). 45 systematic experiments exploring hyperparameter space and novel architectures.

Key findings for the community:

  • XSA on ALL layers > XSA on last 4 (-0.0018 BPB). Every top entry uses XSA only on deepest 3-4 layers — all-layer XSA is consistently better.
  • qk_gain_init = 4.0 (-0.0039 BPB vs default 1.5). Sharper attention patterns help small models significantly. Swept 1.5 → 2.0 → 3.0 → 4.0 with monotonic gains.
  • Warmdown calibration for wallclock-capped training (-0.0078 BPB). Default warmdown_iters=1200 means the LR never reaches full strength when wallclock-capped at 10 min.
  • Pre-quant vs post-quant divergence: XSA Gating (learned per-head gate) achieved 1.1932 pre-quant (better than best) but 1.1961 post int8+zlib (worse). Architectural choices that improve FP loss can degrade quantized loss.

Novel approaches tested (all documented with negative results):

Approach Result Why it failed
Progressive Layer Growing (5→11L at 60%) +0.0057 5L capacity ceiling
Depth Recurrence 4×3 + LoRA16 +0.0753 torch.compile bypass + optimization conflicts
XSA Gating (learned per-head gate) +0.0015 Quantizes worse despite better FP loss
Cosine warmdown +0.0039 Linear warmdown already optimal

Stack

11L, MLP 3×, Partial RoPE 16/64, LN Scale 1/√(layer+1), XSA all layers, LeakyReLU(0.5)², Muon WD 0.06, seq 2048, grad_clip 0.3, qk_gain 4.0, logit_softcap 20.

Full experiment log with 45 runs in the README.

Test plan

  • Verified val_bpb 1.1946 on 1×RTX 5090 (60-min run)
  • All 45 experiments logged with reproducible configurations
  • train_gpt.py included and runnable

🤖 Generated with Claude Code

45 systematic experiments on consumer GPU. Key findings:
- XSA on ALL layers beats XSA on last 4 (-0.0018 BPB)
- qk_gain_init=4.0 significantly better than default 1.5 (-0.0039)
- Warmdown calibration critical for wallclock-capped training (-0.0078)
- 4 novel approaches tested and documented (PLG, depth recurrence, XSA gating, cosine warmdown)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PhamPhuHoa-23 added a commit to angela231005/parameter-golf that referenced this pull request Apr 1, 2026
Architectural innovations from PR openai#1204 (1.1063 BPB record):
- QK_GAIN_INIT=4.0 (from PR openai#1125 sweep, -0.006 BPB)
- Parallel Residuals: dual-lane from physical layer 7+
  - Attn reads lane0, MLP reads lane1, learned cross-lane writes
  - parallel_post_lambdas [N,2,2], parallel_resid_lambdas [N,2]
- Mini Depth Recurrence: repeat layers 4,5 between encoder/decoder
  - Delayed activation at step 3000 (avoids disrupting early training)
  - Tied MLP weights (no extra params, keeps model within 16MB)
- Bigram dim reduced 128->112 for budget headroom
- Refactored forward into _run_backbone() for DRY encoder/decoder/parallel
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 1, 2026
3-seed mean 0.9300 BPP (std 0.0006), beats merged SOTA 1.1194 by 0.189.

Novel mechanisms: scored-position SLOT mask, per-sample delta [bsz,1,dim],
logit bias [bsz,1,vocab], training-data GPTQ calibration, cosine LR schedule.

Base: PR openai#1019. SLOT based on arXiv:2505.12392v2.
Adapted sigmoid-gated skips and Brotli from PR openai#1172, QK-Gain from PR openai#1125.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
anthony-maio added a commit to anthony-maio/parameter-golf that referenced this pull request Apr 3, 2026
Integrates four proven post-March-25 techniques:
- QK-Gain 4.0 (PR openai#1125 sweep)
- XSA all 11 layers (PR openai#1176)
- SLOT per-sample delta + logit bias with scored-position masking (PR openai#1229)
- forward_hidden/compute_logits refactor for SLOT compatibility
PiyushDatta pushed a commit to PiyushDatta/parameter-golf-fork that referenced this pull request Apr 5, 2026
…ents

Key changes:
- Turbo-Muon optimizer (AOL preconditioning, Polar Express coefficients, 4 NS steps)
- Soft-round QAT with sigmoid alpha ramp (1→16), starting at 40% wallclock
- SWA bug fix (was gated by EMA), start_frac=0.7, every=5 steps
- Higher LRs matching baseline: matrix_lr=0.04, scalar_lr=0.04, tied_embed_lr=0.05
- QK_GAIN_INIT=4.0 (PR openai#1125), embed_beta1=0.7, head_beta1=0.7
- Sqrt cooldown schedule, lr_floor=0.05, warmdown_iters=600 for 4xA100
- Int6 quantization (QUANT_BITS=6) with Full Hessian GPTQ
- Best result: exp132 val_bpb=1.2296 (GATED_ATTENTION=0, 1222 steps)
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: XSA-All + QK Gain 4.0 + LN Scale — 45 Experiments on 1×RTX 5090

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

Summary PR #1125 (records/track_non_record_16mb/2026-03-30_XSA-All_QKGain4_LNScale_1x5090/train_gpt.py) is a pure neural submission. Head SHA: 3eff27ce3986cb83e4b50ffc5b474fe45307254a. ## Checks ### N-gram family bug (ILLEGAL: target XOR'd into hash key, NOT BigramHash) NOT PRESENT. No BigramHash, no n-gram table, no hash key, no XOR operation anywhere in the file. The only tgt_ids usage is at lines 270-272 inside eval_val() for byte-counting with base_bytes_lut — this is the standard BPB metric bookkeeping, not a training signal. ### Pre-Quant TTT — multi-epoch on val_tokens without score-first (ILLEGAL) NOT PRESENT. The eval_val() function runs under torch.inference_mode() (line 256) and never calls .backward(). val_tokens is used only for read-only evaluation. The only .backward() calls are at lines 992 and 1061, both operating on train_loader batches (x, y from train_files). model.train() at line 283 is just restoring training mode after evaluation — no gradient flows through val data. ### Score-first TTT / is_last_chunk guard (LEGAL, PR #1413 pattern) NOT PRESENT. No TTT of any kind exists in this submission. No is_last_chunk guard, no score-first logic. ### Scored-region SLOT (HOLD) NOT PRESENT. No scored-region slot indicators found. ## Novel technique The PR introduces XSA (Cross-Self Attention) via the _xsa_efficient method (lines 594-602) enabled on all attention layers (line 725-727). This subtracts the self-value projection from the attention output: y = y - (y·v̂)v̂ in a GQA-aware manner. This is a zero-parameter architectural modification applied during training, not a post-hoc or inference-only trick. It is pure-neural and architecturally clean. Also includes: LNScale (per-layer ln_scale_factor = 1/sqrt(layer_idx+1) at line 671), QKGain4 (q_gain_init=4.0), and...

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

AjAnubolu added a commit to AjAnubolu/parameter-golf that referenced this pull request Apr 14, 2026
Embedding-space delta optimized with 8 AdamW steps per chunk.
Worse than both sliding window (1.1246) and naive eval (1.1479).

Lesson: SLOT needs L-BFGS in logit space (see exp_075), not AdamW in
embedding space. 8 steps underfits, and the embedding-space loss
surface is non-convex.

Also bumped QK-Gain 1.5 -> 4.0 (free -0.006 BPB from PR openai#1125).
AjAnubolu added a commit to AjAnubolu/parameter-golf that referenced this pull request Apr 14, 2026
Embedding-space delta optimized with 8 AdamW steps per chunk.
Worse than both sliding window (1.1246) and naive eval (1.1479).

Lesson: SLOT needs L-BFGS in logit space (see exp_075), not AdamW in
embedding space. 8 steps underfits, and the embedding-space loss
surface is non-convex.

Also bumped QK-Gain 1.5 -> 4.0 (free -0.006 BPB from PR openai#1125).
PiyushDatta pushed a commit to PiyushDatta/parameter-golf-fork that referenced this pull request May 1, 2026
…ents

Key changes:
- Turbo-Muon optimizer (AOL preconditioning, Polar Express coefficients, 4 NS steps)
- Soft-round QAT with sigmoid alpha ramp (1→16), starting at 40% wallclock
- SWA bug fix (was gated by EMA), start_frac=0.7, every=5 steps
- Higher LRs matching baseline: matrix_lr=0.04, scalar_lr=0.04, tied_embed_lr=0.05
- QK_GAIN_INIT=4.0 (PR openai#1125), embed_beta1=0.7, head_beta1=0.7
- Sqrt cooldown schedule, lr_floor=0.05, warmdown_iters=600 for 4xA100
- Int6 quantization (QUANT_BITS=6) with Full Hessian GPTQ
- Best result: exp132 val_bpb=1.2296 (GATED_ATTENTION=0, 1222 steps)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants