Non-record: XSA-All + QK Gain 4.0 + LN Scale — 45 Experiments on 1×RTX 5090#1125
Open
jainpranjal97 wants to merge 1 commit intoopenai:mainfrom
Open
Non-record: XSA-All + QK Gain 4.0 + LN Scale — 45 Experiments on 1×RTX 5090#1125jainpranjal97 wants to merge 1 commit intoopenai:mainfrom
jainpranjal97 wants to merge 1 commit intoopenai:mainfrom
Conversation
45 systematic experiments on consumer GPU. Key findings: - XSA on ALL layers beats XSA on last 4 (-0.0018 BPB) - qk_gain_init=4.0 significantly better than default 1.5 (-0.0039) - Warmdown calibration critical for wallclock-capped training (-0.0078) - 4 novel approaches tested and documented (PLG, depth recurrence, XSA gating, cosine warmdown) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PhamPhuHoa-23
added a commit
to angela231005/parameter-golf
that referenced
this pull request
Apr 1, 2026
Architectural innovations from PR openai#1204 (1.1063 BPB record): - QK_GAIN_INIT=4.0 (from PR openai#1125 sweep, -0.006 BPB) - Parallel Residuals: dual-lane from physical layer 7+ - Attn reads lane0, MLP reads lane1, learned cross-lane writes - parallel_post_lambdas [N,2,2], parallel_resid_lambdas [N,2] - Mini Depth Recurrence: repeat layers 4,5 between encoder/decoder - Delayed activation at step 3000 (avoids disrupting early training) - Tied MLP weights (no extra params, keeps model within 16MB) - Bigram dim reduced 128->112 for budget headroom - Refactored forward into _run_backbone() for DRY encoder/decoder/parallel
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 1, 2026
3-seed mean 0.9300 BPP (std 0.0006), beats merged SOTA 1.1194 by 0.189. Novel mechanisms: scored-position SLOT mask, per-sample delta [bsz,1,dim], logit bias [bsz,1,vocab], training-data GPTQ calibration, cosine LR schedule. Base: PR openai#1019. SLOT based on arXiv:2505.12392v2. Adapted sigmoid-gated skips and Brotli from PR openai#1172, QK-Gain from PR openai#1125. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Apr 1, 2026
Closed
5 tasks
anthony-maio
added a commit
to anthony-maio/parameter-golf
that referenced
this pull request
Apr 3, 2026
Integrates four proven post-March-25 techniques: - QK-Gain 4.0 (PR openai#1125 sweep) - XSA all 11 layers (PR openai#1176) - SLOT per-sample delta + logit bias with scored-position masking (PR openai#1229) - forward_hidden/compute_logits refactor for SLOT compatibility
This was referenced Apr 3, 2026
PiyushDatta
pushed a commit
to PiyushDatta/parameter-golf-fork
that referenced
this pull request
Apr 5, 2026
…ents Key changes: - Turbo-Muon optimizer (AOL preconditioning, Polar Express coefficients, 4 NS steps) - Soft-round QAT with sigmoid alpha ramp (1→16), starting at 40% wallclock - SWA bug fix (was gated by EMA), start_frac=0.7, every=5 steps - Higher LRs matching baseline: matrix_lr=0.04, scalar_lr=0.04, tied_embed_lr=0.05 - QK_GAIN_INIT=4.0 (PR openai#1125), embed_beta1=0.7, head_beta1=0.7 - Sqrt cooldown schedule, lr_floor=0.05, warmdown_iters=600 for 4xA100 - Int6 quantization (QUANT_BITS=6) with Full Hessian GPTQ - Best result: exp132 val_bpb=1.2296 (GATED_ATTENTION=0, 1222 steps)
Community Review — Non-record: XSA-All + QK Gain 4.0 + LN Scale — 45 Experiments on 1×RTX 5090Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache Summary PR #1125 (
|
AjAnubolu
added a commit
to AjAnubolu/parameter-golf
that referenced
this pull request
Apr 14, 2026
Embedding-space delta optimized with 8 AdamW steps per chunk. Worse than both sliding window (1.1246) and naive eval (1.1479). Lesson: SLOT needs L-BFGS in logit space (see exp_075), not AdamW in embedding space. 8 steps underfits, and the embedding-space loss surface is non-convex. Also bumped QK-Gain 1.5 -> 4.0 (free -0.006 BPB from PR openai#1125).
AjAnubolu
added a commit
to AjAnubolu/parameter-golf
that referenced
this pull request
Apr 14, 2026
Embedding-space delta optimized with 8 AdamW steps per chunk. Worse than both sliding window (1.1246) and naive eval (1.1479). Lesson: SLOT needs L-BFGS in logit space (see exp_075), not AdamW in embedding space. 8 steps underfits, and the embedding-space loss surface is non-convex. Also bumped QK-Gain 1.5 -> 4.0 (free -0.006 BPB from PR openai#1125).
8 tasks
PiyushDatta
pushed a commit
to PiyushDatta/parameter-golf-fork
that referenced
this pull request
May 1, 2026
…ents Key changes: - Turbo-Muon optimizer (AOL preconditioning, Polar Express coefficients, 4 NS steps) - Soft-round QAT with sigmoid alpha ramp (1→16), starting at 40% wallclock - SWA bug fix (was gated by EMA), start_frac=0.7, every=5 steps - Higher LRs matching baseline: matrix_lr=0.04, scalar_lr=0.04, tied_embed_lr=0.05 - QK_GAIN_INIT=4.0 (PR openai#1125), embed_beta1=0.7, head_beta1=0.7 - Sqrt cooldown schedule, lr_floor=0.05, warmdown_iters=600 for 4xA100 - Int6 quantization (QUANT_BITS=6) with Full Hessian GPTQ - Best result: exp132 val_bpb=1.2296 (GATED_ATTENTION=0, 1222 steps)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Non-record submission: 1.1946 BPB on 1×RTX 5090 (60-min, 3699 steps). 45 systematic experiments exploring hyperparameter space and novel architectures.
Key findings for the community:
Novel approaches tested (all documented with negative results):
Stack
11L, MLP 3×, Partial RoPE 16/64, LN Scale 1/√(layer+1), XSA all layers, LeakyReLU(0.5)², Muon WD 0.06, seq 2048, grad_clip 0.3, qk_gain 4.0, logit_softcap 20.
Full experiment log with 45 runs in the README.
Test plan
🤖 Generated with Claude Code