ArjunAutoResearch: MLP 3x + STE int6 QAT + seq4096 + sliding window. val_bpb 1.1632#66
ArjunAutoResearch: MLP 3x + STE int6 QAT + seq4096 + sliding window. val_bpb 1.1632#66arjun-krishna1 wants to merge 20 commits intoopenai:mainfrom
Conversation
Stack four proven techniques identified via systematic PR analysis: - TRAIN_SEQ_LEN=4096 for richer per-step training signal - Optimizer tuning: Muon momentum 0.99, LRs halved, warmdown 3000 - fp16 tied embedding export (MLP_HIDDEN=992 to stay under 16MB) - Sliding window eval at stride=64 with 4096-token context windows Beats naive baseline (1.2244) by 0.041 BPB and all public PRs. Training: 9919 steps at 60ms/step on 8xH100 SXM. Eval: 278s sliding window (within separate 10-min eval budget). Made-with: Cursor
Made-with: Cursor
- seq_len=4096 (4x context, biggest single BPB win) - Muon momentum 0.99, lower LRs (0.02/0.02/0.03) - Batch 393K (more steps/min), warmdown 3000 - fp16 tied embedding export (halves quant penalty) - Defaults to SP-1024 (data exists on HuggingFace, no tokenizer training) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three seeds all clear the 1.2194 threshold (SOTA - 0.005): - SEED=1337: val_bpb=1.18335372 - SEED=1338: val_bpb=1.18437368 - SEED=1339: val_bpb=1.18481782 Mean=1.18418174, std=0.00075068, t=81.26 (df=2), p<<0.001. Made-with: Cursor
Made-with: Cursor
- Run command now references full records folder path so it runs correctly from the repo root as reviewers expect - Root train_gpt.py reverted to openai/parameter-golf main so PR only adds the records folder, as required by challenge rules Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Skill now lives at parameter-golf-autoresearch/SKILL.md inside the records submission folder, following the agentskills.io standard (folder name matches the name field, proper frontmatter with metadata). Removed the .cursor/skills/ copy so the PR only touches the records folder. Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
…nfig Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
…_bpb 1.1652 Stack five techniques from systematic PR analysis: - MLP_MULT=3.0 (hidden=1536) for wider model capacity (from PR openai#70) - int6 per-row quant on MLP+attn, fp16 tied embed passthrough (from PR openai#70) - zstd-22 compression (from PR openai#70) - TRAIN_SEQ_LEN=4096 for richer per-step training signal (from PR openai#65) - Sliding window eval at stride=64 with compiled forward_logits Mean val_bpb=1.16520 (std=0.00102, t=92.15, p<<0.001). Three seeds: 1.16615, 1.16532, 1.16412. Artifact: 15.6MB (under 16,000,000 byte cap). Training: 9370 steps at 64ms/step on 8xH100 SXM. Made-with: Cursor
|
Time to sleep 😭 |
Made-with: Cursor
Every submission scoring <1.18 BPB uses these EXACT settings. We were running defaults — now matching the winners: MUON_MOMENTUM: 0.95 → 0.99 (stronger smoothing) MATRIX_LR: 0.04 → 0.02 (halved, reduces quant gap) SCALAR_LR: 0.04 → 0.02 (halved) TIED_EMBED_LR: 0.05 → 0.03 (halved) WARMDOWN_ITERS: 1200 → 3000 (longer warmdown) MUON_WARMUP_START: 0.85 → 0.92 (higher start) MUON_WARMUP_STEPS: 500 → 1500 (3x longer warmup) These settings are proven by PR openai#64 (1.0149), openai#66 (1.1652), openai#70 (1.1659), openai#65 (1.1808) — all top submissions. Applied to both v5 and v6. Both compile, 1498 lines each.
openai#77, openai#78) Analyzed techniques, ablations, and individual BPB contributions. Key finding: sliding window eval (~0.034) and int6+wider MLP (~0.029) are the dominant validated techniques. Several promising combinations remain untested across submissions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Awesome! |
thanks man! @chonchiog feel free to fork and build off of it |
|
for sure! @arjun-krishna1 I'm waiting for the runpod credits :) |
Major improvements based on competition intelligence (day 2 PRs): 1. Sliding window eval (stride=256): overlapping windows give each token more context. Free ~0.03 bpb improvement, zero artifact cost. Based on PRs openai#70, openai#77, openai#65. 2. Int6 quantization: configurable WEIGHT_QUANT_BITS (default 6) and EMBED_QUANT_BITS (default 8). Saves ~25% artifact space vs int8, allowing bigger models. Based on PRs openai#78, openai#70. 3. MLP 3x expansion: MLP_MULT_NUM=3 (up from 8/3). Wider MLP gives ~0.019 bpb improvement. Based on PRs openai#70, openai#66. 4. Default dim=512 with LR=0.03 (best config from experiments). 5. forward_logits() helper for sliding window (avoids model.forward which returns loss, not logits). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add Straight-Through Estimator fake int6 quantization to CastedLinear during training. Forward pass uses quantized weights (int6 per-row), backward passes gradients through originals. Teaches weight distributions that survive post-training int6 quantization. Composes with existing: seq4096, MLP 3x, fp16 tok_emb, int6+zstd, stride=64. Three seeds: - SEED=1337: val_bpb=1.16356083 - SEED=1338: val_bpb=1.16275343 - SEED=1339: val_bpb=1.16337225 Mean=1.16323, std=0.00042, t=230.34 (df=2), p<<0.001. Artifact: 15.3MB (under 16,000,000 byte cap). Made-with: Cursor
Based on PR openai#66 (ArjunAutoResearch) composition of top techniques: - Int6 per-row quantization + zstd-22 (~4MB savings vs int8+zlib) - MLP 3x expansion (hidden=1536) enabled by int6 budget savings - STE fake int6 QAT in CastedLinear (trains weights to survive quantization) - Sliding window eval (stride=64, seq_len=4096) - Tuned optimizer: matrix_lr=0.02, muon_momentum=0.99, warmdown=3000 - fp16 tied embedding passthrough (no embedding quant penalty) - Seq len 4096, batch tokens 393K Expected: ~1.163 BPB on 8xH100 (vs baseline 1.2244) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — ArjunAutoResearch: MLP 3x + STE int6 QAT + seq4096 + sliding window. val_bpb 1.1632BPB: 1.1632 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache What I found in the code (head SHA Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=57201 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline. Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=57201 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Hey there! This was super fun thanks. I took the approach of building an AutoResearch agent harness that could work towards solving this autonomously. Added as an agent skill in my submission for others to build off of!
Built an auto-research pipeline:
ArjunAutoResearch ended up with a final val_bpb of 1.16323 +/- 0.00042 (mean across 3 seeds, p << 0.001).
The artifact size is: 15,265,243 bytes (under 16,000,000).
With more compute (will apply for more) I would scale this AutoResearch agent by trying to compose approaches in the Medium, Low buckets as well as trying to come up with strategies on its own and researching approaches from the internet that people haven't made pull requests for yet.
The approach ArjunAutoResearch came up with composed the following techniques from these pull requests:
Wider MLP (
MLP_MULT=3.0, hidden=1536) (from Submission: Wider MLP 3x + int6 quant + sliding window eval, val_bpb=1.1659 #70)3x MLP expansion enabled by int6 quantization saving ~4MB. 1536 is 64-aligned for optimal H100 matmul tile utilization.
Longer training context (
TRAIN_SEQ_LEN=4096) (from Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556 #65)4x more context per sequence than the baseline's 1024, significantly improving convergence quality per step.
Optimizer tuning (from New SOTA attempt (val_bpb=1.2014) #52)
MUON_MOMENTUM=0.99, learning rates halved, batch 393K, warmdown 3000 stepsSTE fake int6 quantization-aware training (from Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556 #65)
During training, all
CastedLinearweights get fake int6 quantization via Straight-Through Estimator: forward uses quantized weights, backward passes gradients through originals. Teaches weight distributions that survive int6 post-training quantization.int6 per-row quantization on MLP+attention (from Submission: Wider MLP 3x + int6 quant + sliding window eval, val_bpb=1.1659 #70)
Mixed precision: int6 on 2D block weights, fp16 passthrough on tied embedding, zstd-22 compression.
fp16 tied embedding passthrough (from fp16 tied embedding + warmdown/LR tuning (val_bpb 1.2197) #42)
The tied embedding doubles as the output head. Keeping it in fp16 eliminates embedding quantization penalty entirely.
Sliding window evaluation (
EVAL_STRIDE=64,TRAIN_SEQ_LEN=4096) (from Record: Sliding Window Eval (stride=64), val_bpb=1.1925 #50, extended to seq_len=4096)Each token scored with up to 4032 tokens of context. Compiled
forward_logitsfor fast eval.Results
Mean: 1.16323, std: 0.00042. One-sample t-test against threshold 1.2194: t = 230.34 (df=2). p << 0.001.
Key numbers (seed 1337)