ArjunAutoResearch: MLP 3x + STE int6 QAT + seq4096 + sliding window. val_bpb 1.1632 by arjun-krishna1 · Pull Request #66 · openai/parameter-golf

arjun-krishna1 · 2026-03-19T07:50:27Z

Hey there! This was super fun thanks. I took the approach of building an AutoResearch agent harness that could work towards solving this autonomously. Added as an agent skill in my submission for others to build off of!

Built an auto-research pipeline:

Gave an agent the GitHub CLI
Asked it to go through all of the open PRs, describe what the approach being taken was and bucketing it by expected impact on BPB (High, Medium or Low)
The agent then composed the approaches in the "High" bucket
Iteratively trying to figure out which approaches built on top of each other well and led to the best score

ArjunAutoResearch ended up with a final val_bpb of 1.16323 +/- 0.00042 (mean across 3 seeds, p << 0.001).
The artifact size is: 15,265,243 bytes (under 16,000,000).

With more compute (will apply for more) I would scale this AutoResearch agent by trying to compose approaches in the Medium, Low buckets as well as trying to come up with strategies on its own and researching approaches from the internet that people haven't made pull requests for yet.

The approach ArjunAutoResearch came up with composed the following techniques from these pull requests:

Wider MLP (MLP_MULT=3.0, hidden=1536) (from Submission: Wider MLP 3x + int6 quant + sliding window eval, val_bpb=1.1659 #70)
3x MLP expansion enabled by int6 quantization saving ~4MB. 1536 is 64-aligned for optimal H100 matmul tile utilization.
Longer training context (TRAIN_SEQ_LEN=4096) (from Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556 #65)
4x more context per sequence than the baseline's 1024, significantly improving convergence quality per step.
Optimizer tuning (from New SOTA attempt (val_bpb=1.2014) #52)

MUON_MOMENTUM=0.99, learning rates halved, batch 393K, warmdown 3000 steps

STE fake int6 quantization-aware training (from Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556 #65)
During training, all CastedLinear weights get fake int6 quantization via Straight-Through Estimator: forward uses quantized weights, backward passes gradients through originals. Teaches weight distributions that survive int6 post-training quantization.
int6 per-row quantization on MLP+attention (from Submission: Wider MLP 3x + int6 quant + sliding window eval, val_bpb=1.1659 #70)
Mixed precision: int6 on 2D block weights, fp16 passthrough on tied embedding, zstd-22 compression.
fp16 tied embedding passthrough (from fp16 tied embedding + warmdown/LR tuning (val_bpb 1.2197) #42)
The tied embedding doubles as the output head. Keeping it in fp16 eliminates embedding quantization penalty entirely.
Sliding window evaluation (EVAL_STRIDE=64, TRAIN_SEQ_LEN=4096) (from Record: Sliding Window Eval (stride=64), val_bpb=1.1925 #50, extended to seq_len=4096)
Each token scored with up to 4032 tokens of context. Compiled forward_logits for fast eval.

Results

Seed	val_bpb
1337	1.16356083
1338	1.16275343
1339	1.16337225

Mean: 1.16323, std: 0.00042. One-sample t-test against threshold 1.2194: t = 230.34 (df=2). p << 0.001.

Key numbers (seed 1337)

Stopped at step 9211/20000 (wallclock cap)
Pre-quant val_bpb: 1.1769
Post-quant sliding window val_bpb: 1.16356083
Train time: 600s at 65.15ms/step
Eval time: 157s (within the separate 10-min eval budget)
Artifact size: 15,265,243 bytes (under 16,000,000)
Peak memory: 8521 MiB

Stack four proven techniques identified via systematic PR analysis: - TRAIN_SEQ_LEN=4096 for richer per-step training signal - Optimizer tuning: Muon momentum 0.99, LRs halved, warmdown 3000 - fp16 tied embedding export (MLP_HIDDEN=992 to stay under 16MB) - Sliding window eval at stride=64 with 4096-token context windows Beats naive baseline (1.2244) by 0.041 BPB and all public PRs. Training: 9919 steps at 60ms/step on 8xH100 SXM. Eval: 278s sliding window (within separate 10-min eval budget). Made-with: Cursor

Made-with: Cursor

- seq_len=4096 (4x context, biggest single BPB win) - Muon momentum 0.99, lower LRs (0.02/0.02/0.03) - Batch 393K (more steps/min), warmdown 3000 - fp16 tied embedding export (halves quant penalty) - Defaults to SP-1024 (data exists on HuggingFace, no tokenizer training) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three seeds all clear the 1.2194 threshold (SOTA - 0.005): - SEED=1337: val_bpb=1.18335372 - SEED=1338: val_bpb=1.18437368 - SEED=1339: val_bpb=1.18481782 Mean=1.18418174, std=0.00075068, t=81.26 (df=2), p<<0.001. Made-with: Cursor

Made-with: Cursor

- Run command now references full records folder path so it runs correctly from the repo root as reviewers expect - Root train_gpt.py reverted to openai/parameter-golf main so PR only adds the records folder, as required by challenge rules Made-with: Cursor

Made-with: Cursor

Skill now lives at parameter-golf-autoresearch/SKILL.md inside the records submission folder, following the agentskills.io standard (folder name matches the name field, proper frontmatter with metadata). Removed the .cursor/skills/ copy so the PR only touches the records folder. Made-with: Cursor

Made-with: Cursor

…nfig Made-with: Cursor

Made-with: Cursor

…_bpb 1.1652 Stack five techniques from systematic PR analysis: - MLP_MULT=3.0 (hidden=1536) for wider model capacity (from PR openai#70) - int6 per-row quant on MLP+attn, fp16 tied embed passthrough (from PR openai#70) - zstd-22 compression (from PR openai#70) - TRAIN_SEQ_LEN=4096 for richer per-step training signal (from PR openai#65) - Sliding window eval at stride=64 with compiled forward_logits Mean val_bpb=1.16520 (std=0.00102, t=92.15, p<<0.001). Three seeds: 1.16615, 1.16532, 1.16412. Artifact: 15.6MB (under 16,000,000 byte cap). Training: 9370 steps at 64ms/step on 8xH100 SXM. Made-with: Cursor

arjun-krishna1 · 2026-03-19T10:33:10Z

Time to sleep 😭

Made-with: Cursor

Every submission scoring <1.18 BPB uses these EXACT settings. We were running defaults — now matching the winners: MUON_MOMENTUM: 0.95 → 0.99 (stronger smoothing) MATRIX_LR: 0.04 → 0.02 (halved, reduces quant gap) SCALAR_LR: 0.04 → 0.02 (halved) TIED_EMBED_LR: 0.05 → 0.03 (halved) WARMDOWN_ITERS: 1200 → 3000 (longer warmdown) MUON_WARMUP_START: 0.85 → 0.92 (higher start) MUON_WARMUP_STEPS: 500 → 1500 (3x longer warmup) These settings are proven by PR openai#64 (1.0149), openai#66 (1.1652), openai#70 (1.1659), openai#65 (1.1808) — all top submissions. Applied to both v5 and v6. Both compile, 1498 lines each.

openai#77, openai#78) Analyzed techniques, ablations, and individual BPB contributions. Key finding: sliding window eval (~0.034) and int6+wider MLP (~0.029) are the dominant validated techniques. Several promising combinations remain untested across submissions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chonchiog · 2026-03-19T13:45:58Z

Awesome!

arjun-krishna1 · 2026-03-19T14:24:05Z

Awesome!

thanks man! @chonchiog feel free to fork and build off of it

chonchiog · 2026-03-19T14:25:59Z

for sure! @arjun-krishna1 I'm waiting for the runpod credits :)

Major improvements based on competition intelligence (day 2 PRs): 1. Sliding window eval (stride=256): overlapping windows give each token more context. Free ~0.03 bpb improvement, zero artifact cost. Based on PRs openai#70, openai#77, openai#65. 2. Int6 quantization: configurable WEIGHT_QUANT_BITS (default 6) and EMBED_QUANT_BITS (default 8). Saves ~25% artifact space vs int8, allowing bigger models. Based on PRs openai#78, openai#70. 3. MLP 3x expansion: MLP_MULT_NUM=3 (up from 8/3). Wider MLP gives ~0.019 bpb improvement. Based on PRs openai#70, openai#66. 4. Default dim=512 with LR=0.03 (best config from experiments). 5. forward_logits() helper for sliding window (avoids model.forward which returns loss, not logits). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add Straight-Through Estimator fake int6 quantization to CastedLinear during training. Forward pass uses quantized weights (int6 per-row), backward passes gradients through originals. Teaches weight distributions that survive post-training int6 quantization. Composes with existing: seq4096, MLP 3x, fp16 tok_emb, int6+zstd, stride=64. Three seeds: - SEED=1337: val_bpb=1.16356083 - SEED=1338: val_bpb=1.16275343 - SEED=1339: val_bpb=1.16337225 Mean=1.16323, std=0.00042, t=230.34 (df=2), p<<0.001. Artifact: 15.3MB (under 16,000,000 byte cap). Made-with: Cursor

Based on PR openai#66 (ArjunAutoResearch) composition of top techniques: - Int6 per-row quantization + zstd-22 (~4MB savings vs int8+zlib) - MLP 3x expansion (hidden=1536) enabled by int6 budget savings - STE fake int6 QAT in CastedLinear (trains weights to survive quantization) - Sliding window eval (stride=64, seq_len=4096) - Tuned optimizer: matrix_lr=0.02, muon_momentum=0.99, warmdown=3000 - fp16 tied embedding passthrough (no embedding quant penalty) - Seq len 4096, batch tokens 393K Expected: ~1.163 BPB on 8xH100 (vs baseline 1.2244) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-11T20:11:09Z

Community Review — ArjunAutoResearch: MLP 3x + STE int6 QAT + seq4096 + sliding window. val_bpb 1.1632

BPB: 1.1632 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 3d1780d86d24, file records/track_10min_16mb/2026-03-19_ArjunAutoResearch/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=57201 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=9, vocab=1024, code=57201 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

arjun-krishna1 added 3 commits March 19, 2026 07:22

feat: stuffs

62b4374

Rename submission to ArjunAutoresearch

8518bec

Made-with: Cursor

arjun-krishna1 changed the title ~~Record: Combined SOTA V2 — val_bpb 1.18335372 (beats baseline by 0.041)~~ ArjunAutoresearch: val_bpb 1.1833 Mar 19, 2026

arjun-krishna1 changed the title ~~ArjunAutoresearch: val_bpb 1.1833~~ ArjunAutoResearch: Longer training context, Optimizer training, fp16 tied embedding export, Sliding window evaluation. val_bpb 1.1833 Mar 19, 2026

arjun-krishna1 added 15 commits March 19, 2026 08:25

Rename submission folder to 2026-03-19_ArjunAutoResearch

27888f0

Made-with: Cursor

Rewrite README and skill in plain human language

77265b7

Made-with: Cursor

Add SKILL.md to submission records folder

3d6d163

Made-with: Cursor

Rename skill to arjunautoresearch

6e855e2

Made-with: Cursor

Clarify skill description

863dd01

Made-with: Cursor

Add pull-pr-diffs.sh script to skill

52b0532

Made-with: Cursor

Generalize skill: reusable patterns, gotchas section, no hardcoded co…

5bd66b0

…nfig Made-with: Cursor

Hardcode openai/parameter-golf repo in pull-pr-diffs.sh

e8f97fa

Made-with: Cursor

feat: update skill

cbca955

Rewrite README to match PR description tone

5854645

Made-with: Cursor

Default MLP_HIDDEN to 992 for best config out of the box

9fd9f04

Made-with: Cursor

arjun-krishna1 changed the title ~~ArjunAutoResearch: Longer training context, Optimizer tuning, fp16 tied embedding export, Sliding window evaluation. val_bpb 1.1833~~ ArjunAutoResearch: MLP 3x + int6 quant + seq4096 + sliding window. val_bpb 1.1652 Mar 19, 2026

Clean up README: bake all defaults, add full repro instructions

7a63e4d

Made-with: Cursor

jordankzf mentioned this pull request Mar 19, 2026

Unofficial Leaderboard #83

Closed

arjun-krishna1 changed the title ~~ArjunAutoResearch: MLP 3x + int6 quant + seq4096 + sliding window. val_bpb 1.1652~~ ArjunAutoResearch: MLP 3x + STE int6 QAT + seq4096 + sliding window. val_bpb 1.1628 Mar 19, 2026

arjun-krishna1 changed the title ~~ArjunAutoResearch: MLP 3x + STE int6 QAT + seq4096 + sliding window. val_bpb 1.1628~~ ArjunAutoResearch: MLP 3x + STE int6 QAT + seq4096 + sliding window. val_bpb 1.1632 Mar 19, 2026

0hq added the record submission ready for review label Mar 19, 2026

mtybadger mentioned this pull request Mar 19, 2026

Record: Sliding Window Eval, 2048 Vocab Size, fp16 embeddings, SWA, NorMuon, FA3; mean_val_bpb:1.160 #122

Open

rsavitt mentioned this pull request Mar 19, 2026

Record: Int6 MLP3x + STE QAT + Sliding Window (val_bpb=1.1594) #128

Open

5 tasks

notapplica mentioned this pull request Mar 20, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

cocohearts added does not beat SOTA and removed record submission ready for review labels Mar 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ArjunAutoResearch: MLP 3x + STE int6 QAT + seq4096 + sliding window. val_bpb 1.1632#66

ArjunAutoResearch: MLP 3x + STE int6 QAT + seq4096 + sliding window. val_bpb 1.1632#66
arjun-krishna1 wants to merge 20 commits intoopenai:mainfrom
arjun-krishna1:feat/arjun/haily-mary

arjun-krishna1 commented Mar 19, 2026 •

edited

Loading

Uh oh!

arjun-krishna1 commented Mar 19, 2026

Uh oh!

chonchiog commented Mar 19, 2026

Uh oh!

arjun-krishna1 commented Mar 19, 2026

Uh oh!

chonchiog commented Mar 19, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

arjun-krishna1 commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results

Key numbers (seed 1337)

Uh oh!

arjun-krishna1 commented Mar 19, 2026

Uh oh!

chonchiog commented Mar 19, 2026

Uh oh!

arjun-krishna1 commented Mar 19, 2026

Uh oh!

chonchiog commented Mar 19, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — ArjunAutoResearch: MLP 3x + STE int6 QAT + seq4096 + sliding window. val_bpb 1.1632

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

arjun-krishna1 commented Mar 19, 2026 •

edited

Loading