Claude/busy thompson 9c94f9 by GodlyDonuts · Pull Request #1 · GodlyDonuts/parameter-golf

GodlyDonuts · 2026-04-27T22:03:35Z

No description provided.

…b 1.1105 (3-seed mean)

…pb 1.09785 (3-seed mean)

@clarkkev

….0912 (3-seed mean) WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses 5% better, creating headroom for ALL 66 layers at int6 precision. The extra quantization quality more than recovers the WD BPP cost. 3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337) All seeds under 16MB with 32K+ margins. No TTT, no SLOT, no eval-time adaptation. Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.

…al_bpb 1.0897 (3-seed mean) Track A (fixed predictor): no TTT, no SLOT, no eval-time adaptation. SP4096 + MLP 4x + WD 0.090 + depth recurrence + parallel residuals + MuonEq-R + QK-Gain 5.0. 3-seed mean: 1.0897 BPB, delta -0.0250 vs merged SOTA.

…3 (5-seed mean)

…08354 BPB)

@clarkkev

…(3-seed mean) On PR openai#1394 (@clarkkev): added single-knob QK_GAIN_INIT=5.0 and a legal score-first TTT eval pass (TTT_LR=0.005, epochs=3, freeze=0) on top of the clean sp8192 base. Three independent seeds (0, 42, 1234) on 8xH100 SXM, all fitting 16MB with 7-11K margin. Per-seed (post-TTT): - seed 0 : 1.08210 (val_loss 2.79517) - seed 42 : 1.08315 (val_loss 2.79788) - seed 1234: 1.08314 (val_loss 2.79785) - mean : 1.08279 (2.79697 nats per token) Improvement vs PR openai#1394 (1.08563 mean): -0.00284 bpb = -0.00731 nats/token, clearing the 0.005 nats record threshold by 0.00231 nats per seed. No SLOT, no pre-quant TTT, no ETLB, no n-gram cache, no tokenizer change. Score-first TTT matches PR openai#549 precedent: every chunk scored under inference_mode() before any parameter update.

…2 (3-seed mean)

…25 + Legal TTT — val_bpb 1.0810 (3-seed mean) 3-seed mean: 1.0810 (std 0.0002), seeds 42/314/999 All artifacts under 16MB, training under 600s, eval under 600s Score-first TTT (SGD 3ep, cosine decay), no SLOT, no pre-quant TTT Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…60-gptq-brotli-1.1105 Record: Split-LR + BigramHash(2816x160) + Full GPTQ + Brotli — val_bpb 1.1105 (3-seed mean)

…mult4-wd085 Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean)

Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR openai#1179, -0.0143 vs merged SOTA

…0-allint6 Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean)

…-slot-v4 Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean)

…mb-sdclip-loop45x2 Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean)

…-ttt-1.08279 Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean)

…rallel-ttt Record: SP8192 + Parallel Residuals + Score-First TTT — val_bpb 1.0822 (3-seed mean)

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean)

…duals-hessian-sdclip Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean)

…oard-readme Update README leaderboard for April records

…179 (3-seed mean) (openai#1148) Two novel TTT innovations: (1) Muon-style Newton-Schulz orthogonalized updates replace SGD in the TTT loop; (2) entropy-adaptive 2/3/4 epochs per chunk based on globally-synced chunk NLL. 3-seed mean 1.1179, std 0.0002. All under 16MB/600s. Co-authored-by: aamodbhatt <bhat.aamod@gmail.com>

@abaybektursun

…eed mean) (openai#1060) * Record: 1.1123 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all 3-seed mean val_bpb: 1.1123 (std 0.0005) All artifacts under 16MB, all eval under 600s. Key changes from PR openai#549: - Coprime-stride multi-shard data pipeline (PR openai#726 style) - Full Hessian GPTQ with Cholesky error compensation - XSA on all 11 layers - BigramHash(2816×112) - No TTT (sliding-only outperforms on this stack) Built on PR openai#549 by @abaybektursun. * fix: add run command, requirements.txt for reproducibility * chore: strip dead code from train_gpt.py (111KB→96KB, +14KB artifact headroom) * fix: re-verify 3 seeds with stripped train_gpt.py for full consistency Seed logs now generated with the same 96,398-byte train_gpt.py that ships in this record. Previous logs were from the pre-strip 111,130-byte version. Updated results: Seed 1337: 1.1118 BPP, 15,973,962 bytes Seed 42: 1.1127 BPP, 15,980,438 bytes Seed 2025: 1.1121 BPP, 15,983,626 bytes Mean: 1.1122 ± 0.0004 * docs(record): clean stripped submission logs Fixes openai#1060

…ean) (openai#1184) Co-authored-by: icryo <icryo@users.noreply.github.com>

… merges (openai#1806) * Update leaderboard with recent record submissions * Keep only valid recent leaderboard rows * Remove invalid Scylla record * Remove non-record Muon TTT submission

Opus is the working directory for the leaderboard run targeting the PR openai#1493 SOTA (val_bpb 1.0810). Documents the 3-day execution plan, the angle of attack (selective-param TTT on the non-quantized control tensors), a budget breakdown ($500), and a full decode of the SOTA architecture pulled from the LZMA-compressed train_gpt.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Synthesis of (a) deep records-folder pass, (b) modded-nanogpt record openai#80 gold standard, (c) FP8 / CUDA Graphs / distillation literature. Key findings: 1. Leaderboard converged on gradient-quality + quantization tricks while leaving raw throughput largely unexplored. Modded-nanogpt has absorbed multiple compute-maxing techniques that haven't crossed into PG. 2. NEVER-TRIED on the leaderboard (open territory): - CUDA Graphs (record openai#80 of modded-nanogpt uses heavily) - Multiple parallel training rounds in unused VRAM - Multiple EMAs / Polyak averaging - Distillation initialization - Larger GPTQ calibration set (>64 batches) - Sequence-length warmup 3. Top-8 ranked actionable items (CUDA Graphs #1, batch-size sweep #2, FP8 head openai#3, multi-EMA openai#4). Cost estimates and confidence per item. 4. Modded-nanogpt techniques NOT in our SOTA: FP8 head + asymmetric rescale, fused softcapped CE, Cautious Weight Decay, "Adam every other step", paired-head Q/K orthogonalization, attention window warmup, MTP. 5. TRIED-AND-DROPPED on PG (don't waste compute): seq_len=4096, parallel residual MLP-skip, 3-loop mini-recurrence, ternary, YaRN, NeoMuon, hash embeddings, etc. Verbatim quotes from records folder for each. 6. FP8 honest analysis: 1.6x typical training speedup (not 3x), with documented loss-spike instability. FP8 only on lm_head + tok_emb is the right initial bet (small surface, well-conditioned matmuls). Decision rules tied to Phase 3 outcome: - Phase 2 mean > 1.0780: prioritize throughput stack (CUDA Graphs + batch sweep + FP8 head) plus Newton-Muon as gradient-quality lever. - Phase 2 mean 1.0760-1.0780: just CUDA Graphs + LR follow-on + Newton-Muon. - Phase 2 mean clears 1.0760: ship; none of this matters this cycle. Still-research items: torch.compile(mode='reduce-overhead'), MTP re-test, qTTT paper body, Cautious WD diff from modded-nanogpt. None spend GPU. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dexhunter and others added 30 commits March 31, 2026 11:15

Record: Split-LR + BigramHash(2816x160) + Full GPTQ + Brotli — val_bp…

926a242

…b 1.1105 (3-seed mean)

fix: clarify PR openai#1019 attribution (not our merged PR)

6772e6a

Copy PR 1179 record trainer to repo root

1a2ba5d

Decompress PR 1179 root trainer wrapper

d95a637

Port mixed quant to PR 1179 root trainer

23e26fb

Port depth recurrence to PR 1179 root trainer

2571aa6

Log full untie depth recurrence result

1486b6a

Prioritize shared params in GPTQ

101fe21

Somewhat working

d9a7d19

Log partitioned residual result

2a2fe32

Train wallclock log

60857e5

logging fix

0ebead6

First run in

e5a01dc

Parallel Residuals readme entry

78cf56a

Update submission README and add seed logs

ad65f02

Update submission reproducibility notes

b7c4931

Add submission metadata for ParallelResiduals run

18e14d3

Clean root for submission branch

7441f45

Restore root files for submission

ae2f7b7

Record: 4096-Vocab + Larger Model + High WD + Simplifications — val_b…

4421a8d

…pb 1.09785 (3-seed mean)

Add train_gpt.py to submission

e01b7e1

Record: SP8192 + GPTQ Embeddings + SDClip + Loop45x2 — val_bpb 1.0856…

f4fa11f

…3 (5-seed mean)

Non-record: Parallel Residuals + Hessian-Aware SDClip (3-seed mean 1.…

3f1e814

…08354 BPB)

Fix LaTeX rendering

4b57791

Record: SP8192 + Parallel Residuals + Score-First TTT — val_bpb 1.082…

5470a39

…2 (3-seed mean)

Merge pull request openai#1179 from dexhunter/submission/splitlr-dim1…

c7d01f4

…60-gptq-brotli-1.1105 Record: Split-LR + BigramHash(2816x160) + Full GPTQ + Brotli — val_bpb 1.1105 (3-seed mean)

cocohearts and others added 17 commits April 9, 2026 14:10

Merge pull request openai#1218 from clarkkev/submission/vocab4096-mlp…

d96169d

…mult4-wd085 Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean)

Merge pull request openai#1204 from msisovic/hyperconnections_submission

444b4f7

Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR openai#1179, -0.0143 vs merged SOTA

Merge pull request openai#1285 from dexhunter/muoneqr-recurrence-wd09…

ee38f46

…0-allint6 Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean)

Merge pull request openai#1334 from aryanbhosale/submission/sp4096-no…

ed829c0

…-slot-v4 Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean)

Merge pull request openai#1394 from clarkkev/submission/sp8192-gptq-e…

8d62bdd

…mb-sdclip-loop45x2 Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean)

Merge pull request openai#1413 from dexhunter/record/sp8192-qk5-legal…

6f92b13

…-ttt-1.08279 Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean)

Merge pull request openai#1477 from aryanbhosale/submission/sp8192-pa…

905ef58

…rallel-ttt Record: SP8192 + Parallel Residuals + Score-First TTT — val_bpb 1.0822 (3-seed mean)

Merge pull request openai#1493 from bigbag/submission/sp8192-ttt-clean

bac888c

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean)

Merge pull request openai#1412 from Robby955/submission/parallel-resi…

c714a4d

…duals-hessian-sdclip Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean)

Update README leaderboard for April records

81b6bd7

Merge pull request openai#1511 from openai/codex/update-april-leaderb…

75700cb

…oard-readme Update README leaderboard for April records

Record: Scylla + Full GPTQ + XSA-all + FA3 — val_bpb 0.9485 (3-seed m…

8b148a0

…ean) (openai#1184) Co-authored-by: icryo <icryo@users.noreply.github.com>

Update README leaderboard with recent record submissions + revert bad…

7427de2

… merges (openai#1806) * Update leaderboard with recent record submissions * Keep only valid recent leaderboard rows * Remove invalid Scylla record * Remove non-record Muon TTT submission

Draft SOTA implementation: MLA + 3.5x MLP + Int6 QAT for tracking

51f0301

GodlyDonuts merged commit 82931a7 into main Apr 27, 2026

GodlyDonuts mentioned this pull request Apr 27, 2026

Opus pre-flight: selective-TTT patch + experiment specs + pod scripts #2

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Claude/busy thompson 9c94f9#1

Claude/busy thompson 9c94f9#1
GodlyDonuts merged 47 commits intomainfrom
claude/busy-thompson-9c94f9

GodlyDonuts commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

GodlyDonuts commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants