Claude/busy thompson 9c94f9#1
Merged
GodlyDonuts merged 47 commits intomainfrom Apr 27, 2026
Merged
Conversation
…b 1.1105 (3-seed mean)
…pb 1.09785 (3-seed mean)
….0912 (3-seed mean) WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses 5% better, creating headroom for ALL 66 layers at int6 precision. The extra quantization quality more than recovers the WD BPP cost. 3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337) All seeds under 16MB with 32K+ margins. No TTT, no SLOT, no eval-time adaptation. Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.
…al_bpb 1.0897 (3-seed mean) Track A (fixed predictor): no TTT, no SLOT, no eval-time adaptation. SP4096 + MLP 4x + WD 0.090 + depth recurrence + parallel residuals + MuonEq-R + QK-Gain 5.0. 3-seed mean: 1.0897 BPB, delta -0.0250 vs merged SOTA.
…(3-seed mean) On PR openai#1394 (@clarkkev): added single-knob QK_GAIN_INIT=5.0 and a legal score-first TTT eval pass (TTT_LR=0.005, epochs=3, freeze=0) on top of the clean sp8192 base. Three independent seeds (0, 42, 1234) on 8xH100 SXM, all fitting 16MB with 7-11K margin. Per-seed (post-TTT): - seed 0 : 1.08210 (val_loss 2.79517) - seed 42 : 1.08315 (val_loss 2.79788) - seed 1234: 1.08314 (val_loss 2.79785) - mean : 1.08279 (2.79697 nats per token) Improvement vs PR openai#1394 (1.08563 mean): -0.00284 bpb = -0.00731 nats/token, clearing the 0.005 nats record threshold by 0.00231 nats per seed. No SLOT, no pre-quant TTT, no ETLB, no n-gram cache, no tokenizer change. Score-first TTT matches PR openai#549 precedent: every chunk scored under inference_mode() before any parameter update.
…25 + Legal TTT — val_bpb 1.0810 (3-seed mean) 3-seed mean: 1.0810 (std 0.0002), seeds 42/314/999 All artifacts under 16MB, training under 600s, eval under 600s Score-first TTT (SGD 3ep, cosine decay), no SLOT, no pre-quant TTT Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…60-gptq-brotli-1.1105 Record: Split-LR + BigramHash(2816x160) + Full GPTQ + Brotli — val_bpb 1.1105 (3-seed mean)
…mult4-wd085 Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean)
Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR openai#1179, -0.0143 vs merged SOTA
…0-allint6 Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean)
…-slot-v4 Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean)
…mb-sdclip-loop45x2 Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean)
…-ttt-1.08279 Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean)
…rallel-ttt Record: SP8192 + Parallel Residuals + Score-First TTT — val_bpb 1.0822 (3-seed mean)
Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean)
…duals-hessian-sdclip Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean)
…oard-readme Update README leaderboard for April records
…179 (3-seed mean) (openai#1148) Two novel TTT innovations: (1) Muon-style Newton-Schulz orthogonalized updates replace SGD in the TTT loop; (2) entropy-adaptive 2/3/4 epochs per chunk based on globally-synced chunk NLL. 3-seed mean 1.1179, std 0.0002. All under 16MB/600s. Co-authored-by: aamodbhatt <bhat.aamod@gmail.com>
…eed mean) (openai#1060) * Record: 1.1123 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all 3-seed mean val_bpb: 1.1123 (std 0.0005) All artifacts under 16MB, all eval under 600s. Key changes from PR openai#549: - Coprime-stride multi-shard data pipeline (PR openai#726 style) - Full Hessian GPTQ with Cholesky error compensation - XSA on all 11 layers - BigramHash(2816×112) - No TTT (sliding-only outperforms on this stack) Built on PR openai#549 by @abaybektursun. * fix: add run command, requirements.txt for reproducibility * chore: strip dead code from train_gpt.py (111KB→96KB, +14KB artifact headroom) * fix: re-verify 3 seeds with stripped train_gpt.py for full consistency Seed logs now generated with the same 96,398-byte train_gpt.py that ships in this record. Previous logs were from the pre-strip 111,130-byte version. Updated results: Seed 1337: 1.1118 BPP, 15,973,962 bytes Seed 42: 1.1127 BPP, 15,980,438 bytes Seed 2025: 1.1121 BPP, 15,983,626 bytes Mean: 1.1122 ± 0.0004 * docs(record): clean stripped submission logs Fixes openai#1060
…ean) (openai#1184) Co-authored-by: icryo <icryo@users.noreply.github.com>
… merges (openai#1806) * Update leaderboard with recent record submissions * Keep only valid recent leaderboard rows * Remove invalid Scylla record * Remove non-record Muon TTT submission
Opus is the working directory for the leaderboard run targeting the PR openai#1493 SOTA (val_bpb 1.0810). Documents the 3-day execution plan, the angle of attack (selective-param TTT on the non-quantized control tensors), a budget breakdown ($500), and a full decode of the SOTA architecture pulled from the LZMA-compressed train_gpt.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
GodlyDonuts
added a commit
that referenced
this pull request
Apr 28, 2026
Synthesis of (a) deep records-folder pass, (b) modded-nanogpt record openai#80 gold standard, (c) FP8 / CUDA Graphs / distillation literature. Key findings: 1. Leaderboard converged on gradient-quality + quantization tricks while leaving raw throughput largely unexplored. Modded-nanogpt has absorbed multiple compute-maxing techniques that haven't crossed into PG. 2. NEVER-TRIED on the leaderboard (open territory): - CUDA Graphs (record openai#80 of modded-nanogpt uses heavily) - Multiple parallel training rounds in unused VRAM - Multiple EMAs / Polyak averaging - Distillation initialization - Larger GPTQ calibration set (>64 batches) - Sequence-length warmup 3. Top-8 ranked actionable items (CUDA Graphs #1, batch-size sweep #2, FP8 head openai#3, multi-EMA openai#4). Cost estimates and confidence per item. 4. Modded-nanogpt techniques NOT in our SOTA: FP8 head + asymmetric rescale, fused softcapped CE, Cautious Weight Decay, "Adam every other step", paired-head Q/K orthogonalization, attention window warmup, MTP. 5. TRIED-AND-DROPPED on PG (don't waste compute): seq_len=4096, parallel residual MLP-skip, 3-loop mini-recurrence, ternary, YaRN, NeoMuon, hash embeddings, etc. Verbatim quotes from records folder for each. 6. FP8 honest analysis: 1.6x typical training speedup (not 3x), with documented loss-spike instability. FP8 only on lm_head + tok_emb is the right initial bet (small surface, well-conditioned matmuls). Decision rules tied to Phase 3 outcome: - Phase 2 mean > 1.0780: prioritize throughput stack (CUDA Graphs + batch sweep + FP8 head) plus Newton-Muon as gradient-quality lever. - Phase 2 mean 1.0760-1.0780: just CUDA Graphs + LR follow-on + Newton-Muon. - Phase 2 mean clears 1.0760: ship; none of this matters this cycle. Still-research items: torch.compile(mode='reduce-overhead'), MTP re-test, qTTT paper body, Cautious WD diff from modded-nanogpt. None spend GPU. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.