Skip to content

Claude/busy thompson 9c94f9#1

Merged
GodlyDonuts merged 47 commits intomainfrom
claude/busy-thompson-9c94f9
Apr 27, 2026
Merged

Claude/busy thompson 9c94f9#1
GodlyDonuts merged 47 commits intomainfrom
claude/busy-thompson-9c94f9

Conversation

@GodlyDonuts
Copy link
Copy Markdown
Owner

No description provided.

dexhunter and others added 30 commits March 31, 2026 11:15
….0912 (3-seed mean)

WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses
5% better, creating headroom for ALL 66 layers at int6 precision.
The extra quantization quality more than recovers the WD BPP cost.

3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337)
All seeds under 16MB with 32K+ margins.
No TTT, no SLOT, no eval-time adaptation.

Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.
…al_bpb 1.0897 (3-seed mean)

Track A (fixed predictor): no TTT, no SLOT, no eval-time adaptation.
SP4096 + MLP 4x + WD 0.090 + depth recurrence + parallel residuals + MuonEq-R + QK-Gain 5.0.
3-seed mean: 1.0897 BPB, delta -0.0250 vs merged SOTA.
…(3-seed mean)

On PR openai#1394 (@clarkkev): added single-knob QK_GAIN_INIT=5.0 and a legal
score-first TTT eval pass (TTT_LR=0.005, epochs=3, freeze=0) on top of the
clean sp8192 base. Three independent seeds (0, 42, 1234) on 8xH100 SXM, all
fitting 16MB with 7-11K margin.

Per-seed (post-TTT):
- seed 0   : 1.08210 (val_loss 2.79517)
- seed 42  : 1.08315 (val_loss 2.79788)
- seed 1234: 1.08314 (val_loss 2.79785)
- mean     : 1.08279 (2.79697 nats per token)

Improvement vs PR openai#1394 (1.08563 mean): -0.00284 bpb = -0.00731 nats/token,
clearing the 0.005 nats record threshold by 0.00231 nats per seed.

No SLOT, no pre-quant TTT, no ETLB, no n-gram cache, no tokenizer change.
Score-first TTT matches PR openai#549 precedent: every chunk scored under
inference_mode() before any parameter update.
…25 + Legal TTT — val_bpb 1.0810 (3-seed mean)

3-seed mean: 1.0810 (std 0.0002), seeds 42/314/999
All artifacts under 16MB, training under 600s, eval under 600s
Score-first TTT (SGD 3ep, cosine decay), no SLOT, no pre-quant TTT

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…60-gptq-brotli-1.1105

Record: Split-LR + BigramHash(2816x160) + Full GPTQ + Brotli — val_bpb 1.1105 (3-seed mean)
cocohearts and others added 17 commits April 9, 2026 14:10
…mult4-wd085

Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean)
Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR openai#1179, -0.0143 vs merged SOTA
…0-allint6

Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean)
…-slot-v4

Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean)
…mb-sdclip-loop45x2

Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip  — val_bpb 1.08563 (5 seed mean)
…-ttt-1.08279

Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean)
…rallel-ttt

Record: SP8192 + Parallel Residuals + Score-First TTT — val_bpb 1.0822 (3-seed mean)
Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean)
…duals-hessian-sdclip

Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean)
…oard-readme

Update README leaderboard for April records
…179 (3-seed mean) (openai#1148)

Two novel TTT innovations: (1) Muon-style Newton-Schulz orthogonalized updates
replace SGD in the TTT loop; (2) entropy-adaptive 2/3/4 epochs per chunk based
on globally-synced chunk NLL. 3-seed mean 1.1179, std 0.0002. All under 16MB/600s.

Co-authored-by: aamodbhatt <bhat.aamod@gmail.com>
…eed mean) (openai#1060)

* Record: 1.1123 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all

3-seed mean val_bpb: 1.1123 (std 0.0005)
All artifacts under 16MB, all eval under 600s.

Key changes from PR openai#549:
- Coprime-stride multi-shard data pipeline (PR openai#726 style)
- Full Hessian GPTQ with Cholesky error compensation
- XSA on all 11 layers
- BigramHash(2816×112)
- No TTT (sliding-only outperforms on this stack)

Built on PR openai#549 by @abaybektursun.

* fix: add run command, requirements.txt for reproducibility

* chore: strip dead code from train_gpt.py (111KB→96KB, +14KB artifact headroom)

* fix: re-verify 3 seeds with stripped train_gpt.py for full consistency

Seed logs now generated with the same 96,398-byte train_gpt.py that ships
in this record. Previous logs were from the pre-strip 111,130-byte version.

Updated results:
  Seed 1337: 1.1118 BPP, 15,973,962 bytes
  Seed 42:   1.1127 BPP, 15,980,438 bytes
  Seed 2025: 1.1121 BPP, 15,983,626 bytes
  Mean: 1.1122 ± 0.0004

* docs(record): clean stripped submission logs

Fixes openai#1060
…ean) (openai#1184)

Co-authored-by: icryo <icryo@users.noreply.github.com>
… merges (openai#1806)

* Update leaderboard with recent record submissions

* Keep only valid recent leaderboard rows

* Remove invalid Scylla record

* Remove non-record Muon TTT submission
Opus is the working directory for the leaderboard run targeting the
PR openai#1493 SOTA (val_bpb 1.0810). Documents the 3-day execution plan,
the angle of attack (selective-param TTT on the non-quantized control
tensors), a budget breakdown ($500), and a full decode of the SOTA
architecture pulled from the LZMA-compressed train_gpt.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@GodlyDonuts GodlyDonuts merged commit 82931a7 into main Apr 27, 2026
GodlyDonuts added a commit that referenced this pull request Apr 28, 2026
Synthesis of (a) deep records-folder pass, (b) modded-nanogpt record openai#80
gold standard, (c) FP8 / CUDA Graphs / distillation literature.

Key findings:
1. Leaderboard converged on gradient-quality + quantization tricks while
   leaving raw throughput largely unexplored. Modded-nanogpt has absorbed
   multiple compute-maxing techniques that haven't crossed into PG.
2. NEVER-TRIED on the leaderboard (open territory):
   - CUDA Graphs (record openai#80 of modded-nanogpt uses heavily)
   - Multiple parallel training rounds in unused VRAM
   - Multiple EMAs / Polyak averaging
   - Distillation initialization
   - Larger GPTQ calibration set (>64 batches)
   - Sequence-length warmup
3. Top-8 ranked actionable items (CUDA Graphs #1, batch-size sweep #2,
   FP8 head openai#3, multi-EMA openai#4). Cost estimates and confidence per item.
4. Modded-nanogpt techniques NOT in our SOTA: FP8 head + asymmetric
   rescale, fused softcapped CE, Cautious Weight Decay, "Adam every other
   step", paired-head Q/K orthogonalization, attention window warmup, MTP.
5. TRIED-AND-DROPPED on PG (don't waste compute): seq_len=4096, parallel
   residual MLP-skip, 3-loop mini-recurrence, ternary, YaRN, NeoMuon,
   hash embeddings, etc. Verbatim quotes from records folder for each.
6. FP8 honest analysis: 1.6x typical training speedup (not 3x), with
   documented loss-spike instability. FP8 only on lm_head + tok_emb is
   the right initial bet (small surface, well-conditioned matmuls).

Decision rules tied to Phase 3 outcome:
- Phase 2 mean > 1.0780: prioritize throughput stack (CUDA Graphs +
  batch sweep + FP8 head) plus Newton-Muon as gradient-quality lever.
- Phase 2 mean 1.0760-1.0780: just CUDA Graphs + LR follow-on +
  Newton-Muon.
- Phase 2 mean clears 1.0760: ship; none of this matters this cycle.

Still-research items: torch.compile(mode='reduce-overhead'), MTP
re-test, qTTT paper body, Cautious WD diff from modded-nanogpt.
None spend GPU.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants