non-record - mockingbird - sota copy/10kvocab by newjordan · Pull Request #2120 · openai/parameter-golf

newjordan · 2026-05-01T12:53:24Z

non-record — mockingbird — SOTA copy / 10k vocab

Non-record submission. The SP10240 CaseOps sister of PR #1855 — same body, same compression / phased-TTT machinery, vocab swapped 8192 → 10240 with the body shrunk to MLP3.75 to stay under the 16 MB cap.

Filed for comparison, not as a record claim.

Results

3-seed mean (seeds 42 / 0 / 1):

Seed	val_bpb (quantized_ttt_phased)	Steps	Total submission size
42	1.06204667	5,264	15,816,988 B
0	1.06226648	5,231	15,818,783 B
1	1.06299064	5,221	15,810,544 B
mean	1.06243460		15,818,783 B (max)

For reference: PR #1855 (SP8192) reported 3-seed mean 1.06107587 post-phased-TTT. Mockingbird is +0.00136 worse — the cost of the 10k vocab swap on otherwise-identical machinery.

Hardware: 8×H100 SXM · 600 s wallclock · bytes_code 163,036 (uncompressed) / 41,220 (compressed)

Architecture

11L · dim 512 · mlp_mult=3.75 · loop_start=3, loop_end=5, enable_looping_at=0.45

Tokenizer: SP10240 CaseOps lossless-caps (10,240 tokens), FineWeb 10B with byte-level loss accounting
Quant: per-group, embed int7, matrix int6, LQER asymmetric rank-4
Eval: PR1855 phased LoRA TTT — prefix_docs=2500, phases=3, chunk=48
Compression: pergroup
Train budget: 600 s wallclock, hard 16 MB artifact cap

Seeds note

The three runs use byte-identical training code, differing only in Hyperparameters.seed = N (line 479) and four cosmetic TEST_ID/TEST_DATE/RUN_KIND/blurb fields. The committed train_gpt.py is the seed-42 run.

Reproduce

SKIP_GPTQ=1 SEED=42 torchrun --standalone --nproc_per_node=8 \
  records/track_10min_16mb/2026-05-01_Mockingbird_8xH100/train_gpt.py

For seeds 0 / 1, change line 479 (Hyperparameters.seed = 42) to the desired seed.

Lineage

Derived from: Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855 (merge 510d03e0fc355406c9fd06f92d23b8c5aedea7fb)
Branch base: upstream/main at fdde8dc (PR Record: SmearGate BOS Fix 3-Seed Compliance Re-run — val_bpb 1.06141 (3-seed mean) #1868 SmearGate BOS Fix)

Couldnt beat this bad boy in a day, nice job everyone. My last gambit was to push the vocab and wind down better because thats where everythign went. I spend the last two weeks trying to push the neral with kernels, vocabs, got really stuck on some 12L options for awhile... Just never really cracked past 1.079. Re-looked at evals too late in the game (spent a lot of early time on them) to matter! Cheers everyone. Thanks for the ride. I'm a better/smarter person than I was when I started this. May the Schwartz be with you all.

⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠚⠓⠤⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⡾⣅⠀⠀⠀⠀⣨⢷⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⣀⡤⡦⣄⡀⠀⡇⠀⠙⢢⡔⠋⠀⢸⠀⢀⣠⢴⢤⣀⠀⠀
⡴⠊⠁⠀⢷⠀⠉⠲⣇⠀⠀⢸⡇⠀⠀⣸⠖⠉⠀⡼⠀⠈⠑⢦
⣧⠀⠀⢀⣸⡀⠀⠀⢿⠙⠲⢼⣧⠖⠋⡿⠀⠀⢀⣇⡀⠀⠀⣸
⢹⡴⠚⠉⠀⠈⠑⠦⣼⠀⠀⢸⡇⠀⠀⣧⡴⠊⠁⠀⠉⠓⢦⡏
⠀⠈⠓⢤⣀⡤⠖⠋⠁⠙⠲⣼⣧⠖⠋⠈⠙⠲⢤⣀⡤⠚⠁⠀
⢀⡠⠖⠉⠀⠉⠓⠦⣄⠴⠚⢹⡏⠓⠦⣠⡴⠚⠉⠀⠉⠲⢄⡀
⣼⠙⠲⢤⣀⡠⠔⠋⢹⠀⠀⣸⣇⠀⠀⣏⠙⠲⢄⣀⡤⠖⠋⣧
⡏⠀⠀⠀⢸⠀⠀⠀⣿⠴⠚⢹⡏⠓⠦⣿⠀⠀⠀⡇⠀⠀⠀⢸
⠙⠢⣄⡀⡟⣀⡤⠚⡇⠀⠀⢸⡇⠀⠀⢸⠓⠤⣀⢹⢀⣠⠔⠋
⠀⠀⠀⠉⠋⠁⠀⠀⡇⣀⠴⠊⠑⠦⣀⢸⠀⠀⠈⠙⠉⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠻⣅⠀⠀⠀⠀⣨⠟⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠲⠖⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀

Non-record submission filing the 10k-vocab (SP10240 lossless-caps CaseOps) sister of PR openai#1855. The architecture is held fixed; only the tokenizer / vocab dimension changes (8192 -> 10240), with the body shrunk to MLP3.75 to stay under the 16 MB artifact cap. Results (3-seed mean over [42, 0, 1]): val_bpb_exact = 1.06243460 (max bytes_total = 15,818,783) seed 42: 1.06204667 BPB · 15,816,988 B seed 0: 1.06226648 BPB · 15,818,783 B seed 1: 1.06299064 BPB · 15,810,544 B Architecture: 11L · dim 512 · mlp_mult=3.75 · loop_start=3, loop_end=5, enable_looping_at=0.45. Quant: per-group, embed int7, matrix int6, LQER asym rank-4. Eval: PR1855 phased LoRA TTT (prefix_docs=2500, phases=3, chunk=48). 8xH100 SXM, 600s wallclock, FineWeb 10B SP10240 CaseOps. Filed for comparison with the SP8192 lane in PR openai#1855 (3-seed mean 1.06107587). Mockingbird does not beat PR openai#1855; it documents the cost of the vocab swap on otherwise-identical machinery. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Appendix subdir tokenization_10kvocab/ inside the records folder, holding everything needed to inspect or reproduce the SP10240 CaseOps tokenization stack the seed runs trained on: tokenizer/ the actual lossless_caps_caseops_v1_reserved tokenizer (.model + .vocab) used by all three seeds, plus the standard SP10240 BPE for diff reference and the BPE training spec build/ the one-command rebuild driver, HF upload helper, and the actual SentencePiece trainer log from the build caseops/ the CaseOps codec (lossless_caps.py — 4 reserved operators), the end-to-end prep_sp10240_caseops_data.py, build/upload drivers, and HF download helpers (first80 and full124) notes/ the HF-lane derivation note and the byte-fit plan that explains why the body was held at MLP3.75 Full preprocessed dataset (~5 GB) is published at huggingface.co/datasets/Frosty40/10k_golfer; download scripts in caseops/ pull it with the standard HF CLI. This is appendix material — the canonical submission remains train_gpt.py + submission.json + the three seed logs in the parent directory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

newjordan · 2026-05-01T13:58:03Z

Pushed 1521d9f adding tokenization_10kvocab/ inside the records folder for full reviewer access to the SP10240 CaseOps stack:

Tokenizer: fineweb_10240_bpe_lossless_caps_caseops_v1_reserved.model (the actual file all three seeds used) plus the standard SP10240 BPE for diff reference and the BPE training spec
CaseOps codec: lossless_caps.py (4 reserved operators U+E001..U+E004) and prepare_sp10240_caseops_data.py (end-to-end tokenizer + dataset prep)
Build/upload scripts with the actual SentencePiece trainer log
HF download helpers — full preprocessed dataset (~5 GB) is published at https://huggingface.co/datasets/Frosty40/10k_golfer
Derivation notes — how the SP10240 lane was built from the PR Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855 SP8192 trainer spec, and the byte-fit reasoning behind MLP3.75

The original train_gpt.py + seed logs in the parent directory remain the canonical submission; this is appendix material for verifying the tokenization stack.

newjordan · 2026-05-01T17:51:57Z

gptq 32, a lil custom int6-int7 re-smoothing to push to 16mb, and my smoothing/chunk gated TTT brought it down more, but not enough to clutter the leaderboards. final score for me was a flat 1.060 on the 10kvocab. I do think the best place to spend time on this model is in the loop/unet relationship, and if I had more time this is where I would spend it - the two have a symbiotic relationship that can be pushed further.

If it matters, or the organizers want to see (I do not think it does) - I have a mountain of daily tests from the last 44 days. 8-10 hours work daily, testing across multiple GPU all the time. That data is useful to me now, and I can always pull techniques to optimize projects. Eventually I will try and get it organized for public conshumption. Cheers and thanks for the ride.

Octavian and others added 2 commits May 1, 2026 07:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non-record - mockingbird - sota copy/10kvocab#2120

non-record - mockingbird - sota copy/10kvocab#2120
newjordan wants to merge 2 commits intoopenai:mainfrom
newjordan:submission/mockingbird

newjordan commented May 1, 2026 •

edited

Loading

Uh oh!

newjordan commented May 1, 2026

Uh oh!

newjordan commented May 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

newjordan commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!