Skip to content

non-record - mockingbird - sota copy/10kvocab#2120

Open
newjordan wants to merge 2 commits intoopenai:mainfrom
newjordan:submission/mockingbird
Open

non-record - mockingbird - sota copy/10kvocab#2120
newjordan wants to merge 2 commits intoopenai:mainfrom
newjordan:submission/mockingbird

Conversation

@newjordan
Copy link
Copy Markdown

@newjordan newjordan commented May 1, 2026

mockingbird

non-record — mockingbird — SOTA copy / 10k vocab

Non-record submission. The SP10240 CaseOps sister of PR #1855 — same body, same compression / phased-TTT machinery, vocab swapped 8192 → 10240 with the body shrunk to MLP3.75 to stay under the 16 MB cap.

Filed for comparison, not as a record claim.

Results

3-seed mean (seeds 42 / 0 / 1):

Seed val_bpb (quantized_ttt_phased) Steps Total submission size
42 1.06204667 5,264 15,816,988 B
0 1.06226648 5,231 15,818,783 B
1 1.06299064 5,221 15,810,544 B
mean 1.06243460 15,818,783 B (max)

For reference: PR #1855 (SP8192) reported 3-seed mean 1.06107587 post-phased-TTT. Mockingbird is +0.00136 worse — the cost of the 10k vocab swap on otherwise-identical machinery.

Hardware: 8×H100 SXM · 600 s wallclock · bytes_code 163,036 (uncompressed) / 41,220 (compressed)

Architecture

11L · dim 512 · mlp_mult=3.75 · loop_start=3, loop_end=5, enable_looping_at=0.45

  • Tokenizer: SP10240 CaseOps lossless-caps (10,240 tokens), FineWeb 10B with byte-level loss accounting
  • Quant: per-group, embed int7, matrix int6, LQER asymmetric rank-4
  • Eval: PR1855 phased LoRA TTT — prefix_docs=2500, phases=3, chunk=48
  • Compression: pergroup
  • Train budget: 600 s wallclock, hard 16 MB artifact cap

Seeds note

The three runs use byte-identical training code, differing only in Hyperparameters.seed = N (line 479) and four cosmetic TEST_ID/TEST_DATE/RUN_KIND/blurb fields. The committed train_gpt.py is the seed-42 run.

Reproduce

SKIP_GPTQ=1 SEED=42 torchrun --standalone --nproc_per_node=8 \
  records/track_10min_16mb/2026-05-01_Mockingbird_8xH100/train_gpt.py

For seeds 0 / 1, change line 479 (Hyperparameters.seed = 42) to the desired seed.

Lineage


Couldnt beat this bad boy in a day, nice job everyone. My last gambit was to push the vocab and wind down better because thats where everythign went. I spend the last two weeks trying to push the neral with kernels, vocabs, got really stuck on some 12L options for awhile... Just never really cracked past 1.079. Re-looked at evals too late in the game (spent a lot of early time on them) to matter! Cheers everyone. Thanks for the ride. I'm a better/smarter person than I was when I started this. May the Schwartz be with you all.

⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠚⠓⠤⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⡾⣅⠀⠀⠀⠀⣨⢷⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⣀⡤⡦⣄⡀⠀⡇⠀⠙⢢⡔⠋⠀⢸⠀⢀⣠⢴⢤⣀⠀⠀
⡴⠊⠁⠀⢷⠀⠉⠲⣇⠀⠀⢸⡇⠀⠀⣸⠖⠉⠀⡼⠀⠈⠑⢦
⣧⠀⠀⢀⣸⡀⠀⠀⢿⠙⠲⢼⣧⠖⠋⡿⠀⠀⢀⣇⡀⠀⠀⣸
⢹⡴⠚⠉⠀⠈⠑⠦⣼⠀⠀⢸⡇⠀⠀⣧⡴⠊⠁⠀⠉⠓⢦⡏
⠀⠈⠓⢤⣀⡤⠖⠋⠁⠙⠲⣼⣧⠖⠋⠈⠙⠲⢤⣀⡤⠚⠁⠀
⢀⡠⠖⠉⠀⠉⠓⠦⣄⠴⠚⢹⡏⠓⠦⣠⡴⠚⠉⠀⠉⠲⢄⡀
⣼⠙⠲⢤⣀⡠⠔⠋⢹⠀⠀⣸⣇⠀⠀⣏⠙⠲⢄⣀⡤⠖⠋⣧
⡏⠀⠀⠀⢸⠀⠀⠀⣿⠴⠚⢹⡏⠓⠦⣿⠀⠀⠀⡇⠀⠀⠀⢸
⠙⠢⣄⡀⡟⣀⡤⠚⡇⠀⠀⢸⡇⠀⠀⢸⠓⠤⣀⢹⢀⣠⠔⠋
⠀⠀⠀⠉⠋⠁⠀⠀⡇⣀⠴⠊⠑⠦⣀⢸⠀⠀⠈⠙⠉⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠻⣅⠀⠀⠀⠀⣨⠟⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠲⠖⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀

Octavian and others added 2 commits May 1, 2026 07:41
Non-record submission filing the 10k-vocab (SP10240 lossless-caps CaseOps)
sister of PR openai#1855. The architecture is held fixed; only the tokenizer /
vocab dimension changes (8192 -> 10240), with the body shrunk to MLP3.75
to stay under the 16 MB artifact cap.

Results (3-seed mean over [42, 0, 1]):
  val_bpb_exact = 1.06243460   (max bytes_total = 15,818,783)

  seed 42: 1.06204667 BPB  ·  15,816,988 B
  seed  0: 1.06226648 BPB  ·  15,818,783 B
  seed  1: 1.06299064 BPB  ·  15,810,544 B

Architecture: 11L · dim 512 · mlp_mult=3.75 · loop_start=3, loop_end=5,
enable_looping_at=0.45. Quant: per-group, embed int7, matrix int6, LQER
asym rank-4. Eval: PR1855 phased LoRA TTT (prefix_docs=2500, phases=3,
chunk=48). 8xH100 SXM, 600s wallclock, FineWeb 10B SP10240 CaseOps.

Filed for comparison with the SP8192 lane in PR openai#1855 (3-seed mean
1.06107587). Mockingbird does not beat PR openai#1855; it documents the cost
of the vocab swap on otherwise-identical machinery.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Appendix subdir tokenization_10kvocab/ inside the records folder, holding
everything needed to inspect or reproduce the SP10240 CaseOps tokenization
stack the seed runs trained on:

  tokenizer/   the actual lossless_caps_caseops_v1_reserved tokenizer
               (.model + .vocab) used by all three seeds, plus the standard
               SP10240 BPE for diff reference and the BPE training spec
  build/       the one-command rebuild driver, HF upload helper, and the
               actual SentencePiece trainer log from the build
  caseops/     the CaseOps codec (lossless_caps.py — 4 reserved operators),
               the end-to-end prep_sp10240_caseops_data.py, build/upload
               drivers, and HF download helpers (first80 and full124)
  notes/       the HF-lane derivation note and the byte-fit plan that
               explains why the body was held at MLP3.75

Full preprocessed dataset (~5 GB) is published at
huggingface.co/datasets/Frosty40/10k_golfer; download scripts in caseops/
pull it with the standard HF CLI.

This is appendix material — the canonical submission remains train_gpt.py
+ submission.json + the three seed logs in the parent directory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@newjordan
Copy link
Copy Markdown
Author

Pushed 1521d9f adding tokenization_10kvocab/ inside the records folder for full reviewer access to the SP10240 CaseOps stack:

The original train_gpt.py + seed logs in the parent directory remain the canonical submission; this is appendix material for verifying the tokenization stack.

@newjordan
Copy link
Copy Markdown
Author

newjordan commented May 1, 2026

gptq 32, a lil custom int6-int7 re-smoothing to push to 16mb, and my smoothing/chunk gated TTT brought it down more, but not enough to clutter the leaderboards. final score for me was a flat 1.060 on the 10kvocab. I do think the best place to spend time on this model is in the loop/unet relationship, and if I had more time this is where I would spend it - the two have a symbiotic relationship that can be pushed further.

If it matters, or the organizers want to see (I do not think it does) - I have a mountain of daily tests from the last 44 days. 8-10 hours work daily, testing across multiple GPU all the time. That data is useful to me now, and I can always pull techniques to optimize projects. Eventually I will try and get it organized for public conshumption. Cheers and thanks for the ride.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant