non-record - mockingbird - sota copy/10kvocab#2120
non-record - mockingbird - sota copy/10kvocab#2120newjordan wants to merge 2 commits intoopenai:mainfrom
Conversation
Non-record submission filing the 10k-vocab (SP10240 lossless-caps CaseOps) sister of PR openai#1855. The architecture is held fixed; only the tokenizer / vocab dimension changes (8192 -> 10240), with the body shrunk to MLP3.75 to stay under the 16 MB artifact cap. Results (3-seed mean over [42, 0, 1]): val_bpb_exact = 1.06243460 (max bytes_total = 15,818,783) seed 42: 1.06204667 BPB · 15,816,988 B seed 0: 1.06226648 BPB · 15,818,783 B seed 1: 1.06299064 BPB · 15,810,544 B Architecture: 11L · dim 512 · mlp_mult=3.75 · loop_start=3, loop_end=5, enable_looping_at=0.45. Quant: per-group, embed int7, matrix int6, LQER asym rank-4. Eval: PR1855 phased LoRA TTT (prefix_docs=2500, phases=3, chunk=48). 8xH100 SXM, 600s wallclock, FineWeb 10B SP10240 CaseOps. Filed for comparison with the SP8192 lane in PR openai#1855 (3-seed mean 1.06107587). Mockingbird does not beat PR openai#1855; it documents the cost of the vocab swap on otherwise-identical machinery. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Appendix subdir tokenization_10kvocab/ inside the records folder, holding
everything needed to inspect or reproduce the SP10240 CaseOps tokenization
stack the seed runs trained on:
tokenizer/ the actual lossless_caps_caseops_v1_reserved tokenizer
(.model + .vocab) used by all three seeds, plus the standard
SP10240 BPE for diff reference and the BPE training spec
build/ the one-command rebuild driver, HF upload helper, and the
actual SentencePiece trainer log from the build
caseops/ the CaseOps codec (lossless_caps.py — 4 reserved operators),
the end-to-end prep_sp10240_caseops_data.py, build/upload
drivers, and HF download helpers (first80 and full124)
notes/ the HF-lane derivation note and the byte-fit plan that
explains why the body was held at MLP3.75
Full preprocessed dataset (~5 GB) is published at
huggingface.co/datasets/Frosty40/10k_golfer; download scripts in caseops/
pull it with the standard HF CLI.
This is appendix material — the canonical submission remains train_gpt.py
+ submission.json + the three seed logs in the parent directory.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Pushed
The original |
|
gptq 32, a lil custom int6-int7 re-smoothing to push to 16mb, and my smoothing/chunk gated TTT brought it down more, but not enough to clutter the leaderboards. final score for me was a flat 1.060 on the 10kvocab. I do think the best place to spend time on this model is in the loop/unet relationship, and if I had more time this is where I would spend it - the two have a symbiotic relationship that can be pushed further. If it matters, or the organizers want to see (I do not think it does) - I have a mountain of daily tests from the last 44 days. 8-10 hours work daily, testing across multiple GPU all the time. That data is useful to me now, and I can always pull techniques to optimize projects. Eventually I will try and get it organized for public conshumption. Cheers and thanks for the ride. |
non-record — mockingbird — SOTA copy / 10k vocab
Non-record submission. The SP10240 CaseOps sister of PR #1855 — same body, same compression / phased-TTT machinery, vocab swapped 8192 → 10240 with the body shrunk to MLP3.75 to stay under the 16 MB cap.
Filed for comparison, not as a record claim.
Results
3-seed mean (seeds 42 / 0 / 1):
For reference: PR #1855 (SP8192) reported 3-seed mean 1.06107587 post-phased-TTT. Mockingbird is +0.00136 worse — the cost of the 10k vocab swap on otherwise-identical machinery.
Hardware: 8×H100 SXM · 600 s wallclock ·
bytes_code163,036 (uncompressed) / 41,220 (compressed)Architecture
11L · dim 512 ·
mlp_mult=3.75· loop_start=3, loop_end=5,enable_looping_at=0.45prefix_docs=2500, phases=3, chunk=48Seeds note
The three runs use byte-identical training code, differing only in
Hyperparameters.seed = N(line 479) and four cosmeticTEST_ID/TEST_DATE/RUN_KIND/blurb fields. The committedtrain_gpt.pyis the seed-42 run.Reproduce
For seeds 0 / 1, change line 479 (
Hyperparameters.seed = 42) to the desired seed.Lineage
510d03e0fc355406c9fd06f92d23b8c5aedea7fb)upstream/mainatfdde8dc(PR Record: SmearGate BOS Fix 3-Seed Compliance Re-run — val_bpb 1.06141 (3-seed mean) #1868 SmearGate BOS Fix)Couldnt beat this bad boy in a day, nice job everyone. My last gambit was to push the vocab and wind down better because thats where everythign went. I spend the last two weeks trying to push the neral with kernels, vocabs, got really stuck on some 12L options for awhile... Just never really cracked past 1.079. Re-looked at evals too late in the game (spent a lot of early time on them) to matter! Cheers everyone. Thanks for the ride. I'm a better/smarter person than I was when I started this. May the Schwartz be with you all.