Record candidate: PR #1855 + TTT_LORA_RANK=56 — val_bpb 1.05997 (s42)#1935
Closed
vimeto wants to merge 1 commit intoopenai:mainfrom
Closed
Record candidate: PR #1855 + TTT_LORA_RANK=56 — val_bpb 1.05997 (s42)#1935vimeto wants to merge 1 commit intoopenai:mainfrom
vimeto wants to merge 1 commit intoopenai:mainfrom
Conversation
6 tasks
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
Apr 30, 2026
Re-evaluates each spec 250 seed's final_model.pt with TTT_LORA_RANK=56 (PR openai#1935's lever) at eval time only. Single env var override, no retrain, no code change. Hedge against deadline-day risk. Promote criterion: 3-seed mean paired Δ ≤ −0.0005 vs spec 250 same-seed. Cost: ~$9-15 (3 seeds × ~$3 eval-only).
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
Apr 30, 2026
Re-evaluates each spec 250 seed's final_model.pt with TTT_LORA_RANK=56 (PR openai#1935's lever) at eval time only. Single env var override, no retrain, no code change. Hedge against deadline-day risk. Promote criterion: 3-seed mean paired Δ ≤ −0.0005 vs spec 250 same-seed. Cost: ~$9-15 (3 seeds × ~$3 eval-only).
Author
|
Closing this PR — the multi-seed verification work it promised in the test plan has been consolidated into PR #2157 with H100-side reference logs. PR #2157 continues the b180-tlr56 lineage but adds AWQ-lite top_k=3 + LQER 60k on top, since those levers became available in the meantime (PR #1908 lineage). The single-seed headline there (SEED=0, val_bpb 1.06043) is a continuation of the same direction this PR was exploring — the SEED=0 + SEED=1234 logs originally promised here are now in PR #2157 with the slightly evolved recipe. Closing here for queue cleanliness; please review PR #2157 instead. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Builds on #1855 (
SP8192 + LQER + SparseAttnGate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack). Two hparam changes:Single seed (SEED=42): val_bpb 1.05997. Beats #1855's 3-seed mean (1.06108) by 0.00111 BPB. Sits between #1855's 3-seed mean and its single-seed best (1.05989, also at SEED=42).
I'm running additional seeds at this exact configuration on RunPod 8×H100 right now and will append SEED=0 and SEED=1234 numbers as soon as they finish. The single-seed result is supported by the rank ablation below: every rank in {48, 56, 64, 80, 96} at SEED=42 lands within ±0.0030 BPB of #1855's 3-seed mean, so the regime looks well-behaved.
TTT_LORA_RANK ablation (SEED=42, QK_GAIN=6.0, all else = #1855)
Clear inverted-U with the optimum at rank 56. PR #1855's greedy-keep stopped at 80 without probing smaller ranks. Lowering the rank tightens the LoRA TTT regularizer and recovers some over-parameterization slack on this stack.
QK_GAIN_INIT (motivation)
Per-head learnable Q-K gain, initialised at 5.0 in #1855. I tested {5.0, 5.25, 6.0} at SEED=42 in our lineage and 6.0 was the local optimum. The rank ablation above is at QK=6.0; if you want exact #1855 reproducibility, revert this single line. I expect the rank-56 win to transfer to QK=5.0 with similar magnitude (untested). The value 6.0 is what I've used throughout the b180 series.
Embedding-side ablations: sparse-FP16, factorization, mixed-precision, vocab sweep
This is the most novel arc of the sprint and the part I want logged for posterity even though it's negative on this artifact budget. Total embedding/quantization run count over batches b106–b180 is roughly 1,200 individual training/eval runs.
Trained-in unstructured sparse FP16 embeddings
The goal: fit a wider vocab (SP12k or SP16k CaseOps) under the 16 MB cap without dropping the embed below FP16. INT5/INT4 GPTQ on the embed is catastrophic, see the vocab sweep below.
How it works: per-row top-K mask on
tok_embapplied during training, then re-applied post-EMA so thesparse_fp16serializer (bitmap + FP16 non-zeros) actually engages at quantize time. brotli compresses the bitmap-sparse FP16 representation extremely well.Compression study (Phase A: what sparsity is required to fit each vocab)
Training-time recipe (final, what worked)
_apply_sparse_embed(base_model, h, step, train_frac, lr_scale)updates the per-row top-K mask each step.sparse_fp16path dies (sparsity drops from 90 % to under 50 %).EMBED_LR=0.3(vs default 0.6). lower embed LR is empirically necessary for sparse-embed convergence; without it BPB regresses 0.05–0.10.TTT_EVAL_ONLY=1+DISABLE_EVAL_COMPILE=1), because the in-train post-quant eval segfaults on LUMI ROCm.Sparsity / vocab grid (Phase B: partial-train, ~830 steps)
Findings:
sparse_fp16path works. topk + EMA-fix gives brotli 14.9–15.4 MB across SP12k and SP16k.EMBED_LR=0.3is necessary. r3_run7 (low LR) clearly beats r3_run0 (default LR) at the same sparsity.Full-train + TTT (Phase C, multi-seed at the best sparse config)
Multi-seed range across 4 SP12k+CO seeds: 0.0078 BPB (1.0931–1.1009). tight, no lottery effect.
Why sparse-FP16 lost (mechanism)
Decomposing the gap from the SP8k baseline:
The sparse mask hurts the embed harder than the larger vocab helps. bitmap+FP16 storage is byte-efficient, but the model treats the masked weights as zero. CaseOps marker tokens and rare vocab pieces lose representation capacity faster than the wider vocab amortizes case info.
Sparse-FP16 + PR #1855 recipe: does not compose
I then bolted the
sparse_fp16embed onto PR #1855's full hparam stack (BETA2=0.99, MLP_CLIP=11.5, EMBED_CLIP=14.0, WARMDOWN=0.85, TTT_BETA2=0.99, TTT_WD=0.5, TTT_LORA_RANK=80, PHASED_TTT_PREFIX_DOCS=2500, BOS-fix SmearGate). The hypothesis was that #1855's recipe + a larger vocab via sparse embed wins.Pre-quant looks normal. but TTT eval plateaus at rb≈1.32 instead of dropping to ~1.10 the way our default recipe does on the same sparse-embed model. that's +0.260 BPB worse than the shipped non-sparse rank=56 recipe.
Likely interactions (not isolated): TTT_LORA_RANK=80 × sparse-embed gradient dynamics, or WARMDOWN_FRAC=0.85 × sparse-EMA. recipe stacks don't compose blindly. each one is a small attractor in hparam space. shelved.
Mixed-precision embedding attempts (failed)
Factorized embedding (ALBERT V × r + r × d)
tok_emb ≈ E_v · E_rwith rank r ≪ d:The rank you need at d=512 is too high to fit budget. ALBERT was designed for BERT (d=768, vocab 30k). it doesn't transfer to small d.
Aggressive embed quantization to fit a larger vocab
INT ≤ 5 on the embed degrades TTT much more than the vocab gain at SP32k recovers. the quantization Pareto floor for the embed at our scale is INT7.
Frequency-aware mixed-bit embed (per-row bits by token frequency)
The idea: high-frequency tokens get more bits, tail rows get fewer. staged in
specs/batch125/run_atlas.pyusing the BPB damage atlas to allocate bits per row.Sparse + low-rank residual (LSR) embeddings
tok_emb = sparse_topk + low_rank_residual. the sparse path captures structured info, the residual covers the rest with a low-rank approximation. implementation inspecs/batch125/run_lsr.py.STE-trained mixed-bit embed (planned but blocked)
The conceptually right answer: train with STE forward = "round to row's bit width" so the model adapts to the exact bit scheme. implemented but training was unstable. STE forward produces gradient-mismatched updates that diverge under Muon. marked pending and not retried before deadline.
Vocab sweep (SP8k → SP32k, with corresponding INT-bits)
For reference (full-train, full TTT, single-seed):
Doubling vocab gives roughly −0.008 to −0.01 BPB at TTT. not yet saturating at SP32k. Compression eats the vocab gain at this artifact budget. the "if compression were free" floor is the SP32k INT7 number 1.05076, which I never figured out how to ship.
CaseOps tokenizer ("lossless caps"): biggest single delta of the sprint
fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model. reserves operator tokens (U+E001–U+E004) for case transformations (TitleCase / UPPER / lowercase / single-cap-letter), so vocab pieces only need to encode case-insensitive surface forms.About 9.9 % of val tokens are case markers. CaseOps amortizes case info into a shared marker, freeing vocab budget for content patterns. all subsequent SP8k results in this PR are on CaseOps base, including the shipped recipe.
Honest byte counting fix: PUE markers (U+E001–U+E004) get
base_bytes_lut = 0because CaseOps post-decode strips these markers. without the fix the byte denominator inflates ~9 % and BPB undercount by ~0.08.ZERO_PUE_MARKERS=1ships intrain_gpt.py.LQER capacity sweep (low-rank quant error recovery)
The best post-hoc compensation method I found for INT6 GPTQ residual error. sweep at b115 base + DISABLE_EVAL_COMPILE + TTT_FIXED_SEQ_COMPILE:
top_k=4 rank=6 (asym, group=64)is the sweet spot. more tensors (top_k 3→4) helps as long as rank stays moderate. increasing rank past 6 overfits the LQER residual to calibration. bytes overhead +40-50 KB. this is what this PR ships.Quant Pareto floor at brotli ≤ 16 MB cap (negative findings)
After a lot of sweeping in
specs/batch124/run_atlas.py, the post-quant pre-TTT BPB floor at the cap is ~1.11450 (b172 SP8k+CO baseline). none of these levers moved the shippable Pareto floor by more than ~0.0005 BPB:Real progress requires training-side levers, not quant tricks. STE-based mixed-bit embed, quant-grid-aware Muon, vocab/tokenizer changes (CaseOps wins ~10×).
QAT noise scale (training-side quant friendliness)
Quantization-aware training during warmdown by adding noise to weights:
Sweet spot at 0.20. didn't ship in this PR because it doesn't compose with #1855's
WARMDOWN_FRAC=0.85recipe at our wallclock budget.GPTQ block size
Default 128 is correct. don't change.
Per-group lrzip artifact
Composition follows #1855 exactly: per-group bucketing + L1 sim-sort on hot 2D groups + lrzip ZPAQ on each group blob.
train_gpt.pywrapperRoundtrip verified lossless: 275 quant tensors decompress byte-exact via Docker amd64 lrzip 0.651 (
scripts/pergroup_lrzip_recompress.py --roundtrip).Eval host needs
apt install lrzip(same as #1855).Test plan
Files changed
records/submission_2026_04_29_b180_tlr56_SUB106/(new)final_model.int6.ptz: per-group lrzip quant blob (15,920,473 B)train_gpt.py: recipe. behavioural diff vs Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855 isQK_GAIN_INIT=6.0andTTT_LORA_RANK=56. source-level diff is bigger because this is our ROCm-ported lineage with FA fallback / inductor shims, but the H100 path is structurally equivalent.submission.json: metadata (val_bpb 1.05997, ablation tables, artifact byte breakdown)lossless_caps.py,prepare_caseops_data.py: CaseOps preprocessing (unchanged from Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855 lineage)fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model: tokenizerlumi_train_ttt_17928157.log: combined MI250X training + TTT eval log (rank=56 SEED=42, single SLURM job, ~37 KB)