Record: MHA Path + 1855 9-hparam Stack + PR #1948 + PR #1855 (val_bpb = 1.06184, 3-seed)#1987
Record: MHA Path + 1855 9-hparam Stack + PR #1948 + PR #1855 (val_bpb = 1.06184, 3-seed)#1987TimS-ml wants to merge 3 commits intoopenai:mainfrom
Conversation
(val_bpb=1.06184, 3-seed) - 3-seed mean 1.06184 (sigma 0.000379) across seeds {1334, 999, 42} - Per-seed: 1.06222 (s1334) / 1.06183 (s999) / 1.06146 (s42) - Artifact (max): 15,844,523 bytes (155 KB headroom under 16 MB cap) - All seeds compliant on training (584.1s + 16s GPTQ reserve = 600.1s) and eval (576.5-591.0s) - MHA path: KV=4 GQA -> KV=8 MHA, MLP_MULT 4.0 -> 3.5 (param-matched at ~36M) - 1855 9-hparam tuning stack (BETA2/TTT_BETA2=0.99, TTT_LORA_RANK=80, WARMDOWN_FRAC=0.85, etc.) - Carries PR openai#1948 wins: ReLU squared slope=0.3 + GPTQ reverse-Cholesky Hinv path - Compressor: lrzip pergroup (PR openai#1586 via openai#1667/openai#1729) - Includes per-seed training logs (train_seed{1334,999,42}.log)
|
SOTA is 1.0611 https://github.com/openai/parameter-golf#leaderboard (PR#1855) |
|
@cocohearts Please see our blog for more detailed write-ups and thought process and experiments. Ten Minutes, 16 Megabytes — blog |
|
@cocohearts following up on Billy's note — the writeup is partly a synthesis of our own several hundred experiments and partly an attempt to trace how the community's technical focus shifted across the 30 days: data → tokenization → architecture → optimizer → quantization → test-time compute. We stopped score-chasing on the last day to put the time into the synthesis instead, since it felt more useful long-term than another tick of BPB. Also glad to see PRs 1987 and 1948 picked up along the way. Notes: https://www.junchengbillyli.com/llm-notes.html Genuinely curious whether the trajectory we traced lines up with what you saw from the inside. |
Record: MHA Path + 1855 9-hparam Stack + PR #1948 + PR #1855 (val_bpb = 1.06184, 3-seed)
val_bpb (3-seed mean) = 1.06184 | σ ≈ 0.000379 | ~15.84 MB max (~15.84 MB mean) | 8×H100 SXM | 600 s training + 600 s eval
A joint effort by Tim Shen (@TimS-ml) and Billy Li (@lijuncheng16), with thanks to Prof. Lin Hao (Fordham University) for sponsoring the 8×H100 SXM and 4×RTX 4090 compute used in this submission, Xingyuan Ding for additional experiments, Bill (Yiyuan) Li for meaningful discussions on tokenizers, Lijun Yu (@Lijun-Yu) for his invaluable insights, and Hang Zhou (@greyjoeyzhou) for project discussions.
TL;DR
Extends PR #1948 (Tim Shen's & Billy Li Leaky ReLU Slope + GPTQ Reverse-Cholesky Speedup, val_bpb=1.06242) with PR #1855
Two algorithmically free wins:
−0.00073BPB free win; size-neutral, wallclock-neutral. (4-point sweep confirms 0.3 is the minimum — see Key Change 1.)chol → cholesky_inverse → chol(upper)— mathematically equivalent within fp32 ULP, 2.07–2.24× faster on RTX 4090 cuSOLVER microbench at the GPTQ workload range. (Key Change 2.)Both are hardcoded inside
train_gpt.py(the variant from PR #1867), which also ships this PR's compliance-tuned defaults on top of PR #1938:LQER_TOP_K=3,GATED_ATTN_QUANT_GATE=1,TTT_BATCH_SIZE=16,PHASED_TTT_NUM_PHASES=3,GPTQ_RESERVE_SECONDS=16.Result
GPTQ reserve-time accounting
Key Change 1 in PR1948: Leaky ReLU² slope = 0.3
4-point sweep at fixed seed=42 / 1.0× batch / 600 s wallclock:
Shallow V minimum at 0.3, size-neutral, no wallclock cost. Hardcoded in
train_gpt.pylines 694-695 (Triton kernel) and line 910 (eager fallback).Key Change 2 in PR1948: GPTQ reverse-Cholesky Hinv path
Replaces
with the mathematically equivalent single-pass
(The proof uses
chol(H^{-1}, upper)uniqueness under the positive-diagonal constraint; full derivation in the authors' Stage 7 ablation note.)RTX 4090 cuSOLVER fp32 microbench:
Numerics: max relative error ≤ 5.3e-7 across
n=64..2048; artifact bytes byte-equivalent within brotli noise. Hardcoded intrain_gpt.pylines 1870-1874.Compliance-tuned defaults (this PR vs PR #1948)
This PR's headline change is the MHA conversion + 1855 9-hparam stack: switch
from KV=4 GQA to KV=8 MHA, drop MLP_MULT 4.0→3.5 to stay cap-legal, then layer
on the 9-hparam tuning stack ported from PR #1855.
MHA path (new in this PR)
NUM_KV_HEADSMLP_MULT1855 9-hparam stack (new in this PR)
WARMDOWN_FRACBETA2TTT_BETA2TTT_WEIGHT_DECAYTTT_LORA_RANKSPARSE_ATTN_GATE_SCALEPHASED_TTT_PREFIX_DOCSEMBED_CLIP_SIGMASMLP_CLIP_SIGMASExisting (carried from PR #1948)
LQER_TOP_KGATED_ATTN_QUANT_GATETTT_BATCH_SIZEPHASED_TTT_NUM_PHASESGPTQ_RESERVE_SECONDSCASEOPS_ENABLEDSPARSE_ATTN_GATE_ENABLEDCOMPRESSORbrotlipergroup(lrzip; ≈ −270 KB byte savings)EMBED_BITSMIN_LRArchitecture
SP8192 CaseOps + 11L MHA(KV=8)/XSA11 + L3-5 depth recurrence x2 + L8+ parallel residual lanes + LeakyReLU(0.3)^2 MLP (mult=3.5) + ln-scale + tied embeddings + SmearGate BOS-safe + SparseAttnGate int8 (gate_scale=0.5) + GPTQ int6 Reverse-Cholesky/SDClip (mlp_clip=11.5) + embed int7 (clip=14) + LQER-asym rank4 top3 + lrzip pergroup + phased TTT LoRA r80 bs16 3ph (prefix_docs=2500, β₂=0.99, wd=0.5) + Adam β₂=0.99 + WARMDOWN_FRAC=0.85
Model size: 35,945,671 params (raw); ≈ 15.84 MB compressed (lrzip pergroup).
Quantization
Full-Hessian GPTQ + SDClip, on the reverse-Cholesky Hinv path:
c_q,c_k,c_v,proj) and MLP (fc,proj) weightstok_emb.weightonly (LQER_TOP_K=3)attn_gate_w(GATED_ATTN_QUANT_GATE=1)Compliance (3-seed)
train + GPTQtotal_eval_timeDataset
This submission uses the pre-built case-op augmented FineWeb-10B tokenization from
romeerp/parameter-golf-caseops-v1(pre-built shards), the same dataset that PR #1729 / PR #1736 / PR #1851 use.
The bijective case-op tokenizer (
fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model,shipped in
tokenizers/) and the build script (prepare_caseops_data.py+lossless_caps.py) are included for byte-exact rebuild, but using thepre-built shards from
romeerp/parameter-golf-caseops-v1is the recommendedpath.
Reproducing
Hyperparametersdefaults already encode this PR's compliance-tuned envelope (this PR + b-series, on top of PR #1938); no other env exports are needed.Builds On
Acknowledgments
A joint effort by Tim Shen (@TimS-ml) and Billy Li (@lijuncheng16).
With thanks to:
Additional credits (technique stack):