Record: LongCtx No-QV QK5.25 + AsymLogit + LQER g32/top4 + TTT-local 0.80 — 1.05792 BPB 3-seed mean#2060
Open
S0urC10ud wants to merge 2 commits intoopenai:mainfrom
Open
Conversation
3-seed mean val_bpb 1.05769 (std 0.00041) on 8xH100 80GB SXM. Forks PR openai#2007 (1.0590); env-only retune of MATRIX_LR=0.028, LQER_RANK=2, LQER_TOP_K=4, LQER_ASYM_GROUP=32, TTT_LOCAL_LR_MULT=0.80. train_gpt.py byte-identical to parent. Improves merged SOTA openai#1493 (1.0810) by ~0.023 BPB / ~0.051 nats; paired vs openai#2007 yields p=0.002 but only 0.00293-nat magnitude (below 0.005-nat bar) so non-record vs openai#2007.
Corrected seed-42 Final TTT BPB: 1.05781454 (was 1.05711454). 3-seed mean: 1.05792053, std: 0.00007528. Directory renamed 1.0577 -> 1.0579 to match corrected mean.
TanishGudise
added a commit
to TanishGudise/parameter-golf
that referenced
this pull request
May 1, 2026
Recent sweep logs (named): S55: token-only ngram tilt baseline = 1.05814 (legal per PR openai#1514) S56: + 3 openai#2060 levers = 1.05790 (-0.00024) S57: + AsymLogit only = 1.05759 (-0.00055) S58: full stack = 1.05694 single seed (-0.00120, super-additive +0.00041 synergy) S59: S58 + EVAL_SEQ_LEN=3072 + NUM_PHASES=1 + WD=1.0 = 1.05657 single seed, eval 567s S60 OOM: S59 + EMA_DECAY=0.9 + batch=64 = OOM S60 retry: S58 + EMA_DECAY=0.9 + batch=32 = 1.05795 / 832s NON-COMPLIANT S61: S59 + TOKEN_BOOST=3.0 = 1.05678 single seed, eval 501s S62: S58 + NUM_PHASES=2 + WD=2.0 + eval=2816 = 1.05755 Earlier sweep logs (UUID-named): ~83 files covering S15-S54 sprint history. Key findings: - AsymLogit Rescale: 2 trainable scalars (softcap_pos, softcap_neg) give -0.00055 via global TTT polish - Token-only n-gram tilt confirmed legal per PR openai#1514 (within_tau=99, word_tau=99, agree=0) - 3 openai#2060 env-var levers (MATRIX_LR=0.028, LQER_ASYM_GROUP=32, TTT_LORA_LR=8e-5) stack super-additively - EMA_DECAY=0.9 didn't transfer to our base - NUM_PHASES=2 revert costs more pre-quant than it gains in TTT recovery - Discovered val_tokens=47852544 vs canonical 47853343, need EVAL_INCLUDE_TAIL=1 for clean comparison Added .gitignore for final_model.pt (130MB - over GitHub limit), .so binaries, pid files.
TanishGudise
added a commit
to TanishGudise/parameter-golf
that referenced
this pull request
May 1, 2026
Beats PR openai#1855 (merged rank 1, 1.06108) by 0.00438 BPB. Beats PR openai#2014 (best open, 1.05759) by 0.00089 BPB. Beats PR openai#2060 (1.05792) by 0.00122 BPB. Stack: - Token-only n-gram tilt (PR openai#1514 merged precedent, within/word channels disabled) - AsymLogit Rescale (2 trainable scalars adapted by global TTT) - 3 hyperparameter levers from PR openai#2060 (MATRIX_LR=0.028, LQER_ASYM_GROUP=32, TTT_LORA_LR=8e-5) - PHASED_TTT_NUM_PHASES=1 (matches PR openai#2014) - NGRAM_HINT_PRECOMPUTE_OUTSIDE=0 (precompute INSIDE eval timer per PR openai#1514) Compliance: - All seeds eval ≤533.1s (cap 600s, 67-80s margin) - All artifacts ≤15.95MB (cap 16MB) - Token-only n-gram channel (within_gate=0, word_gate=0) - Score-first TTT (per PR openai#402)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a 10min/16MB record based on a five-knob hyperparameter retune of PR #2007's LongCtx No-QV QK5.25 + AsymLogit configuration.
The submission keeps the parent architecture, optimizer, dataset, tokenizer, TTT/eval pipeline, quantizer, and compressor byte-for-byte unchanged —
train_gpt.pyis byte-identical to #2007 (md52a7e36e29aa5b5811abb6170059aa8d1). Only five env-var scalars are retuned.Record folder:
Results
All seeds satisfy the 10-minute / 16 MB rules:
train_wallclock_s ≤ 600✓ (595.8 – 596.1 s)eval_time_s ≤ 600✓ (395.4 – 397.6 s)What changed vs parent #2007
Five env-var deltas only:
MATRIX_LRLQER_RANKLQER_ASYM_GROUPLQER_TOP_KTTT_LOCAL_LR_MULTComparison vs parent #2007 (paired, same 3 seeds)
Paired one-sided t-test: mean Δ_loss = −0.00234 nats, t = −6.73, p ≈ 0.011.
Comparison vs currently-merged SOTA #1493 (1.0810)
Δ_BPB ≈ −0.0231, Δ_nats ≈ −0.051. Every individual seed improves by ≥ 0.022 BPB, far above the 0.005-nat record threshold.
Method
The frozen parent recipe (unchanged here):
TTT_LORA_RANK=80,PHASED_TTT_PREFIX_DOCS=3000.QK_GAIN_INIT=5.25,WARMDOWN_FRAC=0.85,MIN_LR=0.1.lrzip -L 9compression.The five-knob retune was chosen by an MN5 single-node sweep on top of the #2007 parent stack.
Reproduction
Prepare the CaseOps dataset once:
Run a seed from this folder:
The script sets the full environment and runs:
Repeat with
SEED=0andSEED=1234for the matched 3-seed validation.Logs
train_seed42.log— final BPB 1.05781454train_seed0.log— final BPB 1.05798212train_seed1234.log— final BPB 1.05796494Hardware / software