Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570)#2140
Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570)#2140simon-marcus wants to merge 2 commits intoopenai:mainfrom
Conversation
c1ac531 to
0eac71e
Compare
0eac71e to
fbedd5e
Compare
|
Flagging what looks like a Condition 1 compliance issue. This PR uses the n-gram tilt's within-word and word-start channels (
The merged precedent for this exact n-gram code (PR #1514, merged 2026-04-29) explicitly excludes both channels:
The same target-dependent gating is still present in this PR's code:
With The README's compliance section addresses future-token leakage ("does not inspect future tokens") but not the target-token-at-position-t issue that PR #1420's review identified. PR #2018 is cited as additional precedent but is currently OPEN/unmerged; only PR #1514 is binding precedent on this code, and it disabled these channels. Could the author and maintainers take a look? Happy to be corrected if I've misread something. |
…es, paper scan Post-deadline PR activity: PR openai#2138 Lock-In Byte Mixer confirmed BPB bug (corrected ~1.0671, not 0.979556); PR openai#2135 codemath3000 1.05651 narrowly misses 0.005 threshold; PR openai#2139 TTT Peer-LoRA Ensemble novel technique; PR openai#2140 flagged for target-token n-gram gating violation. New papers: BBQ quantization (ICLR 2026, arXiv:2603.01599), EntroLLM (2505.02380), In-Place TTT NTP-aligned (2604.06169). https://claude.ai/code/session_01CxuVyZaKMxMMc8Q4sMb2dF
|
Update after applying the grace policy: I am treating the corrected token-only state as technically acceptable. The correction disables the unintended within-word / word-start / agreement channels, and the corrected score moved worse to 1.05701907. It still does not add a leaderboard row in #2146 because earlier PR #2135 is now included and scores lower (1.05650768), so #2140 is not a chronological frontier. |
|
Thanks @cocohearts and @codemath3000. I agree with the C1 concern for the originally submitted #2140 state. This was my mistake: when porting my #2018 in-timer n-gram path onto the #2014 stack, I accidentally restored the within-word / word-start / agreement channels. The intended posture was the same as I had done in #2018: token-only n-gram tilt, with the target-token-gated channels disabled. I’ve corrected this by setting:
and by hard-disabling those code paths when the boosts are zero. The corrected logs now report:
The corrected 3-seed results are:
I’ll push a corrective commit updating the code, README, On timing/eligibility, I’ll defer to maintainers. My main goal here is to make the technical record unambiguous and reproducible. |
Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.05702)
Corrected 3-seed mean: val_bpb 1.05701907 | max 15,989,637 bytes | 8xH100 SXM | 600s train + in-timer eval
This commit corrects the originally submitted #2140 state. The initial #2140 logs accidentally restored the within-word, word-start, and agreement n-gram channels. The corrected run uses the intended PR #2018 posture: token-only n-gram tilt, with the target-token-gated channels disabled.
Results
Compared with the last merged leaderboard record (#1855, 1.06107587 BPB), this corrected 3-seed mean improves val_bpb by 0.00405680.
Summary
This corrected submission starts from the PR #2014 strict-compliance stack and adds two changes:
NGRAM_HINT_PRECOMPUTE_OUTSIDE=0) and applied as a scoring-time posterior adjustment to per-token NLL. The within-word, word-start, and agreement channels are disabled (WITHIN_BOOST=0.0,WORD_BOOST=0.0,AGREE_ADD_BOOST=0.0).The n-gram path does not add model parameters and has no artifact-size cost beyond source files. The run keeps the PR #2014 global prefix phase (
PHASED_TTT_PREFIX_DOCS=2500) and uses larger TTT chunks to fit hint construction and scoring inside the 600s eval budget.What changed vs PR #2014
NGRAM_HINT_PRECOMPUTE_OUTSIDE=0Compliance notes
596.069s,596.182s,596.099s).506.254s,553.458s,473.217s). All useNGRAM_HINT_PRECOMPUTE_OUTSIDE=0, so n-gram hint generation is inside the measured eval timer.Total submission size quantized+pergroupis 15,989,637 bytes, under 16 MB.ngram_tilt:hints total=47853343 gated=628156 token_gate=628156 within_gate=0 word_gate=0 agree2plus=0.Key settings
Files
train_gpt.py— full script for the candidate.online_ngram_tilt.py,online_ngram_state.c— online causal n-gram hint builder and scoring-time tilt helper, from the PR Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047) #2018 in-timer n-gram tilt work.train_eval_seed42_corrected_token_only.log— corrected seed-42 token-only in-timer TTT eval log, using the corrected seed-42 training artifact.train_eval_seed0_corrected_token_only.log— corrected seed-0 training, quantization, and token-only in-timer TTT eval log.train_eval_seed314_corrected_token_only.log— corrected seed-314 training, quantization, and token-only in-timer TTT eval log.train_seed42.log,eval_seed42_ngram_p0_c64.log,eval_seed42_ngram_p2500_c64.log,train_eval_seed314.log,train_eval_seed0.log— superseded initial Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140 logs retained for transparency; these used the accidental within-word / word-start / agreement n-gram channels and are not the corrected token-only result.prepare_caseops_data.py,lossless_caps.py,tokenizers/...model— CaseOps data/tokenizer helpers from the merged Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855 lineage.submission.json— structured 3-seed metadata.Reproducing
After preparing the CaseOps data and tokenizer, run with the environment above:
For the eval-only sweep used here, load the saved quantized artifact and run with:
Credits
This is a stack on top of the recent strict-compliance CaseOps line. Most directly: