Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) by simon-marcus · Pull Request #2140 · openai/parameter-golf

simon-marcus · 2026-05-01T23:59:27Z

Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.05702)

Corrected 3-seed mean: val_bpb 1.05701907 | max 15,989,637 bytes | 8xH100 SXM | 600s train + in-timer eval

This commit corrects the originally submitted #2140 state. The initial #2140 logs accidentally restored the within-word, word-start, and agreement n-gram channels. The corrected run uses the intended PR #2018 posture: token-only n-gram tilt, with the target-token-gated channels disabled.

Results

Seed	Train steps	Pre-quant val_bpb	Quantized val_bpb	Post-TTT val_bpb	Train time	Eval time	Artifact bytes	Notes
42	4,901	1.05911087	1.06721120	1.05590816	596.1s	506.3s	15,989,637	token-only in-timer n-gram hints, prefix=2500, chunk=64
0	4,861	1.06126113	1.07025102	1.05838308	596.2s	553.5s	15,985,432	token-only in-timer n-gram hints, prefix=2500, chunk=64
314	4,855	1.06013753	1.06808383	1.05676598	596.1s	473.2s	15,983,433	token-only in-timer n-gram hints, prefix=2500, chunk=64
Mean	4872.3	1.06016984	1.06851535	1.05701907	596.1s	511.0s	15,989,637 max	3 corrected seeds

Compared with the last merged leaderboard record (#1855, 1.06107587 BPB), this corrected 3-seed mean improves val_bpb by 0.00405680.

Summary

This corrected submission starts from the PR #2014 strict-compliance stack and adds two changes:

LeakyReLU-square slope 0.3. PR Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed) #2014 inherited the older LeakyReLU(0.5)^2 MLP slope. This changes the fused/eager LeakyReLU-square path to slope 0.3, following the later PR Record: V21 + N-gram Tilt + LeakyReLU 0.3 — val_bpb 1.05851 (3-seed mean) #1967 lineage.
Token-only in-timer online n-gram tilt during TTT eval. This ports the in-timer online n-gram tilt approach we introduced in PR Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047) #2018 into the PR Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed) #2014 progressive-context / short-doc TTT path. Hints are built causally from validation tokens inside the measured TTT eval timer (NGRAM_HINT_PRECOMPUTE_OUTSIDE=0) and applied as a scoring-time posterior adjustment to per-token NLL. The within-word, word-start, and agreement channels are disabled (WITHIN_BOOST=0.0, WORD_BOOST=0.0, AGREE_ADD_BOOST=0.0).

The n-gram path does not add model parameters and has no artifact-size cost beyond source files. The run keeps the PR #2014 global prefix phase (PHASED_TTT_PREFIX_DOCS=2500) and uses larger TTT chunks to fit hint construction and scoring inside the 600s eval budget.

What changed vs PR #2014

Component	PR #2014	This submission
Base stack	Progressive 3k context growth + ShortDoc TTT + CaseOps + LQER + AWQ-lite	Same
LeakyReLU-square slope	0.5	0.3
Eval-time n-gram tilt	off	token-only, causal, in timer
N-gram hint timing	n/a	`NGRAM_HINT_PRECOMPUTE_OUTSIDE=0`
Global phased TTT prefix	2500 docs	2500 docs
TTT chunking	48 / short 24	64 / short 32
Artifact	PR #2014 per-group compressed artifact	same compression path, 15,989,637 bytes max

Compliance notes

Training cap: all three corrected seeds stopped under 600s (596.069s, 596.182s, 596.099s).
Eval cap: all three corrected final TTT evals are under 600s (506.254s, 553.458s, 473.217s). All use NGRAM_HINT_PRECOMPUTE_OUTSIDE=0, so n-gram hint generation is inside the measured eval timer.
Artifact cap: max observed Total submission size quantized+pergroup is 15,989,637 bytes, under 16 MB.
Score-first TTT: the LoRA TTT path scores each chunk before any per-doc update. The global prefix SGD phase runs after the prefix docs have already been scored.
N-gram causality: hints are generated by a single left-to-right pass over validation tokens and aligned to target positions. The tilt uses prefix-derived token hint IDs and boosts; it does not inspect future tokens for the scored position.
Token-only diagnostic: all corrected evals report ngram_tilt:hints total=47853343 gated=628156 token_gate=628156 within_gate=0 word_gate=0 agree2plus=0.

Key settings

CASEOPS_ENABLED=1
VOCAB_SIZE=8192
TRAIN_SEQ_LEN=3072
ROPE_TRAIN_SEQ_LEN=3072
TRAIN_SEQ_SCHEDULE=1024@0.100,2048@0.700,3072@1.000
TRAIN_SEQ_SCHEDULE_MODE=wallclock
SEQ_CHANGE_WARMUP_STEPS=32
COMPILE_SHAPE_WARMUP=1
EVAL_SEQ_LEN=3072
EVAL_STRIDE=1536

LEAKY_RELU_SQ_SLOPE=0.3

TTT_ENABLED=1
TTT_EVAL_SEQ_LEN=3072
TTT_BATCH_SIZE=24
TTT_CHUNK_SIZE=64
TTT_SHORT_SCORE_FIRST_ENABLED=1
TTT_SHORT_DOC_LEN=2000
TTT_SHORT_CHUNK_SIZE=32
TTT_SHORT_SCORE_FIRST_STEPS=256:16,2000:32
TTT_LORA_RANK=80
TTT_LORA_LR=0.0001
TTT_LOCAL_LR_MULT=0.75
TTT_MASK=no_qv
TTT_Q_LORA=0
TTT_V_LORA=0
TTT_WEIGHT_DECAY=0.5
TTT_BETA2=0.99
PHASED_TTT_PREFIX_DOCS=2500
PHASED_TTT_NUM_PHASES=1

NGRAM_TILT_ENABLED=1
NGRAM_HINT_PRECOMPUTE_OUTSIDE=0
TOKEN_ORDER=16
TOKEN_THRESHOLD=0.800
TOKEN_BOOST=2.625
WITHIN_TAU=0.450
WITHIN_BOOST=0.0
WORD_ORDER=4
WORD_NORMALIZE=strip_punct_lower
WORD_TAU=0.650
WORD_BOOST=0.0
AGREE_ADD_BOOST=0.0

WARMDOWN_FRAC=0.85
BETA2=0.99
QK_GAIN_INIT=5.25
SPARSE_ATTN_GATE_ENABLED=1
SPARSE_ATTN_GATE_SCALE=0.5
GATED_ATTN_QUANT_GATE=1
SMEAR_GATE_ENABLED=1
GATE_WINDOW=12
FUSED_CE_ENABLED=1
MATRIX_LR=0.026
MIN_LR=0.1
GRAD_CLIP_NORM=0.3
EMBED_BITS=7
EMBED_CLIP_SIGMAS=14.0
MATRIX_CLIP_SIGMAS=12.85
ATTN_CLIP_SIGMAS=13.0
MLP_CLIP_SIGMAS=11.5
LQER_ENABLED=1
LQER_RANK=4
LQER_TOP_K=3
LQER_FACTOR_BITS=4
LQER_ASYM_ENABLED=1
LQER_ASYM_GROUP=64
AWQ_LITE_ENABLED=1
AWQ_LITE_BITS=8
AWQ_LITE_GROUP_TOP_K=1
AWQ_LITE_GROUP_SIZE=64
ASYM_LOGIT_RESCALE=1
COMPRESSOR=pergroup
GPTQ_RESERVE_SECONDS=4.0
GPTQ_CALIBRATION_BATCHES=16
VAL_LOSS_EVERY=0

Files

train_gpt.py — full script for the candidate.
online_ngram_tilt.py, online_ngram_state.c — online causal n-gram hint builder and scoring-time tilt helper, from the PR Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047) #2018 in-timer n-gram tilt work.
train_eval_seed42_corrected_token_only.log — corrected seed-42 token-only in-timer TTT eval log, using the corrected seed-42 training artifact.
train_eval_seed0_corrected_token_only.log — corrected seed-0 training, quantization, and token-only in-timer TTT eval log.
train_eval_seed314_corrected_token_only.log — corrected seed-314 training, quantization, and token-only in-timer TTT eval log.
train_seed42.log, eval_seed42_ngram_p0_c64.log, eval_seed42_ngram_p2500_c64.log, train_eval_seed314.log, train_eval_seed0.log — superseded initial Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140 logs retained for transparency; these used the accidental within-word / word-start / agreement n-gram channels and are not the corrected token-only result.
prepare_caseops_data.py, lossless_caps.py, tokenizers/...model — CaseOps data/tokenizer helpers from the merged Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855 lineage.
submission.json — structured 3-seed metadata.

Reproducing

After preparing the CaseOps data and tokenizer, run with the environment above:

SEED=42 DATA_PATH=./data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model torchrun --standalone --nproc_per_node=8 train_gpt.py

For the eval-only sweep used here, load the saved quantized artifact and run with:

TTT_EVAL_ONLY=1 NGRAM_TILT_ENABLED=1 NGRAM_HINT_PRECOMPUTE_OUTSIDE=0 WITHIN_BOOST=0.0 WORD_BOOST=0.0 AGREE_ADD_BOOST=0.0 PHASED_TTT_PREFIX_DOCS=2500 TTT_CHUNK_SIZE=64 TTT_SHORT_CHUNK_SIZE=32 TTT_SHORT_SCORE_FIRST_STEPS=256:16,2000:32 torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

This is a stack on top of the recent strict-compliance CaseOps line. Most directly:

PR Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed) #2014 by @simonbissonnette — progressive 3k context growth and short-doc TTT base.
PR Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047) #2018 — in-timer online n-gram tilt during TTT eval.
PR Record: V21 + N-gram Tilt + LeakyReLU 0.3 — val_bpb 1.05851 (3-seed mean) #1967 / PR Record: Leaky ReLU Slope + GPTQ Reverse-Cholesky Speedup + PR #1938 (val_bpb = 1.06242) #1948 lineage — LeakyReLU-square slope 0.3 evidence.
PR Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855 by @codemath3000 — merged CaseOps / SparseGate / LQER / per-group compression lineage and data-prep precedent.
PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787, Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549 #1736, Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729, RECORD: SmearGate + Attention Output Gate + Legal TTT | val_bpb=1.07139 #1667, Record: VarLen Attention + Fused MLP + Multi-Phase Global SGD TTT — val_bpb 1.07193 (3-seed mean) #1626, Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) #1530, Record: SP4096 + Polar Express + MuonEq-R + Depth Recurrence — 1.0923 BPB (3-seed) #1344, Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309) #493, New SOTA: 1.12676 BPB - 11L XSA-all(11) + GPTQ-lite + EMA + Late QAT #478, Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248) #315, and SmearGate + BigramHash + Int6 + SWA + U-Net Skips (1.1518 BPB) #289 for the underlying architecture, optimizer, tokenizer, quantization, compression, and legal score-first TTT components.

codemath3000 · 2026-05-02T12:49:13Z

Flagging what looks like a Condition 1 compliance issue.

This PR uses the n-gram tilt's within-word and word-start channels (WITHIN_BOOST=0.750, WORD_BOOST=0.750, AGREE_ADD_BOOST=0.500). These channels were identified as a Rule 1 violation in review of PR #1420, and the author accepted the finding and removed them. Quoting PR #1420's own body:

"@Gusanidas identified that within_hint and word_hint used is_bnd/is_ws flags derived from tokens_[p] (the target token) to gate whether a hint was produced — a Rule 1 violation."

"The gating decision 'should I produce a hint at this position?' depended on whether the target token was a word boundary or had a leading space. This meant the probability distribution P(x_t | x_1...x_{t-1}) changed depending on the value of x_t itself."

"Conclusion: The within/word channels' -0.0025 BPB contribution came entirely from target-dependent gating. Without it, they add noise. Only token_hint (orders 8–16) produces a legitimate improvement."

The merged precedent for this exact n-gram code (PR #1514, merged 2026-04-29) explicitly excludes both channels:

"Causal n-gram tilt — only the prefix-only token expert is active. The within-word and word-start experts from PR #1420 are explicitly zeroed out."

The same target-dependent gating is still present in this PR's code:

online_ngram_state.c:321-324, 351, 361: at scoring position i, the C kernel reads tok = tokens[i] and gates the within-channel hint emission on boundary_lut[tok] and starts_new_word_lut[tok] — both functions of the realized target.
online_ngram_tilt.py:181-188: WordStartState gates word-channel hint emission on the same target-derived flags.

With WITHIN_BOOST and WORD_BOOST nonzero, the tilt at position t depends on whether x_t is a boundary/word-start token — exactly the pattern the merged ruling excluded.

The README's compliance section addresses future-token leakage ("does not inspect future tokens") but not the target-token-at-position-t issue that PR #1420's review identified. PR #2018 is cited as additional precedent but is currently OPEN/unmerged; only PR #1514 is binding precedent on this code, and it disabled these channels.

Could the author and maintainers take a look? Happy to be corrected if I've misread something.

…es, paper scan Post-deadline PR activity: PR openai#2138 Lock-In Byte Mixer confirmed BPB bug (corrected ~1.0671, not 0.979556); PR openai#2135 codemath3000 1.05651 narrowly misses 0.005 threshold; PR openai#2139 TTT Peer-LoRA Ensemble novel technique; PR openai#2140 flagged for target-token n-gram gating violation. New papers: BBQ quantization (ICLR 2026, arXiv:2603.01599), EntroLLM (2505.02380), In-Place TTT NTP-aligned (2604.06169). https://claude.ai/code/session_01CxuVyZaKMxMMc8Q4sMb2dF

cocohearts · 2026-05-02T18:15:27Z

Update after applying the grace policy: I am treating the corrected token-only state as technically acceptable.

The correction disables the unintended within-word / word-start / agreement channels, and the corrected score moved worse to 1.05701907. It still does not add a leaderboard row in #2146 because earlier PR #2135 is now included and scores lower (1.05650768), so #2140 is not a chronological frontier.

simon-marcus · 2026-05-02T19:09:58Z

Thanks @cocohearts and @codemath3000. I agree with the C1 concern for the originally submitted #2140 state.

This was my mistake: when porting my #2018 in-timer n-gram path onto the #2014 stack, I accidentally restored the within-word / word-start / agreement channels. The intended posture was the same as I had done in #2018: token-only n-gram tilt, with the target-token-gated channels disabled.

I’ve corrected this by setting:

WITHIN_BOOST=0.0
WORD_BOOST=0.0
AGREE_ADD_BOOST=0.0

and by hard-disabling those code paths when the boosts are zero. The corrected logs now report:

ngram_tilt:hints total=47853343 gated=628156 token_gate=628156 within_gate=0 word_gate=0 agree2plus=0

The corrected 3-seed results are:

Seed	Pre-quant val_bpb	Quantized val_bpb	Post-TTT val_bpb	Eval time
42	1.05911087	1.06721120	1.05590816	506.254s
0	1.06126113	1.07025102	1.05838308	553.458s
314	1.06013753	1.06808383	1.05676598	473.217s
Mean	1.06016984	1.06851535	1.05701907	510.976s

I’ll push a corrective commit updating the code, README, submission.json, and logs so the PR clearly distinguishes the corrected token-only run from the earlier invalid submitted state.

On timing/eligibility, I’ll defer to maintainers. My main goal here is to make the technical record unambiguous and reproducible.

simon-marcus force-pushed the codex/pr2014-lrelu-ngram-ttt branch 3 times, most recently from c1ac531 to 0eac71e Compare May 2, 2026 00:02

Add PR2014 Leaky n-gram TTT submission staging

fbedd5e

simon-marcus changed the title ~~Record: PR #2014 stack + LeakyReLU 0.3 + strict in-timer n-gram TTT (val_bpb 1.0555)~~ Record: PR #2014 stack + LeakyReLU 0.3 + strict in-timer n-gram TTT (val_bpb 1.0560) May 2, 2026

simon-marcus force-pushed the codex/pr2014-lrelu-ngram-ttt branch from 0eac71e to fbedd5e Compare May 2, 2026 00:17

cocohearts mentioned this pull request May 2, 2026

Update leaderboard with May 1 audited rows #2146

Merged

Correct PR2014 n-gram TTT submission logs

62f21f8

simon-marcus changed the title ~~Record: PR #2014 stack + LeakyReLU 0.3 + strict in-timer n-gram TTT (val_bpb 1.0560)~~ Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) May 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570)#2140

Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570)#2140
simon-marcus wants to merge 2 commits intoopenai:mainfrom
simon-marcus:codex/pr2014-lrelu-ngram-ttt

simon-marcus commented May 1, 2026 •

edited

Loading

Uh oh!

codemath3000 commented May 2, 2026

Uh oh!

cocohearts commented May 2, 2026 •

edited

Loading

Uh oh!

simon-marcus commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

simon-marcus commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.05702)

Results

Summary

What changed vs PR #2014

Compliance notes

Key settings

Files

Reproducing

Credits

Uh oh!

codemath3000 commented May 2, 2026

Uh oh!

cocohearts commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simon-marcus commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

simon-marcus commented May 1, 2026 •

edited

Loading

cocohearts commented May 2, 2026 •

edited

Loading