S0/PR1851 + Cap Tokenizer + LQER + Global TTT (val_bpb = 1.0713) by lijuncheng16 · Pull Request #1938 · openai/parameter-golf

lijuncheng16 · 2026-04-29T16:56:30Z

S0/PR1851 + Cap Tokenizer + LQER + Global TTT (val_bpb = 1.0713)

A joint effort by Billy Li and Tim Shen, with thanks to Xingyuan Ding for additional experiments and Bill (Yiyuan) Li for meaningful discussions on tokenizers.

I started looking at this challenge around 4/20. The merged leaderboard hadn't changed much by then, but the volume of PRs and improvements was absolutely overwhelming. I cleaned up my thoughts and followed a systematic procedure — tackling the problem piece by piece: data → tokenization → model architecture → optimizer → quantization → test-time compute.

A more detailed write-up is at: https://www.junchengbillyli.com/llm-notes.html

Results

Best result: quant+TTT val_bpb = 1.0713, artifact ≈ 16.09 MB (seed 1337, 8xH100 SXM, 10min wallclock).

Seed	Steps	EMA BPB	Quant BPB	Quant+TTT BPB	Artifact
1337	4733	1.0746	1.0832	1.0713	16.09 MB
42	4741	1.0752	1.0834	1.0718	16.09 MB
999	4775	1.0740	1.0845	(running)	16.09 MB

Script: final_s0_pr1851_mod_gptq_v2.py (3143 lines, 31 KB compressed).

Data & Tokenization

There was a clear trend across PRs that smaller vocabs have a low ceiling — 8192 seems to be the sweet spot for all the later successful submissions. But relying on the default SentencePiece tokenizer is not the best idea.

What we tried:

Vocabulary pruning: Thought tokenizing full words could be wasteful given the time/compute limits. Tried pruning long words that could be covered by combinations of shorter subword tokens. This did not help (+0.001 BPB).
Case folding (lowercasing + capital token): Lowercasing everything and treating the leading capital letter as a special token — this helped. This is the "cap tokenizer" (SP8192, effective vocab 7972 after folding).
Data normalization: Getting rid of long URLs and anything rare/difficult in the FineWeb dataset.

Key insight: The fact that 1024-token vocabs plateau quickly tells us the network tends to stall if tokenization is too easy. The tokenization needs to make the task hard enough for the model to keep learning.

Model Architecture

When I first saw the 9-layer implementation, I thought it was pretty standard. Depth recurrence was clearly proven effective within the community. From there:

GQA → MHA: Considered replacing GQA (group size 2) back to MHA to trade a bit more parameters for better performance.
Local attention heads: Implemented fancy local attention — failed horrendously, since the implementation is inherently inefficient and could never utilize the Flash Attention 3 ecosystem.
DeepSeek Engrams, value embeddings, embedding factorizations: None worked within the 10-minute wall clock. None of these are as fast as a vanilla attention + MLP combo.
The only thing that helps is making the MLPs wider. All other architectural tweaks don't see ROI.

Final architecture: 11L x 512d, 8 heads / 4 KV heads, MLP 4x, tied embeddings (vocab 7972), logit softcap 30.0, partial RoPE (16/64 dims), layer looping (layers 3–5, 2 loops enabled at 35% of training), parallel residuals from layer 8+, skip gates (U-Net connections).

Optimizer

We first ablated Muon vs. AdamW. I thought AdamW wouldn't lag Muon too much on a relatively small dataset — this is not true. AdamW consistently lags Muon in our experiments.

We then looked into Muon to see what could be improved. The all_reduce communication overhead was something I aimed to reduce, but eventually by the 0427 trick, I was only able to squeeze out ~0.0005 BPB gain.

Final config: Muon (Polar-Express Newton-Schulz, 5 backend steps) for matrix params (lr=0.026, momentum=0.97, wd=0.095), AdamW for embeddings (lr=0.6, wd=0.085) and scalars (lr=0.02). Gradient clipping 0.3, warmdown 75%.

Quantization

Quantization was a bit of a black box, though we've done it before. My intuition was that group quantization should produce a more stable estimate of all parameters and be better suited for GPTQ. However, GPTQ's group statistics also take additional space, which pushes the submission file to go oversize — the gain does not justify its cost.

From intuition QAT should work better, but I never got a successful QAT run.

Final config:

GPTQ int6 for all attention + MLP weight matrices (16 calibration batches)
GPTQ int8 for tied embeddings
LQER error correction: rank 4, int4 factors, asymmetric (group 64), applied to top-3 highest-error layers
Brotli compression

Test-Time Compute

This is absolutely the backdoor lottery ticket. The main theme is to align the trained distribution with the test-time distribution.

Final TTT config:

Phased global TTT: 1 phase, 2000 prefix docs, cosine LR (peak 0.001)
215 gradient chunks over 48K suffix docs (32K tokens/chunk)
LoRA rank 96 on K, O, and MLP projections (Adam, beta1=0, beta2=0.999, wd=1.0)
TTT consistently drops BPB by ~0.01 from the quantized baseline

Reproduction

pip install brotli sentencepiece
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/

SEED=1337 torchrun --standalone --nproc_per_node=8 final_s0_pr1851_mod_gptq_v2.py

Files

File	Description
`final_s0_pr1851_mod_gptq_v2.py`	Final training script
`logs/20260429.log`	Training logs for all 04/29 runs
`train_gpt_s0_pr1851_mod.py`	Earlier annotated PR #1851 exploration
`train_gpt_s9.py`	Prior S9 stack (bank-mode + Polar-Express Muon)
`train_gpt_s9_caseops_lqer.py`	Prior cap tokenizer variant

Lines 306 and 603 used double-quoted strings inside an f-string, which the parser rejects before PEP 701 (Python 3.12).

Wrap the FA3 import in try/except. The fallback transposes between FA's (B,T,H,D) layout and SDPA's (B,H,T,D) and expands K/V for GQA so older torch versions without native GQA still work. Slower than FA3 — only for unblocking dev when FA3 isn't built.

Spins up 3 tmux sessions, each running train_gpt_0427.py on its own GPU with a different seed and a unique RUN_ID. Defaults: GPUs 0,1,2, seeds 1337-1339, MAX_WALLCLOCK_SECONDS=4800 (8x the 600s 8xH100 budget, to roughly step-match on 1xH100). Includes pre-flight checks for venv, dataset shards, tokenizer; uses python -u + PYTHONUNBUFFERED=1 so log output flushes through tee in real time. Configurable via env: VENV, REPO, SCRIPT, SEEDS_OVERRIDE, GPUS_OVERRIDE, MAX_WALLCLOCK_SECONDS, VOCAB_SIZE, EXTRA_ENV.

Stack S9 variant alongside 0427: bank-mode weight storage (qo_bank, kv_bank, mlp_up_bank, mlp_down_bank), Polar-Express Newton-Schulz coefficients for Muon, fused Triton softcapped CE, Phased LoRA TTT, global SGD post-quant repair. Has its own 3-tier flash-attn fallback (FA3 -> FA2 -> SDPA) so no hand-patch is needed.

Sibling of run_3seeds.sh, defaults to train_gpt_s9.py and uses session prefix "s9_" + run-id prefix "s9" so it can run alongside the 0427 launcher without colliding (different tmux session names, different log filenames). Same configurable env vars (GPUS_OVERRIDE, SEEDS_OVERRIDE, MAX_WALLCLOCK_SECONDS, VOCAB_SIZE, EXTRA_ENV, etc.).

Reproduces the 2026-04-09 record (SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT, val_bpb=1.0810 3-seed mean). Points at the LZMA-compressed code wrapper inside the record folder, defaults to seeds 42/314/999 (matching the record), and sets the record's documented env overrides (QK_GAIN_INIT=5.25, TTT_ENABLED=1, TTT_LR=0.005, TTT_EPOCHS=3). Session prefix r0409_ so it can run alongside the 0427 and S9 launchers.

Three scripts for preparing the lossless-caps caseops dataset: - lossless_caps.py — case encoding/decoding logic - prepare_caseops_data.py — dataset preparation pipeline - retokenize_corpus.py — re-tokenization helper Used by the train_gpt_s9_caseops_lqer.py training variant.

S9 stack extended with caseops dataset support and LQER (Low-rank Quantization Error Rescue). 4487 lines vs train_gpt_s9.py's 4363. This is the script used in PR openai#1851 stage 1/2 ablations (cells A0–F4 in stage 1, Z0/P*/Q*/R* in stage 2).

5252-line training script reproducing PR openai#1851's stack with extensive inline annotations (CN comments). Mandatory FA3 import (no SDPA fallback) and direct Triton kernel use. Sibling to train_gpt_s9*.py variants.

Single-seed (42) result on 8xH100 SXM. Key additions over S9 base: SmearGate, sparse attention gating, LQER rank-4 asymmetric quantization, and embed int7. Artifact size 15.92 MB. Phased TTT eval in ~554s.

Best results (10min 8GPU, TTT val_bpb): - cap_lrelu03 seed1337: 1.0713 - final_cap seed42: 1.0718 - final_cap seed999: 1.0735

…ission

lijuncheng16 added 15 commits April 26, 2026 23:05

Add train_gpt_0427.py variant

2b12257

Fix Python <3.12 f-string nested-quote SyntaxErrors

fb0f9a3

Lines 306 and 603 used double-quoted strings inside an f-string, which the parser rejects before PEP 701 (Python 3.12).

Add S9 + SmearGate + SparseAttn + LQER submission (val_bpb=1.0705)

7e43c66

Single-seed (42) result on 8xH100 SXM. Key additions over S9 base: SmearGate, sparse attention gating, LQER rank-4 asymmetric quantization, and embed int7. Artifact size 15.92 MB. Phased TTT eval in ~554s.

Add PR description for S9 + SmearGate + SparseAttn + LQER submission

554e654

Add 04/29 training logs — S0/PR1851 cap tokenizer runs

0beb655

Best results (10min 8GPU, TTT val_bpb): - cap_lrelu03 seed1337: 1.0713 - final_cap seed42: 1.0718 - final_cap seed999: 1.0735

Add final_s0_pr1851_mod_gptq_v2.py — final S0/PR1851 training script

b916f92

Update PR description for S0/PR1851 + Cap Tokenizer + Global TTT subm…

bf4060b

…ission

lijuncheng16 changed the title ~~0429~~ S0/PR1851 + Cap Tokenizer + LQER + Global TTT (val_bpb = 1.0713) Apr 29, 2026

This was referenced Apr 29, 2026

Record: Leaky ReLU Slope + GPTQ Reverse-Cholesky Speedup + PR #1938 (val_bpb = 1.06242) #1948

Open

Record: MHA Path + 1855 9-hparam Stack + PR #1948 + PR #1855 (val_bpb = 1.06184, 3-seed) #1987

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S0/PR1851 + Cap Tokenizer + LQER + Global TTT (val_bpb = 1.0713)#1938

S0/PR1851 + Cap Tokenizer + LQER + Global TTT (val_bpb = 1.0713)#1938
lijuncheng16 wants to merge 15 commits intoopenai:mainfrom
lijuncheng16:0428

lijuncheng16 commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lijuncheng16 commented Apr 29, 2026