S0/PR1851 + Cap Tokenizer + LQER + Global TTT (val_bpb = 1.0713)#1938
Open
lijuncheng16 wants to merge 15 commits intoopenai:mainfrom
Open
S0/PR1851 + Cap Tokenizer + LQER + Global TTT (val_bpb = 1.0713)#1938lijuncheng16 wants to merge 15 commits intoopenai:mainfrom
lijuncheng16 wants to merge 15 commits intoopenai:mainfrom
Conversation
Lines 306 and 603 used double-quoted strings inside an f-string, which the parser rejects before PEP 701 (Python 3.12).
Wrap the FA3 import in try/except. The fallback transposes between FA's (B,T,H,D) layout and SDPA's (B,H,T,D) and expands K/V for GQA so older torch versions without native GQA still work. Slower than FA3 — only for unblocking dev when FA3 isn't built.
Spins up 3 tmux sessions, each running train_gpt_0427.py on its own GPU with a different seed and a unique RUN_ID. Defaults: GPUs 0,1,2, seeds 1337-1339, MAX_WALLCLOCK_SECONDS=4800 (8x the 600s 8xH100 budget, to roughly step-match on 1xH100). Includes pre-flight checks for venv, dataset shards, tokenizer; uses python -u + PYTHONUNBUFFERED=1 so log output flushes through tee in real time. Configurable via env: VENV, REPO, SCRIPT, SEEDS_OVERRIDE, GPUS_OVERRIDE, MAX_WALLCLOCK_SECONDS, VOCAB_SIZE, EXTRA_ENV.
Stack S9 variant alongside 0427: bank-mode weight storage (qo_bank, kv_bank, mlp_up_bank, mlp_down_bank), Polar-Express Newton-Schulz coefficients for Muon, fused Triton softcapped CE, Phased LoRA TTT, global SGD post-quant repair. Has its own 3-tier flash-attn fallback (FA3 -> FA2 -> SDPA) so no hand-patch is needed.
Sibling of run_3seeds.sh, defaults to train_gpt_s9.py and uses session prefix "s9_" + run-id prefix "s9" so it can run alongside the 0427 launcher without colliding (different tmux session names, different log filenames). Same configurable env vars (GPUS_OVERRIDE, SEEDS_OVERRIDE, MAX_WALLCLOCK_SECONDS, VOCAB_SIZE, EXTRA_ENV, etc.).
Reproduces the 2026-04-09 record (SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT, val_bpb=1.0810 3-seed mean). Points at the LZMA-compressed code wrapper inside the record folder, defaults to seeds 42/314/999 (matching the record), and sets the record's documented env overrides (QK_GAIN_INIT=5.25, TTT_ENABLED=1, TTT_LR=0.005, TTT_EPOCHS=3). Session prefix r0409_ so it can run alongside the 0427 and S9 launchers.
Three scripts for preparing the lossless-caps caseops dataset: - lossless_caps.py — case encoding/decoding logic - prepare_caseops_data.py — dataset preparation pipeline - retokenize_corpus.py — re-tokenization helper Used by the train_gpt_s9_caseops_lqer.py training variant.
S9 stack extended with caseops dataset support and LQER (Low-rank Quantization Error Rescue). 4487 lines vs train_gpt_s9.py's 4363. This is the script used in PR openai#1851 stage 1/2 ablations (cells A0–F4 in stage 1, Z0/P*/Q*/R* in stage 2).
5252-line training script reproducing PR openai#1851's stack with extensive inline annotations (CN comments). Mandatory FA3 import (no SDPA fallback) and direct Triton kernel use. Sibling to train_gpt_s9*.py variants.
Single-seed (42) result on 8xH100 SXM. Key additions over S9 base: SmearGate, sparse attention gating, LQER rank-4 asymmetric quantization, and embed int7. Artifact size 15.92 MB. Phased TTT eval in ~554s.
Best results (10min 8GPU, TTT val_bpb): - cap_lrelu03 seed1337: 1.0713 - final_cap seed42: 1.0718 - final_cap seed999: 1.0735
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
S0/PR1851 + Cap Tokenizer + LQER + Global TTT (val_bpb = 1.0713)
A joint effort by Billy Li and Tim Shen, with thanks to Xingyuan Ding for additional experiments and Bill (Yiyuan) Li for meaningful discussions on tokenizers.
I started looking at this challenge around 4/20. The merged leaderboard hadn't changed much by then, but the volume of PRs and improvements was absolutely overwhelming. I cleaned up my thoughts and followed a systematic procedure — tackling the problem piece by piece: data → tokenization → model architecture → optimizer → quantization → test-time compute.
A more detailed write-up is at: https://www.junchengbillyli.com/llm-notes.html
Results
Best result: quant+TTT val_bpb = 1.0713, artifact ≈ 16.09 MB (seed 1337, 8xH100 SXM, 10min wallclock).
Script:
final_s0_pr1851_mod_gptq_v2.py(3143 lines, 31 KB compressed).Data & Tokenization
There was a clear trend across PRs that smaller vocabs have a low ceiling — 8192 seems to be the sweet spot for all the later successful submissions. But relying on the default SentencePiece tokenizer is not the best idea.
What we tried:
Key insight: The fact that 1024-token vocabs plateau quickly tells us the network tends to stall if tokenization is too easy. The tokenization needs to make the task hard enough for the model to keep learning.
Model Architecture
When I first saw the 9-layer implementation, I thought it was pretty standard. Depth recurrence was clearly proven effective within the community. From there:
Final architecture: 11L x 512d, 8 heads / 4 KV heads, MLP 4x, tied embeddings (vocab 7972), logit softcap 30.0, partial RoPE (16/64 dims), layer looping (layers 3–5, 2 loops enabled at 35% of training), parallel residuals from layer 8+, skip gates (U-Net connections).
Optimizer
We first ablated Muon vs. AdamW. I thought AdamW wouldn't lag Muon too much on a relatively small dataset — this is not true. AdamW consistently lags Muon in our experiments.
We then looked into Muon to see what could be improved. The all_reduce communication overhead was something I aimed to reduce, but eventually by the 0427 trick, I was only able to squeeze out ~0.0005 BPB gain.
Final config: Muon (Polar-Express Newton-Schulz, 5 backend steps) for matrix params (lr=0.026, momentum=0.97, wd=0.095), AdamW for embeddings (lr=0.6, wd=0.085) and scalars (lr=0.02). Gradient clipping 0.3, warmdown 75%.
Quantization
Quantization was a bit of a black box, though we've done it before. My intuition was that group quantization should produce a more stable estimate of all parameters and be better suited for GPTQ. However, GPTQ's group statistics also take additional space, which pushes the submission file to go oversize — the gain does not justify its cost.
From intuition QAT should work better, but I never got a successful QAT run.
Final config:
Test-Time Compute
This is absolutely the backdoor lottery ticket. The main theme is to align the trained distribution with the test-time distribution.
Final TTT config:
Reproduction
Files
final_s0_pr1851_mod_gptq_v2.pylogs/*20260429*.logtrain_gpt_s0_pr1851_mod.pytrain_gpt_s9.pytrain_gpt_s9_caseops_lqer.py