Skip to content

S0/PR1851 + Cap Tokenizer + LQER + Global TTT (val_bpb = 1.0713)#1938

Open
lijuncheng16 wants to merge 15 commits intoopenai:mainfrom
lijuncheng16:0428
Open

S0/PR1851 + Cap Tokenizer + LQER + Global TTT (val_bpb = 1.0713)#1938
lijuncheng16 wants to merge 15 commits intoopenai:mainfrom
lijuncheng16:0428

Conversation

@lijuncheng16
Copy link
Copy Markdown

S0/PR1851 + Cap Tokenizer + LQER + Global TTT (val_bpb = 1.0713)

A joint effort by Billy Li and Tim Shen, with thanks to Xingyuan Ding for additional experiments and Bill (Yiyuan) Li for meaningful discussions on tokenizers.

I started looking at this challenge around 4/20. The merged leaderboard hadn't changed much by then, but the volume of PRs and improvements was absolutely overwhelming. I cleaned up my thoughts and followed a systematic procedure — tackling the problem piece by piece: data → tokenization → model architecture → optimizer → quantization → test-time compute.

A more detailed write-up is at: https://www.junchengbillyli.com/llm-notes.html


Results

Best result: quant+TTT val_bpb = 1.0713, artifact ≈ 16.09 MB (seed 1337, 8xH100 SXM, 10min wallclock).

Seed Steps EMA BPB Quant BPB Quant+TTT BPB Artifact
1337 4733 1.0746 1.0832 1.0713 16.09 MB
42 4741 1.0752 1.0834 1.0718 16.09 MB
999 4775 1.0740 1.0845 (running) 16.09 MB

Script: final_s0_pr1851_mod_gptq_v2.py (3143 lines, 31 KB compressed).


Data & Tokenization

There was a clear trend across PRs that smaller vocabs have a low ceiling — 8192 seems to be the sweet spot for all the later successful submissions. But relying on the default SentencePiece tokenizer is not the best idea.

What we tried:

  • Vocabulary pruning: Thought tokenizing full words could be wasteful given the time/compute limits. Tried pruning long words that could be covered by combinations of shorter subword tokens. This did not help (+0.001 BPB).
  • Case folding (lowercasing + capital token): Lowercasing everything and treating the leading capital letter as a special token — this helped. This is the "cap tokenizer" (SP8192, effective vocab 7972 after folding).
  • Data normalization: Getting rid of long URLs and anything rare/difficult in the FineWeb dataset.

Key insight: The fact that 1024-token vocabs plateau quickly tells us the network tends to stall if tokenization is too easy. The tokenization needs to make the task hard enough for the model to keep learning.


Model Architecture

When I first saw the 9-layer implementation, I thought it was pretty standard. Depth recurrence was clearly proven effective within the community. From there:

  • GQA → MHA: Considered replacing GQA (group size 2) back to MHA to trade a bit more parameters for better performance.
  • Local attention heads: Implemented fancy local attention — failed horrendously, since the implementation is inherently inefficient and could never utilize the Flash Attention 3 ecosystem.
  • DeepSeek Engrams, value embeddings, embedding factorizations: None worked within the 10-minute wall clock. None of these are as fast as a vanilla attention + MLP combo.
  • The only thing that helps is making the MLPs wider. All other architectural tweaks don't see ROI.

Final architecture: 11L x 512d, 8 heads / 4 KV heads, MLP 4x, tied embeddings (vocab 7972), logit softcap 30.0, partial RoPE (16/64 dims), layer looping (layers 3–5, 2 loops enabled at 35% of training), parallel residuals from layer 8+, skip gates (U-Net connections).


Optimizer

We first ablated Muon vs. AdamW. I thought AdamW wouldn't lag Muon too much on a relatively small dataset — this is not true. AdamW consistently lags Muon in our experiments.

We then looked into Muon to see what could be improved. The all_reduce communication overhead was something I aimed to reduce, but eventually by the 0427 trick, I was only able to squeeze out ~0.0005 BPB gain.

Final config: Muon (Polar-Express Newton-Schulz, 5 backend steps) for matrix params (lr=0.026, momentum=0.97, wd=0.095), AdamW for embeddings (lr=0.6, wd=0.085) and scalars (lr=0.02). Gradient clipping 0.3, warmdown 75%.


Quantization

Quantization was a bit of a black box, though we've done it before. My intuition was that group quantization should produce a more stable estimate of all parameters and be better suited for GPTQ. However, GPTQ's group statistics also take additional space, which pushes the submission file to go oversize — the gain does not justify its cost.

From intuition QAT should work better, but I never got a successful QAT run.

Final config:

  • GPTQ int6 for all attention + MLP weight matrices (16 calibration batches)
  • GPTQ int8 for tied embeddings
  • LQER error correction: rank 4, int4 factors, asymmetric (group 64), applied to top-3 highest-error layers
  • Brotli compression

Test-Time Compute

This is absolutely the backdoor lottery ticket. The main theme is to align the trained distribution with the test-time distribution.

Final TTT config:

  • Phased global TTT: 1 phase, 2000 prefix docs, cosine LR (peak 0.001)
  • 215 gradient chunks over 48K suffix docs (32K tokens/chunk)
  • LoRA rank 96 on K, O, and MLP projections (Adam, beta1=0, beta2=0.999, wd=1.0)
  • TTT consistently drops BPB by ~0.01 from the quantized baseline

Reproduction

pip install brotli sentencepiece
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/

SEED=1337 torchrun --standalone --nproc_per_node=8 final_s0_pr1851_mod_gptq_v2.py

Files

File Description
final_s0_pr1851_mod_gptq_v2.py Final training script
logs/*20260429*.log Training logs for all 04/29 runs
train_gpt_s0_pr1851_mod.py Earlier annotated PR #1851 exploration
train_gpt_s9.py Prior S9 stack (bank-mode + Polar-Express Muon)
train_gpt_s9_caseops_lqer.py Prior cap tokenizer variant

Lines 306 and 603 used double-quoted strings inside an f-string,
which the parser rejects before PEP 701 (Python 3.12).
Wrap the FA3 import in try/except. The fallback transposes between FA's
(B,T,H,D) layout and SDPA's (B,H,T,D) and expands K/V for GQA so older
torch versions without native GQA still work. Slower than FA3 — only for
unblocking dev when FA3 isn't built.
Spins up 3 tmux sessions, each running train_gpt_0427.py on its own
GPU with a different seed and a unique RUN_ID. Defaults: GPUs 0,1,2,
seeds 1337-1339, MAX_WALLCLOCK_SECONDS=4800 (8x the 600s 8xH100 budget,
to roughly step-match on 1xH100). Includes pre-flight checks for venv,
dataset shards, tokenizer; uses python -u + PYTHONUNBUFFERED=1 so log
output flushes through tee in real time.

Configurable via env: VENV, REPO, SCRIPT, SEEDS_OVERRIDE, GPUS_OVERRIDE,
MAX_WALLCLOCK_SECONDS, VOCAB_SIZE, EXTRA_ENV.
Stack S9 variant alongside 0427: bank-mode weight storage (qo_bank,
kv_bank, mlp_up_bank, mlp_down_bank), Polar-Express Newton-Schulz
coefficients for Muon, fused Triton softcapped CE, Phased LoRA TTT,
global SGD post-quant repair. Has its own 3-tier flash-attn fallback
(FA3 -> FA2 -> SDPA) so no hand-patch is needed.
Sibling of run_3seeds.sh, defaults to train_gpt_s9.py and uses session
prefix "s9_" + run-id prefix "s9" so it can run alongside the 0427
launcher without colliding (different tmux session names, different log
filenames). Same configurable env vars (GPUS_OVERRIDE, SEEDS_OVERRIDE,
MAX_WALLCLOCK_SECONDS, VOCAB_SIZE, EXTRA_ENV, etc.).
Reproduces the 2026-04-09 record (SP8192 + 3-Layer Recurrence + Parallel
Residuals + QK-Gain 5.25 + Legal TTT, val_bpb=1.0810 3-seed mean). Points
at the LZMA-compressed code wrapper inside the record folder, defaults to
seeds 42/314/999 (matching the record), and sets the record's documented
env overrides (QK_GAIN_INIT=5.25, TTT_ENABLED=1, TTT_LR=0.005, TTT_EPOCHS=3).
Session prefix r0409_ so it can run alongside the 0427 and S9 launchers.
Three scripts for preparing the lossless-caps caseops dataset:
- lossless_caps.py — case encoding/decoding logic
- prepare_caseops_data.py — dataset preparation pipeline
- retokenize_corpus.py — re-tokenization helper

Used by the train_gpt_s9_caseops_lqer.py training variant.
S9 stack extended with caseops dataset support and LQER (Low-rank
Quantization Error Rescue). 4487 lines vs train_gpt_s9.py's 4363.
This is the script used in PR openai#1851 stage 1/2 ablations (cells A0–F4
in stage 1, Z0/P*/Q*/R* in stage 2).
5252-line training script reproducing PR openai#1851's stack with extensive
inline annotations (CN comments). Mandatory FA3 import (no SDPA fallback)
and direct Triton kernel use. Sibling to train_gpt_s9*.py variants.
Single-seed (42) result on 8xH100 SXM. Key additions over S9 base:
SmearGate, sparse attention gating, LQER rank-4 asymmetric quantization,
and embed int7. Artifact size 15.92 MB. Phased TTT eval in ~554s.
Best results (10min 8GPU, TTT val_bpb):
- cap_lrelu03 seed1337: 1.0713
- final_cap seed42: 1.0718
- final_cap seed999: 1.0735
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant