Record: 10L Int5-MLP + SmearGate + BigramHash + Late QAT (val_bpb=1.1628) by chris-buckley · Pull Request #286 · openai/parameter-golf

chris-buckley · 2026-03-20T23:18:44Z

Summary

Mixed-precision int5/int6 export trades per-weight precision for an extra transformer layer: MLP weights go int5 while attention stays int6, buying enough artifact budget for a 10-layer ReLU² model under the 16 MB cap. SmearGate and BigramHash inject cheap token-pair context without learned parameters, and late QAT (kicking in at 85% wallclock) avoids the training instability of always-on STE while still closing most of the quantization gap.

Technique Stack

Mixed int5 MLP / int6 attention export — buys artifact budget for a 10th layer
SmearGate — gated residual smearing for cheap inter-token mixing
BigramHash — 4096-bucket bigram embedding (dim 128) for token-pair context without a full bigram table
Orthogonal init + muP-style output projection scaling — stable training at depth 10
Decoupled Muon weight decay (0.04) — Muon optimizer with weight decay decoupled from the update
SWA during warmdown — averaging 15 checkpoints every 50 steps starting at 50% of training
Late QAT at 85% wallclock — quantization-aware fine-tuning only in the final phase, not always-on STE
Sliding-window eval (stride 64, full-tail handling) — proper long-context evaluation

Metrics

Metric	Value
val_bpb	1.1628
val_loss	1.9634
pre-quant val_bpb	1.1907
pre-quant val_loss	2.0105
artifact size	15,481,841 bytes (model: 15,425,120 + code: 56,721)
steps	4,354
wallclock	602 s
eval time	171 s
hardware	8×H100 SXM
seed	1337

Reproduction

RUN_ID=10l_int5mlp_smearbigram_lateqat_seed1337 \
DATA_PATH=./data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
SEED=1337 \
pip install zstandard && \
torchrun --standalone --nproc_per_node=8 \
./records/track_10min_16mb/2026-03-20_10L_Int5MLP_SmearBigram_LateQAT/train_gpt.py

Three-seed sweep:

for SEED in 1337 42 7; do
  RUN_ID=10l_int5mlp_smearbigram_lateqat_seed${SEED} \
  DATA_PATH=./data/datasets/fineweb10B_sp1024 \
  TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
  VOCAB_SIZE=1024 \
  SEED=${SEED} \
  torchrun --standalone --nproc_per_node=8 \
  ./records/track_10min_16mb/2026-03-20_10L_Int5MLP_SmearBigram_LateQAT/train_gpt.py
done

Status

This is a single-seed result (seed 1337). It does not beat the current best MLP3x submission (val_bpb=1.1598). The technique stack is complete and the run is reproducible, but seeds 42 and 7 still need to be run for statistical significance before this qualifies as a proper record claim.

Posting this as a record contribution to document the mixed int5/int6 + late QAT approach. If multi-seed results hold up or improve, will update.

…628)

MatoTeziTanka · 2026-04-11T20:10:23Z

Community Review — Record: 10L Int5-MLP + SmearGate + BigramHash + Late QAT (val_bpb=1.1628)

BPB: 1.1628 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 3a456b975ff1, file records/track_10min_16mb/2026-03-20_10L_Int5MLP_SmearBigram_LateQAT/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.12s, dim=512, layers=10, vocab=1024, code=56721 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.12s, dim=512, layers=10, vocab=1024, code=56721 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Record: 10L Int5-MLP + SmearGate + BigramHash + Late QAT (val_bpb=1.1…

3a456b9

…628)

This was referenced Mar 25, 2026

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #728

Closed

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019

Merged

Gusanidas mentioned this pull request Mar 30, 2026

Record: 1.1140 BPB — ResidLambdas + Split-LR + Train-Budget GPTQ + Coprime Loader (12-seed mean) #1130

Open

abaybektursun mentioned this pull request Apr 3, 2026

Record: Fused MLP (Triton+CUTLASS EVT) + Fast Causal N-Gram Tilt & Subword Certainty (3-seed mean) #1105

Closed

Its-Just-Crump mentioned this pull request Apr 5, 2026

Add record: SP4096 + Depth Recurrence + Parallel Residuals + QK-Gain + Brotli (1.1020 BPB) #1392

Open

dttdrv mentioned this pull request Apr 28, 2026

{RECORD} CaseOps pre-quant TTT record (1.0354 BPB) #1911

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 10L Int5-MLP + SmearGate + BigramHash + Late QAT (val_bpb=1.1628)#286

Record: 10L Int5-MLP + SmearGate + BigramHash + Late QAT (val_bpb=1.1628)#286
chris-buckley wants to merge 1 commit intoopenai:mainfrom
chris-buckley:10l-int5mlp-smearbigram-lateqat

chris-buckley commented Mar 20, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chris-buckley commented Mar 20, 2026

Summary

Technique Stack

Metrics

Reproduction

Status

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Record: 10L Int5-MLP + SmearGate + BigramHash + Late QAT (val_bpb=1.1628)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants