Skip to content

New SOTA: 1.12676 BPB - 11L XSA-all(11) + GPTQ-lite + EMA + Late QAT#478

Open
gowtham0992 wants to merge 1 commit intoopenai:mainfrom
gowtham0992:submission/11L-XSA-all-GPTQ-lite-EMA-LateQAT
Open

New SOTA: 1.12676 BPB - 11L XSA-all(11) + GPTQ-lite + EMA + Late QAT#478
gowtham0992 wants to merge 1 commit intoopenai:mainfrom
gowtham0992:submission/11L-XSA-all-GPTQ-lite-EMA-LateQAT

Conversation

@gowtham0992
Copy link
Copy Markdown

New SOTA Record: val_bpb 1.12676 (3-seed mean)

Beats current SOTA (1.14276) by 0.016 nats.

3-Seed Results (8xH100 SXM, 600s)

Seed BPB Size
42 1.12713 15.64 MB
1337 1.12648 15.62 MB
2024 1.12667 15.88 MB
Mean 1.12676 ~15.7 MB

Key Techniques

  • XSA on ALL 11 layers
  • GPTQ-lite optimal clip percentile search
  • EMA(0.997) + Tight SWA
  • Late QAT int6-all at LR scale < 0.15
  • Raw binary + zstd22 serialization

Dependencies

  • zstandard, flash_attn_3 (see requirements.txt)

Verified on RunPod 8xH100 SXM (official template): 1.12753 BPB

See README.md for full details.

@mohosy
Copy link
Copy Markdown

mohosy commented Mar 23, 2026

xsa on all 11 layers is bold, most ppl only do last 3 or 4. does it actualy help on the early layers too or is it just not hurting? also gptq lite clip search is a nice touch i havent seen anyone else do that yet

@gowtham0992
Copy link
Copy Markdown
Author

xsa on all 11 layers is bold, most ppl only do last 3 or 4. does it actualy help on the early layers too or is it just not hurting? also gptq lite clip search is a nice touch i havent seen anyone else do that yet

thanks! yeah xsa on all layers actually helps, not just not hurting. its ablated to:

XSA-all(11): 1.12676 BPB, 6764 steps, 88.7ms/step XSA(4) last only: 1.13266 BPB, 6998 steps, 85.7ms/step

so ~0.006 BPB win even though its 3ms/step slower and we lose ~230 steps. early layers tend to repeat self-value patterns, xsa forces them to actually encode new info. at 11L 512d every layer counts

gptq-lite is free basically. 5 clip percentiles per row, pick min MSE. adds like 2 seconds to save and gets ~0.0006 BPB back from quant gap.

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — New SOTA: 1.12676 BPB - 11L XSA-all(11) + GPTQ-lite + EMA + Late QAT

BPB: 1.12676 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 07884e6ce1c4, file records/track_10min_16mb/2026-03-22_11L_XSA-all_GPTQ-lite_EMA_LateQAT_1.1271/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.08s, dim=512, layers=11, vocab=1024, code=53472 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.08s, dim=512, layers=11, vocab=1024, code=53472 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants