Record: Fused LeakyReLU² + Online GPTQ + Parallel Muon — val_bpb 1.117 (1-seed) by vimeto · Pull Request #1072 · openai/parameter-golf

vimeto · 2026-03-29T11:34:17Z

Fused LeakyReLU² + Online GPTQ + Parallel Muon

val_bpb: 1.117 (1-seed, stride=16, pending 3-seed confirmation)
Artifact: 15.95 MB (with selective ±1 pruning)
No TTT — pure neural model with sliding window evaluation

Key Innovations

1. Fused Triton MLP Kernel — Custom Triton kernel fusing F.linear → LeakyReLU(0.5) → square into one GPU pass. Eliminates the 1536-dim intermediate tensor write to HBM per layer. Result: 70ms/step (vs 87ms without) on 8xH100 SXM → 33% more training steps in the same wallclock.

2. Online Hessian GPTQ — Hessian matrices (H = X^T X) accumulated during training via separate uncompiled forward passes every 25 steps. Eliminates the train-time vs GPTQ-time tradeoff: full 600s training budget + Full GPTQ quality.

3. Selective ±1 Pruning — After INT6 quantization, adaptively zeros the least-significant ±1 weights (sorted by scale²) to precisely control artifact size to ≤16MB.

Results

Seed	Steps	Step avg	Pre-quant	Sliding BPB	Stride	Artifact
1337	7,904	70.0ms	1.1290	1.1170	16	15.95MB
42	—	—	—	pending	—	—
2025	—	—	—	pending	—	—

3-seed runs pending due to cloud GPU infrastructure instability. Projected 3-seed mean: ~1.117.

Architecture

11L/512d, 8H/4KV GQA, LeakyReLU(0.5)², XSA all 11 layers, BigramHash 4096, VE128 layers 9-10, SmearGate, Partial RoPE 16/64, U-Net skips, LN Scale 1/√(layer+1), logit softcap 30.

Training

Parallel Muon (parameter banking, 3-phase overlapped reduce-scatter/all-gather, no DDP) + Adam. 786K batch, warmdown=3000, QAT@0.5, EMA 0.997, SWA every 50. Online Hessian GPTQ INT6 + LZMA preset=9 + selective ±1 pruning.

Comparison

Entry	Sliding BPB	TTT?
This (projected)	1.117	No
Merged SOTA (PR #549)	1.1194	Yes
PR #549 pre-TTT	1.1218	No

Credits

Built on: PR #549 (Parallel Muon), PR #414 (base arch), PR #198 (XSA), PR #287 (Partial RoPE), PR #493 (LeakyReLU²), modded-nanogpt (fused kernel pattern).

…7 (1-seed, pending 3-seed)

… reset Combines the best of three approaches: PR openai#1060 (1.1122): coprime loader + Full GPTQ + XSA-all PR openai#1072 (1.117): fused Triton MLP (matmul+activation, 70ms/step) Ours: TTT periodic reset (anti-drift) Expected: ~7900 steps (vs 6700) with PR openai#1060 quality innovations = best training throughput + best quantization + best eval. Fused MLP kernel from PR openai#1072 uses TMA TensorDescriptors (H100 only). Falls back to standard path on non-Hopper GPUs. TTT sweep tests 4 configs on the same trained checkpoint: sota_ttt, pr1039, reset/100, reset/50 Total H100 time: ~10min train + 4×7min TTT ≈ 40 min Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Both from top submissions, zero code risk: MUON_BACKEND_STEPS=4 (PR openai#1089): 4 NS iterations vs 5 Saves ~1-2ms/step, proven at 1.1086 BPB BIGRAM_VOCAB_SIZE=4096 (PR openai#1072): larger hash table More n-gram patterns, proven at 1.117 BPB MLP 3.5x investigated but doesn't fit 16MB budget (+2.2MB). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tile engine-inspired block-level Triton fusion for the 10min/16MB track: - Full-depth MLP megakernel: 5 ops (RMSNorm → UpProj → LeakyReLU² → DownProj → Residual) fused into 1 Triton kernel. The 1536-dim intermediate is processed via tiled register accumulation and never materializes in HBM. Deeper than PR openai#1072. - Fused attention preprocessing: QK RMSNorm + partial RoPE + q_gain in 2 Triton kernels (down from 6+). Novel — nobody in competition fuses post-projection ops. - 41% memory reduction (1562 MiB vs 2656 MiB). Numerically exact (cos_sim>0.99998). - Based on PR openai#1019 (abaybektursun). H100 results PENDING. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-11T20:05:43Z

Community Review — Record: Fused LeakyReLU² + Online GPTQ + Parallel Muon — val_bpb 1.117 (1-seed)

BPB: 1.117 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA 0561eb8d0dbd, file records/track_10min_16mb/2026-03-29_FusedLeakyReLU_OnlineGPTQ_ParallelMuon/train_gpt.py):

The TTT path at line 1180 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.43s, dim=512, layers=11, vocab=1024, code=114887 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.43s, dim=512, layers=11, vocab=1024, code=114887 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Record: Fused LeakyReLU² + Online GPTQ + Parallel Muon — val_bpb 1.11…

0561eb8

…7 (1-seed, pending 3-seed)

Bortlesboat mentioned this pull request Mar 31, 2026

Turbo-Muon + EngramLite + ParamBanking + GPTQ Reserve Opt — val_bpb 1.1126 (3-seed mean) #1169

Open

6 tasks

Gusanidas mentioned this pull request Apr 1, 2026

Record: Window Attention + Mixed Seq_Len Training, bpb 1.1108, eval at 6144 (5-seed mean) #1212

Open

AR6420 mentioned this pull request Apr 3, 2026

Full-Depth MLP Megakernel + Fused Attention Preprocessing (non-record) #1316

Open

MatoTeziTanka mentioned this pull request Apr 12, 2026

Non-record: Full Attention + LZMA + small BigramHash (val_bpb=1.2094) #1250

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Fused LeakyReLU² + Online GPTQ + Parallel Muon — val_bpb 1.117 (1-seed)#1072

Record: Fused LeakyReLU² + Online GPTQ + Parallel Muon — val_bpb 1.117 (1-seed)#1072
vimeto wants to merge 1 commit intoopenai:mainfrom
vimeto:main

vimeto commented Mar 29, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vimeto commented Mar 29, 2026

Fused LeakyReLU² + Online GPTQ + Parallel Muon

Key Innovations

Results

Architecture

Training

Comparison

Credits

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Record: Fused LeakyReLU² + Online GPTQ + Parallel Muon — val_bpb 1.117 (1-seed)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants