Record: Fused LeakyReLU² + Online GPTQ + Parallel Muon — val_bpb 1.117 (1-seed)#1072
Record: Fused LeakyReLU² + Online GPTQ + Parallel Muon — val_bpb 1.117 (1-seed)#1072vimeto wants to merge 1 commit intoopenai:mainfrom
Conversation
…7 (1-seed, pending 3-seed)
… reset Combines the best of three approaches: PR openai#1060 (1.1122): coprime loader + Full GPTQ + XSA-all PR openai#1072 (1.117): fused Triton MLP (matmul+activation, 70ms/step) Ours: TTT periodic reset (anti-drift) Expected: ~7900 steps (vs 6700) with PR openai#1060 quality innovations = best training throughput + best quantization + best eval. Fused MLP kernel from PR openai#1072 uses TMA TensorDescriptors (H100 only). Falls back to standard path on non-Hopper GPUs. TTT sweep tests 4 configs on the same trained checkpoint: sota_ttt, pr1039, reset/100, reset/50 Total H100 time: ~10min train + 4×7min TTT ≈ 40 min Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both from top submissions, zero code risk: MUON_BACKEND_STEPS=4 (PR openai#1089): 4 NS iterations vs 5 Saves ~1-2ms/step, proven at 1.1086 BPB BIGRAM_VOCAB_SIZE=4096 (PR openai#1072): larger hash table More n-gram patterns, proven at 1.117 BPB MLP 3.5x investigated but doesn't fit 16MB budget (+2.2MB). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tile engine-inspired block-level Triton fusion for the 10min/16MB track: - Full-depth MLP megakernel: 5 ops (RMSNorm → UpProj → LeakyReLU² → DownProj → Residual) fused into 1 Triton kernel. The 1536-dim intermediate is processed via tiled register accumulation and never materializes in HBM. Deeper than PR openai#1072. - Fused attention preprocessing: QK RMSNorm + partial RoPE + q_gain in 2 Triton kernels (down from 6+). Novel — nobody in competition fuses post-projection ops. - 41% memory reduction (1562 MiB vs 2656 MiB). Numerically exact (cos_sim>0.99998). - Based on PR openai#1019 (abaybektursun). H100 results PENDING. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — Record: Fused LeakyReLU² + Online GPTQ + Parallel Muon — val_bpb 1.117 (1-seed)BPB: 1.117 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern) What I found in the code (head SHA The TTT path at line 1180 implements the score-first-per-chunk pattern: each chunk is scored under Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.43s, dim=512, layers=11, vocab=1024, code=114887 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass. Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.43s, dim=512, layers=11, vocab=1024, code=114887 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Fused LeakyReLU² + Online GPTQ + Parallel Muon
val_bpb: 1.117 (1-seed, stride=16, pending 3-seed confirmation)
Artifact: 15.95 MB (with selective ±1 pruning)
No TTT — pure neural model with sliding window evaluation
Key Innovations
1. Fused Triton MLP Kernel — Custom Triton kernel fusing
F.linear → LeakyReLU(0.5) → squareinto one GPU pass. Eliminates the 1536-dim intermediate tensor write to HBM per layer. Result: 70ms/step (vs 87ms without) on 8xH100 SXM → 33% more training steps in the same wallclock.2. Online Hessian GPTQ — Hessian matrices (H = X^T X) accumulated during training via separate uncompiled forward passes every 25 steps. Eliminates the train-time vs GPTQ-time tradeoff: full 600s training budget + Full GPTQ quality.
3. Selective ±1 Pruning — After INT6 quantization, adaptively zeros the least-significant ±1 weights (sorted by scale²) to precisely control artifact size to ≤16MB.
Results
3-seed runs pending due to cloud GPU infrastructure instability. Projected 3-seed mean: ~1.117.
Architecture
11L/512d, 8H/4KV GQA, LeakyReLU(0.5)², XSA all 11 layers, BigramHash 4096, VE128 layers 9-10, SmearGate, Partial RoPE 16/64, U-Net skips, LN Scale 1/√(layer+1), logit softcap 30.
Training
Parallel Muon (parameter banking, 3-phase overlapped reduce-scatter/all-gather, no DDP) + Adam. 786K batch, warmdown=3000, QAT@0.5, EMA 0.997, SWA every 50. Online Hessian GPTQ INT6 + LZMA preset=9 + selective ±1 pruning.
Comparison
Credits
Built on: PR #549 (Parallel Muon), PR #414 (base arch), PR #198 (XSA), PR #287 (Partial RoPE), PR #493 (LeakyReLU²), modded-nanogpt (fused kernel pattern).