11L XSA4 + SmearGate + BigramHash + SWA + RoPE50K (mean val_bpb=1.1565, 3 seeds)#333
11L XSA4 + SmearGate + BigramHash + SWA + RoPE50K (mean val_bpb=1.1565, 3 seeds)#333mahsumaktas wants to merge 2 commits intoopenai:mainfrom
Conversation
…l_bpb=1.1754) Combines 10 orthogonal improvements over the naive baseline: - Per-row INT6 quantization + zstd-22 compression (13.98 MB artifact) - FP16 tied embedding export (near-zero quantization gap) - MLP 2.5x expansion - SmearGate + BigramHash bigram-aware modules - OrthoInit + muP scaling + phase-transition residual mixing - Muon weight decay (0.02) - Stochastic Weight Averaging (4 checkpoints) - Sliding window evaluation (stride=64) - Tuned hyperparameters (grad_clip=0.3, warmdown=3000) 8xH100 SXM, 9919 steps in 10 minutes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Major upgrade from V1 to V2 with 23 GPU runs on 8xH100: - 11 layers (was 9) + XSA on last 4 layers - MLP 2.75x (was 2.5x) — sweet spot for 16MB at 11L - RoPE base 50K, LR 0.025, SmearGate + BigramHash(2048) - SWA/50 with fp32 accumulation (bf16 catastrophic fix) - OrthoInit + Overtone SVD + Phase-transition residual mixing - INT6 + zstd-22 + FP16 tied embed + Late-K FP16 3-seed validation: 1.1538 / 1.1565 / 1.1593 (mean 1.1565, std 0.0028) Artifact: 15.87-15.99 MB (all under 16MB) 23 runs tested: EMA, depth recurrence, seq curriculum, LR sweep, WD sweep, QAT, MLP 3x — documented in README. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two-stage investigation into training data selection for Parameter Golf: Stage 1 (shard-level): 8 scoring methods, validated M5 (val-CE) as most reliable (rho=0.984). But all 80 shards have nearly identical bigram statistics (CE spread: 0.018 bits). Shard reordering: -0.001 BPB (noise). Stage 2 (chunk-level): Scored 244K chunks at 32K granularity. Within-shard variance is 535x larger than between-shard. Selected top 12% by bigram CE and by 17M-param neural proxy. Both made val_bpb worse (+0.007, +0.006). Curriculum learning (8xH100, 3 seeds): Hardest-first ordering by model perplexity. Mean delta: -0.0006, one seed regressed. 95% CI spans zero. Conclusion: On FineWeb (already filtered), hard data selection trades diversity for match quality, and diversity wins. Corroborated by PRs openai#737, openai#623, openai#333 and Sachdeva et al. (ICLR 2025). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — 11L XSA4 + SmearGate + BigramHash + SWA + RoPE50K (mean val_bpb=1.1565, 3 seeds)BPB: 1.1565 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache What I found in the code (head SHA Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=10, vocab=1024, code=66074 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline. Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=10, vocab=1024, code=66074 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Summary
Mean val_bpb: 1.1565 (3 seeds) | Best: 1.1538 (seed 1337) | Artifact: ~15.9 MB
23 GPU runs on 8xH100 SXM5. Systematic exploration of XSA, EMA vs SWA, depth recurrence, seq curriculum, LR/WD sweep, and MLP scaling.
Techniques
Results (3 seeds)
Key Findings from 23 Runs
Run command
Test plan
Built with Claude Code