Non-record: Turbo-Muon + EngramLite(10240) + VE(8,9,10) — val_bpb 1.1431#1205
Conversation
… 1.1431 Based on PR openai#1089 stack with hyperparameter tuning: - Higher LR (0.030 vs 0.025) for faster convergence - Wider EngramLite (10240x48 vs 8192x32) - VE on layers 8,9,10 (vs 9,10) - Warmdown 4500 (vs 3500) - Muon momentum warmup 1000 steps (vs 1500) 3-seed mean: 1.1431 (std 0.0007) Seeds: 1337=1.1425, 42=1.1438, 2024=1.1431
2d2f0d7 to
974948e
Compare
Community Review — Non-record: Turbo-Muon + EngramLite(10240) + VE(8,9,10) — val_bpb 1.1431Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern) PR #1205 Audit — Two Submissions Head SHA: 974948e --- ## Submission 1: 2026-03-21_MixedQuant_BigramHash_SWA (val_bpb: 1.2421) BigramHash implementation (lines 525–527):
|
Personal case-study of my participation in the OpenAI Model Craft Challenge, plus the April Turbo-Muon submission brought to main so internal links resolve. Contents: - README.md: personal narrative and results tables - docs/METHODS.md: technical breakdown of each technique used - docs/EXPERIMENTS.md: verified runs and post-mortem of 020_ultimate - docs/UPSTREAM_README.md: original OpenAI README preserved for context - scripts/plot_curves.py: build training curves from train_*.log - assets/loss_curves.png: training dynamics of both submissions - Rewritten README for the 2026-03-21 submission - Full 2026-04-01 Turbo-Muon submission ported from the PR branch: README, submission.json, train_gpt.py, three seed logs Results on main: - 2026-03-21 Mixed Quantization + BigramHash + SWA: val_bpb 1.2421 - 2026-04-01 Turbo-Muon + EngramLite (3 seeds, std 0.0007): val_bpb 1.1431 Upstream PRs: - openai#370 - openai#1205
Summary
Non-record submission based on the PR #1089 Turbo-Muon + EngramLite stack with hyperparameter tuning.
val_bpb: 1.1431 (3-seed mean, std 0.0007)
Changes from PR #1089
Key Finding
The increased model size (~31.6M vs 30.7M params) pushed the artifact to 16.36MB pre-compression, forcing all 66 weight groups into int5 with 0 promotions to int6/int7 and 20.5% selective pruning. This aggressive quantization likely offset the architectural gains. The 16MB budget is extremely tight — even small parameter increases can cascade into significant quality loss through the quantization pipeline.
Hardware
8xH100 80GB SXM, 600s training, ~5550 steps at 106ms/step.