openai · aryanbhosale · Mar 21, 2026 · Mar 24, 2026 · Mar 24, 2026
diff --git a/records/track_10min_16mb/2026-03-24_11L_SOTA_MLP35x/README.md b/records/track_10min_16mb/2026-03-24_11L_SOTA_MLP35x/README.md
@@ -0,0 +1,50 @@
+# Record: 11L MLP3.5x LeakyReLU(0.5)^2 + Full SOTA Stack (mean val_bpb=1.1330)
+
+**3-seed mean val_bpb: 1.1330** (std=0.0007)
+
+| Seed | val_bpb | val_loss | Steps |
+|------|---------|----------|-------|
+| 1337 | 1.1334 | 1.9136 | 3842 |
+| 42 | 1.1322 | 1.9116 | 3885 |
+| 2024 | 1.1334 | 1.9136 | 3857 |
+
+## Architecture (31.4M parameters)
+- 11 transformer layers, dim=512, 8 heads / 4 KV heads (GQA)
+- MLP 3.5x expansion (hidden=1792) with **LeakyReLU(0.5)^2** activation
+- **SmearGate** + **BigramHash(10240, dim=128)** + **TrigramHash(4096, dim=128)**
+- **Value Residual (ResFormer)** — cache V from layer 0, blend via learned lambda
+- **Gated Attention** — per-head sigmoid gate (nn.Linear, bias init 4.0)
+- **XSA on all 11 layers** — exclusive self-attention
+- **Partial RoPE** — 16/64 head dimensions
+- Tied FP16 embeddings, U-Net skip connections, orthogonal initialization
+
+## Training
+- Muon optimizer: lr=0.03, momentum 0.92→0.99/1500 steps, WD=0.04
+- Adam for embeddings (lr=0.035) and scalars (lr=0.03)
+- Batch 786,432 tokens, seq_len 2048
+- EMA (decay=0.997), warmdown 3500 iterations
+- Late QAT via STE (final 15% of wallclock)
+- Gradient clipping 0.3
+
+## Quantization
+- Int6 uniform per-row with GPTQ-lite (5-percentile clip search per row)
+- FP16 passthrough for tied embeddings
+- zstd-22 compression
+
+## Evaluation
+- Sliding window eval, stride=64
+
+## Development Process
+30-experiment autoresearch loop on 1xH100 (~8 hours), then validated on 8xH100 SXM.
+
+### Feature ablation (measured on 1xH100):
+
+| Feature | BPB Impact |
+|---------|-----------|
+| Value Residual | -0.017 |
+| SmearGate | -0.010 |
+| XSA all 11 layers | -0.005 |
+| Gated Attention | -0.004 |
+| Partial RoPE (16/64) | -0.004 |
+| TrigramHash | -0.002 |
+| Late QAT | -0.002 |
diff --git a/records/track_10min_16mb/2026-03-24_11L_SOTA_MLP35x/submission.json b/records/track_10min_16mb/2026-03-24_11L_SOTA_MLP35x/submission.json
@@ -0,0 +1,16 @@
+{
+  "author": "Aryan Bhosale",
+  "github_id": "aryanbhosale",
+  "name": "11L MLP3.5x LeakyReLU(0.5)^2 + Full SOTA Stack (mean val_bpb=1.1330)",
+  "blurb": "11-layer 512d transformer with MLP 3.5x LeakyReLU(0.5)^2, SmearGate, BigramHash(10240), TrigramHash(4096), Value Residual, Gated Attention, XSA-all-11, Partial RoPE(16/64). Muon lr=0.03 WD=0.04, EMA(0.997), Late QAT, int6+GPTQ-lite+zstd-22. 3-seed mean 1.1330 (std=0.0007) on 8xH100 SXM.",
+  "date": "2026-03-24T12:00:00Z",
+  "val_loss": 1.9129,
+  "val_bpb": 1.1330,
+  "bytes_total": 10500000,
+  "bytes_code": 70872,
+  "seeds": {
+    "1337": {"val_bpb": 1.1334, "val_loss": 1.9136, "steps": 3842},
+    "42": {"val_bpb": 1.1322, "val_loss": 1.9116, "steps": 3885},
+    "2024": {"val_bpb": 1.1334, "val_loss": 1.9136, "steps": 3857}
+  }
+}