Competition: OpenAI Parameter Golf Objective: Minimize validation loss (bits-per-byte, bpb) under a 16MB artifact constraint within 10-minute training on 8×H100 Total experiments: 119+ Date range: Early 2026 — 2026-04-13 Best result: 1.0744 legal_ttt bpb (ImprovedParallelResiduals, community PR #1523, 8×H100)
- Global Leaderboard
- Phase 3a: Baseline Experiments (exp00–exp18)
- Phase 3b-Part1: Systematic Ablations (exp27b–exp33b)
- Phase 3b-Part2: LR Fix Era (exp34b–exp48b)
- Phase 3b-Part3: Simplification + XSA (exp53b–clean_54b)
- Phase 3.5: 8×H100 Simulation (exp60–exp80)
- Phase 3.6: Diagnostic-Driven Era (exp83–exp87)
- Phase 3b-Muon: Parallel Muon Optimizer (exp70_parallel_muon–exp91)
- Phase 3c: Architecture Rewrite + Meta-TTT (exp92–exp109)
- Phase 3c-Community: Community SOTA (SP8192+)
- Phase 3c-Frontier: Pushing Past Community (exp110–exp119)
- Misc: Co-occurrence QK Init
- Known Issues
- Key Learnings by Phase
- TLDR: Top 20 Learnings
- Appendix: Full Lineage Trees
| Rank | Experiment | Date | legal_ttt | val_bpb | int6_bpb | Artifact | Hardware | Source |
|---|---|---|---|---|---|---|---|---|
| 1 | ImprovedParallelResiduals | 2026-04-11 | 1.0744 | — | — | 15.96 MB | 8×H100 | Community PR #1523 |
| 2 | WiderEmb_TapInV6_TTT | 2026-04-10 | 1.0788 | 1.0813 | 1.0980 | ~16 MB | 8×H100 | Community |
| 3 | SP8192_3LayerRecur | 2026-04-09 | 1.0808 | 1.0873 | 1.0997 | ~16 MB | 8×H100 | Community |
| 4 | exp101 | 2026-04 | 1.11588 | 1.1352 | 1.13930 | ~16 MB | 8×H100 | Our work |
| 5 | exp95 | 2026-03 | 1.1169 | 1.1363 | — | ~16 MB | 8×H100 | Our work |
| 6 | exp74 | 2026-03 | — | 1.1539 | 1.1685 | 15.86 MB | 1×H100 (sim) | Our work |
| 7 | exp54b | 2026-03 | — | 1.2642 | 1.2708 | 15.54 MB | 1×H100 | Our work |
| Date | Best legal_ttt | Experiment | Key Innovation |
|---|---|---|---|
| Early | 1.3389 (quant) | exp00 baseline | Starting point |
| Early | 1.3145 (quant) | exp09 | Step count + loss masking |
| Phase 3b | 1.2708 (quant) | exp54b | LR fix + simplification |
| 2026-03 | 1.1456 (sliding) | exp74 | Partial RoPE + diagnostics |
| 2026-03 | 1.1169 | exp95 | Meta-TTT + size optimization |
| 2026-04-04 | 1.11588 | exp101 | Position-conditional bigram |
| 2026-04-09 | 1.0808 | SP8192_3LayerRecur (community) | SP8192 + depth recurrence |
| 2026-04-10 | 1.0788 | WiderEmb_TapInV6 (community) | Wider loop + Tap-In V6 |
| 2026-04-11 | 1.0744 | ImprovedParallelResiduals (community, PR #1523) | Cross-lane parallel residuals |
Hardware: 1×H100 (or A100), 600s wallclock Base: exp27 (modded-nanogpt reference) Best result: exp09/exp13 — quant bpb 1.3145, artifact 14.5 MB
| Parameter | Value |
|---|---|
| Model dim | 512 |
| Num layers | 11 (10 unique, layer sharing) |
| Attention | GQA, 8 Q-heads, 4 KV-heads, head_dim=64 |
| MLP | LeakyReLU², mlp_mult=3.0 (hidden=1536) |
| Vocab size | 1024 (SentencePiece BPE) |
| Seq length | 2048 |
| Softcap | 30 |
| Optimizer | Muon (matrices) + Adam (scalars) |
| Momentum | Cyclic 0.85–0.95, period=50 |
| Grad accum | 2 |
| SWA | Start at 20% training, every 100 steps |
| AWQ alpha | 0.6 |
| Quantization | int6 + zstd |
| Wallclock | 600s (10 min) |
| Total params | ~25.5M |
| Rank | Exp | Name | Quant BPB | Raw BPB | Artifact | Under 16MB? |
|---|---|---|---|---|---|---|
| 1 | 54b | xsa-zstd-ckfix | 1.2708 | 1.2642 | 15.54 MB | Yes |
| 2 | 53b | lean-combo (v5) | 1.2720 | 1.2640 | 15.19 MB | Yes |
| 3 | community | SOTA leaderboard (1xH100) | 1.2825 | 1.2501 | 13.06 MB | Yes |
| 4 | 48b | 10blocks-depth | 1.2930 | 1.2870 | 14.59 MB | Yes |
| 5 | 42b | revive-block9 | 1.2969 | 1.2867 | 14.01 MB | Yes |
| 6 | 34b | lr-schedule-fix | 1.2990 | 1.2891 | 15.13 MB | Yes |
| 7 | 39b | swa-tuning | 1.2942 | 1.2875 | 14.55 MB | Yes |
| 8 | 30b | combo | 1.3156 | 1.2983 | 15.05 MB | Yes |
| 9 | 29b | lossweight-typemb | 1.3176 | 1.3007 | 15.75 MB | Yes |
| 10 | 27b | resid-norm | 1.3197 | 1.3000 | 15.30 MB | Yes |
| 11 | 09 | padignore-wordboost | 1.3145 | 1.2974 | 14.5 MB | Yes |
| 11 | 13 | multihead-gate-bigram | 1.3145 | 1.2974 | 14.5 MB | Yes |
| 13 | 10 | trigram-unigram | 1.3151 | 1.2956 | 15.6 MB | Yes |
| 14 | 06 | swa-awq-accum2 | 1.3161 | 1.2982 | 15.7 MB | Yes |
| 15 | 07 | tighter-swa-awq | 1.3164 | 1.2978 | 15.5 MB | Yes |
| 16 | 05 | grad-accum4 | 1.3181 | 1.3001 | 15.8 MB | Yes |
| 17 | 12 | trigram64-awq06 | 1.3222 | 1.2969 | 15.1 MB | Yes |
| 18 | 08 | ctx-freq-bias | 1.3225 | 1.3014 | 15.0 MB | Yes |
| 19 | 18 | separate-trigram64 | 1.3247 | 1.2995 | 15.0 MB | Yes |
| 20 | 11 | trigram-slim-awq07 | 1.3259 | 1.2994 | 14.6 MB | Yes |
| 21 | 15 | engram-3order | 1.3260 | 1.2995 | 14.6 MB | Yes |
| 22 | 14 | engram-multiorder | 1.3338 | 1.3056 | 15.0 MB | Yes |
| 23 | 00 | baseline-rerun | 1.3389 | 1.3166 | 14.7 MB | Yes |
| 24 | 02 | speed-bigramfp16-awq | 1.3429 | 1.3200 | 17.3 MB | No |
| 25 | 04 | no-cyclic-momentum | 1.3489 | 1.3230 | 15.8 MB | Yes |
| 26 | 16 | jepa-aux | 1.3526 | 1.3194 | 14.5 MB | Yes |
| 27 | 17 | byte-engram | 1.3527 | 1.3201 | 14.6 MB | Yes |
| 28 | 01d | xsa-only | 1.3568 | 1.3294 | 14.8 MB | Yes |
| 29 | 03 | qat-ste | 1.3684 | 1.3365 | 15.7 MB | Yes |
| — | 01b | ln-scale-only | — | — | — | Not run |
| — | 01c | ema-only | — | — | — | Not run |
| Field | Value |
|---|---|
| Folder | exp00_baseline-rerun_exp27/ |
| Based on | exp27 (modded-nanogpt reference) |
| Status | Done |
| Steps | 1,084 |
| Raw BPB | 1.3166 |
| Quant BPB | 1.3389 |
| Artifact | 14.7 MB |
Details: Control experiment. Runs the exp27 reference config on A100 with 600s wallclock, grad_accum=8, GQA 8Q/4KV heads, 25.5M params, int6+zstd quantization.
Observations: ~553ms/step. Quant gap = 1.7%. Bigram.proj has worst quantization error. Still improving at end — more steps would help.
| Field | Value |
|---|---|
| Folder | exp01b_ln-scale-only_from-exp27/ |
| Based on | exp27 |
| Status | Not run (empty log) |
Details: Isolated test of 1/√(layer+1) layer-norm damping without partial RoPE.
| Field | Value |
|---|---|
| Folder | exp01c_ema-only_from-exp27/ |
| Based on | exp27 |
| Status | Not run (empty log) |
Details: Isolated test of EMA weight averaging without other changes.
| Field | Value |
|---|---|
| Folder | exp01d_xsa-only_from-exp27/ |
| Based on | exp27 |
| Status | Done |
| Steps | 1,017 |
| Raw BPB | 1.3294 |
| Quant BPB | 1.3568 |
| Artifact | 14.8 MB |
Details: Cross-sequence attention on last 4 layers, keeping SWA and full RoPE. Slower per step (~590ms vs 553ms baseline), fewer steps completed.
Observations: XSA alone hurts — slower steps + no benefit = regression. The speed cost outweighs any representational gain.
| Field | Value |
|---|---|
| Folder | exp02_speed-bigramfp16-awq_from-exp00/ |
| Based on | exp00 |
| Status | Done |
| Steps | 1,080 |
| Raw BPB | 1.3200 |
| Quant BPB | 1.3429 |
| Artifact | 17.3 MB |
Details: Three changes combined: (1) muon_backend_steps=4 (was 5), val_loss_every=200 for speed; (2) bigram.proj kept in FP16 instead of quantized; (3) per-category AWQ alphas (bigram=0.75, attn=0.6, mlp=0.5) with 16 calibration batches.
Observations: FP16 bigram blows artifact to 17.3MB — over budget. Per-category AWQ alpha interesting but gains washed out.
| Field | Value |
|---|---|
| Folder | exp03_qat-ste_from-exp00/ |
| Based on | exp00 |
| Status | Done |
| Steps | 1,035 |
| Raw BPB | 1.3365 |
| Quant BPB | 1.3684 |
| Artifact | 15.7 MB |
Details: Fake-quantize weights during warmdown phase using Straight-Through Estimator.
Observations: Worst result of the batch. QAT-STE adds overhead (fewer steps) and destabilizes training. SWA+AWQ already handles quant gap reasonably.
| Field | Value |
|---|---|
| Folder | exp04_no-cyclic-momentum_from-exp00/ |
| Based on | exp00 |
| Status | Done |
| Steps | 1,084 |
| Raw BPB | 1.3230 |
| Quant BPB | 1.3489 |
| Artifact | 15.8 MB |
Details: Fixed momentum=0.95 instead of cycling 0.85-0.95.
Observations: Marginal degradation. Cyclic momentum is slightly helpful — oscillation may act as implicit regularization.
| Field | Value |
|---|---|
| Folder | exp05_grad-accum4_from-exp00/ |
| Based on | exp00 |
| Status | Done |
| Steps | 1,206 |
| Raw BPB | 1.3001 |
| Quant BPB | 1.3181 |
| Artifact | 15.8 MB |
Details: Reduced grad_accum 8→4, doubling step count within wallclock. Effective batch halved from 524K→262K tokens.
Observations: Major breakthrough — first sub-1.32 quant bpb. +122 more steps than baseline.
| Field | Value |
|---|---|
| Folder | exp06_swa-awq-accum2_from-exp05/ |
| Based on | exp05 |
| Status | Done |
| Steps | 1,219 |
| Raw BPB | 1.2982 |
| Quant BPB | 1.3161 |
| Artifact | 15.7 MB |
Details: Pushed accum 4→2. SWA_EVERY=100, AWQ_ALPHA=0.6, WARMUP_STEPS=80.
Observations: Continued improvement. Raw bpb breaks below 1.30 for first time.
| Field | Value |
|---|---|
| Folder | exp07_tighter-swa-awq_from-exp06/ |
| Based on | exp06 |
| Status | Done |
| Steps | 1,220 |
| Raw BPB | 1.2978 |
| Quant BPB | 1.3164 |
| Artifact | 15.5 MB |
Details: SWA_EVERY=150, AWQ_ALPHA=0.7.
Observations: Reduced artifact ~200KB vs exp06. Quant bpb nearly identical. Sweet spot is SWA_EVERY=100.
| Field | Value |
|---|---|
| Folder | exp08_ctx-freq-bias_from-exp05/ |
| Based on | exp05 |
| Status | Done |
| Steps | 1,196 |
| Raw BPB | 1.3014 |
| Quant BPB | 1.3225 |
| Artifact | 15.0 MB |
Details: Learned scalar bias ctx_freq_bias * log(1 + count_in_context) on logits. Exploits 77.7% token burstiness. +1 parameter.
Observations: Small improvement over exp05 but redundant with what attention learns. Smallest artifact at 15.0MB.
| Field | Value |
|---|---|
| Folder | exp09_padignore-wordboost_from-exp06/ |
| Based on | exp06 |
| Status | Done |
| Steps | 1,203 |
| Raw BPB | 1.2974 |
| Quant BPB | 1.3145 |
| Artifact | 14.5 MB |
Details: (1) ignore_index=0 — skip pad tokens that waste gradient; (2) learned word_start_boost scalar that scales bigram at word boundaries (▁ tokens). +1 parameter.
Observations: Best result. Pad-ignore removes ~5-10% wasted compute. Word-start boost helps bigram focus on hardest prediction (word-initial at 5.1 bpb vs 2.9 bpb for repeats).
| Field | Value |
|---|---|
| Folder | exp10_trigram-unigram_from-exp09/ |
| Based on | exp09 |
| Status | Done |
| Steps | 1,199 |
| Raw BPB | 1.2956 |
| Quant BPB | 1.3151 |
| Artifact | 15.6 MB |
Details: Trigram hash table (10240×128) + learned unigram_bias * log(freq). ~1.3M extra params.
Observations: Best raw bpb (1.2956) but quant bpb slightly worse than exp09. Extra params compress less efficiently.
| Field | Value |
|---|---|
| Folder | exp11_trigram-slim-awq07_from-exp10/ |
| Based on | exp10 |
| Status | Done |
| Steps | 1,198 |
| Raw BPB | 1.2994 |
| Quant BPB | 1.3259 |
| Artifact | 14.6 MB |
Details: Trigram embed dim 128→48 with tri_proj (48→512). AWQ alpha 0.6→0.7.
Observations: Smaller artifact but significantly worse quant bpb. dim=48 too small, AWQ 0.7 too aggressive. Double regression.
| Field | Value |
|---|---|
| Folder | exp12_trigram64-awq06_from-exp10/ |
| Based on | exp10 |
| Status | Done |
| Steps | 1,195 |
| Raw BPB | 1.2969 |
| Quant BPB | 1.3222 |
| Artifact | 15.1 MB |
Details: Middle-ground trigram dim=64, AWQ alpha=0.6.
Observations: Better than exp11 but worse than exp09. Hash collisions in 10240-entry table too frequent for trigrams (1024³ combinations).
| Field | Value |
|---|---|
| Folder | exp13_multihead-gate-bigram_from-exp09/ |
| Based on | exp09 |
| Status | Done |
| Steps | 1,201 |
| Raw BPB | 1.2974 |
| Quant BPB | 1.3145 |
| Artifact | 14.5 MB |
Details: K=2 independent hash functions (averaged, reduces collisions). Context gate: sigmoid(gate_proj(tok_emb) + gate_bias). +513 extra params.
Observations: Tied with exp09 for best quant bpb. Multi-head reduces collision 10%→1%, but impact neutralized. Base for exp14-26.
| Field | Value |
|---|---|
| Folder | exp14_engram-multiorder_from-exp13/ |
| Based on | exp13 |
| Status | Done |
| Steps | 1,196 |
| Raw BPB | 1.3056 |
| Quant BPB | 1.3338 |
| Artifact | 15.0 MB |
Details: 1-5gram with 2 hash heads each = 10 lookups/position. Shared 10240×128 table. 0 new params.
Observations: Significant regression. Shared embeddings across n-gram orders = destructive interference.
| Field | Value |
|---|---|
| Folder | exp15_engram-3order_from-exp14/ |
| Based on | exp14 |
| Status | Done |
| Steps | 1,196 |
| Raw BPB | 1.2995 |
| Quant BPB | 1.3260 |
| Artifact | 14.6 MB |
Details: 1-3gram × 2 heads = 6 lookups. Orthogonal subspace: unigram [0:42], bigram [42:84], trigram [84:128].
Observations: Better than exp14 but worse than exp09. Each subspace too small (~42 dims).
| Field | Value |
|---|---|
| Folder | exp16_jepa-aux_from-exp15/ |
| Based on | exp15 |
| Status | Done |
| Steps | 1,146 |
| Raw BPB | 1.3194 |
| Quant BPB | 1.3526 |
| Artifact | 14.5 MB |
Details: Predictor MLP (512→128) predicts next position's engram embedding. MSE loss λ=0.1. ~65K extra training params.
Observations: Major regression. Not true JEPA — uses fixed hash targets. MSE against hash embeddings provides adversarial gradient. ~524ms/step (slower).
| Field | Value |
|---|---|
| Folder | exp17_byte-engram_from-exp16/ |
| Based on | exp16 |
| Status | Done |
| Steps | 1,146 |
| Raw BPB | 1.3201 |
| Quant BPB | 1.3527 |
| Artifact | 14.6 MB |
Details: ByteBoundaryEmbedding — cross-token byte bigram/trigram from 4-byte window. ~49K extra params.
Observations: No improvement over exp16. Base (exp15+JEPA) too weak.
| Field | Value |
|---|---|
| Folder | exp18_separate-trigram64_from-exp13/ |
| Based on | exp13 |
| Status | Done |
| Steps | 1,194 |
| Raw BPB | 1.2995 |
| Quant BPB | 1.3247 |
| Artifact | 15.0 MB |
Details: Separate 64-dim trigram table (10240×64) + 2-head hashing + projection, on top of exp13's 128-dim bigram. ~688K extra params.
Observations: Marginal raw improvement but worse quant bpb (1.3247 vs 1.3145). Extra params don't compress well.
exp27 (reference)
├── exp00 (baseline rerun) ──────────────────────── quant 1.3389
│ ├── exp01b (LN scale ablation) not run
│ ├── exp01c (EMA ablation) not run
│ ├── exp01d (XSA ablation) quant 1.3568 ✗ worse
│ ├── exp02 (speed+FP16 bigram) quant 1.3429 ✗ over 16MB
│ ├── exp03 (QAT-STE) quant 1.3684 ✗ worst
│ ├── exp04 (no cyclic momentum) quant 1.3489 ✗ worse
│ ├── exp05 (grad_accum=4) ────────────────────── quant 1.3181 ✓ breakthrough
│ │ ├── exp08 (context freq bias) quant 1.3225
│ │ └── exp06 (SWA+AWQ+accum=2) ───────────── quant 1.3161 ✓ improved
│ │ ├── exp07 (tighter SWA/AWQ) quant 1.3164
│ │ └── exp09 (pad ignore+word boost) ──── quant 1.3145 ⭐ BEST
│ │ ├── exp10 (trigram+unigram) ───── quant 1.3151
│ │ │ ├── exp11 (trigram slim) quant 1.3259 ✗
│ │ │ └── exp12 (trigram 64d) quant 1.3222
│ │ └── exp13 (multihead gate bigram) quant 1.3145 ⭐ TIED BEST
│ │ ├── exp14 (engram 1-5gram) quant 1.3338 ✗
│ │ │ └── exp15 (engram 3order) quant 1.3260
│ │ │ └── exp16 (JEPA aux) quant 1.3526 ✗✗
│ │ │ └── exp17 (byte engram) quant 1.3527 ✗✗
│ │ ├── exp18 (separate trigram) quant 1.3247
│ │ └── exp19-26 (phase 3b) ──── in progress
- Step count is king: Reducing grad_accum (2× more optimizer updates) > any architectural change.
- Don't fight the quantizer: QAT-STE and FP16 bigram tried to address quantization directly — both failed. Better convergence naturally produces better-quantizing weights.
- N-gram tables have diminishing returns: Bigram valuable (+0.02 bpb). Trigram marginal. Higher-order actively hurt.
- Hash collision reduction matters less than expected: Multi-head hashing (exp13) reduces collisions 10%→1%, but quant bpb unchanged from exp09.
- Auxiliary losses are dangerous: JEPA (exp16) caused biggest single regression.
- Simple targeted fixes beat complex architectures: Pad-ignore (0 params) + word-start boost (1 param) > 688K trigram params.
- Quant gap is the real metric: exp10 had best raw bpb but exp09 had best quant bpb.
What worked:
- Cutting
grad_accum8→4→2 = 2× more steps = biggest single win (exp05→06) ignore_index=0to skip pad tokens in loss = free improvement (exp09)word_start_boostscalar for bigram at▁boundaries = +1 param, measurable gain (exp09)- SWA every 100 steps + AWQ alpha=0.6 = good quant compression without hurting quality
What didn't work:
- QAT-STE, FP16 bigram, XSA, fixed momentum (exp01d–04) = all worse than baseline
- Trigram/n-gram tables (exp10–12, 14–15, 18) = raw bpb improves but quant bpb regresses
- JEPA auxiliary loss with fixed hash targets (exp16–17) = worst regression of all
One-liner: More optimizer steps + smarter loss masking > fancy architecture, every time.
Hardware: 1×H100, 600s wallclock Base: exp09 (chosen over exp13 — multi-head bigram added complexity without improving quant bpb) Best result: exp30b — quant bpb 1.3156, artifact 15.05 MB
| Rank | Exp | Name | Base | Quant BPB | Raw BPB | Size | Under 16MB? |
|---|---|---|---|---|---|---|---|
| 1 | 30b | combo (resid-norm + loss-wt + type-emb) | exp09 | 1.3156 | 1.2983 | 15.05 MB | Yes |
| 2 | 33b | alternating RoPE + NTK | exp30b | 1.3145 | 1.2971 | 14.94 MB | Yes |
| 3 | 29b | loss-weight + token-type-emb | exp09 | 1.3176 | 1.3005 | 15.75 MB | Yes |
| 4 | 27b | resid-norm | exp09 | 1.3197 | ~1.300 | ~15.3 MB | Yes |
| 5 | 31b | RoPE base 50k | exp30b | 1.3206 | 1.2953 | 15.01 MB | Yes |
| 6 | 09 | padignore-wordboost (baseline) | exp06 | 1.3282 | 1.2974 | 14.5 MB | Yes |
| 7 | 32b | aux word-boundary loss | exp30b | 1.3424 | 1.3153 | 15.68 MB | Yes |
| — | 28b | perlayer quant | exp09 | N/A (analysis only) | — | 16.24 MB | No |
| Field | Value |
|---|---|
| Folder | exp27b_resid-norm_from-exp09/ |
| Based on | exp09 |
| Steps | ~1200 |
| Raw BPB | ~1.300 |
| Quant BPB | 1.3197 |
| Artifact | ~15.3 MB |
| Extra params | 0 |
Change: Parameterless F.rms_norm after each decoder skip-connection.
Observations: Residual norm growth (19.7→89.5) was root cause of poor quantization in later layers. RMSNorm keeps norms bounded → flatter weight distributions → lower quant error.
Verdict: ✅ Validated.
| Field | Value |
|---|---|
| Folder | exp28b_perlayer-quant_from-exp09/ |
| Based on | exp09 |
| Status | Analysis only |
| MSE improvement | -16.13% |
| Artifact | 16.24 MB |
Change: Higher bitwidths for boundary layers (0,1,8,9): int7 attn, int6 MLP.
Observations: Cuts boundary quant error in half but blows 16MB budget.
Verdict: ❌ Not viable. Size kills it.
| Field | Value |
|---|---|
| Folder | exp29b_lossweight-typemb_from-exp09/ |
| Based on | exp09 |
| Steps | 1207 |
| Raw BPB | 1.3005 |
| Quant BPB | 1.3176 |
| Artifact | 15.75 MB |
| Extra params | +8,305 |
Change: (1) Per-token loss weighting: 1.5x word-start, 0.8x easy suffixes. (2) 7-category token-type embedding: 7×16 + 16×512 proj + learned scale. Zero-initialized.
Observations: Strong win. Loss weighting redistributes gradient to high-opportunity tokens. Token-type gives explicit structural signal.
Verdict: ✅ Validated.
| Field | Value |
|---|---|
| Folder | exp30b_combo_from-exp09/ |
| Based on | exp09 |
| Steps | 1200 |
| Raw BPB | 1.2983 |
| Quant BPB | 1.3156 |
| Artifact | 15.05 MB |
Change: All three validated improvements combined, each togglable via env vars.
Observations: Best quant bpb. Gains sub-additive (expected -0.019, got -0.0126). Quant gap reduced 0.023→0.017.
Verdict: ✅ Phase 3b-Part1 SOTA. Base for all subsequent experiments.
| Field | Value |
|---|---|
| Folder | exp31b_rope-headspec_from-exp30b/ |
| Based on | exp30b |
| Steps | 1197 |
| Raw BPB | 1.2953 |
| Quant BPB | 1.3206 |
| Artifact | 15.01 MB |
Change: RoPE base 10000→50000.
Observations: Best raw bpb ever but quant gap widens to 0.0253 vs 0.0173 for exp30b. Net NEGATIVE after quantization.
Verdict: ❌ Raw gain eaten by quant degradation.
| Field | Value |
|---|---|
| Folder | exp32b_aux-boundary_from-exp30b/ |
| Based on | exp30b |
| Steps | 1196 |
| Raw BPB | 1.3153 |
| Quant BPB | 1.3424 |
| Artifact | 15.68 MB |
Change: Auxiliary head (512→1, sigmoid, binary CE, λ=0.15) predicting word-start.
Observations: Significant regression (+0.0268). Aux loss diverts gradient. Token-type already provides structural signal.
Verdict: ❌ Auxiliary losses counterproductive.
| Field | Value |
|---|---|
| Folder | exp33b_swa-attn-ntkrope_from-exp30b/ |
| Based on | exp30b |
| Steps | 1200 |
| Raw BPB | 1.2971 |
| Quant BPB | 1.3145 |
| Artifact | 14.94 MB |
Change: Even blocks rope_base=50000, odd blocks rope_base=1000. NTK scaling.
Observations: Marginal improvement (-0.0011). Positional loss curve STILL flat after 256 tokens. Improvement is likely noise.
Verdict:
- Word-start tokens: 27.7% of tokens but 44.2% of total loss. Mean loss 3.72 vs 1.28 for "other".
- Softcap not saturating: Max |logit| = 28.2 (94% of cap=30).
- Word-start boost learned DOWN to 0.16: Model suppresses bigram at word boundaries — hash collisions may be confusing.
- Token-type embedding actively used: Scale grew 0.05→0.53. Punctuation/whitespace highest norms.
- Top-1 accuracy: 46.6%, Top-5: 68.9%
- First 16 tokens: Loss 2.96 vs 2.27 for rest (+0.70 penalty). Cold-start tokens disproportionately hard.
- Positional loss flat after 256: 2.53→2.33→flat ~2.2-2.4.
exp09 (pad ignore + word-start boost) ──── quant 1.3282
├── exp27b (resid-norm) ──────────────── quant 1.3197 ✓
├── exp28b (perlayer quant) ──────────── N/A (over 16MB) ✗
├── exp29b (loss-weight + type-emb) ──── quant 1.3176 ✓
└── exp30b (combo: 27b+29b) ─────────── quant 1.3156 ⭐ BEST
├── exp31b (RoPE 50k) ──────────── quant 1.3206 ✗ (quant gap)
├── exp32b (aux boundary loss) ──── quant 1.3424 ✗✗ (gradient waste)
└── exp33b (alternating RoPE+NTK) ─ quant 1.3145 ⚠️ (marginal, noisy)
- Residual norm control is high leverage: RMSNorm after skip connections attacks root cause (norm growth) not symptom.
- Stacking orthogonal improvements works: exp30b combined 3 independent improvements for sub-additive but substantial gain.
- Auxiliary losses fatal in compute-starved regimes: ~1200 steps = every gradient must reduce CE.
- RoPE base changes hurt quantization: Higher base → harder-to-compress weight distributions.
- Long context is a dead end: Positional loss flat after 256 tokens regardless of RoPE.
- Size budget matters: Per-layer quant delivers 16% MSE reduction but can't fit in 16MB.
Hardware: 1×H100, 600s wallclock Base: exp30b with LR schedule fix Critical discovery: ITERATIONS=20000 meant warmdown NEVER fired. Fixing to 1300 = -0.0166 bpb. Best result: exp48b — quant bpb 1.2930
All experiments include the LR schedule fix (ITERATIONS=1300, WARMDOWN_ITERS=400).
| Rank | Exp | Name | Base | Quant BPB | Raw BPB | Size | Steps | Verdict |
|---|---|---|---|---|---|---|---|---|
| 1 | 48b | 10blocks-depth | 42b | 1.2930 | 1.2824 | 15.22 MB | 1198 | ✅ Best |
| 2 | 45b | awq-alpha07 | 42b | 1.2897* | — | 14.01 MB | — | ✅ Post-train only |
| 3 | 42b | revive-block9 | 34b | 1.2969 | 1.2867 | 14.01 MB | 1196 | ✅ Layer sharing |
| 4 | 39b | swa-tuning | 34b | 1.2969 | 1.2867 | — | ~1196 | ≈ Tie with 42b |
| 5 | 34b | lr-schedule-fix | 30b | 1.2990 | 1.2891 | 15.13 MB | 1186 | ✅ LR fix breakthrough |
| 6 | 46b | full-mha | 42b | 1.2979 | 1.2896 | 15.59 MB | 1164 | ❌ 8KV not worth it |
| 7 | 43b | boundary-boost | 42b | 1.2993 | — | 14.01 MB | ~1196 | ❌ Too sparse |
| 8 | 47b | warmdown200 | 42b | 1.3081 | 1.2983 | 14.07 MB | 1200 | ❌ Too late warmdown |
| 9 | 37b | fused-cap | 34b | 1.3159 | 1.3014 | 15.17 MB | 1197 | ❌ Cap hurts |
| 10 | 35b | focal-loss | 30b | 1.3201 | — | 15.20 MB | 1196 | ❌ γ=2 too aggressive |
| 11 | 36b | cappedact-labelsmooth | 30b | 1.3248 | — | 14.50 MB | ~1180 | ❌ Both hurt |
| 12 | 44b | seqlen-curriculum | 42b | ~1.32 | — | — | ~1000 | ❌ Failed |
| 13 | 38b | speed-opt | 34b | — | — | — | — | ❌ Failed |
*exp45b is post-training AWQ alpha tuning, not a retrain.
- LR schedule was completely broken (exp34b): ITERATIONS=20000 with 600s wallclock meant warmdown NEVER fired. Fixing to ITERATIONS=1300 gave -0.0166 quant bpb — single biggest improvement.
- Layer sharing revives dead blocks (exp42b): Block 9 dead at 6.1% effective rank. Sharing block 3 at position 9 revived it to 10.3%.
- More depth beats more width (exp48b vs exp46b): 10th unique block (-0.0039 bpb) > 8 KV heads (+0.001 bpb).
- Auxiliary losses still fatal (exp35b, exp43b): Focal loss (γ=2), boundary boost — both hurt.
- Activation capping hurts training (exp36b, exp37b): Warmdown already produces smooth-enough weights.
- AWQ alpha is undertested (exp45b): Sweeping alpha 0.6→0.7 gave -0.007 bpb for free.
- Warmdown=400 is optimal (exp47b): warmdown=200 too late — only 2 SWA checkpoints.
- Calibration improved dramatically: With warmdown, near-perfect calibration (all bins within ±0.003 gap).
- Word-start tokens: 25.5% of tokens but 42.5% of total loss. Mean loss 3.61 vs 1.20. Top-1 accuracy only 24.7%.
- Top confusion pairs:
▁and→,(299),▁the→▁a(263),▁and→.(200). Function word disambiguation is the core problem. - Positional loss flat after 256: 2.47→2.06→flat ~2.1-2.2.
- MLP activation outliers: Block 1 max=1314, 10-15% of activations >4.0. Root cause of 6.86% MLP quant error.
- Calibration near-perfect: All bins within ±0.003.
- Block 3 (shared) under stress: Effective rank 57.9% — lowest of all blocks.
- 728 KB headroom: Tighter than exp42b's 2.09 MB.
- Sentence boundary gap: 0.78 bpb (after-boundary 2.93 vs normal 2.15).
Hardware: 1×H100, 600s wallclock Key insight: Removing features improved results Best result: exp54b — quant bpb 1.2708 (1×H100 SOTA)
| Field | Value |
|---|---|
| Quant BPB | 1.2720 |
| Raw BPB | 1.2640 |
| Steps | 1218 |
| ms/step | 492 |
| Changes | Stripped token-type + loss weighting, kept resid-norm. max-autotune-no-cudagraphs. VAL_LOSS_EVERY=800. WARMUP=20. |
| Field | Value |
|---|---|
| Quant BPB | 1.2708 |
| Raw BPB | 1.2642 |
| Steps | 1235 |
| ms/step | 486 |
| Changes | XSA on last 2 decoder layers. Fixed block 0 c_k outlier (fp16 keep). Reverted to zstd. |
| Field | Value |
|---|---|
| Quant BPB | 1.2717 (marginal regression) |
| Raw BPB | 1.2648 |
| Steps | 1183 |
| ms/step | 507 |
| Changes | Learned xsa_alpha = sigmoid(param) per layer on ALL 10 blocks. |
Finding: Model learned alpha 0.75-0.99 on ALL layers — XSA universally wanted. But 507ms vs 486ms = 52 fewer steps, erasing benefit on 1×H100.
| Field | Value |
|---|---|
| Status | Killed early — no speed improvement |
| ms/step | 508 (same as exp55b) |
Finding: GQA head expansion (repeat_interleave 4→8) is the bottleneck, not the XSA math.
| Field | Value |
|---|---|
| Quant BPB | 1.2825 |
| Raw BPB | 1.2501 |
| Steps | 1479 |
| ms/step | 406 |
| Key difference | Faster per-step (partial RoPE, no resid-norm) but 5× worse quant gap (0.032 vs 0.007). |
| Config | Steps | ms/step | Quant val_bpb |
|---|---|---|---|
| exp54b (no norm) | 1235 | 486 | 1.2708 |
| exp58b (post-addition norm) | 1216 | 493 | 1.2741 |
| exp59b (pre-skip norm) | ~1216 | ~493 | ~1.274 |
Conclusion: With warmdown active, resid-norm is redundant. The 7ms/step overhead costs more than the quant gap benefit.
| Field | Value |
|---|---|
| Quant BPB | 1.2723 |
| Changes | exp54b + vanilla TTT + named checkpoint save + logging fixes. No architectural changes. |
| Exp | What | Why it failed |
|---|---|---|
| 35b | Focal loss (gamma=2) | Too aggressive — suppressed easy token gradients |
| 36b | Capped act + label smooth | Both hurt independently and together |
| 37b | Fused cap (no label smooth) | Cap=4.0 hurt raw quality more than helped quant |
| 43b | Boundary loss boost | Too sparse (2.5% of positions) |
| 44b | Seq-len curriculum | Speed regression |
| 46b | Full MHA (8 KV heads) | Extra params but slower, no bpb improvement |
| 55b | Scaled XSA all layers | 20ms/step overhead costs 52 steps |
| 56b | Fast cosine XSA | GQA head expansion is the bottleneck |
| 58b | Resid-norm re-enabled | 7ms/step → 19 fewer steps, redundant with warmdown |
| 59b | Pre-norm skip | Same overhead, no quality difference |
- LR warmdown is critical — Biggest single improvement (0.017 bpb).
- Simpler is better — Stripping token-type and loss weighting HELPED.
- Steps > features — Every ms/step matters. Features that add compute must justify their cost.
- Resid-norm is redundant with warmdown — Weights already smooth.
- XSA last 2 is the sweet spot — Model wants XSA everywhere but overhead makes all-layer too expensive on 1×H100.
- zstd > LZMA — Better for structured quantized weights (15.19 MB vs 15.37 MB).
- torch.compile mode matters —
max-autotune-no-cudagraphsgives kernel autotuning without tensor overwrite issues.
TTT requires matching the torch.compile context used during training for correct inference results.
exp09 (pad-ignore + word-start boost, quant 1.3282) ← PHASE 3a BEST
│
├── exp27b [✅ POSITIVE] resid-norm (quant 1.3197, Δ-0.0085)
├── exp28b [❌ NEGATIVE] perlayer quant (over 16MB budget)
├── exp29b [✅ POSITIVE] loss-weight + token-type (quant 1.3176, Δ-0.0106)
│
└── exp30b [✅ POSITIVE] combo: resid-norm + loss-weight + token-type (quant 1.3156, Δ-0.0126)
│
├── exp31b [❌ NEGATIVE] RoPE 50k (better raw 1.2953 but worse quant 1.3206)
├── exp32b [❌ NEGATIVE] aux boundary loss (quant 1.3424)
├── exp33b [⚪ NEUTRAL] alternating RoPE + NTK (quant 1.3145, marginal)
│
└── exp34b [✅✅ MAJOR] LR schedule fix (quant 1.2990, Δ-0.0166 — biggest single win)
│
├── exp35b [❌ NEGATIVE] focal loss γ=2 (quant 1.3201)
├── exp36b [❌ NEGATIVE] capped act + label smooth (quant 1.3472)
├── exp37b [❌ NEGATIVE] fused cap only (quant 1.3159)
├── exp38b [⚪ NEUTRAL] speed opt (quant 1.3002)
├── exp39b [✅ POSITIVE] SWA tuning (quant 1.2985)
│
└── exp42b [✅ POSITIVE] layer sharing block 3→pos 9 (quant 1.2969)
│
├── exp43b [⚪ NEUTRAL] boundary loss boost (quant 1.3003)
├── exp44b [❌ NEGATIVE] seq-len curriculum (failed)
├── exp45b [⚪ NEUTRAL] AWQ alpha=0.7 (quant 1.3033)
├── exp46b [⚪ NEUTRAL] full MHA 8 KV heads (quant 1.2979)
├── exp47b [❌ NEGATIVE] warmdown=200 (quant 1.3081)
│
└── exp48b [✅ POSITIVE] 10th unique block (quant 1.2930)
│
├── exp49b [⚪ NOT RUN] diffusion GPT
├── exp50b [⚪ NOT RUN] byte-level JEPA
│
└── exp53b [✅ POSITIVE] strip overhead (quant 1.2720, Δ-0.0210)
│
└── exp54b [✅ POSITIVE] XSA last 2 + c_k fix (quant 1.2708) ← 1xH100 BEST
│
├── exp55b [⚪ NEUTRAL] scaled XSA all layers (quant 1.2717)
├── exp56b [❌ NEGATIVE] fast cosine XSA (no speed gain)
├── exp57b [❌ NEGATIVE] LoRA TTT (failed)
├── exp58b [❌ NEGATIVE] resid-norm ON (quant 1.2741)
├── exp59b [❌ NEGATIVE] pre-norm skip (quant 1.2743)
│
└── clean_54b [✅ POSITIVE] named save + TTT (quant 1.2723)
└── clean_54b_v2 [❌ NEGATIVE] bf16 roundtrip (destroyed quality)
Community B model (fair 1xH100): quant 1.2825 ← WE BEAT BY 0.012 bpb
| Rank | Exp | Quant BPB | Raw BPB | Steps | ms/step | Tag | Key Change |
|---|---|---|---|---|---|---|---|
| 1 | exp54b | 1.2708 | 1.264 | 1235 | 486 | ✅ | XSA last 2 + c_k fix |
| 2 | exp53b | 1.2720 | 1.264 | 1218 | 492 | ✅ | Strip overhead |
| 3 | clean_54b | 1.2723 | 1.264 | 1205 | 498 | ✅ | Named save |
| 4 | community | 1.2825 | 1.250 | 1479 | 406 | — | Their full arch |
| 5 | exp48b | 1.2930 | 1.282 | 1198 | 501 | ✅ | 10th unique block |
| 6 | exp42b | 1.2969 | 1.287 | 1201 | 502 | ✅ | Layer sharing |
| 7 | exp39b | 1.2985 | — | 1196 | 502 | ✅ | SWA tuning |
| 8 | exp34b | 1.2990 | 1.289 | 1186 | 506 | ✅✅ | LR schedule fix |
| 9 | exp38b | 1.3002 | 1.290 | 1196 | 502 | ⚪ | Speed opt |
| 10 | exp43b | 1.3003 | 1.290 | 1198 | 501 | ⚪ | Boundary boost |
| 11 | exp45b | 1.3033 | — | 1196 | 502 | ⚪ | AWQ α=0.7 |
| 12 | exp47b | 1.3081 | 1.298 | 1200 | 500 | ❌ | Warmdown=200 |
| 13 | exp33b | 1.3145 | — | — | — | ⚪ | Alt RoPE + NTK |
| 14 | exp30b | 1.3156 | 1.298 | 1200 | 500 | ✅ | Combo |
| 15 | exp37b | 1.3159 | 1.301 | 1197 | 501 | ❌ | Fused cap |
| 16 | exp29b | 1.3176 | 1.301 | 1202 | 499 | ✅ | Loss-wt + type-emb |
| 17 | exp27b | 1.3197 | ~1.300 | ~1200 | ~500 | ✅ | Resid-norm |
| 18 | exp35b | 1.3201 | — | 1196 | — | ❌ | Focal loss γ=2 |
| 19 | exp31b | 1.3206 | 1.295 | 1197 | 502 | ❌ | RoPE 50k |
| 20 | exp32b | 1.3424 | 1.315 | 1196 | 502 | ❌ | Aux boundary loss |
| 21 | exp36b | 1.3472 | 1.332 | 1135 | 529 | ❌ | Cap + label smooth |
- AWQ alpha sweep on exp54b (test alpha=0.3-0.8)
- Pruning threshold sweep (0%, 1%, 2%, 5%)
- Seed sweep (43, 44, 45). Variance: 0.003-0.005 bpb.
- Weight decay tuning (0.04→0.06)
- LR tuning (MATRIX_LR 0.025→0.030)
- EMA with decay=0.995 (replace SWA)
- Diffusion GPT (exp49b) — Hybrid masked diffusion + AR
- Byte-level JEPA (exp50b) — Raw byte model
- XSA on all layers (alpha=0.75-0.99 everywhere)
- Partial RoPE (16 dims) — community speed trick
- Late QAT — quant noise in last 15%
- Predicted quant val_bpb ~1.12-1.14
Hardware: 1×H100 simulating 8×H100 (6000s wallclock, grad_accum=8, 786K tokens/batch) Base: exp54b (clean baseline) Best result: exp74 — sliding bpb 1.1456, artifact 15.86 MB
| Exp | Description | Pre-quant BPB | Post-quant BPB | Sliding BPB | Artifact | Steps | Status |
|---|---|---|---|---|---|---|---|
| exp74 | pRoPE+qgain+wbigram+LLR | 1.1539 | 1.1685 | 1.1456 | 15.86 MB | 6169 | Best |
| exp70 | Speed-optimized from exp69 | ~1.14 | ~1.17 | ~1.15 | ~16 MB | ~7500 | Baseline |
| exp78 | WS loss curriculum | — | — | — | — | — | Better embeddings |
| exp75 | Word pool from exp74 | — | — | — | — | — | Failed (scale→0.002) |
| exp61b | XSA all + warmdown | 1.1504 | 1.1781 | — | ~16.5 MB | ~7000 | Over budget |
| exp63 | Cascade VR + adaptive WD | 1.1377 | 1.1730 | — | 16.45 MB | ~7000 | Over budget |
exp54b (clean baseline, 1.2708 bpb)
└── exp60 (EMA, flash_attn3, 8×H100 sim)
└── exp61b (XSA all blocks, cosine warmdown → 1.1504 pre-quant)
└── exp63 (cascading V-residual, adaptive warmdown → 1.1377 pre-quant)
├── exp64 (MLP int6 quant — never ran)
├── exp65 (quant overhaul — never ran)
│ └── exp66 (MiLe loss + NoPE — failed)
│ ├── exp67 (word-start semantic attention — failed)
│ └── exp68 (next-word-start MTP — not run)
└── exp69 (better quant: mlp_proj int6, attn int5, lzma, prune 5%)
└── exp70 (speed: batched NS5, EMA/10, set_to_none → 1.15 bpb)
├── exp71 (output bias + label smooth — not run)
├── exp72 (JEPA concept loss — failed)
├── exp73 (warmdown focal + TTT WS — not run)
├── exp74 (pRoPE 16/64 + q_gain + word bigram + LLR → **1.1456**)
│ ├── exp75 (word pool injection — failed: scale→0)
│ └── exp76 (dual word attention — failed)
├── exp77 (progressive batch + seq_len curriculum)
├── exp78 (WS loss curriculum — improved embeddings)
├── exp79 (position ramp + late WS boost)
└── exp80 (best stack: pRoPE + bigram fix + pos ramp + outlier clamp)
- Partial RoPE 16/64 (exp74): 41% less quant error, better word-start attention. Frees 75% of head dims for semantic matching.
- Diverse q_gain init (exp74): Heads specialized faster — sharp (>2.5) for syntax, soft (<1.5) for semantics.
- Cascading value residual (exp63→all): Shallow layers independent (α≈0), deep layers form value highway (α≈0.9).
- Better quantization (exp69): MLP proj→int6 (3.4× less error), attn→int5 (size-neutral), magnitude pruning 5%, lzma.
- Speed optimizations (exp70): Batched NS5 via bmm, EMA every 10 steps, set_to_none=True, deferred .item().
- MiLe Loss (exp66): Downweighted easy tokens before consolidation.
- JEPA concept loss (exp72): Added memory/overhead, not enough steps.
- Word pool injection (exp75): Model drove scale to 0.002 — redundant.
- Output bias (exp71): Needs ~500 steps to build momentum — too slow.
- Focal loss during training (exp35b, exp73): Always hurts easy token accuracy.
- CUDAGraphs + tied embeddings (exp66): Incompatible, caused failure.
- Row 78 is a universal outlier: 9/10 mlp.proj blocks have dim 78 as worst outlier (10-22× ratio). Per-row clamping ±3σ addresses this.
- Embedding uses 17% of capacity: Effective rank 87/512. Word-start tokens (325) in 44 effective dims.
- Word-start norms 12% lower: Tied embeddings structurally bias toward continuations (larger norm → higher logit).
- Deep layers form value highway: VR alphas 7-10 → 0.9+ (strong V inheritance). Layers 2-5 independent.
- Block 0 attention barely used: attn_scale=0.10. MLP-dominant layer.
- Block 4 c_q condition number 49,644: Most quantization-sensitive matrix.
- Word-start tokens: 40.8% of tokens, 5.05 bpb → 66% of total loss
- Continuation tokens: 48.3%, 1.56 bpb → 24% of total loss
- Root causes: (1) full RoPE starves semantic attention, (2) tied embeddings bias continuations, (3) uniform gradient allocation, (4) bigram token-level not word-level
- Best fix: partial RoPE (architectural, not loss manipulation)
| Feature | Community | Ours (exp74) | Gap |
|---|---|---|---|
| Partial RoPE | 16/64 ✓ | 16/64 ✓ | Closed |
| GPTQ quantization | Full Hessian ✓ | Per-row uniform | Open |
| Bigram | 3072×112 | 10240×128 (larger) | Ours bigger |
| Warmdown | 4000 iters | 1500 (premature trigger) | Open |
| Compression | LZMA-9 ✓ | LZMA ✓ | Closed |
| TTT | Dropped (negative) | Enabled | Different |
| Selective pruning | ±1 reconstruction | 5% magnitude | Different |
Hardware: 1×H100 (simulating 8×H100) Philosophy shift: Understand the model first, then act Best result: exp85 — pre-quant 1.1517, post-quant 1.1697, artifact 15.32 MB
| Exp | Description | Pre-quant BPB | Post-quant BPB | Artifact | Status |
|---|---|---|---|---|---|
| exp85 | Community-derived stack | 1.1517 | 1.1697 | 15.32 MB | Best pre-quant |
| exp74 | pRoPE+qgain+wbigram+LLR | 1.1539 | 1.1685 | 15.86 MB | Best post-quant |
| exp87 | Fast convergence (failed) | ~1.17 | — | — | Failed |
| exp84 | Diagnostic-tuned (failed) | ~1.17 | — | — | Failed |
| exp83 | Diagnostics baseline | ~1.15 | 1.1717 | ~16 MB | Diagnostic reference |
exp70 (speed-optimized baseline)
├── exp83 (diagnostics: grad norms, VR health, bigram, block0 attention)
│ → Key finding: warmdown triggers at step 2200 (premature)
│ → Key finding: embed/matrix ratio 3.6→7.3× (misleading for Muon)
│ → Key finding: VR highway at layers 8-10, dead at 2-5
├── exp84 (diagnostic-tuned: VR_init=0.3, embed_lr=0.015)
│ → FAILED: VR alphas went negative, embed_lr change made ratio worse
│ → Lesson: VR_INIT must be 0.5, embed_lr ratio is misleading
├── exp85 (community-derived: pRoPE + x0-to-V + LN scale + clip search)
│ → **1.1517 pre-quant** (best), 15.32 MB artifact
│ → VE scale learned: block 8=0.88 (wants identity), block 9=0.08
│ → VR exploded to 3.26 at layer 6 (LN scale instability)
│ → Row 78 outlier: 4.5 (3× improved from exp70's 14.6)
├── exp86 (deep-opt: fused QKV + int8 critical + TF32)
│ → Not yet run
└── exp87 (fast convergence: embed preinit + prog unfreeze + block9 AdamW)
→ FAILED: embed preinit worse than random, prog unfreeze hurt co-adaptation
→ Lesson: don't fight orthogonal init + Muon
- Partial RoPE 16/64: Consistent across exp74 and exp85. Row 78 outlier 3× reduced.
- x0-to-V injection: Block 8 grew ve_scale 0.3→0.88 — model WANTS token identity in deep-layer values.
- Clip search quantization: Percentile-based clip per row. 25% quant error reduction. Zero training cost.
- Smaller bigram 5120×64: 0.97 MB savings, artifact at 15.32 MB.
- Late warmdown min_steps=3000: Delayed trigger from 2200 to 3100.
- CASCADE_VR_INIT < 0.5: Both 0.1 and 0.3 caused negative VR alphas.
- Lowering TIED_EMBED_LR: 0.035→0.015 made ratio worse (10.4×). Muon normalizes direction differently.
- Embedding pre-init from SVD: val_loss=12.21 at step 0 (vs 6.93 random). Incompatible with orthogonal weights.
- Progressive layer unfreezing: Prevented deep-shallow co-adaptation. VR highway didn't form.
- Block 9 QKV → AdamW: Duplicate parameter issue, inconclusive.
- LN Scale 1/√(layer+1): VR alpha explosion at layers 6-7 (3.26×).
- Block 0 attention dies by step 2000 (structural, not fixable)
- Block 1 x0_mix amplifies to 1.95× (compensates for dead block 0)
- Bigram scale decays 0.26→0.10 (attention supersedes local patterns)
- Grad clip never fires (threshold 0.3, actual norms 0.05-0.17)
- Loss oscillates ±0.07 during warmdown with 1500 iters (need 3500)
| Feature | Status | Estimated Impact |
|---|---|---|
| Partial RoPE | ✅ Matched | — |
| x0-to-V (vs community VE) | ✅ Novel alternative | Similar |
| Warmdown 3500 | ✅ Matched | — |
| Clip search | ✅ Adopted | -25% quant error |
| Full Hessian GPTQ | ❌ Not implemented | ~0.010 bpb |
| VR alpha clamping | ❌ Needed | Fix VR explosion |
| LN Scale fix | ❌ Needs investigation | TBD |
| Smaller bigram | ✅ Done | -0.97 MB |
exp54b (clean baseline, quant bpb 1.2708)
│
├── exp60 (EMA, flash_attn3, 8×H100 sim) 🟡
│ └── exp61b (XSA all blocks) 🟢 Pre-quant 1.1504
│ └── exp63 (cascading V-residual) 🟢 Pre-quant 1.1377
│ │
│ ├── exp64 (MLP int6 quant) 🟡 Never ran
│ ├── exp65 (quant overhaul) 🟡 Never ran
│ │ └── exp66 (MiLe loss + NoPE) 🔴 MiLe hurt convergence
│ │ ├── exp67 (word-start semantic attention) 🔴 failed
│ │ └── exp68 (next-word-start MTP) 🟡 Never ran
│ │
│ └── exp69 (better quant) 🟢 Closed gap 0.035→0.015
│ └── exp70 (speed-optimized) 🟢 BASELINE
│ ├── exp71 (output bias) 🟡 Never ran
│ ├── exp72 (JEPA concept) 🔴 overhead, no improvement
│ ├── exp73 (warmdown focal) 🟡 Never ran
│ ├── exp74 (pRoPE + q_gain + word bigram) 🟢 BEST sliding 1.1456
│ │ ├── exp75 (word pool) 🔴 scale→0.002
│ │ └── exp76 (dual attention) 🔴 failed
│ ├── exp77old (late warmdown) 🟡
│ ├── exp77 (progressive batch) 🟡 Never ran
│ ├── exp78 (WS loss curriculum) 🟢 Best embedding quality
│ │ └── exp81 (pRoPE + WS curriculum) 🟡 failed
│ │ └── exp82 (drop layer 10) 🟡 Never ran
│ ├── exp79 (position ramp) 🔴 premise wrong
│ ├── exp80 (best stack) 🔴 bigram-after-norm backfired
│ ├── exp83 (diagnostics) 🟢 7 actionable insights
│ ├── exp84 (diagnostic-tuned) 🔴 VR negative, embed_lr worse
│ └── exp85 (community-derived) 🟢 BEST pre-quant 1.1517
│ └── exp86 (deep-opt) 🟡 Not yet run
│ └── exp87 (fast convergence) 🔴 All 3 changes hurt
| Outcome | Count | Examples |
|---|---|---|
| 🟢 Positive | 8 | exp61b, exp63, exp69, exp70, exp74, exp78, exp83, exp85 |
| 🟡 Neutral | 9 | exp60, exp64, exp68, exp71, exp73, exp77old, exp77, exp82, exp86 |
| 🔴 Negative | 10 | exp66, exp67, exp72, exp75, exp76, exp79, exp80, exp84, exp87, exp65→66 |
Success rate: 29% positive, 36% neutral, 36% negative
Base: exp70_speed-opt_from_exp69 Goal: Faster training via Parallel Muon optimizer Best result: val_bpb 1.1440 (exp70_faster_version_parallel_muon, step 7317, 1×H100)
exp70_speed-opt_from_exp69 (original, DDP, 750ms/step)
├── exp70_faster_version_parallel_muon [🟢 POSITIVE: 12% speed, same final bpb]
│ ├── exp70_faster_vram_optimized [🔴 NEGATIVE: data loading issue]
│ ├── exp70_cuda_graphs_fused [🔴 NEGATIVE: no improvement]
│ ├── exp90_copy_head [🟡 NEUTRAL: concept validated, 40ms overhead]
│ └── reverted_exp70 [🟢 POSITIVE: clean base with all fixes]
│ └── exp91_smooth_v0residual [🟡 NEUTRAL: pending validation]
| Exp | Name | step_avg | Final BPB | Quant BPB | Size | Tag |
|---|---|---|---|---|---|---|
| exp70_parallel_muon | Parallel Muon + Banks | 658ms | 1.1440 | 1.1715 | 16.3MB | 🟢 |
| exp70_vram_opt | Double-buffer loader | 636ms | — | — | — | 🔴 |
| exp70_cuda_fused | CUDA Graphs + Triton | 662ms | — (higher loss) | — | — | 🔴 |
| exp90_copy | TopicCopyHead (hybrid freq+attn) | 698ms | — (partial) | — | — | 🟡 |
| reverted_exp70 | Clean parallel muon base | 656ms | 1.1440 | 1.1715 | 16.3MB | 🟢 |
| exp91_smooth | V0 residual + label smooth | — | — (pending) | — | — | 🟡 |
- Parallel Muon gives 12% speed via reduce-scatter/all-gather overlap and bank-native batching
- Per-step convergence ~0.002-0.004 bpb worse — different torch.compile graphs, init RNG ordering
- CUDA Graphs incompatible with FA3 — not usable together
- GPTQ requires Late QAT — without QAT-adapted weights, Cholesky error cascades
- Adaptive warmdown is fragile — v1 triggers on noise, v3 never triggers on oscillating loss. Pure time-based is robust.
- Copy mechanism validated: 1.19 bpb copy advantage for repeated tokens, 1.77 bpb for word-start.
- Model self-analysis: word_start_boost=0.017 (dead), cascading VR layers 1-8 ≈ 0 (dead), K_1 kurtosis=33.8 (outlier-heavy), byte tokens 96% cosine similar (confused)
- Double-buffering needs N >= grad_accum_steps buffers — insufficient buffers cause issues
- Custom Triton kernels for elementwise ops rarely help — torch.compile already fuses them; precision differences compound
- AWQ with weight-magnitude proxy is catastrophic — must use real activation statistics from forward hooks
- Selective ±1 pruning (Code 2) > blind magnitude pruning — targets least-impactful quantized values
- Init order matters for reproducibility — nn.init.orthogonal_ consumes RNG; bank vs module ordering creates different trajectories
Hardware: 8×H100 Base: exp70_speed-opt → exp92 (major rewrite) Key finding: Meta-TTT has an architecture-limited ceiling Best result: exp101 — legal_ttt 1.11588
exp70_speed-opt (1.153 bpb)
└── exp92_banks-asyncmuon-partrope-qat-ve [🟢 1.131 bpb — major rewrite]
└── exp93_meta-ttt-inner-outer [🟢 1.120 legal_ttt]
└── exp95_size-ttt-opt-metattt2x [🟢 1.1169 legal_ttt — SOTA at time]
├── exp96_warmdown-fix-trigram-sgdttt [🟡 ~1.135]
│ ├── exp98_metattt-randomsplit-momentum [🟡 ~1.135]
│ │ └── exp99_tripleloop-parallelres [🟡 not run]
│ └── exp97_fp8-pipeline [not run]
├── exp101_poscond-bigram-trigram [🟢 1.11588 legal_ttt — new baseline]
│ ├── exp105a_no-metattt [🟡 ablation: meta-TTT = noise]
│ ├── exp106_metasgd-crosschunk [🟡 ceiling confirmed]
│ │ ├── exp107_sam-inner [🔴 hurts]
│ │ └── exp108_sp8192-brotli [🟡 no results]
│ └── exp109_shared-blocks-softgate [🔴 decoder dead]
└── exp100_half-metattt [not tracked here]
| Exp | Name | val_bpb | int6_bpb | legal_ttt | Tag |
|---|---|---|---|---|---|
| exp92 | Banks + Async Muon + Partial RoPE + QAT + VE | ~1.131 | — | — | 🟢 |
| exp93 | Meta-TTT inner/outer FOMAML | 1.136 | — | ~1.116 | 🟢 |
| exp95 | Size-opt + meta-TTT 2× | 1.1363 | — | 1.1169 | 🟢 |
| exp96 | Warmdown fix + trigram | ~1.135 | — | — | 🟡 |
| exp98 | Random-split FOMAML + momentum LR match | ~1.135 | — | — | 🟡 |
| exp99 | Triple loop + parallel residuals | — | — | — | 🟡 |
| exp101 | Position-conditional bigram hash | 1.1352 | 1.13930 | 1.11588 | 🟢 |
| exp105a | No meta-TTT (ablation) | 1.1353 | 1.13956 | 1.11624 | 🟡 |
| exp106 | MetaSGD + cross-chunk FOMAML | 1.1377 | 1.14160 | ~1.118 | 🟡 |
| exp107 | SAM inner loop | 1.1384 | 1.1424 | 1.11898 | 🔴 |
| exp108 | SP8192 + Brotli | — | — | — | 🟡 |
| exp109 | Block sharing K=8 + SP8192 | 1.1500 | 1.1897 | — | 🔴 |
- Meta-TTT ceiling is architecture-limited: 4 experiments (exp101, 105a, 106, 107) show identical TTT delta ~0.023 bpb regardless of optimizer (SGD, MetaSGD, SAM, none). Ceiling set by bank architecture (rank × dim).
- Position-conditional bigram hashing (exp101): Zero-parameter trick — split hash space by token class (word-start vs within-word). +0.001 bpb.
- Block sharing fails across encoder/decoder boundary (exp109): Shared decoder positions → near-zero scales. Soft gates diagnose but can't fix.
- SP8192 quant degradation 10× worse than SP1024 (exp109): Large embedding table (8192×512) poorly compressed.
Source: Community contributions on parameter-golf repository Impact: Paradigm shift from our 1.1169 to 1.0744 bpb
| Metric | Value |
|---|---|
| val_bpb | 1.0873 |
| int6_bpb | 1.0997 |
| legal_ttt | 1.0808 |
| Hardware | 8×H100 |
Key innovations: SP8192 tokenizer + 3-layer depth recurrence (blocks 3-5, 2 extra passes) + parallel residuals + QK_GAIN_INIT=5.25. 17 virtual layers from 11 physical.
| Metric | Value |
|---|---|
| val_bpb | 1.0813 |
| int6_bpb | 1.0980 |
| legal_ttt | 1.0788 |
| 3-seed mean | 1.078825 |
Key innovations: Wider loop (3×3) + per-pass loop embeddings (3×512, zero-init) + Tap-In V6 cross-window n-gram C++ matcher + legal score-first TTT.
| Metric | Value |
|---|---|
| legal_ttt val_bpb | 1.07438 (3-seed mean) |
| val_bpb_std | 0.00034 |
| Artifact | 15,959,005 bytes (71 bytes headroom) |
| Hardware | 8×H100 80GB SXM |
| step_avg_ms | 124.68 |
Key innovations: Richer parallel residual routing — attn/MLP outputs written into both lanes at block end, decoder skips on lane0 only. CUTLASS EVT fusion for reproducible throughput.
Seed results:
| Seed | val_bpb | post_ema_val_bpb | artifact_bytes | steps | ms/step |
|---|---|---|---|---|---|
| 1337 | 1.07485 | 1.08286 | 15,958,373 | 4685 | 125.53 |
| 2024 | 1.07428 | 1.08242 | 15,956,287 | 4734 | 124.25 |
| 42 | 1.07403 | 1.08212 | 15,959,005 | 4733 | 124.26 |
| Exp | Date | Description | Tag |
|---|---|---|---|
| 2026-04-10_RecurStepFiLM_PooledRetrieval | 2026-04-10 | FiLM conditioning + pooled retrieval | 🟡 |
| 2026-04-10_10L_RecurStepFiLM_PooledRetrieval | 2026-04-10 | 10L variant | 🟡 |
| 2026-04-11_ImprovedParallelResiduals copy | 2026-04-11 | Copy/variant | 🟡 |
| 2026-04-11_newSota | 2026-04-11 | Community SOTA integration | 🟢 |
| 2026-04-11_11L_RecurStep3_loopedonly | 2026-04-11 | 11L recurrence step 3, looped-only | 🟡 |
| 2026-04-11_11L_RecurStep3_loops3 | 2026-04-11 | 11L with 3 loops | 🟡 |
| 2026-04-11_11L_RecurStep_StochDepth_ProgLoop | 2026-04-11 | Stochastic depth + progressive loop | 🟡 |
| 2026-04-11_11L_RecurStep_StochDepth_ProgLoop_KVCache | 2026-04-11 | + KV cache for recurrence | 🟡 |
| 2026-04-11_11L_Block10MLPHalf_RecurStepFiLM_PooledRetrieval | 2026-04-11 | Block 10 MLP halved + FiLM | 🟡 |
| loop_in_SP8192_3LayerRecur | 2026-04-13 | Loop detection (timestep embed, re-injection, per-loop RMSNorm) | 🟡 not trained |
Base: ImprovedParallelResiduals (1.0744 legal_ttt) Theme: Tied embedding bottleneck Result: No improvement over community baseline
| Exp | Name | val_bpb | int6_bpb | legal_ttt | Size | Tag |
|---|---|---|---|---|---|---|
| exp110 | Per-layer quant + trigram + PARALLEL_START=7 | — | — | — | — | 🟡 |
| exp111 | LoRA TTT (rank=8) + shrunk block 10 MLP + per-layer int5 | — | — | — | — | 🟡 |
| exp112 | Gradient rescaling on weak blocks | — | — | — | — | 🔴 |
| exp113 | Drop L0 MLP + batch schedule + MTP | — | — | — | — | 🟡 |
| exp114 | embed_dim=384 decouple | 1.0950 | — | — | fits | 🔴 |
| exp115 | embed_dim=384 + drop boundary MLPs | — | — | — | — | 🟡 |
| exp116 | embed_dim=384 + no x0 pathway | — | — | — | — | 🔴 |
| exp117 | embed_dim=448 tuned | 1.0877 | 1.0982 | 1.0814 (SW) | 16.28MB | 🔴 |
| exp118 | embed_dim=416 + parallel_start=7 + clip tuned | 1.0915 | 1.1013 | 1.0850 | 16.44MB | 🔴 |
| exp119 | Residual low-rank proj (rank=32) | — | — | — | — | 🟡 |
The dominant theme: the model uses the same weight matrix for input embeddings and output projection. With SP8192, this (8192×512) matrix dominates the parameter budget and forces boundary blocks (0 and 10) to specialize for embedding space rather than general computation.
Attempted fixes:
- embed_dim=448 (exp117): Good BPB (1.0877), activates boundary blocks (+50% effective contribution). But 16.28MB — over budget.
- embed_dim=416 (exp118): Similar story at 16.44MB.
- embed_dim=384 (exp114): Fits budget but loses 655K params → BPB regression.
- Residual low-rank projection (exp119): rank-32, zero param loss — theoretically correct fix. Not run to completion.
Verdict: The bottleneck is real. embed_dim≠model_dim activates boundary blocks but any dimension-change approach costs either params (regression) or fp16 passthrough overhead (budget overrun).
Date: 2026-03-24 Hardware: 1×H100 Separate exploration from main competition track
| Metric | Value |
|---|---|
| val_bpb | 1.3525 |
| Pre-quant val_bpb | 1.3245 |
| Artifact | 15.55 MB |
| Seeds | 1 (seed 42) |
| Steps | 1099 |
| Wallclock | 600.138s |
| Base PR | #623 |
Approach: Initialize W_Q and W_K in layer 0 from bigram co-occurrence statistics via SVD:
- Build 1024×1024 co-occurrence matrix from 2M training tokens (<3s)
- Project into model_dim via random projection
- Factorize C_proj = USV^T → Q/K weights where Q·K^T ≈ co-occurrence at step 0
Combined with LeakyReLU(0.5)², cyclic momentum (0.85–0.95), SWA over warmdown.
Note: exp87 later tried SVD-based embedding pre-initialization and it regressed. The difference: co-occurrence QK init changes attention patterns, while embedding SVD changes representation space (conflicts with Muon's orthogonal constraint).
- TTT requires compile-matched inference: Standalone model loading needs the same torch.compile context as training for correct numerical results.
- SP8192 quantization sensitivity: Large embedding table (8192×512) needs GPTQ with SDClip — naive quantization degrades 10× worse than SP1024.
- CUDA Graphs limited: Incompatible with FA3 and tied embeddings in
reduce-overheadmode.
- Step count is king: Reducing grad_accum = biggest single win.
- Don't fight the quantizer: Better convergence naturally produces better-quantizing weights.
- N-gram tables have diminishing returns: Bigram valuable, trigram+ marginal.
- Hash collision reduction matters less than expected: Model routes around collisions.
- Auxiliary losses are dangerous: JEPA caused biggest regression.
- Simple targeted fixes beat complex architectures: 0-param + 1-param > 688K params.
- Quant gap is the real metric: Optimize for post-quantization, not raw.
- Residual norm control is high leverage: RMSNorm after skip connections.
- Stacking orthogonal improvements works: Sub-additive but substantial.
- Auxiliary losses fatal in compute-starved regimes: Every gradient must reduce CE.
- RoPE base changes hurt quantization: Different landscape = harder-to-compress weights.
- Long context is a dead end: Loss flat after 256 tokens.
- Size budget matters: Check BEFORE celebrating.
- LR schedule was completely broken: Biggest single improvement (-0.0166 bpb).
- Layer sharing revives dead blocks: Block 9 dead → shared block 3 revived it.
- More depth beats more width: 10th block > 8 KV heads.
- Activation capping hurts: Warmdown already smooths weights.
- AWQ alpha undertested: Re-sweep for each new best model.
- Warmdown=400 optimal: 4 SWA checkpoints, proper decay.
- Simpler is better: Stripping features HELPED convergence.
- Resid-norm redundant with warmdown: 7ms/step overhead not worth it.
- XSA last 2 is sweet spot: Model wants everywhere but overhead too high on 1×H100.
- zstd > LZMA: Better for structured quantized weights.
- torch.compile mode matters:
max-autotune-no-cudagraphsgives best tradeoff.
- Partial RoPE 16/64 universally good: 41% less quant error, head specialization.
- Cascading VR creates value highway: Natural deep-layer pattern.
- Diagnostics are invaluable: exp83 discovered 7 insights informing 4 experiments.
- The model tells you what it wants: Listen to learned parameters.
- Don't fight Muon's orthogonal constraint: VR_INIT must be 0.5, embed pre-init fails.
- MiLe/focal/JEPA all fail: Loss reweighting doesn't work in limited steps.
- Architectural changes DO work: Partial RoPE, cascading VR, x0-to-V — all positive.
- Quantization improvements are free: int6 for MLP proj, clip search — zero training cost.
- Parallel Muon gives 12% speed but per-step convergence slightly worse.
- Double-buffering needs sufficient buffers for grad accumulation steps.
- Custom Triton kernels rarely help — torch.compile already fuses elementwise ops.
- AWQ needs real activation statistics — weight-magnitude proxy doesn't work.
- Init order matters for reproducibility.
- Meta-TTT ceiling is architecture-limited: TTT delta invariant at ~0.023 regardless of optimizer.
- Block sharing fails at encoder/decoder boundary: Decoder positions → dead.
- Position-conditional bigram hashing: Zero-parameter +0.001 bpb trick.
- Tied embedding bottleneck is real but hard to fix: embed_dim changes bust budget.
- Loss reweighting doesn't work in 7K steps (MiLe, focal, JEPA, position ramp — all failed)
- Architectural changes DO work (partial RoPE, cascading VR, x0-to-V, XSA-all — all positive)
- Quantization improvements are free bpb (int6 for MLP proj, clip search — zero training cost)
- Don't fight the optimizer (Muon's orthogonal constraint is a feature; VR_INIT and embed_lr must respect it)
- Diagnostics are invaluable (exp83 discovered 7 insights that informed 4 subsequent experiments)
- The model knows what it wants (block 0 attention dies = structural, VE scale at block 8 grows to 0.88 = model wants identity there)
-
Steps > everything else. Cutting grad_accum from 8→2 doubled optimizer updates in the same wallclock — biggest single win in Phase 3a. Every ms/step matters when you only get 600 seconds.
-
Fix your LR schedule before anything else. ITERATIONS=20000 with 600s wallclock meant warmdown never fired. Fixing to ITERATIONS=1300 gave -0.017 bpb for free (exp34b). The model was training at max LR for 100% of training.
-
Depth recurrence is the best parameter-efficiency trick (community). 3-layer recurrence (blocks 3-5, 2 extra passes) from the community SP8192 baseline gives 17 virtual layers from 11 physical — the single biggest architectural win. Only works within the encoder, NOT across encoder/decoder boundary.
-
SP8192 tokenizer is transformative (community). Community's jump from SP1024 to SP8192 unlocked ~0.04 bpb improvement. But the larger embedding table (8192×512) needs GPTQ with SDClip — naive int8+brotli gives 10× worse quant degradation.
-
Parallel residuals improve quantization for free (community). GPT-J-style two-lane routing (attn/MLP read same input) from the community baseline collapses the quant gap vs single-lane. Cross-lane accumulation (community ImprovedParallelResiduals, PR #1523) pushed this further to 1.0744.
-
Meta-TTT has an architecture-limited ceiling. 4 experiments (exp101, 105a, 106, 107) show identical TTT delta ~0.023 bpb regardless of inner-loop optimizer (SGD, MetaSGD, SAM, none). The ceiling is set by bank architecture, not training.
-
Auxiliary losses are fatal in compute-starved regimes. JEPA, focal loss, boundary boost, MTP — every auxiliary objective tested hurt. With 1200-4700 steps, every gradient must directly reduce CE loss.
-
Don't fight the optimizer. Muon's orthogonal constraint is a feature. VR_INIT must be 0.5 (lower → negative alphas). Embed LR ratio is misleading because Muon normalizes gradient direction. Progressive unfreezing prevents co-adaptation.
-
Quantization improvements are free BPB. Per-row clip search (-25% quant error), int6 for MLP proj (3.4× less error), GPTQ with SDClip — all zero training cost. Always sweep AWQ alpha for each new best model.
-
Simpler is better. Stripping token-type embedding and loss weighting from exp53b actually HELPED. Fewer competing objectives = better convergence in limited steps.
-
QK_GAIN_INIT=5.25 is a free win (community). Monotonic improvement from 4.0→5.25 observed in the community SP8192 baseline. Per-head query gain initialization helps attention patterns specialize faster.
-
Partial RoPE 16/64 is universally good. Frees 75% of head dims for semantic matching, reduces quantization outliers 3×, and improves word-start attention. Consistent across every experiment it was tested in.
-
Word-start tokens dominate total loss. 25-40% of tokens but 42-66% of total loss. Mean loss 3.6-5.1 vs 1.2-1.6 for continuations. The best fix is architectural (partial RoPE), not loss manipulation (focal, weighting).
-
Layer sharing revives dead blocks. Block 9 was dead at 6.1% effective rank. Sharing block 3 at position 9 revived it to 10.3%. Fewer unique blocks = smaller artifact = more headroom for params.
-
Resid-norm is redundant with warmdown. Adding RMSNorm after skip connections improves quant but costs ~7ms/step (19 fewer training steps). With proper LR warmdown, weights are already smooth enough.
-
Block sharing fails across encoder/decoder boundary. Shared blocks at decoder positions converge to near-zero scales — effectively dead. Soft gates correctly diagnose the problem but can't override it (exp109).
-
The model tells you what it wants. Block 0 attention dies (structural, MLP-dominant). Block 8 ve_scale grows to 0.88 (wants identity in deep-layer values). Bigram scale decays 0.26→0.10 (attention supersedes local patterns). Listen to the learned parameters.
-
Co-occurrence QK initialization works. Initializing W_Q/W_K from bigram SVD gives meaningful step-0 attention patterns instead of random noise. Validated at 1.3525 bpb on 1×H100.
-
Warmdown timing is critical. warmdown=400 steps (start at step 900) gives 4 SWA checkpoints and proper LR decay. Too late (warmdown=200) → only 2 checkpoints. Community uses 3500-4000 iters on longer runs.
-
Size budget is a hard constraint — check BEFORE celebrating. embed_dim=448 achieved great BPB (1.0877) but at 16.28MB — over the 16MB limit. embed_dim=416 similar story at 16.44MB. Multiple experiments wasted on approaches that couldn't fit.
19 experiments. Best: exp09/exp13 at quant bpb 1.3145.
~30 experiments. Best: exp54b at quant bpb 1.2708.
| Outcome | Count |
|---|---|
| 🟢 Positive | 8 |
| 🟡 Neutral | 10 |
| 🔴 Negative | 10 |
Success rate: 29% positive, 36% neutral, 36% negative
6 experiments. 2 positive, 2 neutral, 2 negative.
| Outcome | Count | Examples |
|---|---|---|
| 🟢 Positive | 8 | exp92, exp93, exp95, exp101, SP8192_3LayerRecur, WiderEmb, ImprovedParallelResiduals, CooccurrenceQKInit |
| 🟡 Neutral | 14 | exp96, exp98, exp99, exp105a, exp106, exp108, exp110, exp111, exp113, exp115, exp119, FiLM variants, recurrence variants |
| 🔴 Negative | 7 | exp107, exp109, exp112, exp114, exp116, exp117, exp118 |
Success rate: 28% positive, 48% neutral, 24% negative
~119+ experiments across all phases. Overall positive rate ~28-29%.
Every experiment across all phases in one table.
| # | Experiment | Base | Motivation | Result | Learning |
|---|---|---|---|---|---|
| Phase 3a (exp00–exp18) | |||||
| 1 | exp00 (baseline-rerun) | exp27 | Establish baseline on A100 | Baseline | Quant bpb 1.3389; bigram.proj has worst quant error |
| 2 | exp01b (ln-scale-only) | exp27 | Test layer-norm damping | Not run | — |
| 3 | exp01c (ema-only) | exp27 | Test EMA weight averaging | Not run | — |
| 4 | exp01d (xsa-only) | exp27 | Test cross-sequence attention | Negative | XSA slows steps without quality gain |
| 5 | exp02 (speed-bigramfp16-awq) | exp00 | FP16 bigram + per-category AWQ | Negative | FP16 bigram blows artifact to 17.3MB |
| 6 | exp03 (qat-ste) | exp00 | Quantization-aware training via STE | Negative | QAT-STE destabilizes training; worst result |
| 7 | exp04 (no-cyclic-momentum) | exp00 | Test fixed momentum=0.95 | Negative | Cyclic momentum is slightly helpful as regularization |
| 8 | exp05 (grad-accum4) | exp00 | Double step count via accum 8->4 | Positive | Major breakthrough: 2x more steps = first sub-1.32 quant bpb |
| 9 | exp06 (swa-awq-accum2) | exp05 | Push accum to 2; SWA+AWQ tuning | Positive | Raw bpb breaks below 1.30 for first time |
| 10 | exp07 (tighter-swa-awq) | exp06 | SWA_EVERY=150, AWQ=0.7 | Neutral | Smaller artifact, quant bpb identical; sweet spot is SWA_EVERY=100 |
| 11 | exp08 (ctx-freq-bias) | exp05 | Learned token burstiness bias (+1 param) | Neutral | Redundant with attention; smallest artifact at 15.0MB |
| 12 | exp09 (padignore-wordboost) | exp06 | Skip pad tokens + word-start boost | Positive | Best quant bpb (1.3145); 0+1 params beat 688K trigram params |
| 13 | exp10 (trigram-unigram) | exp09 | Trigram hash table + unigram bias | Neutral | Best raw bpb but quant bpb regresses — extra params compress poorly |
| 14 | exp11 (trigram-slim-awq07) | exp10 | Slim trigram dim=48, AWQ=0.7 | Negative | dim=48 too small, AWQ too aggressive; double regression |
| 15 | exp12 (trigram64-awq06) | exp10 | Middle-ground trigram dim=64 | Neutral | Better than exp11 but worse than exp09; hash collisions too frequent |
| 16 | exp13 (multihead-gate-bigram) | exp09 | K=2 hash heads + context gate | Positive | Tied best quant bpb; collision reduction real but impact negligible |
| 17 | exp14 (engram-multiorder) | exp13 | 1-5gram, 10 lookups/position | Negative | Shared n-gram embeddings cause destructive interference |
| 18 | exp15 (engram-3order) | exp14 | 1-3gram with orthogonal subspaces | Neutral | Better isolation but each subspace too small (~42 dims) |
| 19 | exp16 (jepa-aux) | exp15 | JEPA predictor MLP, MSE loss | Negative | Biggest regression; fixed hash targets provide adversarial gradient |
| 20 | exp17 (byte-engram) | exp16 | Byte boundary features | Negative | No gain; base too weak to evaluate |
| 21 | exp18 (separate-trigram64) | exp13 | Separate 64-dim trigram + projection | Neutral | 688K extra params don't survive quantization |
| Phase 3b-Part1 (exp27b–exp33b) | |||||
| 22 | exp27b (resid-norm) | exp09 | RMSNorm after skip connections | Positive | High-leverage: attacks root cause of quant error (norm growth 19.7->89.5) |
| 23 | exp28b (perlayer-quant) | exp09 | Variable bitwidth per layer | Negative | 16% MSE reduction but over 16MB budget |
| 24 | exp29b (lossweight-typemb) | exp09 | 1.5x word-start loss + token-type embed | Positive | Gradient redistribution + structural signal both help |
| 25 | exp30b (combo) | exp09 | Stack all 3 validated improvements | Positive | Phase 3b-Part1 SOTA (1.3156); sub-additive but substantial |
| 26 | exp31b (rope-50k) | exp30b | RoPE base 10k->50k | Negative | Best raw bpb but quant gap widens; net negative after quantization |
| 27 | exp32b (aux-boundary) | exp30b | Auxiliary word-boundary classifier | Negative | Gradient waste; token-type already provides structural signal |
| 28 | exp33b (alt-rope-ntk) | exp30b | Alternating RoPE bases + NTK | Neutral | Marginal; positional loss still flat after 256 tokens |
| Phase 3b-Part2 (exp34b–exp48b) | |||||
| 29 | exp34b (lr-schedule-fix) | exp30b | Fix ITERATIONS 20000->1300 so warmdown fires | Positive | Single biggest improvement (-0.0166 bpb); warmdown was never firing |
| 30 | exp35b (focal-loss) | exp30b | Focal loss gamma=2 | Negative | Too aggressive; suppresses easy token gradients |
| 31 | exp36b (cappedact-labelsmooth) | exp30b | Activation cap + label smoothing | Negative | Both changes hurt independently and together |
| 32 | exp37b (fused-cap) | exp34b | Activation cap only | Negative | Cap hurts raw quality more than it helps quant |
| 33 | exp38b (speed-opt) | exp34b | Speed optimization | Neutral | Failed (OOM) |
| 34 | exp39b (swa-tuning) | exp34b | SWA parameter sweep | Positive | SWA_EVERY=100 confirmed optimal |
| 35 | exp42b (revive-block9) | exp34b | Share block 3 at position 9 | Positive | Dead block 9 (6.1% rank) revived to 10.3% |
| 36 | exp43b (boundary-boost) | exp42b | Boundary loss boost | Neutral | Too sparse (2.5% of positions) to matter in 1200 steps |
| 37 | exp44b (seqlen-curriculum) | exp42b | Sequence length curriculum | Negative | Speed regression |
| 38 | exp45b (awq-alpha07) | exp42b | AWQ alpha sweep (post-train) | Neutral | Alpha=0.7 gave -0.007 bpb free on exp42b |
| 39 | exp46b (full-mha) | exp42b | 8 KV heads (double from 4) | Neutral | Extra params but slower; depth > width |
| 40 | exp47b (warmdown200) | exp42b | Shorter warmdown=200 | Negative | Too late; only 2 SWA checkpoints vs 4 with warmdown=400 |
| 41 | exp48b (10blocks-depth) | exp42b | Add 10th unique block | Positive | Depth > width confirmed; quant bpb 1.2930 |
| Phase 3b-Part3 (exp53b–clean_54b) | |||||
| 42 | exp53b (lean-combo) | exp48b | Strip token-type + loss weighting | Positive | Removing features HELPED; quant bpb 1.2720 (-0.021!) |
| 43 | exp54b (xsa-zstd-ckfix) | exp53b | XSA last 2 layers + c_k fix + zstd | Positive | 1xH100 SOTA: quant bpb 1.2708 |
| 44 | exp55b (scaled-xsa-all) | exp54b | Learned XSA alpha on all layers | Neutral | Model wants XSA everywhere (alpha=0.75-0.99) but 20ms overhead |
| 45 | exp56b (fast-cosine-xsa) | exp55b | Cosine-scale XSA approximation | Negative | GQA head expansion is bottleneck, not XSA math |
| 46 | exp57b (lora-ttt) | exp54b | LoRA-based TTT | Negative | Failed |
| 47 | exp58b (resid-norm-on) | exp54b | Re-enable resid-norm | Negative | Redundant with warmdown; 7ms/step overhead not worth it |
| 48 | exp59b (pre-norm-skip) | exp54b | Pre-skip normalization | Negative | Same overhead as full resid-norm, no quality difference |
| 49 | clean_54b (final-arch) | exp54b | Clean submission version + TTT | Positive | Quant bpb 1.2723; clean baseline |
| 50 | clean_54b_v2 (bf16-roundtrip) | clean_54b | BF16 roundtrip test | Negative | Destroyed quality |
| Phase 3.5 (exp60–exp80) | |||||
| 51 | exp60 (8xh100-sim) | exp54b | EMA + flash_attn3 + 8xH100 simulation | Neutral | Infrastructure for scaling; not a bpb experiment |
| 52 | exp61b (xsa-all-warmdown) | exp60 | XSA all blocks + cosine warmdown | Positive | Pre-quant 1.1504; XSA-all works at scale |
| 53 | exp63 (cascade-vr) | exp61b | Cascading value residual + adaptive warmdown | Positive | Pre-quant 1.1377; discovered deep-layer value highway |
| 54 | exp64 (mlp-int6) | exp63 | MLP int6 quantization | Not run | Superseded by exp69 |
| 55 | exp65 (quant-overhaul) | exp63 | Full quantization overhaul | Not run | Ideas flowed into exp69 |
| 56 | exp66 (mile-nope) | exp65 | MiLe loss + partial NoPE | Negative | MiLe hurts early convergence |
| 57 | exp67 (ws-semantic-attn) | exp66 | Word-start semantic attention | Negative | Failed |
| 58 | exp68 (ws-mtp) | exp66 | Next-word-start MTP head | Not run | TTT data leakage concern |
| 59 | exp69 (better-quant) | exp63 | MLP proj->int6, attn->int5, LZMA, prune 5% | Positive | Closed quant gap 0.035->0.015; free improvements |
| 60 | exp70 (speed-opt) | exp69 | Batched NS5, EMA/10, set_to_none, deferred .item() | Positive | ~1.15 bpb; speed-optimized foundation for all subsequent |
| 61 | exp71 (output-bias) | exp70 | Output bias + label smooth + Z-loss | Not run | Needs too many steps to build momentum |
| 62 | exp72 (jepa-concept) | exp70 | JEPA concept loss | Negative | Added overhead, not enough steps even at 7K |
| 63 | exp73 (warmdown-focal) | exp70 | Warmdown focal + TTT weight | Not run | Safe late-training intervention (designed) |
| 64 | exp74 (prope-qgain-wbigram) | exp70 | Partial RoPE 16/64 + diverse q_gain + word bigram | Positive | Sliding bpb 1.1456; heads specialized (sharp+soft) |
| 65 | exp75 (word-pool) | exp74 | Inject previous word-start embedding | Negative | Model suppressed it (scale 0.1->0.002); redundant with attention |
| 66 | exp76 (dual-word-attn) | exp74 | Dual token + word attention | Negative | Failed |
| 67 | exp77old (late-warmdown) | exp70 | Late warmdown only | Neutral | Superseded by exp77 |
| 68 | exp77 (progressive-batch) | exp70 | Progressive batch + seq_len curriculum | Not run | Theoretically sound but non-standard |
| 69 | exp78 (ws-loss-curriculum) | exp70 | Word-start loss curriculum 0.1->1.0 | Positive | Best embedding quality; WS rank improved |
| 70 | exp79 (position-ramp) | exp70 | Position ramp 1.0->1.2 + late WS boost | Negative | Premise wrong: late positions are EASIER (90% repeats) |
| 71 | exp80 (best-stack) | exp70 | Combine pRoPE + bigram-after-norm + pos ramp + clamp | Negative | Bigram-after-norm destabilized attention |
| Phase 3.6 (exp81–exp87) | |||||
| 72 | exp81 (prope-ws-curriculum) | exp78 | Partial RoPE + WS curriculum | Neutral | Failed |
| 73 | exp82 (drop-layer10) | exp81 | Drop layer 10 + diverse q_gain | Not run | Designed only |
| 74 | exp83 (diagnostics) | exp70 | Full diagnostic run: grad norms, VR health, block analysis | Positive | 7 actionable insights; premature warmdown, dead blocks identified |
| 76 | exp84 (diagnostic-tuned) | exp83 | Apply diagnostics: VR_init=0.3, embed_lr=0.015 | Negative | VR went negative; embed_lr ratio misleading with Muon |
| 77 | exp85 (community-derived) | exp83 | pRoPE + x0-to-V + LN scale + clip search + small bigram | Positive | Best pre-quant (1.1517); ve_scale revealed model preferences |
| 78 | exp86 (deep-opt) | exp85 | Fused QKV + int8 critical + TF32 | Not run | Designed |
| 79 | exp87 (fast-convergence) | exp85 | Embed preinit SVD + progressive unfreeze + block9 AdamW | Negative | All 3 hurt; don't fight Muon's orthogonal constraint |
| Phase 3b-Muon (parallel optimizer) | |||||
| 80 | exp70_parallel_muon | exp70 | Parallel Muon via reduce-scatter/all-gather overlap | Positive | 12% speed (658ms vs 750ms); same final bpb |
| 81 | exp70_vram_opt | exp70_parallel_muon | Double-buffer data loader | Negative | Insufficient buffers for grad_accum |
| 82 | exp70_cuda_fused | exp70_parallel_muon | CUDA Graphs + Triton fusion | Negative | No improvement |
| 83 | exp90 (copy-head) | exp70_parallel_muon | TopicCopyHead (hybrid freq+attn) | Neutral | Concept validated; 40ms overhead |
| 84 | reverted_exp70 | exp70_parallel_muon | Clean base with all fixes | Positive | Clean foundation; 656ms/step |
| 85 | exp91 (smooth-v0residual) | reverted_exp70 | V0 residual + label smoothing | Neutral | Pending validation |
| Phase 3c (exp92–exp109) | |||||
| 86 | exp92 (banks-asyncmuon) | exp70 | Major rewrite: bank tensors + async Muon + partial RoPE + QAT + VE | Positive | ~1.131 bpb; paradigm shift in architecture |
| 87 | exp93 (meta-ttt) | exp92 | Meta-TTT inner/outer FOMAML | Positive | Legal_ttt ~1.116; first meta-TTT integration |
| 88 | exp95 (size-opt-metattt2x) | exp93 | Size optimization + meta-TTT 2x | Positive | Legal_ttt 1.1169; SOTA at the time |
| 89 | exp96 (warmdown-trigram) | exp95 | Warmdown fix + trigram hash | Neutral | ~1.135 bpb; marginal |
| 90 | exp97 (fp8-pipeline) | exp96 | FP8 pipeline + compile | Not run | Designed |
| 91 | exp98 (metattt-randomsplit) | exp96 | Random-split FOMAML + momentum LR match | Neutral | ~1.135 bpb; no improvement |
| 92 | exp99 (tripleloop) | exp98 | Triple loop + parallel residuals | Not run | Community merged first |
| 93 | exp100 (half-metattt) | exp95 | Half meta-TTT variant | Neutral | Not tracked in detail |
| 94 | exp101 (poscond-bigram) | exp95 | Position-conditional bigram hash by token class | Positive | Legal_ttt 1.11588; zero-param trick splitting hash by word-start |
| 95 | exp105a (no-metattt ablation) | exp101 | Remove meta-TTT to measure its contribution | Neutral | Meta-TTT = +0.00036 bpb (noise); ceiling is architectural |
| 96 | exp106 (metasgd-crosschunk) | exp101 | MetaSGD + cross-chunk FOMAML | Neutral | TTT delta invariant at ~0.023; ceiling confirmed |
| 97 | exp107 (sam-inner) | exp106 | SAM inner loop for TTT | Negative | SAM hurts; TTT delta still ~0.023 regardless of optimizer |
| 98 | exp108 (sp8192-brotli) | exp106 | SP8192 tokenizer + Brotli compression | Neutral | No stored results |
| 99 | exp109 (shared-blocks-softgate) | exp101 | Block sharing K=8 + soft gates + SP8192 | Negative | Decoder positions dead (near-zero scales); 10x worse quant |
| Community SOTA (SP8192+) | |||||
| 100 | SP8192_3LayerRecur (community) | Community | SP8192 + 3-layer recurrence (blocks 3-5) + parallel residuals + QK_GAIN=5.25 | Positive | Legal_ttt 1.0808; paradigm shift — 17 virtual layers from 11 physical |
| 101 | WiderEmb_TapInV6_TTT (community) | Community | Wider loop (3x3) + per-pass embeddings + Tap-In V6 + legal TTT | Positive | Legal_ttt 1.0788 (3-seed mean 1.078825) |
| 102 | ImprovedParallelResiduals (community PR #1523) | Community | Cross-lane attn/MLP accumulation + CUTLASS EVT fusion | Positive | Legal_ttt 1.0744 — CURRENT BEST; 71 bytes headroom |
| 103 | RecurStepFiLM_PooledRetrieval (community) | Community | FiLM conditioning + pooled retrieval | Neutral | No improvement over base |
| 104 | 10L_RecurStepFiLM_PooledRetrieval (community) | Community | 10L variant of FiLM+retrieval | Neutral | No improvement |
| 105 | newSota (community) | Community | Community SOTA integration | Positive | Integration checkpoint |
| 106 | 11L_RecurStep3_loopedonly | Community | 11L, recurrence step 3, looped-only | Neutral | No improvement over ImprovedParallelResiduals |
| 107 | 11L_RecurStep3_loops3 | Community | 11L with 3 loops | Neutral | No improvement |
| 108 | 11L_RecurStep_StochDepth_ProgLoop | Community | Stochastic depth + progressive loop | Neutral | No improvement |
| 109 | 11L_RecurStep_StochDepth_ProgLoop_KVCache | Community | + KV cache for recurrence | Neutral | No improvement |
| 110 | 11L_Block10MLPHalf_RecurStepFiLM | Community | Block 10 MLP halved + FiLM + retrieval | Neutral | No improvement |
| 111 | loop_in_SP8192_3LayerRecur | Community | Loop detection: timestep embed + re-injection + per-loop RMSNorm | Neutral | Not yet trained |
| Frontier (exp110–exp119) | |||||
| 112 | exp110 (perlayer-quant-trigram) | ImprovedParallelResiduals | Per-layer quant + trigram + PARALLEL_START=7 | Neutral | No improvement |
| 113 | exp111 (lora-ttt-shrunk) | ImprovedParallelResiduals | LoRA TTT rank=8 + shrunk block 10 MLP | Neutral | No improvement |
| 114 | exp112 (grad-rescaling) | ImprovedParallelResiduals | Gradient rescaling on weak blocks | Negative | Doesn't fix structural tied-embedding bottleneck |
| 115 | exp113 (drop-l0-mtp) | ImprovedParallelResiduals | Drop L0 MLP + batch schedule + MTP | Neutral | Truncated logs |
| 116 | exp114 (embed384-decouple) | ImprovedParallelResiduals | embed_dim=384 to decouple boundary blocks | Negative | 655K param loss -> BPB regression (1.0950) |
| 117 | exp115 (embed384-asymmetric) | ImprovedParallelResiduals | embed_dim=384 + drop boundary MLPs | Neutral | Truncated |
| 118 | exp116 (embed384-no-x0) | ImprovedParallelResiduals | embed_dim=384 + remove x0 pathway | Negative | No stored results |
| 119 | exp117 (embed448-tuned) | ImprovedParallelResiduals | embed_dim=448 to activate boundary blocks | Negative | Good BPB (1.0877) but 16.28MB — over budget |
| 120 | exp118 (embed416-parstart7) | ImprovedParallelResiduals | embed_dim=416 + parallel_start=7 + tighter clip | Negative | Good BPB (1.0915) but 16.44MB — over budget |
| 121 | exp119 (residual-lowrank-proj) | ImprovedParallelResiduals | Residual low-rank projection (rank=32) | Neutral | Theoretically correct fix; not run to completion |
| Misc | |||||
| 122 | CooccurrenceQKInit | PR #623 | Init W_Q/W_K from bigram co-occurrence SVD | Positive | Val_bpb 1.3525 on 1xH100; meaningful step-0 attention patterns |
Last updated: 2026-04-13