openai · Itssshikhar · Mar 21, 2026 · Mar 21, 2026 · Mar 20, 2026 · Mar 20, 2026
diff --git a/.gitignore b/.gitignore
@@ -8,4 +8,6 @@ data/manifest.json
 data/docs_selected.jsonl
 .mypy_cache/
 .venv
-logs/
+*.pt
+*.ptz
+*.log
diff --git a/3seed_s1337.log b/3seed_s1337.log
diff --git a/3seed_s42.log b/3seed_s42.log
diff --git a/3seed_s7.log b/3seed_s7.log
diff --git a/EXPERIMENTS.md b/EXPERIMENTS.md
@@ -0,0 +1,123 @@
+# Experimental Findings — Parameter Golf (April 2026)
+
+## Objective
+Beat PR #1493's SOTA of **1.0810 BPB** (3-seed mean) in the parameter-golf competition.  
+Constraints: 16,000,000 bytes (decimal) artifact cap, 600s train+eval wallclock, 8×H100 GPUs.
+
+## Baseline: PR #1493
+- **Architecture**: 11L×512d×8H/4KV, 3-layer recurrence (L3-5 at frac=0.35), parallel residuals (L7+), tied embeddings, logit softcap=30, XSA on all 11 layers, U-Net skip connections
+- **Quantization**: int6 matrices (clip_sigmas=12.85, clip_range=31), int8 embeddings (clip_sigmas=20.0, clip_range=127), brotli compression, GPTQ
+- **Results** (SEED=42, QK_GAIN_INIT=5.25):
+  - Pre-quant non-sliding: **1.08757**
+  - Post-quant non-sliding: **1.10014** (quant gap = 0.01257)
+  - Post-quant sliding: **1.08329**
+  - Post-quant TTT: **1.08103**
+
+---
+
+## Experiment 1: Quantization-Aware Training (QAT)
+
+**Hypothesis**: Fake-quantize STE during warmdown lets the model adapt to quantization noise, reducing the quant gap.
+
+**Implementation** (`train_qat.py`):
+- Forward: `w + (quantize(w) - w).detach()` — uses quantized weights, backward passes gradients as identity
+- Applied to all CastedLinear (numel > 65536) + embedding when `_QAT_ACTIVE=True`
+- Toggled on at configurable training fraction
+
+**Runs**:
+| Run | Start Frac | Strategy | Post-Quant Non-Sliding | Notes |
+|-----|-----------|----------|----------------------|-------|
+| run1 | 0.28 | Standard | CRASHED | `FailOnRecompileLimitHit` (fixed with `cache_size_limit=32`) |
+| run2 | 0.28 | Standard | 1.1118 | +0.012 worse than baseline |
+| run3 | 0.55 | Standard | 1.1111 | Similar degradation |
+| run4 | 0.85 | Standard | 1.1092 | Slightly better but still worse |
+| run5 | 0.28 | EMA-then-finetune | 1.1194 | Worst — optimizer state mismatch |
+
+**Root Cause**: EMA contamination. With decay=0.9965, the last ~300 steps dominate the EMA average. QAT noise in those final steps contaminates the EMA irreversibly. The EMA model (which gets saved) never fully benefits from QAT adaptation.
+
+**Conclusion**: QAT is incompatible with this training setup. The EMA mechanism, which is critical for final model quality, cannot coexist with QAT noise injection.
+
+---
+
+## Experiment 2: Partial Key Offset (PKO)
+
+**Hypothesis**: Shifting non-RoPE key dimensions by 1 position enables implicit bigram/position awareness.
+
+**Implementation** (`train_pko.py`, `eval_pko.py`):
+```python
+rd = self.rope_dims  # 16
+if seqlen > 1:
+    k = k.clone()
+    k[:, 1:, :, rd:] = k[:, :-1, :, rd:]
+```
+
+**Results**:
+
+### Eval-only PKO (applied to pre-trained model without retraining):
+- Baseline: **1.10014**
+- PKO all layers: **~1.68** (catastrophic failure)
+- PKO encoder-only: **~1.68** (catastrophic failure)
+
+### Training with PKO:
+- Pre-quant: **1.08829** (vs baseline 1.08757, +0.0007 worse)
+- Post-quant sliding: **1.08420** (vs baseline 1.08329, +0.0009 worse)
+- **Post-quant TTT: 1.10474** (catastrophic — TTT makes it WORSE by 0.02 BPB)
+
+**Root Cause**: PKO shifts create non-standard key representations that break TTT's gradient-based weight adaptation. TTT assumes standard attention semantics; the shifted keys create an optimization landscape that gradient updates can't navigate.
+
+**Conclusion**: PKO is incompatible with TTT. Since TTT provides the critical final ~0.002 BPB gain, PKO is a net negative.
+
+---
+
+## Experiment 3: Mixed-Precision Sensitivity Scan
+
+**Hypothesis**: With code packing (~16,600 bytes for submission code), we have freed budget to promote high-sensitivity layers from int6 to int7.
+
+**Implementation** (`sensitivity_scan.py`):
+- For each quantizable matrix: test int7 (clip_range=63) while keeping others at int6
+- Measure extra compressed bytes and BPB improvement
+
+**Results**:
+- Estimated budget: 6,447 bytes (16M - baseline_model - 16,600 code)
+- Minimum cost to promote any matrix: ~16,500 bytes (even the smallest 131K-param layers)
+- **ALL matrices marked OVER budget**
+
+**Conclusion**: The byte budget is far too tight for mixed precision. Even the smallest matrix promotion exceeds available margin after compression.
+
+---
+
+## Strategic Analysis: What's Left on the Table
+
+### High-EV opportunities (not yet tested):
+1. **MTP (Multi-Token Prediction)** — auxiliary loss predicting t+2, t+3 tokens during training. Heads discarded at inference → zero byte cost. v13 codebase already has an implementation. Estimated potential: 0.002-0.005 BPB pre-quant improvement.
+2. **More recurrence at eval-only** — using num_loops=3 during eval (trained with 2). May give ~0.001 BPB gain at inference time cost within 600s budget.
+3. **Hyperparameter tuning** — warmdown_frac, learning rate schedule, MuonEQR parameters.
+
+### Dead ends (definitively ruled out):
+- QAT with EMA-based training
+- PKO (incompatible with TTT)
+- Mixed precision promotion (budget too tight)
+- Eval-only architectural modifications on pre-trained models
+
+### Key insight:
+The quant gap (0.0126 BPB) is the biggest single loss. Reducing it requires either:
+- Better GPTQ (different Hessian estimation, more calibration data)
+- Smaller model that quantizes better (but then pre-quant suffers)
+- Training changes that make weights more quantization-friendly WITHOUT explicit QAT noise
+
+---
+
+## File Index
+
+| File | Purpose |
+|------|---------|
+| `train_qat.py` | QAT-modified training script |
+| `train_pko.py` | PKO-modified training script |
+| `eval_pko.py` | Eval-only PKO test |
+| `sensitivity_scan.py` | Mixed-precision sensitivity scanner |
+| `run_qat.sh` | QAT run configuration |
+| `logs/qat_run*.log` | QAT training logs (5 runs) |
+| `logs/pko_run1.log` | PKO training log |
+| `logs/eval_pko.log` | PKO eval-only log |
+| `logs/sensitivity_scan.log` | Sensitivity scan results |
+| `logs/baseline_restore.log` | Baseline restoration log |
diff --git a/README.md b/README.md
@@ -30,10 +30,16 @@ Happy training!
 
 | Run | Score | Author | Summary | Date | Info |
 |-----|------:|--------|---------|------|------|
+| LeakyReLU² + Legal Score-First TTT + Parallel Muon | 1.1194 | abaybektursun | On PR #549: LeakyReLU(0.5)^2 + TTT + Parallel Muon on the PR #414 stack | 2026-03-23 | [info](records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/README.md) |
+| 11L EMA + GPTQ-lite + warmdown3500 | 1.1228 | signalrush | On PR #374: GPTQ-lite clip search + EMA, plus warmdown3500 and QAT@0.15 | 2026-03-22 | [info](records/track_10min_16mb/2026-03-22_11L_EMA_GPTQ-lite_warmdown3500_QAT015_1.1233/README.md) |
+| 11L Partial RoPE + LN Scale + EMA + XSA4 | 1.1248 | jfprincz | On PR #287: Partial RoPE (16/64) + layerwise LN scale | 2026-03-21 | [info](records/track_10min_16mb/2026-03-21_11L_XSA4_EMA_PartialRoPE_LateQAT_1.1248/README.md) |
+| 11L XSA4 + EMA + Int6 MLP3x | 1.1271 | jfprincz | On PR #198: XSA on the last 4 layers + EMA replacing SWA | 2026-03-20 | [info](records/track_10min_16mb/2026-03-20_11L_XSA4_EMA_Int6_MLP3x_WD04_1.1271/README.md) |
+| 11L Efficient Partial XSA | 1.1307 | unnir | On PR #198: Efficient Partial XSA on the deepest 3 layers | 2026-03-20 | [info](records/track_10min_16mb/2026-03-20_11L_EfficientPartialXSA_FA3_SWA120/README.md) |
 | 10L Int5-MLP + BigramHash(10240) | 1.1428 | thwu1 | 10 layers, mixed int5/int6 quantization, BigramHash(10240), SWA(0.4), WD=0.04 | 2026-03-20 | [info](records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/README.md) |
 | Int6 MLP3x + SmearGate + BigramHash | 1.1458 | Raahil Shah | 3x MLP + SmearGate + BigramHash + OrthoInit + Muon WD + SWA | 2026-03-20 | [info](records/track_10min_16mb/2026-03-20_Int6_MLP3x_SmearGate_BigramHash_MuonWD_SWA/README.md) |
 | 11L MLP3x + Int6 QAT | 1.1502 | aruniyer | 11 layers, 3x MLP, int6 QAT, zstd-22, WD=0.04, sliding eval | 2026-03-20 | [info](records/track_10min_16mb/2026-03-19_MLP3x_QAT_Int6_SlidingWindow/README.md) |
 | SmearGate + OrthoInit + Muon WD | 1.1556 | aquariouseworkman | SmearGate + BigramHash + 3x MLP + int6 STE QAT + sliding eval | 2026-03-19 | [info](records/track_10min_16mb/2026-03-19_smeargate_orthoinit_muonwd/README.md) |
+| Ternary Quantization | 1.1570 | Ciprian-Florin Ifrim | 73.7M params quantized to 1 0 -1 + misc arch changes | 2026-03-24 | [info](records/track_10min_16mb/2026-03-24_74M_Ternary_UNet_FP8_10L_8192BPE_YaRN_NeoMuon/README.md) |
 | 10L Int6 QAT + Zstd MLP2.6x | 1.1586 | yahya010 | 10 layers, int6 QAT + zstd-22, MLP 1344, Muon 0.99, sliding eval | 2026-03-19 | [info](records/track_10min_16mb/2026-03-19_Seq2048_FP16Emb_TunedLR/README.md) |
 | Mixed Quant + Sliding Window Eval | 1.1630 | aquariouseworkman | Int6 block weights + int8 embeddings + 3x MLP + sliding eval | 2026-03-19 | [info](records/track_10min_16mb/2026-03-19_MixedQuant_Int6Int8_SlidingWindow/README.md) |
 | Muon WD + 10 layer | 1.1748 | notapplica | Includes prev. wins + Spectral embed init + resid mix | 2026-03-19 | [info](records/track_10min_16mb/2026-03-19_SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit/README.md) |
@@ -45,12 +51,29 @@ Happy training!
 | fp16 Embed | 1.2197 | Renier Velazco | FP16 Tied Embedding + LR/Warmdown Tuning | 2026-03-18 | [info](records/track_10min_16mb/2026-03-18_FP16Embed_WD3600/README.md) |
 | Naive Baseline | 1.2244 | Baseline | 9layer 512dim 1024vocab TiedEmbeddings 4 KV heads | 2026-03-18 | [info](records/track_10min_16mb/2026-03-17_NaiveBaseline/README.md) |
 
-#### Notable Non-Record Runs
+#### Unlimited Compute Leaderboard & Non-record Submissions
 
 | Run | Score | Author | Summary | Date | Info |
 |-----|------:|--------|---------|------|------|
+| 1 Bit Quantization | 1.1239 | Ciprian-Florin Ifrim | 106M params quantized to 1 bit + misc arch changes + 2hr training | 2026-03-24 | [info](records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/README.md) |
 | 4-Hour Baseline | 1.2074 | Will DePue | Testing unlimited compute, 4 hours on 8xH100 | 2026-03-18 | [info](records/track_non_record_16mb/2026-03-18_Quasi10Bfrom50B_SP1024_9x512_KV4_4h_pgut3/README.md) |
 
+#### Requests for PRs
+
+Breakthrough ideas are rarely immediately state-of-the-art, instead, they're developed slowly, first demonstrating signs-of-life, iterated on, then only ultimately optimized on the systems side. Don't get discouraged if a new algorithm doesn't instantly beat the best leaderboard run or even the naive baseline. If you have an idea you believe in, consider ignoring step times early on: once you prove you can beat the baseline in the same # of steps you can then start focusing on how to also make it fast.
+
+We'd love to see weird & creative ideas in the challenge, since you never know what may work in the end. Most likely, these will be a good fit in our unlimited compute leaderboard as non-record submissions. We have some requests for what we'd love to see people implement:
+
+- [x] 1-bit quantization - [implementation](records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/README.md)
+- [x] Ternary quantization - [implementation](records/track_10min_16mb/2026-03-24_74M_Ternary_UNet_FP8_10L_8192BPE_YaRN_NeoMuon/README.md)
+- [ ] JEPA
+- [ ] Text diffusion
+- [ ] H-net tokenization
+- [ ] Universal transformer - [We have lots of depth recurrence submissions, but I'd love to see one 4 hour
+- [ ] Megakernels
+- [ ] State-space models, E2E TTT, super long context for evaluation or training 
+- [ ] Learning adapters on random linear maps
+
 ## Getting Started
 
 ### Training Your First Model (Mac with Apple Silicon)

diff --git a/_top_ref/README.md b/_top_ref/README.md
@@ -0,0 +1,125 @@
+# Record: 3-Seed Compliance Reproduction — Support for PR #1851
+
+**val_bpb = 1.06145** (3-seed mean ± 0.00068) | **~15.95 MB** | 8×H100 SXM 80GB
+
+## Summary
+
+This is a **3-seed compliance reproduction and support package** for [PR #1851](https://github.com/openai/parameter-golf/pull/1851) by @aquariouseworkman (SmearGate BOS Fix + PR #1787 Base + LQER Asymmetric + Phased TTT).
+
+The purpose of this package is to:
+1. Provide statistical significance evidence (3 seeds) for the PR #1851 result.
+2. Confirm that results are reproducible across seeds by an independent party.
+3. Document a compliance re-run demonstrating GPTQ fits within the 600s training budget.
+
+**No new ML technique is introduced.** This package reproduces the exact code and configuration from PR #1851.
+
+## 3-Seed Results (Original Runs)
+
+These are the originally-committed results. Seed 42 is from @aquariouseworkman's PR #1851 submission; seeds 314 and 1234 were run by @Christopher-Lee-McClendon as independent reproductions using the same code and environment variables.
+
+| Seed | Post-TTT BPB | Artifact (bytes) | Eval Time | Source |
+|------|-------------|------------------|-----------|--------|
+| 42   | **1.06128183** | 15,952,086 | 519.5s | PR #1851 (@aquariouseworkman) |
+| 314  | **1.06086831** | 15,952,419 | 525.6s | Reproduction (@Christopher-Lee-McClendon) |
+| 1234 | **1.06220261** | 15,952,690 | 479.6s | Reproduction (@Christopher-Lee-McClendon) |
+| **Mean ± Std** | **1.06145 ± 0.00068** | | | |
+
+All artifacts < 16,000,000 bytes ✓  
+All eval times < 600s ✓
+
+### Log Files (Original)
+
+- `train_seed42_pr1851_original.log` — Seed 42 from PR #1851 by @aquariouseworkman
+- `train_seed314_original.log` — Seed 314 reproduction by @Christopher-Lee-McClendon
+- `train_seed1234_original.log` — Seed 1234 reproduction by @Christopher-Lee-McClendon
+
+## Compliance Re-run Evidence (GPTQ Within 600s)
+
+The original runs used `GPTQ_RESERVE_SECONDS=0.5`, which resulted in the training loop running until ~599.6s. GPTQ hessian collection (which accesses training data) adds ~3.5s, potentially extending past the 600s budget.
+
+To confirm compliance, all 3 seeds were re-run with `GPTQ_RESERVE_SECONDS=8.0`, ensuring the training loop ends at ~592s and GPTQ hessians complete by ~595.5s (well within 600s). The only code change is the timing margin — no ML change.
+
+| Seed | Post-TTT BPB (re-run) | Train Time | GPTQ Ends By | Artifact (bytes) |
+|------|----------------------|------------|--------------|------------------|
+| 42   | 1.06083288 | 592.1s | ~595.5s ✅ | 15,949,701 |
+| 314  | 1.06090748 | 592.0s | ~595.5s ✅ | 15,951,777 |
+| 1234 | 1.06248776 | 592.1s | ~595.5s ✅ | 15,951,968 |
+| **Mean ± Std** | **1.06141 ± 0.00093** | | | |
+
+**No statistically significant difference:** Original mean 1.06145 vs re-run mean 1.06141 (delta = −0.00004, well within 1-sigma noise). This confirms the GPTQ reserve setting has negligible impact on model quality.
+
+### Re-run Log Files
+
+- `train_seed42_rerun_gptq8s.log`
+- `train_seed314_rerun_gptq8s.log`
+- `train_seed1234_rerun_gptq8s.log`
+
+### What Changed in Re-run
+
+1. **`GPTQ_RESERVE_SECONDS` 0.5 → 8.0** — Training loop ends ~8s early for GPTQ headroom.
+2. **Serialize-before-diagnostic reordering** — Artifact written immediately after GPTQ, before pre-quant diagnostic eval.
+3. **Timing instrumentation** — `serialize_wallclock` and `artifact_production_wallclock` logged for transparency.
+
+### GPTQ Timing Breakdown (Re-run)
+
+| Phase | Time | Accesses Training Data? |
+|-------|------|------------------------|
+| Training loop (with 8s reserve) | ~592s | ✅ Yes |
+| Hessian collection | ~3.5s | ✅ Yes |
+| **Total training-data-access time** | **~595.5s** | **< 600s ✅** |
+| Quantization | ~10.1s | ❌ No (uses cached Hessians) |
+| Brotli compression | ~65-67s | ❌ No (pure I/O) |
+
+## Technique Stack
+
+All techniques inherited from PR #1851 and its lineage. No new techniques introduced.
+
+| Technique | Source |
+|-----------|--------|
+| Base architecture (11L, MLP 4×, MuonEq-R) | PR #1787 (@nprime06) |
+| SmearGate attention + BOS fix | PR #1797 (@dexhunter) + PR #1851 (@aquariouseworkman) |
+| LQER Asymmetric quantization | PR #1797 (@dexhunter) |
+| CaseOps SP8192 | PR #1729 (@romeerp) |
+| GPTQ + SP8192 | PR #1394 (@clarkkev) |
+| Score-first TTT (3 phases) | PR #549 (@abaybektursun) |
+| BOS bug identification | @cocohearts |
+
+## Architecture
+
+11L × 512d × 8H/4KV, MLP 4×, LeakyReLU(0.5)², Partial RoPE (16/64 dims), layerwise LN scale, tied embeddings, logit softcap=30.0. Depth recurrence: layers 3–5 looped ×2 (activated at frac=0.35). Parallel residuals from layer 8. XSA on all 11 layers. SmearGate window=12.
+
+## Reproduction
+
+```bash
+# Install dependencies
+pip install brotli python-minifier
+
+# Prepare CaseOps SP8192 data
+python3 prepare_caseops_data.py  # downloads from romeerp/parameter-golf-caseops-v1
+
+# Run training (replace SEED with 42, 314, or 1234)
+SEED=42 \
+CASEOPS_ENABLED=1 \
+EMBED_BITS=7 \
+SMEAR_GATE_ENABLED=1 \
+SPARSE_ATTN_GATE_ENABLED=1 \
+MIN_LR=0.1 \
+EMBED_CLIP_SIGMAS=15.0 \
+MLP_CLIP_SIGMAS=12.0 \
+GPTQ_RESERVE_SECONDS=8.0 \
+PHASED_TTT_NUM_PHASES=3 \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+**Hardware:** 8×H100 SXM 80GB (RunPod)
+
+## Credits
+
+- **@aquariouseworkman** — PR #1851 author (SmearGate BOS fix, original seed 42 result)
+- **@nprime06** — PR #1787 (base architecture)
+- **@romeerp** — PR #1729 (CaseOps)
+- **@dexhunter** — PR #1797 (SmearGate + LQER asymmetric quantization)
+- **@cocohearts** — BOS document boundary bug identification
+- **@abaybektursun** — PR #549 (score-first TTT)
+- **@clarkkev** — PR #1394 (GPTQ + SP8192)
+- **@Christopher-Lee-McClendon** — Seeds 314/1234 reproduction and compliance re-run
-Original file line number
+Diff line change
@@ Expand Up / @@ -8,4 +8,6 @@ data/manifest.json @@
     data/docs_selected.jsonl
     .mypy_cache/
     .venv
-    logs/
+    *.pt
+    *.ptz
+    *.log