Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
65261c9
Added my understanding of train_gpt.py
Itssshikhar Mar 21, 2026
e1b518e
restored some original changes as well in mu_understanding.md file.
Itssshikhar Mar 21, 2026
85451b0
Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) — NEW SOTA
unnir Mar 20, 2026
14d9389
Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)
jfprincz Mar 20, 2026
6937a42
Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)
jfprincz Mar 21, 2026
336c711
Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233…
signalrush Mar 22, 2026
58f1a98
Update README leaderboard with merged records
cocohearts Mar 23, 2026
843091a
Use GitHub usernames in new leaderboard rows
cocohearts Mar 23, 2026
4a43355
Describe leaderboard entries by base-run diff
cocohearts Mar 23, 2026
7563398
Record: LeakyReLU² + Legal TTT + Parallel Muon — val_bpb 1.1194 (3-se…
abaybektursun Mar 23, 2026
b6722b8
Fix pre-TTT BPB, TTT gains, and steps to match logs exactly
abaybektursun Mar 23, 2026
d8bd62f
Fix author attributions: PR #493 @parinzee, PR #461 @Christopher-Lee-…
abaybektursun Mar 23, 2026
79ab878
Update README.md
valerio-oai Mar 24, 2026
ef7070e
Update README.md
valerio-oai Mar 24, 2026
fe045a9
Notable Non-Record Submission: 1.1239 BPB - 106.2M Binary Asymmetric …
CiprianFlorin-Ifrim Mar 25, 2026
068bf80
Update README.md
0hq Mar 25, 2026
f4e09d4
Record Submission: 1.1570 BPB - 73.7M Ternary U-Net (10L 768d 8192BPE…
CiprianFlorin-Ifrim Mar 25, 2026
3623ba2
Update README.md
0hq Mar 25, 2026
2f9b80c
Update README.md
0hq Mar 25, 2026
cf56bcb
Update README.md
0hq Mar 25, 2026
b6c3e52
Update README.md
0hq Mar 25, 2026
65269a2
Update README.md
0hq Mar 25, 2026
5372bea
Non-record: Depth Recurrence in Parameter-Constrained Transformers — …
evangelinehelsinki Mar 25, 2026
6b079fb
Add Gram Newton-Schulz integration on top of #1 submission
Itssshikhar Mar 31, 2026
c4b9b46
Fix Gram Newton-Schulz to match Dao-AILab reference implementation
Itssshikhar Apr 2, 2026
557cde4
Fix 8xH100 runtime bugs + document full run results (val_bpb 1.1228)
Itssshikhar Apr 2, 2026
ed7f292
Use actual Dao-AILab gram-newton-schulz kernels instead of pure PyTorch
Itssshikhar Apr 2, 2026
99a8196
Fix contiguity bug + document kernel run results (99.02ms, SLOWER tha…
Itssshikhar Apr 2, 2026
4f2879e
Rewrite Gram NS: inline, same coefficients, same bf16, no library
Itssshikhar Apr 3, 2026
f58e089
Add sliding window attention experiment + full debugging journey
Itssshikhar Apr 4, 2026
025f348
seq4096 breakthrough: 1.1130 BPB sliding eval (beats #1's 1.1147)
Itssshikhar Apr 6, 2026
be1a166
Add run instructions for all configs (direct + Modal + 4090 smoke test)
Itssshikhar Apr 6, 2026
68eafdf
3-seed reproduction: 1.1165 mean BPB — 1.1130 claim does NOT hold
Itssshikhar Apr 7, 2026
9d5a37a
Document perf investigation: 89ms local vs 82ms Modal is container-li…
Itssshikhar Apr 7, 2026
5b2cd04
Document NanoGPT technique stacking: QK gain 2.5 + sigmoid rescale + PKO
Itssshikhar Apr 8, 2026
070fff5
Document sparse attention gate results + quantization wall analysis
Itssshikhar Apr 9, 2026
3cca46c
Bank QAT on all F.linear weights: 3-seed mean 1.1117 BPB (beats #1 by…
Itssshikhar Apr 9, 2026
f5d137b
Day 1 stack: SP8192 + SDClip + Brotli + WD 0.085 + MLP 4x as new defa…
Itssshikhar Apr 25, 2026
9465963
added sota approaches in txt file
Itssshikhar Apr 25, 2026
bae4f61
Scylla TM998 integration + raw retokenization + S2 training run
Itssshikhar Apr 25, 2026
d654799
Add Scylla tokenizer files, 3-seed runner, and data manifest
Itssshikhar Apr 25, 2026
7508362
v7 recurrence quantization fixes: INT8 recurred layers + skip-recur-e…
Itssshikhar Apr 27, 2026
fa8678d
Add v1-v3 recurrence logs, SP8192 tokenizer script, and tokenizer spe…
Itssshikhar Apr 27, 2026
d7327d5
trying out better hessian for recurrence layers.
Itssshikhar Apr 27, 2026
432f3b8
v12 bankless rewrite: per-layer CastedLinear, quant gap 0.084→0.005
Itssshikhar Apr 27, 2026
cc619e1
Save PR #1394 reference code and hessian fix doc
Itssshikhar Apr 27, 2026
7100db1
v13 plan: staged strip-down + capacity recovery
openhands-agent Apr 27, 2026
0202216
v13 experiments: VE strip baseline + TTT catastrophic failure
Itssshikhar Apr 27, 2026
e15c752
v13 analysis: lock v13a, fix TARGET_MB bug, retry TTT before declarin…
openhands-agent Apr 28, 2026
b303cad
v13 complete: 9 experiments, v13a remains best at 1.0955 BPB
Itssshikhar Apr 28, 2026
2332d0b
v14 experiments: QAT, PKO, mixed-precision sensitivity scan
Itssshikhar Apr 28, 2026
2a97eff
Add all experiment scripts, logs, and training history
Itssshikhar Apr 28, 2026
5f35bb9
Add PR1493 priority experiment harness
openhands-agent Apr 28, 2026
8ba15a5
Add PR1493 priority experiments live results log
Itssshikhar Apr 28, 2026
190134d
Record docshuffle result: q_ttt=1.08279 (regression)
Itssshikhar Apr 29, 2026
6b60ed9
Record wd result: q_ttt=1.08029 (small real win)
Itssshikhar Apr 29, 2026
8a82456
Record iha failure: harness bug in GPTQ Hessian collection
Itssshikhar Apr 29, 2026
10ca871
Fix iha stop_step typo: 4527 -> 4524
Itssshikhar Apr 29, 2026
ded7e22
Record mtp result: q_ttt=1.09023 (clear regression)
Itssshikhar Apr 29, 2026
74dc702
Add PR1493 stacking experiments
openhands-agent Apr 29, 2026
49d1068
Record real wd_paired result: q_ttt=1.07974 + add safe_launch guard
Itssshikhar Apr 29, 2026
2a8fbca
Record wd_strong_paired result: q_ttt=1.07971 (no stack vs wd_paired)
Itssshikhar Apr 29, 2026
2285bc6
Record wd_paired_iha kill: pre=1.08666 worse than wd_paired (1.08610)
Itssshikhar Apr 29, 2026
468be92
GPTQ Hessian all-reduce + damp/block sweep
Itssshikhar Apr 30, 2026
cd87935
SmearGate+attn_gate port (regression at single seed) + pivot doc to c…
Itssshikhar Apr 30, 2026
ec48ff1
Port wd_schedule onto PR1851 base as train_top.py
Itssshikhar Apr 30, 2026
3f661b3
Document PR1851 wd_strong port + portability audit of PR1493 stack
Itssshikhar Apr 30, 2026
6c53583
GPTQ Hessian all-reduce on PR1851 base
Itssshikhar Apr 30, 2026
97fc8a5
Port paired-head Muon NS to PR1851 bank architecture
Itssshikhar Apr 30, 2026
01347da
Document Run 1 (AR alone) + reversal of wd_strong verdict
Itssshikhar Apr 30, 2026
d3aa1bc
Document Runs 1-3 + final audit of PR1493-stack portability to PR1851
Itssshikhar Apr 30, 2026
611b598
Switch to PR #1855 base + port wd_schedule + AR onto train_top_1855.py
Itssshikhar Apr 30, 2026
4a73033
Document Run 4 (PR1851 + 9 hparams + wd_strong + AR) — best q_ttt yet
Itssshikhar Apr 30, 2026
dba2da7
Document PR1855 wd_strong AR run
openhands-agent Apr 30, 2026
d0eb37c
Document Run 4 pergroup recovery plan
openhands-agent Apr 30, 2026
0209a50
Port PR #1855 pergroup lrzip compressor into train_top.py
Itssshikhar Apr 30, 2026
074817a
Run 6: PR1851 + 9hp + wd_strong + AR + ported pergroup — best valid q…
Itssshikhar Apr 30, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,6 @@ data/manifest.json
data/docs_selected.jsonl
.mypy_cache/
.venv
logs/
*.pt
*.ptz
*.log
255 changes: 255 additions & 0 deletions 3seed_s1337.log

Large diffs are not rendered by default.

252 changes: 252 additions & 0 deletions 3seed_s42.log

Large diffs are not rendered by default.

263 changes: 263 additions & 0 deletions 3seed_s7.log

Large diffs are not rendered by default.

123 changes: 123 additions & 0 deletions EXPERIMENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# Experimental Findings — Parameter Golf (April 2026)

## Objective
Beat PR #1493's SOTA of **1.0810 BPB** (3-seed mean) in the parameter-golf competition.
Constraints: 16,000,000 bytes (decimal) artifact cap, 600s train+eval wallclock, 8×H100 GPUs.

## Baseline: PR #1493
- **Architecture**: 11L×512d×8H/4KV, 3-layer recurrence (L3-5 at frac=0.35), parallel residuals (L7+), tied embeddings, logit softcap=30, XSA on all 11 layers, U-Net skip connections
- **Quantization**: int6 matrices (clip_sigmas=12.85, clip_range=31), int8 embeddings (clip_sigmas=20.0, clip_range=127), brotli compression, GPTQ
- **Results** (SEED=42, QK_GAIN_INIT=5.25):
- Pre-quant non-sliding: **1.08757**
- Post-quant non-sliding: **1.10014** (quant gap = 0.01257)
- Post-quant sliding: **1.08329**
- Post-quant TTT: **1.08103**

---

## Experiment 1: Quantization-Aware Training (QAT)

**Hypothesis**: Fake-quantize STE during warmdown lets the model adapt to quantization noise, reducing the quant gap.

**Implementation** (`train_qat.py`):
- Forward: `w + (quantize(w) - w).detach()` — uses quantized weights, backward passes gradients as identity
- Applied to all CastedLinear (numel > 65536) + embedding when `_QAT_ACTIVE=True`
- Toggled on at configurable training fraction

**Runs**:
| Run | Start Frac | Strategy | Post-Quant Non-Sliding | Notes |
|-----|-----------|----------|----------------------|-------|
| run1 | 0.28 | Standard | CRASHED | `FailOnRecompileLimitHit` (fixed with `cache_size_limit=32`) |
| run2 | 0.28 | Standard | 1.1118 | +0.012 worse than baseline |
| run3 | 0.55 | Standard | 1.1111 | Similar degradation |
| run4 | 0.85 | Standard | 1.1092 | Slightly better but still worse |
| run5 | 0.28 | EMA-then-finetune | 1.1194 | Worst — optimizer state mismatch |

**Root Cause**: EMA contamination. With decay=0.9965, the last ~300 steps dominate the EMA average. QAT noise in those final steps contaminates the EMA irreversibly. The EMA model (which gets saved) never fully benefits from QAT adaptation.

**Conclusion**: QAT is incompatible with this training setup. The EMA mechanism, which is critical for final model quality, cannot coexist with QAT noise injection.

---

## Experiment 2: Partial Key Offset (PKO)

**Hypothesis**: Shifting non-RoPE key dimensions by 1 position enables implicit bigram/position awareness.

**Implementation** (`train_pko.py`, `eval_pko.py`):
```python
rd = self.rope_dims # 16
if seqlen > 1:
k = k.clone()
k[:, 1:, :, rd:] = k[:, :-1, :, rd:]
```

**Results**:

### Eval-only PKO (applied to pre-trained model without retraining):
- Baseline: **1.10014**
- PKO all layers: **~1.68** (catastrophic failure)
- PKO encoder-only: **~1.68** (catastrophic failure)

### Training with PKO:
- Pre-quant: **1.08829** (vs baseline 1.08757, +0.0007 worse)
- Post-quant sliding: **1.08420** (vs baseline 1.08329, +0.0009 worse)
- **Post-quant TTT: 1.10474** (catastrophic — TTT makes it WORSE by 0.02 BPB)

**Root Cause**: PKO shifts create non-standard key representations that break TTT's gradient-based weight adaptation. TTT assumes standard attention semantics; the shifted keys create an optimization landscape that gradient updates can't navigate.

**Conclusion**: PKO is incompatible with TTT. Since TTT provides the critical final ~0.002 BPB gain, PKO is a net negative.

---

## Experiment 3: Mixed-Precision Sensitivity Scan

**Hypothesis**: With code packing (~16,600 bytes for submission code), we have freed budget to promote high-sensitivity layers from int6 to int7.

**Implementation** (`sensitivity_scan.py`):
- For each quantizable matrix: test int7 (clip_range=63) while keeping others at int6
- Measure extra compressed bytes and BPB improvement

**Results**:
- Estimated budget: 6,447 bytes (16M - baseline_model - 16,600 code)
- Minimum cost to promote any matrix: ~16,500 bytes (even the smallest 131K-param layers)
- **ALL matrices marked OVER budget**

**Conclusion**: The byte budget is far too tight for mixed precision. Even the smallest matrix promotion exceeds available margin after compression.

---

## Strategic Analysis: What's Left on the Table

### High-EV opportunities (not yet tested):
1. **MTP (Multi-Token Prediction)** — auxiliary loss predicting t+2, t+3 tokens during training. Heads discarded at inference → zero byte cost. v13 codebase already has an implementation. Estimated potential: 0.002-0.005 BPB pre-quant improvement.
2. **More recurrence at eval-only** — using num_loops=3 during eval (trained with 2). May give ~0.001 BPB gain at inference time cost within 600s budget.
3. **Hyperparameter tuning** — warmdown_frac, learning rate schedule, MuonEQR parameters.

### Dead ends (definitively ruled out):
- QAT with EMA-based training
- PKO (incompatible with TTT)
- Mixed precision promotion (budget too tight)
- Eval-only architectural modifications on pre-trained models

### Key insight:
The quant gap (0.0126 BPB) is the biggest single loss. Reducing it requires either:
- Better GPTQ (different Hessian estimation, more calibration data)
- Smaller model that quantizes better (but then pre-quant suffers)
- Training changes that make weights more quantization-friendly WITHOUT explicit QAT noise

---

## File Index

| File | Purpose |
|------|---------|
| `train_qat.py` | QAT-modified training script |
| `train_pko.py` | PKO-modified training script |
| `eval_pko.py` | Eval-only PKO test |
| `sensitivity_scan.py` | Mixed-precision sensitivity scanner |
| `run_qat.sh` | QAT run configuration |
| `logs/qat_run*.log` | QAT training logs (5 runs) |
| `logs/pko_run1.log` | PKO training log |
| `logs/eval_pko.log` | PKO eval-only log |
| `logs/sensitivity_scan.log` | Sensitivity scan results |
| `logs/baseline_restore.log` | Baseline restoration log |
25 changes: 24 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,10 +30,16 @@ Happy training!

| Run | Score | Author | Summary | Date | Info |
|-----|------:|--------|---------|------|------|
| LeakyReLU² + Legal Score-First TTT + Parallel Muon | 1.1194 | abaybektursun | On PR #549: LeakyReLU(0.5)^2 + TTT + Parallel Muon on the PR #414 stack | 2026-03-23 | [info](records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/README.md) |
| 11L EMA + GPTQ-lite + warmdown3500 | 1.1228 | signalrush | On PR #374: GPTQ-lite clip search + EMA, plus warmdown3500 and QAT@0.15 | 2026-03-22 | [info](records/track_10min_16mb/2026-03-22_11L_EMA_GPTQ-lite_warmdown3500_QAT015_1.1233/README.md) |
| 11L Partial RoPE + LN Scale + EMA + XSA4 | 1.1248 | jfprincz | On PR #287: Partial RoPE (16/64) + layerwise LN scale | 2026-03-21 | [info](records/track_10min_16mb/2026-03-21_11L_XSA4_EMA_PartialRoPE_LateQAT_1.1248/README.md) |
| 11L XSA4 + EMA + Int6 MLP3x | 1.1271 | jfprincz | On PR #198: XSA on the last 4 layers + EMA replacing SWA | 2026-03-20 | [info](records/track_10min_16mb/2026-03-20_11L_XSA4_EMA_Int6_MLP3x_WD04_1.1271/README.md) |
| 11L Efficient Partial XSA | 1.1307 | unnir | On PR #198: Efficient Partial XSA on the deepest 3 layers | 2026-03-20 | [info](records/track_10min_16mb/2026-03-20_11L_EfficientPartialXSA_FA3_SWA120/README.md) |
| 10L Int5-MLP + BigramHash(10240) | 1.1428 | thwu1 | 10 layers, mixed int5/int6 quantization, BigramHash(10240), SWA(0.4), WD=0.04 | 2026-03-20 | [info](records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50/README.md) |
| Int6 MLP3x + SmearGate + BigramHash | 1.1458 | Raahil Shah | 3x MLP + SmearGate + BigramHash + OrthoInit + Muon WD + SWA | 2026-03-20 | [info](records/track_10min_16mb/2026-03-20_Int6_MLP3x_SmearGate_BigramHash_MuonWD_SWA/README.md) |
| 11L MLP3x + Int6 QAT | 1.1502 | aruniyer | 11 layers, 3x MLP, int6 QAT, zstd-22, WD=0.04, sliding eval | 2026-03-20 | [info](records/track_10min_16mb/2026-03-19_MLP3x_QAT_Int6_SlidingWindow/README.md) |
| SmearGate + OrthoInit + Muon WD | 1.1556 | aquariouseworkman | SmearGate + BigramHash + 3x MLP + int6 STE QAT + sliding eval | 2026-03-19 | [info](records/track_10min_16mb/2026-03-19_smeargate_orthoinit_muonwd/README.md) |
| Ternary Quantization | 1.1570 | Ciprian-Florin Ifrim | 73.7M params quantized to 1 0 -1 + misc arch changes | 2026-03-24 | [info](records/track_10min_16mb/2026-03-24_74M_Ternary_UNet_FP8_10L_8192BPE_YaRN_NeoMuon/README.md) |
| 10L Int6 QAT + Zstd MLP2.6x | 1.1586 | yahya010 | 10 layers, int6 QAT + zstd-22, MLP 1344, Muon 0.99, sliding eval | 2026-03-19 | [info](records/track_10min_16mb/2026-03-19_Seq2048_FP16Emb_TunedLR/README.md) |
| Mixed Quant + Sliding Window Eval | 1.1630 | aquariouseworkman | Int6 block weights + int8 embeddings + 3x MLP + sliding eval | 2026-03-19 | [info](records/track_10min_16mb/2026-03-19_MixedQuant_Int6Int8_SlidingWindow/README.md) |
| Muon WD + 10 layer | 1.1748 | notapplica | Includes prev. wins + Spectral embed init + resid mix | 2026-03-19 | [info](records/track_10min_16mb/2026-03-19_SlidingWindow_FP16Emb_10L_MuonWD_OvertoneInit/README.md) |
Expand All @@ -45,12 +51,29 @@ Happy training!
| fp16 Embed | 1.2197 | Renier Velazco | FP16 Tied Embedding + LR/Warmdown Tuning | 2026-03-18 | [info](records/track_10min_16mb/2026-03-18_FP16Embed_WD3600/README.md) |
| Naive Baseline | 1.2244 | Baseline | 9layer 512dim 1024vocab TiedEmbeddings 4 KV heads | 2026-03-18 | [info](records/track_10min_16mb/2026-03-17_NaiveBaseline/README.md) |

#### Notable Non-Record Runs
#### Unlimited Compute Leaderboard & Non-record Submissions

| Run | Score | Author | Summary | Date | Info |
|-----|------:|--------|---------|------|------|
| 1 Bit Quantization | 1.1239 | Ciprian-Florin Ifrim | 106M params quantized to 1 bit + misc arch changes + 2hr training | 2026-03-24 | [info](records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/README.md) |
| 4-Hour Baseline | 1.2074 | Will DePue | Testing unlimited compute, 4 hours on 8xH100 | 2026-03-18 | [info](records/track_non_record_16mb/2026-03-18_Quasi10Bfrom50B_SP1024_9x512_KV4_4h_pgut3/README.md) |

#### Requests for PRs

Breakthrough ideas are rarely immediately state-of-the-art, instead, they're developed slowly, first demonstrating signs-of-life, iterated on, then only ultimately optimized on the systems side. Don't get discouraged if a new algorithm doesn't instantly beat the best leaderboard run or even the naive baseline. If you have an idea you believe in, consider ignoring step times early on: once you prove you can beat the baseline in the same # of steps you can then start focusing on how to also make it fast.

We'd love to see weird & creative ideas in the challenge, since you never know what may work in the end. Most likely, these will be a good fit in our unlimited compute leaderboard as non-record submissions. We have some requests for what we'd love to see people implement:

- [x] 1-bit quantization - [implementation](records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/README.md)
- [x] Ternary quantization - [implementation](records/track_10min_16mb/2026-03-24_74M_Ternary_UNet_FP8_10L_8192BPE_YaRN_NeoMuon/README.md)
- [ ] JEPA
- [ ] Text diffusion
- [ ] H-net tokenization
- [ ] Universal transformer - [We have lots of depth recurrence submissions, but I'd love to see one 4 hour
- [ ] Megakernels
- [ ] State-space models, E2E TTT, super long context for evaluation or training
- [ ] Learning adapters on random linear maps

## Getting Started

### Training Your First Model (Mac with Apple Silicon)
Expand Down
125 changes: 125 additions & 0 deletions _top_ref/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Record: 3-Seed Compliance Reproduction — Support for PR #1851

**val_bpb = 1.06145** (3-seed mean ± 0.00068) | **~15.95 MB** | 8×H100 SXM 80GB

## Summary

This is a **3-seed compliance reproduction and support package** for [PR #1851](https://github.com/openai/parameter-golf/pull/1851) by @aquariouseworkman (SmearGate BOS Fix + PR #1787 Base + LQER Asymmetric + Phased TTT).

The purpose of this package is to:
1. Provide statistical significance evidence (3 seeds) for the PR #1851 result.
2. Confirm that results are reproducible across seeds by an independent party.
3. Document a compliance re-run demonstrating GPTQ fits within the 600s training budget.

**No new ML technique is introduced.** This package reproduces the exact code and configuration from PR #1851.

## 3-Seed Results (Original Runs)

These are the originally-committed results. Seed 42 is from @aquariouseworkman's PR #1851 submission; seeds 314 and 1234 were run by @Christopher-Lee-McClendon as independent reproductions using the same code and environment variables.

| Seed | Post-TTT BPB | Artifact (bytes) | Eval Time | Source |
|------|-------------|------------------|-----------|--------|
| 42 | **1.06128183** | 15,952,086 | 519.5s | PR #1851 (@aquariouseworkman) |
| 314 | **1.06086831** | 15,952,419 | 525.6s | Reproduction (@Christopher-Lee-McClendon) |
| 1234 | **1.06220261** | 15,952,690 | 479.6s | Reproduction (@Christopher-Lee-McClendon) |
| **Mean ± Std** | **1.06145 ± 0.00068** | | | |

All artifacts < 16,000,000 bytes ✓
All eval times < 600s ✓

### Log Files (Original)

- `train_seed42_pr1851_original.log` — Seed 42 from PR #1851 by @aquariouseworkman
- `train_seed314_original.log` — Seed 314 reproduction by @Christopher-Lee-McClendon
- `train_seed1234_original.log` — Seed 1234 reproduction by @Christopher-Lee-McClendon

## Compliance Re-run Evidence (GPTQ Within 600s)

The original runs used `GPTQ_RESERVE_SECONDS=0.5`, which resulted in the training loop running until ~599.6s. GPTQ hessian collection (which accesses training data) adds ~3.5s, potentially extending past the 600s budget.

To confirm compliance, all 3 seeds were re-run with `GPTQ_RESERVE_SECONDS=8.0`, ensuring the training loop ends at ~592s and GPTQ hessians complete by ~595.5s (well within 600s). The only code change is the timing margin — no ML change.

| Seed | Post-TTT BPB (re-run) | Train Time | GPTQ Ends By | Artifact (bytes) |
|------|----------------------|------------|--------------|------------------|
| 42 | 1.06083288 | 592.1s | ~595.5s ✅ | 15,949,701 |
| 314 | 1.06090748 | 592.0s | ~595.5s ✅ | 15,951,777 |
| 1234 | 1.06248776 | 592.1s | ~595.5s ✅ | 15,951,968 |
| **Mean ± Std** | **1.06141 ± 0.00093** | | | |

**No statistically significant difference:** Original mean 1.06145 vs re-run mean 1.06141 (delta = −0.00004, well within 1-sigma noise). This confirms the GPTQ reserve setting has negligible impact on model quality.

### Re-run Log Files

- `train_seed42_rerun_gptq8s.log`
- `train_seed314_rerun_gptq8s.log`
- `train_seed1234_rerun_gptq8s.log`

### What Changed in Re-run

1. **`GPTQ_RESERVE_SECONDS` 0.5 → 8.0** — Training loop ends ~8s early for GPTQ headroom.
2. **Serialize-before-diagnostic reordering** — Artifact written immediately after GPTQ, before pre-quant diagnostic eval.
3. **Timing instrumentation** — `serialize_wallclock` and `artifact_production_wallclock` logged for transparency.

### GPTQ Timing Breakdown (Re-run)

| Phase | Time | Accesses Training Data? |
|-------|------|------------------------|
| Training loop (with 8s reserve) | ~592s | ✅ Yes |
| Hessian collection | ~3.5s | ✅ Yes |
| **Total training-data-access time** | **~595.5s** | **< 600s ✅** |
| Quantization | ~10.1s | ❌ No (uses cached Hessians) |
| Brotli compression | ~65-67s | ❌ No (pure I/O) |

## Technique Stack

All techniques inherited from PR #1851 and its lineage. No new techniques introduced.

| Technique | Source |
|-----------|--------|
| Base architecture (11L, MLP 4×, MuonEq-R) | PR #1787 (@nprime06) |
| SmearGate attention + BOS fix | PR #1797 (@dexhunter) + PR #1851 (@aquariouseworkman) |
| LQER Asymmetric quantization | PR #1797 (@dexhunter) |
| CaseOps SP8192 | PR #1729 (@romeerp) |
| GPTQ + SP8192 | PR #1394 (@clarkkev) |
| Score-first TTT (3 phases) | PR #549 (@abaybektursun) |
| BOS bug identification | @cocohearts |

## Architecture

11L × 512d × 8H/4KV, MLP 4×, LeakyReLU(0.5)², Partial RoPE (16/64 dims), layerwise LN scale, tied embeddings, logit softcap=30.0. Depth recurrence: layers 3–5 looped ×2 (activated at frac=0.35). Parallel residuals from layer 8. XSA on all 11 layers. SmearGate window=12.

## Reproduction

```bash
# Install dependencies
pip install brotli python-minifier

# Prepare CaseOps SP8192 data
python3 prepare_caseops_data.py # downloads from romeerp/parameter-golf-caseops-v1

# Run training (replace SEED with 42, 314, or 1234)
SEED=42 \
CASEOPS_ENABLED=1 \
EMBED_BITS=7 \
SMEAR_GATE_ENABLED=1 \
SPARSE_ATTN_GATE_ENABLED=1 \
MIN_LR=0.1 \
EMBED_CLIP_SIGMAS=15.0 \
MLP_CLIP_SIGMAS=12.0 \
GPTQ_RESERVE_SECONDS=8.0 \
PHASED_TTT_NUM_PHASES=3 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

**Hardware:** 8×H100 SXM 80GB (RunPod)

## Credits

- **@aquariouseworkman** — PR #1851 author (SmearGate BOS fix, original seed 42 result)
- **@nprime06** — PR #1787 (base architecture)
- **@romeerp** — PR #1729 (CaseOps)
- **@dexhunter** — PR #1797 (SmearGate + LQER asymmetric quantization)
- **@cocohearts** — BOS document boundary bug identification
- **@abaybektursun** — PR #549 (score-first TTT)
- **@clarkkev** — PR #1394 (GPTQ + SP8192)
- **@Christopher-Lee-McClendon** — Seeds 314/1234 reproduction and compliance re-run
Loading