openai · dexhunter · Apr 11, 2026 · Apr 12, 2026
diff --git a/...k_10min_16mb/2026-04-11_SP8192_ImprovedParResid_Muon097_LR03_LegalTTT/README.md b/...k_10min_16mb/2026-04-11_SP8192_ImprovedParResid_Muon097_LR03_LegalTTT/README.md
@@ -0,0 +1,77 @@
+# Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT
+
+**val_bpb = 1.07785** (3-seed mean, std 0.00047) | **~15.99 MB** | 8xH100 SXM
+
+## 3-Seed Results
+
+| Seed | Sliding BPP | **TTT BPP** | Artifact |
+|------|-------------|-------------|----------|
+| 42   | 1.07880     | **1.07718** | 15,990,780 |
+| 314  | 1.07959     | **1.07810** | 15,987,449 |
+| 999  | 1.07963     | **1.07826** | 15,987,550 |
+| **Mean** | **1.07934** | **1.07785** | **15,988,593** |
+| **Std**  | **0.00039** | **0.00047** | |
+
+Merged SOTA (PR #1493, our previous): **1.0810 BPP**. Delta: **-0.0032 BPP**.
+
+## Key Techniques
+
+1. **Improved Parallel Residuals** (from PR #1529 @msisovic) -- cross-lane routing where attention and MLP outputs route to BOTH lanes via learned scalars. 66 new scalar params (`par_post[11,2,2]` + `par_resid[11,2]`). Final output = MLP lane (lane1). Starts at layer 7.
+
+2. **Muon Momentum 0.97** (from PR #1514 @dexhunter) -- reduced from 0.99. Shorter memory horizon (~33 steps) better tracks the rapidly changing loss surface during warmdown.
+
+3. **MATRIX_LR = 0.03** -- re-tuned for momentum 0.97 (higher LR pairs with lower momentum). Sweep: 0.022 → 1.0797, 0.03 → 1.0795, 0.04 → 1.0811.
+
+4. **3-Layer Depth Recurrence** (L3-5, activate at frac=0.35) -- 17 virtual layers from 11 physical.
+
+5. **QK-Gain 5.25** -- monotonic improvement from 4.0 to 5.25.
+
+6. **Legal Score-First TTT** -- SGD (lr=0.005, mom=0.9), 3 epochs per 32K-token chunk, cosine LR decay.
+
+7. **SP8192 + GPTQ SDClip** -- int6 matrices (k=12.85), int8 embeddings (k=20.0), Brotli-11 compression.
+
+8. **Tuned Hyperparameters** -- WD=0.095, EMA=0.9965, warmdown=0.72.
+
+## Architecture
+
+11L x 512d x 8H / 4KV, MLP 4x, LeakyReLU(0.5)^2, Partial RoPE (16/64 dims), layerwise LN scale, tied embeddings, logit softcap=30.0. Depth recurrence: encoder [0,1,2,3,4,5,3,4] decoder [5,3,4,5,6,7,8,9,10]. Improved parallel residuals from layer 7: attention reads from lane0, MLP reads from lane1, both outputs route to both lanes via learned `par_post` and `par_resid` scalars. Skip gates (sigmoid-gated U-Net connections).
+
+## Compliance (Track B)
+
+Per Issue #1017:
+- **Condition 1 (Causality):** Sliding-window eval, prefix only
+- **Condition 2 (Normalized):** Standard softmax, no n-gram/logit bias
+- **Condition 3 (Score before update):** Each chunk scored under `torch.no_grad()` BEFORE SGD
+- **Condition 4 (Single pass):** Each token scored once, no rescoring
+
+No SLOT, no pre-quant TTT, no ETLB, no n-gram cache. All artifacts < 16MB, train < 600s, eval < 600s.
+
+## Reproduction
+
+```bash
+SEED=42 QK_GAIN_INIT=5.25 MUON_MOMENTUM=0.97 MATRIX_LR=0.03 \
+  TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=3 \
+  torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## Credits
+
+- **@msisovic** -- Improved parallel residuals (PR #1529, #1204)
+- **@clarkkev** -- SP8192 + GPTQ + SDClip + MuonEq-R (PR #1394)
+- **@dexhunter** -- Muon 0.97 (PR #1514), depth recurrence (PR #1331, #1437), TTT on SP8192 (PR #1413)
+- **@abaybektursun** -- Score-first TTT framework (PR #549)
+- **@X-Abhishek-X** -- Hyperparameter tuning (PR #1445, #1471)
+- **@Robby955** -- Parallel residuals on SP8192 (PR #1412)
+
+## Acknowledgements
+
+Thanks to OpenAI's Advanced Competitor grant ($500 compute credit via RunPod).
+
+## Included Files
+
+- `README.md` (this file)
+- `submission.json`
+- `train_gpt.py`
+- `train_seed42.log`
+- `train_seed314.log`
+- `train_seed999.log`
diff --git a/...track_10min_16mb/2026-04-11_SP8192_ImprovedParResid_Muon097_LR03_LegalTTT/submission.json b/...track_10min_16mb/2026-04-11_SP8192_ImprovedParResid_Muon097_LR03_LegalTTT/submission.json
@@ -0,0 +1,37 @@
+{
+  "author": "bigbag",
+  "github_id": "bigbag",
+  "name": "SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal Score-First TTT",
+  "date": "2026-04-11",
+  "track": "10min_16mb",
+  "val_bpb": 1.07785,
+  "val_bpb_std": 0.00047,
+  "seeds": [42, 314, 999],
+  "seed_results": {
+    "42": {"val_bpb": 1.07718, "artifact_bytes": 15990780},
+    "314": {"val_bpb": 1.07810, "artifact_bytes": 15987449},
+    "999": {"val_bpb": 1.07826, "artifact_bytes": 15987550}
+  },
+  "hardware": "8xH100 80GB SXM",
+  "pytorch_version": "2.9.1+cu128",
+  "technique_summary": "SP8192 + Improved Parallel Residuals (cross-lane routing L7+) + 3-Layer Depth Recurrence (L3-5) + Muon 0.97 + LR 0.03 + QK-Gain 5.25 + EMA 0.9965 + WD 0.095 + Score-First TTT (SGD 3ep) + GPTQ SDClip + Brotli",
+  "compliance": {
+    "train_under_600s": true,
+    "artifact_under_16mb": true,
+    "eval_under_600s": true,
+    "no_slot": true,
+    "no_pre_quant_ttt": true,
+    "no_etlb": true,
+    "no_ngram_cache": true,
+    "score_first_ttt": true,
+    "three_seeds": true
+  },
+  "attribution": {
+    "sp8192_gptq_sdclip": "@clarkkev (PR #1394)",
+    "depth_recurrence": "@dexhunter (PR #1331, #1437)",
+    "improved_parallel_residuals": "@msisovic (PR #1529, #1204)",
+    "legal_ttt_framework": "@abaybektursun (PR #549), @dexhunter (PR #1413)",
+    "muon_097": "@dexhunter (PR #1514)",
+    "hyperparameter_tuning": "@X-Abhishek-X (PR #1445)"
+  }
+}
diff --git a/...ds/track_10min_16mb/2026-04-11_SP8192_ImprovedParResid_Muon097_LR03_LegalTTT/train_gpt.py b/...ds/track_10min_16mb/2026-04-11_SP8192_ImprovedParResid_Muon097_LR03_LegalTTT/train_gpt.py
diff --git a/...ack_10min_16mb/2026-04-11_SP8192_ImprovedParResid_Muon097_LR03_LegalTTT/train_seed314.log b/...ack_10min_16mb/2026-04-11_SP8192_ImprovedParResid_Muon097_LR03_LegalTTT/train_seed314.log
@@ -0,0 +1,149 @@
+W0411 02:14:48.169000 47104 torch/distributed/run.py:803] 
+W0411 02:14:48.169000 47104 torch/distributed/run.py:803] *****************************************
+W0411 02:14:48.169000 47104 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+W0411 02:14:48.169000 47104 torch/distributed/run.py:803] *****************************************
+Hyperparameters:
+  adam_eps: 1e-08
+  adam_wd: 0.02
+  beta1: 0.9
+  beta2: 0.95
+  compressor: brotli
+  data_dir: ./data/
+  datasets_dir: ./data/datasets/fineweb10B_sp8192
+  distributed: True
+  ema_decay: 0.9965
+  embed_bits: 8
+  embed_clip_sigmas: 20.0
+  embed_lr: 0.6
+  embed_wd: 0.085
+  embedding_dim: 512
+  enable_looping_at: 0.35
+  etlb_clip: 3.0
+  etlb_enabled: False
+  etlb_lr: 0.05
+  etlb_steps: 5
+  eval_seq_len: 2048
+  eval_stride: 64
+  gptq_calibration_batches: 64
+  gptq_reserve_seconds: 12.0
+  grad_accum_steps: 1
+  grad_clip_norm: 0.3
+  head_lr: 0.008
+  is_main_process: True
+  iterations: 20000
+  ln_scale: True
+  local_rank: 0
+  logfile: logs/3f0916f2-8576-4b2d-95dd-fae9c621b1a2.txt
+  logit_softcap: 30.0
+  loop_end: 5
+  loop_start: 3
+  matrix_bits: 6
+  matrix_clip_sigmas: 12.85
+  matrix_lr: 0.03
+  max_wallclock_seconds: 600.0
+  min_lr: 0.0
+  mlp_mult: 4.0
+  model_dim: 512
+  model_path: final_model.pt
+  muon_backend_steps: 5
+  muon_beta2: 0.95
+  muon_momentum: 0.97
+  muon_momentum_warmup_start: 0.92
+  muon_momentum_warmup_steps: 1500
+  muon_row_normalize: True
+  muon_wd: 0.095
+  num_heads: 8
+  num_kv_heads: 4
+  num_layers: 11
+  num_loops: 2
+  parallel_residual_start: 7
+  qk_gain_init: 5.25
+  quantized_model_path: final_model.int6.ptz
+  rank: 0
+  rope_base: 10000.0
+  rope_dims: 16
+  rope_train_seq_len: 2048
+  run_id: 3f0916f2-8576-4b2d-95dd-fae9c621b1a2
+  scalar_lr: 0.02
+  seed: 314
+  skip_gates_enabled: True
+  sliding_window_enabled: True
+  tie_embeddings: True
+  tied_embed_init_std: 0.005
+  tied_embed_lr: 0.03
+  tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model
+  train_batch_tokens: 786432
+  train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin
+  train_log_every: 500
+  train_seq_len: 2048
+  ttt_chunk_tokens: 32768
+  ttt_enabled: True
+  ttt_epochs: 3
+  ttt_hash_buckets: 16384
+  ttt_hash_embed: True
+  ttt_lr: 0.005
+  ttt_momentum: 0.9
+  val_batch_tokens: 524288
+  val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin
+  val_loss_every: 4000
+  vocab_size: 8192
+  warmdown_frac: 0.72
+  warmup_steps: 20
+  world_size: 8
+  xsa_last_n: 11
+train_shards: 128
+val_tokens: 40540160
+model_params:35944602
+gptq:reserving 12s, effective=588000ms
+warmup_step: 1/20
+warmup_step: 2/20
+warmup_step: 3/20
+warmup_step: 4/20
+warmup_step: 5/20
+warmup_step: 6/20
+warmup_step: 10/20
+warmup_step: 20/20
+loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+loop_warmup_step: 1/20
+loop_warmup_step: 2/20
+loop_warmup_step: 3/20
+loop_warmup_step: 4/20
+loop_warmup_step: 5/20
+loop_warmup_step: 6/20
+loop_warmup_step: 10/20
+loop_warmup_step: 20/20
+0/20000 val_loss: 9.0096 val_bpb: 3.4879
+1/20000 train_loss: 9.0112 train_time: 0.0m tok/s: 7915600
+2/20000 train_loss: 12.5748 train_time: 0.0m tok/s: 7809948
+3/20000 train_loss: 11.4700 train_time: 0.0m tok/s: 7718096
+4/20000 train_loss: 9.7598 train_time: 0.0m tok/s: 7668274
+5/20000 train_loss: 8.5575 train_time: 0.0m tok/s: 7639140
+500/20000 train_loss: 3.3436 train_time: 0.9m tok/s: 7404018
+1000/20000 train_loss: 3.2122 train_time: 1.8m tok/s: 7388854
+1500/20000 train_loss: 3.1225 train_time: 2.7m tok/s: 7389355
+layer_loop:enabled step:1935 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+2000/20000 train_loss: 3.0884 train_time: 3.6m tok/s: 7284216
+2500/20000 train_loss: 3.0880 train_time: 4.9m tok/s: 6705687
+3000/20000 train_loss: 2.9604 train_time: 6.2m tok/s: 6371157
+3500/20000 train_loss: 2.9751 train_time: 7.5m tok/s: 6152445
+4000/20000 train_loss: 2.9034 train_time: 8.7m tok/s: 5998426
+4000/20000 val_loss: 2.8647 val_bpb: 1.1090
+4414/20000 val_loss: 2.8071 val_bpb: 1.0867
+stopping_early: wallclock_cap train_time: 588150ms step: 4414/20000
+peak memory allocated: 39718 MiB reserved: 39742 MiB
+ema:applying EMA weights
+pre-quantization post-ema val_loss:2.80693411 val_bpb:1.08665167 eval_time:6466ms
+Serialized model: 135431741 bytes
+Code size: 17184 bytes
+GPTQ:collecting Hessians from calibration data...
+GPTQ:collected 67 Hessians in 13.2s
+Quantized weights:
+  gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight
+  gptq (int8): tok_emb.weight
+  passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, par_post, par_resid, skip_gates, skip_weights
+Serialized model quantized+brotli: 15970265 bytes
+Total submission size quantized+brotli: 15987449 bytes
+quantized val_loss:2.83210781 val_bpb:1.09639719 eval_time:9232ms
+quantized_sliding_window val_loss:2.78869620 val_bpb:1.07959120 eval_time:93165ms
+ttt:start chunks=1238 ttt_lr=0.005 ttt_epochs=3
+quantized_ttt val_loss:2.78483468 val_bpb:1.07809629 eval_time:334646ms