openai · msisovic · Apr 10, 2026 · Apr 11, 2026 · Apr 11, 2026 · Apr 11, 2026
diff --git a/...rds/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/README.md b/...rds/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/README.md
@@ -0,0 +1,40 @@
+# Record: Improved Parallel Residuals
+
+**val_bpb: 1.07578747** (3-seed mean, std 0.0007) | **2.77887078 nats** | **~15.98 MB** | 8xH100 SXM, 600s | Legal TTT
+
+This submission starts from [PR #1523](https://github.com/openai/parameter-golf/pull/1523). Most of the newer submissions moved away from my fuller parallel-residual formulation and settled on a simpler GPT-J-style split-lane decoder. This version keeps the strong parts of that newer baseline and reintroduces the useful parts of my parallel residual implementation.
+
+The key architectural change relative to PR #1523 is in the decoder after the split point. Attention and MLP read from different lanes, but neither sublayer writes back immediately. Instead, both outputs are accumulated into the two lanes together at the end of the block:
+
+```python
+next_lane0 = attn_resid * lane0 + attn_post[0] * attn_out + mlp_post[0] * mlp_out
+next_lane1 = mlp_resid * lane1 + attn_post[1] * attn_out + mlp_post[1] * mlp_out
+```
+
+That keeps the GPT-J-style parallel-in-time update, while restoring the richer learned routing between the attention and MLP lanes. The other important part is that decoder U-Net skips are still written only into `lane0`, which preserves the cheaper and more stable skip path from the newer baseline. Attention reads the mixed `lane0/x0` path, while MLP reads raw `lane1`. Final output uses the mean of the two lanes.
+
+In practice, that is pretty much the only modeling change here versus PR #1523, together with moving `PARALLEL_RESIDUAL_START` from the baseline's `7` to `8`. I ablated that start-layer change separately on top of the plain PR #1523 baseline, without my fuller parallel residual routing changes, and it gave a mild regression on its own. The other notable requirement is that I needed the CUTLASS EVT path to recover the full throughput. In this iteration the CUDA/C++ source is inlined into the training script itself and built against a standard `/opt/cutlass` checkout rather than shipping a separate prebuilt `.so`.
+
+## Results (8xH100 80GB SXM, 600s)
+
+| Seed | Steps | ms/step | Post-EMA BPB | Legal TTT BPB | val_loss (nats) | Artifact |
+|------|-------|---------|--------------|----------------|-----------------|----------|
+| 1337 | 4,655 | 126.13 | 1.0830 | **1.0751** | 2.7770 | 15,983,095 |
+| 2024 | 4,689 | 125.20 | 1.0843 | **1.0765** | 2.7806 | 15,987,382 |
+| 42 | 4,696 | 125.04 | 1.0837 | **1.0759** | 2.7790 | 15,982,563 |
+| **Mean** | **4680.00** | **125.46** | **1.0837** | **1.07578747** | **2.77887078** | **15984347** |
+
+## Reproducibility
+
+```bash
+pip install brotli sentencepiece
+git clone https://github.com/NVIDIA/cutlass.git /opt/cutlass
+cd /opt/cutlass
+git checkout 08185b9c3e90510ee2b656662ed0d53b06d28157
+cd -
+MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192
+for SEED in 1337 2024 42; do
+    SEED=$SEED TTT_ENABLED=1 HASH_EMBED_ENABLED=1 TTT_LR=0.01 MUON_MOMENTUM=0.97 PARALLEL_RESIDUAL_START=8 GPTQ_RESERVE_SECONDS=13 \
+    torchrun --standalone --nproc_per_node=8 train_gpt.py
+done
+```
diff --git a/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/requirements.txt b/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/requirements.txt
@@ -0,0 +1,10 @@
+numpy
+tqdm
+huggingface-hub
+kernels
+setuptools
+typing-extensions==4.15.0
+datasets
+tiktoken
+sentencepiece
+brotli
diff --git a/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/submission.json b/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/submission.json
@@ -0,0 +1,56 @@
+{
+  "author": "Marko Sisovic",
+  "github_id": "msisovic",
+  "name": "Parallel Residuals",
+  "blurb": "Built from PR #1523. Restores fuller parallel residual routing on top of the newer GPT-J-style split-lane baseline by writing attention and MLP outputs into both lanes together at block end, while keeping decoder skips on lane0 only. Includes inline CUTLASS EVT fusion for reproducible throughput. Exact 3-seed legal-TTT mean: 1.07578747 BPB / 2.77887078 nats.",
+  "date": "2026-04-11",
+  "track": "10min_16mb",
+  "val_loss": 2.77887078,
+  "val_bpb": 1.07578747,
+  "val_loss_std": 0.00180154,
+  "val_bpb_std": 0.00069743,
+  "seeds": [
+    1337,
+    2024,
+    42
+  ],
+  "seed_results": {
+    "1337": {
+      "val_loss": 2.77699288,
+      "val_bpb": 1.07506048,
+      "post_ema_val_loss": 2.7975248,
+      "post_ema_val_bpb": 1.08300903,
+      "artifact_bytes": 15983095,
+      "steps": 4655,
+      "step_avg_ms": 126.13
+    },
+    "2024": {
+      "val_loss": 2.78058475,
+      "val_bpb": 1.07645101,
+      "post_ema_val_loss": 2.80083877,
+      "post_ema_val_bpb": 1.08429197,
+      "artifact_bytes": 15987382,
+      "steps": 4689,
+      "step_avg_ms": 125.2
+    },
+    "42": {
+      "val_loss": 2.7790347,
+      "val_bpb": 1.07585093,
+      "post_ema_val_loss": 2.79919043,
+      "post_ema_val_bpb": 1.08365385,
+      "artifact_bytes": 15982563,
+      "steps": 4696,
+      "step_avg_ms": 125.04
+    }
+  },
+  "baseline_pr": 1523,
+  "artifact_bytes_mean": 15984346.67,
+  "artifact_bytes_max": 15987382,
+  "bytes_total": 15987382,
+  "code_bytes": 26056,
+  "train_steps_mean": 4680,
+  "step_avg_ms_mean": 125.46,
+  "hardware": "8xH100 80GB SXM",
+  "evaluation": "legal_ttt_exact",
+  "technique_summary": "Parallel residual routing + GPT-J-style parallel-in-time lane update + lane0-only decoder skips + inline CUTLASS EVT fusion + legal TTT"
+}
diff --git a/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/train_gpt.py b/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/train_gpt.py