openai · monroestephenson · Mar 18, 2026 · Mar 18, 2026 · Mar 18, 2026 · Mar 18, 2026
diff --git a/records/track_10min_16mb/2026-03-18_ImprovedBaseline/EXPERIMENT_MATRIX.md b/records/track_10min_16mb/2026-03-18_ImprovedBaseline/EXPERIMENT_MATRIX.md
@@ -0,0 +1,81 @@
+# Improved Baseline Sweep Matrix
+
+This matrix is ordered by expected value under limited compute. Start at the top and stop once one family is clearly winning.
+
+## Baseline
+
+Use the current branch defaults as the anchor:
+
+```bash
+RUN_ID=ibl_base \
+DATA_PATH=./data/datasets/fineweb10B_sp1024 \
+TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
+VOCAB_SIZE=1024 \
+torchrun --standalone --nproc_per_node=1 train_gpt.py
+```
+
+## Tier 1: Fast Ablations
+
+These answer "what is actually carrying?"
+
+| Name | Env overrides | Why |
+|---|---|---|
+| `abl_no_lora` | `LORA_RANK=0` | Tests whether loop-specific adapters are paying for themselves |
+| `abl_no_mtp` | `MTP_HEADS=0 MTP_WEIGHT=0.0` | Tests whether auxiliary prediction is helping |
+| `abl_no_ema` | `USE_EMA=0` | Tests whether EMA is helping final eval |
+| `abl_no_rec_scale` | `USE_RECURRENCE_SCALES=0` | Tests whether recurrence scales are redundant with LoRA |
+| `abl_relu2` | `USE_SWIGLU=0` | Tests whether SwiGLU is actually beating the cheaper ReLU-squared path |
+
+## Tier 2: High-Leverage Sweeps
+
+Run these around the best Tier 1 setting.
+
+| Name | Env overrides |
+|---|---|
+| `rank8` | `LORA_RANK=8` |
+| `rank16` | `LORA_RANK=16` |
+| `rank32` | `LORA_RANK=32` |
+| `mtp_w005` | `MTP_WEIGHT=0.05` |
+| `mtp_w010` | `MTP_WEIGHT=0.10` |
+| `mtp_w015` | `MTP_WEIGHT=0.15` |
+| `mtp_w025` | `MTP_WEIGHT=0.25` |
+| `mtp1` | `MTP_HEADS=1` |
+| `mtp2` | `MTP_HEADS=2` |
+| `mtp4` | `MTP_HEADS=4` |
+| `ema_099_0` | `EMA_DECAY=0.99 EMA_START_STEP=0` |
+| `ema_0995_50` | `EMA_DECAY=0.995 EMA_START_STEP=50` |
+| `ema_0998_100` | `EMA_DECAY=0.998 EMA_START_STEP=100` |
+
+## Tier 3: Capacity / Layout Sweeps
+
+Only do these after one variant is already clearly beating the baseline.
+
+| Name | Env overrides | Why |
+|---|---|---|
+| `dim768` | `MODEL_DIM=768` | Spend more of the artifact budget |
+| `dim832` | `MODEL_DIM=832` | Aggressive width push if compressed size stays safe |
+| `layout_4x3` | `NUM_UNIQUE_LAYERS=4 NUM_RECURRENCE=3` | More effective depth with stronger sharing |
+| `layout_3x4` | `NUM_UNIQUE_LAYERS=3 NUM_RECURRENCE=4` | Push recurrence harder |
+| `layout_3x5` | `NUM_UNIQUE_LAYERS=3 NUM_RECURRENCE=5` | PR #11-style deeper recurrence regime |
+
+## Recommended Run Order
+
+1. `base`
+2. `abl_no_lora`
+3. `abl_no_mtp`
+4. `abl_no_rec_scale`
+5. `abl_no_ema`
+6. Best of the above + `rank8`, `rank32`
+7. Best LoRA setting + `mtp_w010`, `mtp_w025`
+8. Best above + `ema_099_0`, `ema_0998_100`
+9. Best above + `dim768`
+10. Best above + `layout_4x3`
+
+## What To Log
+
+For each run, keep:
+
+- `final_int8_zlib_roundtrip_exact val_bpb`
+- `Total submission size int8+zlib`
+- wallclock at stop
+- whether it finished cleanly under the cap
diff --git a/records/track_10min_16mb/2026-03-18_ImprovedBaseline/README.md b/records/track_10min_16mb/2026-03-18_ImprovedBaseline/README.md
@@ -0,0 +1,41 @@
+This record captures an improved baseline for the Parameter Golf challenge.
+
+Key architectural changes from the naive baseline:
+- **Depth recurrence**: 5 unique transformer blocks looped 2x = 10 effective layers (vs baseline's 9 unique layers). Weight sharing reduces stored parameter count while maintaining effective model depth.
+- **Wider model**: MODEL_DIM=704 (vs baseline 512), enabled by the parameter savings from weight sharing. Gives 88-dim attention heads vs 64.
+- **Per-recurrence learned scales**: Each effective layer gets a learned scale vector so shared blocks can distinguish which recurrence pass they're on, allowing the same block to behave differently in encoder vs decoder roles.
+- **Encoder-decoder skip connections**: Maintained from baseline, adapted for recurrent depth.
+
+Configuration:
+- Layout: `VOCAB_SIZE=1024 NUM_UNIQUE_LAYERS=5 NUM_RECURRENCE=2 MODEL_DIM=704 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=2`
+- Effective depth: 10 layers (5 unique x 2 recurrence passes)
+- Tied output/input embeddings: `TIE_EMBEDDINGS=1`
+- Batching: `TRAIN_BATCH_TOKENS=524288 TRAIN_SEQ_LEN=1024`
+
+Command (track-relevant params):
+```bash
+RUN_ID=improved_baseline \
+DATA_PATH=./data/datasets/fineweb10B_sp1024 \
+TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
+VOCAB_SIZE=1024 \
+MAX_WALLCLOCK_SECONDS=600 \
+TRAIN_LOG_EVERY=50 \
+VAL_LOSS_EVERY=200 \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+Design rationale:
+- Depth recurrence with 5 unique blocks x 2 passes keeps per-step compute at ~2.1x baseline (est ~91ms/step vs ~44ms), allowing ~6500 steps / ~3.4B tokens in 10 minutes.
+- The wider dimension (704 vs 512) improves per-head representational capacity.
+- Learned per-recurrence scales (10 x 704 params) let the model route information differently on each pass through the shared weights — critical for encoder/decoder role differentiation.
+
+Estimated model size:
+- ~18.1M parameters (unique)
+- Estimated int8+zlib: ~8.9 MB
+- Estimated code: ~44.3 KB
+- Estimated total: ~9.0 MB (under 16 MB cap)
+
+Included files:
+- `train_gpt.py` (code snapshot for the run)
+- `submission.json` (leaderboard metadata)
+- `README.md` (this file)
diff --git a/records/track_10min_16mb/2026-03-18_ImprovedBaseline/launch_sweep.sh b/records/track_10min_16mb/2026-03-18_ImprovedBaseline/launch_sweep.sh
@@ -0,0 +1,100 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+profile="${1:-base}"
+shift || true
+
+ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+
+export DATA_PATH="${DATA_PATH:-./data/datasets/fineweb10B_sp1024}"
+export TOKENIZER_PATH="${TOKENIZER_PATH:-./data/tokenizers/fineweb_1024_bpe.model}"
+export VOCAB_SIZE="${VOCAB_SIZE:-1024}"
+export RUN_ID="${RUN_ID:-$profile}"
+
+case "$profile" in
+  base)
+    ;;
+  abl_no_lora)
+    export LORA_RANK=0
+    ;;
+  abl_no_mtp)
+    export MTP_HEADS=0
+    export MTP_WEIGHT=0.0
+    ;;
+  abl_no_ema)
+    export USE_EMA=0
+    ;;
+  abl_no_rec_scale)
+    export USE_RECURRENCE_SCALES=0
+    ;;
+  abl_relu2)
+    export USE_SWIGLU=0
+    ;;
+  rank8)
+    export LORA_RANK=8
+    ;;
+  rank16)
+    export LORA_RANK=16
+    ;;
+  rank32)
+    export LORA_RANK=32
+    ;;
+  mtp_w005)
+    export MTP_WEIGHT=0.05
+    ;;
+  mtp_w010)
+    export MTP_WEIGHT=0.10
+    ;;
+  mtp_w015)
+    export MTP_WEIGHT=0.15
+    ;;
+  mtp_w025)
+    export MTP_WEIGHT=0.25
+    ;;
+  mtp1)
+    export MTP_HEADS=1
+    ;;
+  mtp2)
+    export MTP_HEADS=2
+    ;;
+  mtp4)
+    export MTP_HEADS=4
+    ;;
+  ema_099_0)
+    export EMA_DECAY=0.99
+    export EMA_START_STEP=0
+    ;;
+  ema_0995_50)
+    export EMA_DECAY=0.995
+    export EMA_START_STEP=50
+    ;;
+  ema_0998_100)
+    export EMA_DECAY=0.998
+    export EMA_START_STEP=100
+    ;;
+  dim768)
+    export MODEL_DIM=768
+    ;;
+  dim832)
+    export MODEL_DIM=832
+    ;;
+  layout_4x3)
+    export NUM_UNIQUE_LAYERS=4
+    export NUM_RECURRENCE=3
+    ;;
+  layout_3x4)
+    export NUM_UNIQUE_LAYERS=3
+    export NUM_RECURRENCE=4
+    ;;
+  layout_3x5)
+    export NUM_UNIQUE_LAYERS=3
+    export NUM_RECURRENCE=5
+    ;;
+  *)
+    echo "Unknown profile: $profile" >&2
+    exit 1
+    ;;
+esac
+
+cd "$ROOT_DIR"
+exec torchrun --standalone --nproc_per_node="${NPROC_PER_NODE:-1}" train_gpt.py "$@"
diff --git a/records/track_10min_16mb/2026-03-18_ImprovedBaseline/runs/run_2026-03-19_8xH100.log b/records/track_10min_16mb/2026-03-18_ImprovedBaseline/runs/run_2026-03-19_8xH100.log
@@ -0,0 +1,79 @@
+logs/68f891e0-fa95-49a2-88cd-f338efb12663.txt
+val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
+train_loader:dataset:fineweb10B_sp1024 train_shards:80
+val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
+model_params:17671712 (effective_layers:12 unique_layers:4 recurrence:3)
+world_size:8 grad_accum_steps:1
+sdp_backends:cudnn=False flash=True mem_efficient=False math=False
+attention_mode:gqa num_heads:8 num_kv_heads:4
+tie_embeddings:True embed_lr:0.05 head_lr:0.0 matrix_lr:0.04 scalar_lr:0.04
+train_batch_tokens:524288 train_seq_len:1024 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
+seed:1337
+warmup_step:1/20
+warmup_step:2/20
+warmup_step:3/20
+warmup_step:4/20
+warmup_step:5/20
+warmup_step:6/20
+warmup_step:7/20
+warmup_step:8/20
+warmup_step:9/20
+warmup_step:10/20
+warmup_step:11/20
+warmup_step:12/20
+warmup_step:13/20
+warmup_step:14/20
+warmup_step:15/20
+warmup_step:16/20
+warmup_step:17/20
+warmup_step:18/20
+warmup_step:19/20
+warmup_step:20/20
+step:0/20000 val_loss:6.9622 val_bpb:4.1234 train_time:0ms step_avg:0.01ms
+step:1/20000 train_loss:8.0069 train_time:1214ms step_avg:1213.96ms
+step:2/20000 train_loss:21.2216 train_time:1311ms step_avg:655.50ms
+step:3/20000 train_loss:14.6102 train_time:1441ms step_avg:480.44ms
+step:4/20000 train_loss:8.2296 train_time:1572ms step_avg:393.07ms
+step:5/20000 train_loss:7.6861 train_time:1703ms step_avg:340.60ms
+step:6/20000 train_loss:8.6377 train_time:1834ms step_avg:305.61ms
+step:7/20000 train_loss:7.7169 train_time:1964ms step_avg:280.63ms
+step:8/20000 train_loss:7.6308 train_time:2095ms step_avg:261.89ms
+step:9/20000 train_loss:7.4202 train_time:2226ms step_avg:247.37ms
+step:10/20000 train_loss:7.2083 train_time:2357ms step_avg:235.70ms
+step:200/20000 train_loss:3.7648 train_time:27369ms step_avg:136.84ms
+step:400/20000 train_loss:3.1536 train_time:53804ms step_avg:134.51ms
+step:600/20000 train_loss:3.3251 train_time:80295ms step_avg:133.83ms
+step:800/20000 train_loss:3.0375 train_time:106798ms step_avg:133.50ms
+step:1000/20000 train_loss:3.1249 train_time:133309ms step_avg:133.31ms
+step:1000/20000 val_loss:2.3808 val_bpb:1.4101 train_time:133341ms step_avg:133.34ms
+step:1200/20000 train_loss:3.1493 train_time:159835ms step_avg:133.20ms
+step:1400/20000 train_loss:3.1933 train_time:186356ms step_avg:133.11ms
+step:1600/20000 train_loss:2.7988 train_time:212878ms step_avg:133.05ms
+step:1800/20000 train_loss:2.9320 train_time:239419ms step_avg:133.01ms
+step:2000/20000 train_loss:2.9923 train_time:266050ms step_avg:133.03ms
+step:2000/20000 val_loss:2.2877 val_bpb:1.3549 train_time:266084ms step_avg:133.04ms
+step:2200/20000 train_loss:2.7845 train_time:292593ms step_avg:133.00ms
+step:2400/20000 train_loss:2.9360 train_time:319133ms step_avg:132.97ms
+step:2600/20000 train_loss:3.1746 train_time:345656ms step_avg:132.94ms
+step:2800/20000 train_loss:2.9731 train_time:372176ms step_avg:132.92ms
+step:3000/20000 train_loss:2.9641 train_time:398700ms step_avg:132.90ms
+step:3000/20000 val_loss:2.2486 val_bpb:1.3318 train_time:398733ms step_avg:132.91ms
+step:3200/20000 train_loss:2.9128 train_time:425229ms step_avg:132.88ms
+step:3400/20000 train_loss:2.8763 train_time:451758ms step_avg:132.87ms
+step:3600/20000 train_loss:2.8249 train_time:478283ms step_avg:132.86ms
+step:3800/20000 train_loss:2.9308 train_time:504814ms step_avg:132.85ms
+step:4000/20000 train_loss:2.8545 train_time:531351ms step_avg:132.84ms
+step:4000/20000 val_loss:2.1955 val_bpb:1.3003 train_time:531385ms step_avg:132.85ms
+step:4200/20000 train_loss:2.8580 train_time:557942ms step_avg:132.84ms
+step:4400/20000 train_loss:2.7688 train_time:584473ms step_avg:132.83ms
+step:4518/20000 val_loss:2.1687 val_bpb:1.2844 train_time:600133ms step_avg:132.83ms
+stopping_early: wallclock_cap train_time:600133ms step:4518/20000
+peak memory allocated: 28955 MiB reserved: 30950 MiB
+Loaded EMA weights for evaluation and serialization
+Serialized model: 69118949 bytes
+Code size: 52440 bytes
+Total submission size: 69171389 bytes
+Serialized model int8+zlib: 16507603 bytes (payload:18118784 raw_torch:18146818 payload_ratio:3.81x)
+Total submission size int8+zlib: 16560043 bytes
+final_int8_zlib_roundtrip val_loss:3.9270 val_bpb:2.3258 eval_time:3712ms
+final_int8_zlib_roundtrip_exact val_loss:3.92703062 val_bpb:2.32580873
diff --git a/records/track_10min_16mb/2026-03-18_ImprovedBaseline/submission.json b/records/track_10min_16mb/2026-03-18_ImprovedBaseline/submission.json
@@ -0,0 +1,11 @@
+{
+  "author": "Monroe Stephenson",
+  "github_id": "monroestephenson",
+  "name": "Improved Baseline",
+  "blurb": "SP-1024 5x2=10eff 704dim KV4 depth-recurrence with learned per-recurrence scales; targets sub-1.22 BPB under 16MB cap.",
+  "date": "2026-03-18T00:00:00Z",
+  "val_loss": null,
+  "val_bpb": null,
+  "bytes_total": null,
+  "bytes_code": 44327
+}