Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Improved Baseline Sweep Matrix

This matrix is ordered by expected value under limited compute. Start at the top and stop once one family is clearly winning.

## Baseline

Use the current branch defaults as the anchor:

```bash
RUN_ID=ibl_base \
DATA_PATH=./data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
torchrun --standalone --nproc_per_node=1 train_gpt.py
```

## Tier 1: Fast Ablations

These answer "what is actually carrying?"

| Name | Env overrides | Why |
|---|---|---|
| `abl_no_lora` | `LORA_RANK=0` | Tests whether loop-specific adapters are paying for themselves |
| `abl_no_mtp` | `MTP_HEADS=0 MTP_WEIGHT=0.0` | Tests whether auxiliary prediction is helping |
| `abl_no_ema` | `USE_EMA=0` | Tests whether EMA is helping final eval |
| `abl_no_rec_scale` | `USE_RECURRENCE_SCALES=0` | Tests whether recurrence scales are redundant with LoRA |
| `abl_relu2` | `USE_SWIGLU=0` | Tests whether SwiGLU is actually beating the cheaper ReLU-squared path |

## Tier 2: High-Leverage Sweeps

Run these around the best Tier 1 setting.

| Name | Env overrides |
|---|---|
| `rank8` | `LORA_RANK=8` |
| `rank16` | `LORA_RANK=16` |
| `rank32` | `LORA_RANK=32` |
| `mtp_w005` | `MTP_WEIGHT=0.05` |
| `mtp_w010` | `MTP_WEIGHT=0.10` |
| `mtp_w015` | `MTP_WEIGHT=0.15` |
| `mtp_w025` | `MTP_WEIGHT=0.25` |
| `mtp1` | `MTP_HEADS=1` |
| `mtp2` | `MTP_HEADS=2` |
| `mtp4` | `MTP_HEADS=4` |
| `ema_099_0` | `EMA_DECAY=0.99 EMA_START_STEP=0` |
| `ema_0995_50` | `EMA_DECAY=0.995 EMA_START_STEP=50` |
| `ema_0998_100` | `EMA_DECAY=0.998 EMA_START_STEP=100` |

## Tier 3: Capacity / Layout Sweeps

Only do these after one variant is already clearly beating the baseline.

| Name | Env overrides | Why |
|---|---|---|
| `dim768` | `MODEL_DIM=768` | Spend more of the artifact budget |
| `dim832` | `MODEL_DIM=832` | Aggressive width push if compressed size stays safe |
| `layout_4x3` | `NUM_UNIQUE_LAYERS=4 NUM_RECURRENCE=3` | More effective depth with stronger sharing |
| `layout_3x4` | `NUM_UNIQUE_LAYERS=3 NUM_RECURRENCE=4` | Push recurrence harder |
| `layout_3x5` | `NUM_UNIQUE_LAYERS=3 NUM_RECURRENCE=5` | PR #11-style deeper recurrence regime |

## Recommended Run Order

1. `base`
2. `abl_no_lora`
3. `abl_no_mtp`
4. `abl_no_rec_scale`
5. `abl_no_ema`
6. Best of the above + `rank8`, `rank32`
7. Best LoRA setting + `mtp_w010`, `mtp_w025`
8. Best above + `ema_099_0`, `ema_0998_100`
9. Best above + `dim768`
10. Best above + `layout_4x3`

## What To Log

For each run, keep:

- `final_int8_zlib_roundtrip_exact val_bpb`
- `Total submission size int8+zlib`
- wallclock at stop
- whether it finished cleanly under the cap
41 changes: 41 additions & 0 deletions records/track_10min_16mb/2026-03-18_ImprovedBaseline/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
This record captures an improved baseline for the Parameter Golf challenge.

Key architectural changes from the naive baseline:
- **Depth recurrence**: 5 unique transformer blocks looped 2x = 10 effective layers (vs baseline's 9 unique layers). Weight sharing reduces stored parameter count while maintaining effective model depth.
- **Wider model**: MODEL_DIM=704 (vs baseline 512), enabled by the parameter savings from weight sharing. Gives 88-dim attention heads vs 64.
- **Per-recurrence learned scales**: Each effective layer gets a learned scale vector so shared blocks can distinguish which recurrence pass they're on, allowing the same block to behave differently in encoder vs decoder roles.
- **Encoder-decoder skip connections**: Maintained from baseline, adapted for recurrent depth.

Configuration:
- Layout: `VOCAB_SIZE=1024 NUM_UNIQUE_LAYERS=5 NUM_RECURRENCE=2 MODEL_DIM=704 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=2`
- Effective depth: 10 layers (5 unique x 2 recurrence passes)
- Tied output/input embeddings: `TIE_EMBEDDINGS=1`
- Batching: `TRAIN_BATCH_TOKENS=524288 TRAIN_SEQ_LEN=1024`

Command (track-relevant params):
```bash
RUN_ID=improved_baseline \
DATA_PATH=./data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
MAX_WALLCLOCK_SECONDS=600 \
TRAIN_LOG_EVERY=50 \
VAL_LOSS_EVERY=200 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

Design rationale:
- Depth recurrence with 5 unique blocks x 2 passes keeps per-step compute at ~2.1x baseline (est ~91ms/step vs ~44ms), allowing ~6500 steps / ~3.4B tokens in 10 minutes.
- The wider dimension (704 vs 512) improves per-head representational capacity.
- Learned per-recurrence scales (10 x 704 params) let the model route information differently on each pass through the shared weights — critical for encoder/decoder role differentiation.

Estimated model size:
- ~18.1M parameters (unique)
- Estimated int8+zlib: ~8.9 MB
- Estimated code: ~44.3 KB
- Estimated total: ~9.0 MB (under 16 MB cap)

Included files:
- `train_gpt.py` (code snapshot for the run)
- `submission.json` (leaderboard metadata)
- `README.md` (this file)
100 changes: 100 additions & 0 deletions records/track_10min_16mb/2026-03-18_ImprovedBaseline/launch_sweep.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
#!/usr/bin/env bash
set -euo pipefail

profile="${1:-base}"
shift || true

ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

export DATA_PATH="${DATA_PATH:-./data/datasets/fineweb10B_sp1024}"
export TOKENIZER_PATH="${TOKENIZER_PATH:-./data/tokenizers/fineweb_1024_bpe.model}"
export VOCAB_SIZE="${VOCAB_SIZE:-1024}"
export RUN_ID="${RUN_ID:-$profile}"

case "$profile" in
base)
;;
abl_no_lora)
export LORA_RANK=0
;;
abl_no_mtp)
export MTP_HEADS=0
export MTP_WEIGHT=0.0
;;
abl_no_ema)
export USE_EMA=0
;;
abl_no_rec_scale)
export USE_RECURRENCE_SCALES=0
;;
abl_relu2)
export USE_SWIGLU=0
;;
rank8)
export LORA_RANK=8
;;
rank16)
export LORA_RANK=16
;;
rank32)
export LORA_RANK=32
;;
mtp_w005)
export MTP_WEIGHT=0.05
;;
mtp_w010)
export MTP_WEIGHT=0.10
;;
mtp_w015)
export MTP_WEIGHT=0.15
;;
mtp_w025)
export MTP_WEIGHT=0.25
;;
mtp1)
export MTP_HEADS=1
;;
mtp2)
export MTP_HEADS=2
;;
mtp4)
export MTP_HEADS=4
;;
ema_099_0)
export EMA_DECAY=0.99
export EMA_START_STEP=0
;;
ema_0995_50)
export EMA_DECAY=0.995
export EMA_START_STEP=50
;;
ema_0998_100)
export EMA_DECAY=0.998
export EMA_START_STEP=100
;;
dim768)
export MODEL_DIM=768
;;
dim832)
export MODEL_DIM=832
;;
layout_4x3)
export NUM_UNIQUE_LAYERS=4
export NUM_RECURRENCE=3
;;
layout_3x4)
export NUM_UNIQUE_LAYERS=3
export NUM_RECURRENCE=4
;;
layout_3x5)
export NUM_UNIQUE_LAYERS=3
export NUM_RECURRENCE=5
;;
*)
echo "Unknown profile: $profile" >&2
exit 1
;;
esac

cd "$ROOT_DIR"
exec torchrun --standalone --nproc_per_node="${NPROC_PER_NODE:-1}" train_gpt.py "$@"
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
logs/68f891e0-fa95-49a2-88cd-f338efb12663.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:80
val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
model_params:17671712 (effective_layers:12 unique_layers:4 recurrence:3)
world_size:8 grad_accum_steps:1
sdp_backends:cudnn=False flash=True mem_efficient=False math=False
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.05 head_lr:0.0 matrix_lr:0.04 scalar_lr:0.04
train_batch_tokens:524288 train_seq_len:1024 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
seed:1337
warmup_step:1/20
warmup_step:2/20
warmup_step:3/20
warmup_step:4/20
warmup_step:5/20
warmup_step:6/20
warmup_step:7/20
warmup_step:8/20
warmup_step:9/20
warmup_step:10/20
warmup_step:11/20
warmup_step:12/20
warmup_step:13/20
warmup_step:14/20
warmup_step:15/20
warmup_step:16/20
warmup_step:17/20
warmup_step:18/20
warmup_step:19/20
warmup_step:20/20
step:0/20000 val_loss:6.9622 val_bpb:4.1234 train_time:0ms step_avg:0.01ms
step:1/20000 train_loss:8.0069 train_time:1214ms step_avg:1213.96ms
step:2/20000 train_loss:21.2216 train_time:1311ms step_avg:655.50ms
step:3/20000 train_loss:14.6102 train_time:1441ms step_avg:480.44ms
step:4/20000 train_loss:8.2296 train_time:1572ms step_avg:393.07ms
step:5/20000 train_loss:7.6861 train_time:1703ms step_avg:340.60ms
step:6/20000 train_loss:8.6377 train_time:1834ms step_avg:305.61ms
step:7/20000 train_loss:7.7169 train_time:1964ms step_avg:280.63ms
step:8/20000 train_loss:7.6308 train_time:2095ms step_avg:261.89ms
step:9/20000 train_loss:7.4202 train_time:2226ms step_avg:247.37ms
step:10/20000 train_loss:7.2083 train_time:2357ms step_avg:235.70ms
step:200/20000 train_loss:3.7648 train_time:27369ms step_avg:136.84ms
step:400/20000 train_loss:3.1536 train_time:53804ms step_avg:134.51ms
step:600/20000 train_loss:3.3251 train_time:80295ms step_avg:133.83ms
step:800/20000 train_loss:3.0375 train_time:106798ms step_avg:133.50ms
step:1000/20000 train_loss:3.1249 train_time:133309ms step_avg:133.31ms
step:1000/20000 val_loss:2.3808 val_bpb:1.4101 train_time:133341ms step_avg:133.34ms
step:1200/20000 train_loss:3.1493 train_time:159835ms step_avg:133.20ms
step:1400/20000 train_loss:3.1933 train_time:186356ms step_avg:133.11ms
step:1600/20000 train_loss:2.7988 train_time:212878ms step_avg:133.05ms
step:1800/20000 train_loss:2.9320 train_time:239419ms step_avg:133.01ms
step:2000/20000 train_loss:2.9923 train_time:266050ms step_avg:133.03ms
step:2000/20000 val_loss:2.2877 val_bpb:1.3549 train_time:266084ms step_avg:133.04ms
step:2200/20000 train_loss:2.7845 train_time:292593ms step_avg:133.00ms
step:2400/20000 train_loss:2.9360 train_time:319133ms step_avg:132.97ms
step:2600/20000 train_loss:3.1746 train_time:345656ms step_avg:132.94ms
step:2800/20000 train_loss:2.9731 train_time:372176ms step_avg:132.92ms
step:3000/20000 train_loss:2.9641 train_time:398700ms step_avg:132.90ms
step:3000/20000 val_loss:2.2486 val_bpb:1.3318 train_time:398733ms step_avg:132.91ms
step:3200/20000 train_loss:2.9128 train_time:425229ms step_avg:132.88ms
step:3400/20000 train_loss:2.8763 train_time:451758ms step_avg:132.87ms
step:3600/20000 train_loss:2.8249 train_time:478283ms step_avg:132.86ms
step:3800/20000 train_loss:2.9308 train_time:504814ms step_avg:132.85ms
step:4000/20000 train_loss:2.8545 train_time:531351ms step_avg:132.84ms
step:4000/20000 val_loss:2.1955 val_bpb:1.3003 train_time:531385ms step_avg:132.85ms
step:4200/20000 train_loss:2.8580 train_time:557942ms step_avg:132.84ms
step:4400/20000 train_loss:2.7688 train_time:584473ms step_avg:132.83ms
step:4518/20000 val_loss:2.1687 val_bpb:1.2844 train_time:600133ms step_avg:132.83ms
stopping_early: wallclock_cap train_time:600133ms step:4518/20000
peak memory allocated: 28955 MiB reserved: 30950 MiB
Loaded EMA weights for evaluation and serialization
Serialized model: 69118949 bytes
Code size: 52440 bytes
Total submission size: 69171389 bytes
Serialized model int8+zlib: 16507603 bytes (payload:18118784 raw_torch:18146818 payload_ratio:3.81x)
Total submission size int8+zlib: 16560043 bytes
final_int8_zlib_roundtrip val_loss:3.9270 val_bpb:2.3258 eval_time:3712ms
final_int8_zlib_roundtrip_exact val_loss:3.92703062 val_bpb:2.32580873
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"author": "Monroe Stephenson",
"github_id": "monroestephenson",
"name": "Improved Baseline",
"blurb": "SP-1024 5x2=10eff 704dim KV4 depth-recurrence with learned per-recurrence scales; targets sub-1.22 BPB under 16MB cap.",
"date": "2026-03-18T00:00:00Z",
"val_loss": null,
"val_bpb": null,
"bytes_total": null,
"bytes_code": 44327
}
Loading