Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Asynchronous Prefetching — submission notes

### 1191 (with technique) vs 1137 (default) steps in 600s on local compute

## Key changes

Same model, optimizer, data layout, and training math as baseline. This is **a general purpose rework that could apply to most other approaches** for slight speed boosts. Overlap CPU data prep and host→device copies with GPU work so the GPU spends less time idle.

| Area | Original (`train_gpt_og_linux.py`) | Improved |
|------|-------------------------------------|----------|
| **Training batches** | Each step: read tokens on CPU, then H2D — all on the main thread before forward. | **Background thread** (`PrefetchingDistributedTokenLoader`) builds the **next** pinned CPU batch while the GPU runs the current step. Primary win: **CPU work overlaps GPU compute** (not GPU-side double-buffering of H2D vs forward). |
| **H2D** | Single default stream. | Optional **dedicated CUDA copy stream** (`TRAIN_COPY_STREAM`, off when timing diagnostics are on). Transfers use pinned memory; the training path **still waits for that step’s H2D** before forward (`wait_stream`). |
| **Validation** | Simple loop: slice → GPU → forward; BPB byte math on GPU. | **Prefetch thread** for pinned CPU batches; **double-buffered** H2D with copy stream + events so the **next** batch can copy while the **current** forward runs. Default **`VAL_BYTECOUNT_DEVICE=cpu`** moves BPB byte counting off the GPU vs the original (set **`cuda`** to mirror baseline GPU LUT math). |

## Diagnostics

To measure how much time this actually saves, I added **`TRAINING_TIMING_BREAKDOWN`** (batch CPU vs H2D vs FWD/BWD/opt vs val; adds syncs). When enabled, lines log every **`TRAINING_TIMING_EVERY`** steps (default 200) and for early steps (first 10). Extra logs: train/val I/O mode, `val_stage_time_ms`, train vs val wall time split.

**`VAL_BYTECOUNT_DEVICE`** defaults to **`cpu`** in the improved script (not an extra flag you must set). Use **`cuda`** if you want validation byte math on the GPU like the original.

Optional **`VAL_PROGRESS_LOG_EVERY`** (default **0**): set to a positive value to log per-batch validation progress (`val_progress:...`).

## Defaults & toggles

Overlap features are **on by default** (`TRAIN_PREFETCH`, `TRAIN_COPY_STREAM`, `VAL_PREFETCH`, `VAL_COPY_STREAM`, etc.) and can be turned off via env vars if needed. **`TRAINING_TIMING_BREAKDOWN`** defaults to 0 and is not displayed. Prefetch/overlap are **automatically disabled** when `TRAINING_TIMING_BREAKDOWN=1` so timings stay interpretable.

## Idea

**Prefetch training and validation batches asynchronously and parallelize CPU ↔ GPU transfers with compute** to minimize pipeline bubbles under a fixed wall-clock budget.
This is an intuitive idea that I came up with that could help models with real research and architectural advancements place slightly higher.

## Why this may be unimpactful in some cases

With **`TRAINING_TIMING_BREAKDOWN=1`**, early-step lines look like this (same hardware / config as above; `grad_accum_steps=8`, per-micro averages for batch/forward/backward):

```text
timing_breakdown step:1 micro_steps:8 batch_cpu_ms:0.29 batch_h2d_ms:0.35 forward_ms:30.54 backward_ms:64.93 grad_clip_ms:0.00 optimizer_ms:55.37 val_ms:121092.09 explicit_sync_ms:0.16 (per_optimizer_step; forward/backward/batch averaged over micro_steps; grad_accum_steps=8)
timing_breakdown step:2 micro_steps:8 batch_cpu_ms:0.29 batch_h2d_ms:0.35 forward_ms:30.29 backward_ms:64.72 grad_clip_ms:0.00 optimizer_ms:55.08 val_ms:0.00 explicit_sync_ms:0.00 (per_optimizer_step; forward/backward/batch averaged over micro_steps; grad_accum_steps=8)
timing_breakdown step:3 micro_steps:8 batch_cpu_ms:0.28 batch_h2d_ms:0.37 forward_ms:30.66 backward_ms:65.18 grad_clip_ms:0.00 optimizer_ms:54.45 val_ms:0.00 explicit_sync_ms:0.00 (per_optimizer_step; forward/backward/batch averaged over micro_steps; grad_accum_steps=8)
timing_breakdown step:4 micro_steps:8 batch_cpu_ms:0.31 batch_h2d_ms:0.34 forward_ms:30.34 backward_ms:64.43 grad_clip_ms:0.00 optimizer_ms:55.19 val_ms:0.00 explicit_sync_ms:0.00 (per_optimizer_step; forward/backward/batch averaged over micro_steps; grad_accum_steps=8)
```

**How to read this:** `batch_cpu_ms` and `batch_h2d_ms` are ~0.3 ms per micro-step; `forward_ms` and `backward_ms` are ~30 ms and ~65 ms per micro-step. Scaled by 8 micro-steps, batch prep + H2D is on the order of **~5 ms per optimizer step**, while forward + backward + optimizer is on the order of **~800+ ms**. So **data movement is a tiny slice** of the step; overlapping it cannot move wall-clock much when the GPU is already busy with compute for almost the whole step.

**Caveat:** On a **much faster GPU** (or smaller model / larger batch so steps are shorter), the same CPU+H2D work could become a **larger fraction** of the step, and prefetch or val overlap might show up more in profiles. The breakdown above is **not** universal; it only shows why the optimization can be a no-op when **compute is the bottleneck**.
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
{
"id": "f9ba5b9c-412b-48cd-b1bc-7d2c0fc65ee7",
"started_at": "2026-03-24T00:48:35.433731+00:00",
"finished_at": "2026-03-24T01:11:47.811831+00:00",
"exit_code": 0,
"script": "train_gpt_improved_linux.py",
"argv": [
"wsl",
"-d",
"Ubuntu",
"bash",
"-lc",
"cd '/mnt/c/Users/REDACTED/parameter-golf' && export ADAM_EPS=1e-8 && export BETA1=0.9 && export BETA2=0.95 && export DATA_PATH=./data/datasets/fineweb10B_sp1024 && export EMBED_LR=0.6 && export GRAD_CLIP_NORM=0.0 && export HEAD_LR=0.008 && export ITERATIONS=20000 && export LOGIT_SOFTCAP=30.0 && export MATRIX_LR=0.04 && export MAX_WALLCLOCK_SECONDS=600.0 && export MLP_MULT=2 && export MODEL_DIM=512 && export MUON_BACKEND_STEPS=5 && export MUON_MOMENTUM=0.95 && export MUON_MOMENTUM_WARMUP_START=0.85 && export MUON_MOMENTUM_WARMUP_STEPS=500 && export NUM_HEADS=8 && export NUM_KV_HEADS=4 && export NUM_LAYERS=9 && export PYTHONUNBUFFERED=1 && export QK_GAIN_INIT=1.5 && export ROPE_BASE=10000.0 && export RUN_ID=8gb_vram && export SCALAR_LR=0.04 && export SEED=1337 && export TIED_EMBED_INIT_STD=0.005 && export TIED_EMBED_LR=0.05 && export TIE_EMBEDDINGS=1 && export TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model && export TRAIN_BATCH_TOKENS=65536 && export TRAIN_LOG_EVERY=200 && export TRAIN_SEQ_LEN=1024 && export VAL_BATCH_SIZE=524288 && export VAL_LOSS_EVERY=1000 && export VAL_PROGRESS_LOG_EVERY=0 && export VOCAB_SIZE=1024 && export WARMDOWN_ITERS=1200 && export WARMUP_STEPS=20 && ./.venv/bin/python -u train_gpt_improved_linux.py"
],
"env_snapshot": {
"DATA_PATH": "./data/datasets/fineweb10B_sp1024",
"TOKENIZER_PATH": "./data/tokenizers/fineweb_1024_bpe.model",
"RUN_ID": "8gb_vram",
"SEED": "1337",
"VAL_BATCH_SIZE": "524288",
"VAL_LOSS_EVERY": "1000",
"VAL_PROGRESS_LOG_EVERY": "0",
"TRAIN_LOG_EVERY": "200",
"ITERATIONS": "20000",
"WARMDOWN_ITERS": "1200",
"WARMUP_STEPS": "20",
"TRAIN_BATCH_TOKENS": "65536",
"TRAIN_SEQ_LEN": "1024",
"MAX_WALLCLOCK_SECONDS": "600.0",
"QK_GAIN_INIT": "1.5",
"VOCAB_SIZE": "1024",
"NUM_LAYERS": "9",
"NUM_KV_HEADS": "4",
"MODEL_DIM": "512",
"NUM_HEADS": "8",
"MLP_MULT": "2",
"TIE_EMBEDDINGS": "1",
"ROPE_BASE": "10000.0",
"LOGIT_SOFTCAP": "30.0",
"EMBED_LR": "0.6",
"HEAD_LR": "0.008",
"TIED_EMBED_LR": "0.05",
"TIED_EMBED_INIT_STD": "0.005",
"MATRIX_LR": "0.04",
"SCALAR_LR": "0.04",
"MUON_MOMENTUM": "0.95",
"MUON_BACKEND_STEPS": "5",
"MUON_MOMENTUM_WARMUP_START": "0.85",
"MUON_MOMENTUM_WARMUP_STEPS": "500",
"BETA1": "0.9",
"BETA2": "0.95",
"ADAM_EPS": "1e-8",
"GRAD_CLIP_NORM": "0.0",
"PYTHONUNBUFFERED": "1"
},
"command_powershell": "(WSL runtime selected; native PowerShell command disabled.)",
"command_bash": "cd '/mnt/c/Users/REDACTED/parameter-golf' && export ADAM_EPS=1e-8 && export BETA1=0.9 && export BETA2=0.95 && export DATA_PATH=./data/datasets/fineweb10B_sp1024 && export EMBED_LR=0.6 && export GRAD_CLIP_NORM=0.0 && export HEAD_LR=0.008 && export ITERATIONS=20000 && export LOGIT_SOFTCAP=30.0 && export MATRIX_LR=0.04 && export MAX_WALLCLOCK_SECONDS=600.0 && export MLP_MULT=2 && export MODEL_DIM=512 && export MUON_BACKEND_STEPS=5 && export MUON_MOMENTUM=0.95 && export MUON_MOMENTUM_WARMUP_START=0.85 && export MUON_MOMENTUM_WARMUP_STEPS=500 && export NUM_HEADS=8 && export NUM_KV_HEADS=4 && export NUM_LAYERS=9 && export PYTHONUNBUFFERED=1 && export QK_GAIN_INIT=1.5 && export ROPE_BASE=10000.0 && export RUN_ID=8gb_vram && export SCALAR_LR=0.04 && export SEED=1337 && export TIED_EMBED_INIT_STD=0.005 && export TIED_EMBED_LR=0.05 && export TIE_EMBEDDINGS=1 && export TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model && export TRAIN_BATCH_TOKENS=65536 && export TRAIN_LOG_EVERY=200 && export TRAIN_SEQ_LEN=1024 && export VAL_BATCH_SIZE=524288 && export VAL_LOSS_EVERY=1000 && export VAL_PROGRESS_LOG_EVERY=0 && export VOCAB_SIZE=1024 && export WARMDOWN_ITERS=1200 && export WARMUP_STEPS=20 && ./.venv/bin/python -u train_gpt_improved_linux.py",
"log_file": null,
"metrics": {
"timing_val_stage_total_ms": 366515,
"timing_train_loop_ms": 600433,
"timing_val_total_ms": 366515,
"peak_memory_line": "peak memory allocated: 1552 MiB reserved: 2280 MiB",
"serialized_model_line": "Serialized model: 67224983 bytes",
"code_bytes": 70245,
"serialized_int8_zlib_line": "Serialized model int8+zlib: 12907011 bytes (payload:17178912 raw_torch:17224025 payload_ratio:3.91x)",
"model_int8_zlib_bytes": 12907011,
"submission_size_line": "Total submission size int8+zlib: 12977256 bytes",
"submission_total_bytes_int8_zlib": 12977256,
"timing_quantize_ms": 1082,
"timing_roundtrip_dequant_ms": 407,
"timing_roundtrip_eval_ms": 125190,
"final_roundtrip_line": "final_int8_zlib_roundtrip val_loss:2.5457 val_bpb:1.5077 eval_time:125190ms",
"val_loss": 2.5457,
"val_bpb": 1.5077,
"final_roundtrip_exact_line": "final_int8_zlib_roundtrip_exact val_loss:2.54570429 val_bpb:1.50770948",
"val_loss_exact": 2.54570429,
"val_bpb_exact": 1.50770948
}
}
Loading