Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 98 additions & 0 deletions HANDOFF.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Parameter Golf — full map (where everything lives)

This file is the **single place** that describes what is on GitHub vs what is only on your Mac.

## 1. What counts for the challenge (on GitHub)

| Item | Location |
|------|----------|
| Your fork | `https://github.com/0xjaishy/parameter-golf` |
| PR to OpenAI | `https://github.com/openai/parameter-golf/pull/223` |
| Git branch | `submission/allinone-smeargate-int6qat-slidingwindow` |
| **Submission folder** | `records/track_10min_16mb/2026-03-20_SOTA_TTT_RoPE50K_EMA_Curriculum/` |
| Entry script | `train_gpt.py` (run with `torchrun --standalone --nproc_per_node=8` on **8×H100 SXM**) |
| Metadata | `submission.json` — set `val_loss`, `val_bpb`, `bytes_total` after a real run |
| Short write-up | `README.md` in that same folder |

**Rule:** For a clean competition PR, reviewers mainly care about that **records/...** directory. Everything else in the repo is optional.

## 2. One folder on your Mac (no duplicate repo)

Use **only this git clone** (your fork), e.g.:

**`/Users/shivashish/Desktop/parameter-golf-fork`**

Open **that** path in Cursor/VS Code. Do **not** keep a second `parameter-golf` copy on Desktop — it wastes space and drifts out of sync.

| Path | Purpose |
|------|---------|
| `HANDOFF.md` (this file) | Map of URLs, paths, commands |
| `README.md` | Upstream readme + Mac / prep notes |
| `scripts/check_submission_local.py` | CPU/MPS smoke test for a `train_gpt.py` |
| `scripts/sample_fineweb_tokens.py` | Decode shard samples |
| `scripts/validate_submission.py` | AST + sliding ref + import + forward/quant (defaults to SOTA `train_gpt.py`) |
| `data/datasets/`, `data/tokenizers/` | Downloaded data (**gitignored**, stays local) |
| `.venv/` | Python venv (**gitignored**, stays local) |

Committed to git: code, `records/`, `HANDOFF.md`, `scripts/`. **Not** committed: `data/datasets`, `.venv` (see `.gitignore`).

## 3. Commands (copy-paste)

**Download minimal data (val + 1 train shard):**

```bash
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 1
```

**Peek at real text in the corpus:**

```bash
python3 scripts/sample_fineweb_tokens.py --shard val --num-samples 5 --length 96
```

**Smoke-test your submission file (paths may differ):**

```bash
python3 scripts/check_submission_local.py \
records/track_10min_16mb/2026-03-20_SOTA_TTT_RoPE50K_EMA_Curriculum/train_gpt.py
```

**Validate (AST + optional torch checks; same default path):**

```bash
python3 scripts/validate_submission.py
```

**MLX baseline on Apple Silicon:**

```bash
RUN_ID=mlx_smoke ITERATIONS=200 TRAIN_BATCH_TOKENS=8192 VAL_LOSS_EVERY=0 python3 train_gpt_mlx.py
```

**Official training (8×H100, from submission directory):**

```bash
cd records/track_10min_16mb/2026-03-20_SOTA_TTT_RoPE50K_EMA_Curriculum
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## 4. Git workflow

Work inside this repo only. Edit, then:

```bash
git add -A && git status
git commit -m "your message"
git push origin submission/allinone-smeargate-int6qat-slidingwindow
```

## 5. After you get a real GPU run

1. Note `val_bpb` (and `val_loss`, artifact bytes) from the log.
2. Update `submission.json` in the submission folder.
3. Commit and push the fork branch; PR #223 updates automatically.
4. Mark the PR ready for review when you meet record rules (e.g. multiple seeds if required).

---

*Last aligned with: SOTA+ submission (PR #198 base + RoPE50K + EMA + curriculum + TTT).*
25 changes: 24 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,12 @@ python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 10
This populates `./data/datasets/fineweb10B_sp1024/` and `./data/tokenizers/`.
By default this downloads the full validation split plus 80 training shards (8B tokens). For a smaller local smoke subset, pass `--train-shards 1`, for example `python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 1`.

After downloading, you can **inspect raw text** (decoded BPE) from a shard:

```bash
python3 scripts/sample_fineweb_tokens.py --shard val --num-samples 5 --length 96
```

Then run a small MLX training job:

```bash
Expand All @@ -113,7 +119,24 @@ VAL_BATCH_SIZE=8192 \
python3 train_gpt_mlx.py
```

Validation always runs on the full `fineweb_val_*` split, which is the fixed first-50k-document set. The smoke command above skips periodic validation and just prints the final `val_loss` and `val_bpb` once at the end.
Validation always runs on the full `fineweb_val_*` split, which is the fixed first-50k-document set. The smoke command above skips periodic validation and just prints the final `val_loss` and `val_bpb` once at the end (that final full-val pass can still take a while if `VAL_BATCH_SIZE` is very small; use the default `VAL_BATCH_SIZE` unless you are debugging memory).

#### Your `records/.../train_gpt.py` submission on a Mac

Leaderboard submissions target **CUDA + 8 GPUs** (`torchrun`). That path will not run end-to-end on Apple Silicon. You can still de-risk locally:

1. **MLX** (`train_gpt_mlx.py`) — same FineWeb shards and the same **BPB idea** (next-token loss converted with the tokenizer byte tables), but a **different** training stack than your CUDA submission. Use it to learn the data pipeline and see loss/BPB trends.
2. **Smoke test** — load your submission file as a module and run a tiny forward pass + int6 export roundtrip on **CPU or MPS** (no official score):

```bash
python3 scripts/check_submission_local.py records/track_10min_16mb/<your_run>/train_gpt.py
```

Optional: `LOCAL_SMOKE_LAYERS=2 LOCAL_SMOKE_DIM=128` for an even smaller model. A real **competition BPB** (sliding window, 10-minute train, quantized artifact) still requires a CUDA machine (e.g. Runpod).

**Prep checklist (in order):** (1) download data + run `sample_fineweb_tokens.py` so you’ve seen the corpus, (2) MLX smoke `train_gpt_mlx.py`, (3) `scripts/check_submission_local.py` on your `records/.../train_gpt.py`, (4) full `torchrun` on **8×H100 SXM** and log `val_bpb`, (5) multiple seeds if you claim a record.

**Use one clone on your machine:** keep a single repo folder (this fork), with `data/` and `.venv` local only — see `HANDOFF.md` at repo root.

### Scaling Up to a Remote Machine

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# SOTA+ TTT + RoPE50K + EMA + Curriculum

**Target: sub-1.13 BPB** | 8xH100 SXM, 600s | Pending compute run

## Base: PR #198 Stack (1.1326 BPB)

Every proven technique from the current #1 submission:

| Technique | Detail |
|-----------|--------|
| 11 layers | Deeper model, funded by int6 compression |
| Int6 MLP+Attn / Int8 Embed | Mixed precision quantization + zstd-22 |
| MLP 3x (1536 hidden) | Wider feed-forward, enabled by int6 savings |
| SmearGate | Learned per-dim gate blending token with predecessor |
| BigramHash (2048 buckets) | Hash-based token-pair embeddings |
| OrthoInit + muP | Orthogonal weight init with output scaling |
| WD=0.04 (Muon + Adam) | Quantization-friendly weight distribution |
| FA3 with SDPA fallback | FlashAttention 3 on H100, PyTorch SDPA locally |
| Sliding window eval (s64) | Near-full context for every scored token |
| FP16 tied embedding | Embedding never quantized |

## New: Four Untried Improvements

### 1. RoPE Base 50K (was 10K)

Smoother position interpolation at seq2048. Validated by PR #206 (1.1507 on 9L).
Zero parameter/compute cost. Expected gain: ~0.002 BPB.

### 2. LAWA-EMA (replaces periodic SWA)

Exponential moving average (decay=0.995) updated every step during warmdown,
instead of periodic SWA checkpoints every 200 steps. Smoother weight averaging
should reduce noise in the final model. Expected gain: ~0.002 BPB.

### 3. Context-Length Curriculum

Train at seq1024 for first 60% of wallclock (~50ms/step), then switch to seq2048
(~81ms/step). The short-context phase yields ~60% more optimizer steps, building
a stronger feature representation before introducing long context. Expected gain: ~0.003 BPB.

### 4. Full-Model SGD Test-Time Training

After training, run 1 epoch of SGD (lr=3e-4, momentum=0.95) over the validation
set before scoring. Each token predicted with backward-looking context only
(causal model ensures no leakage). Adapts the model to the evaluation distribution.

Without SmearGate, TTT adds ~0.033 BPB (PR #152). With SmearGate on a 9L model,
only ~0.001 (PR #178). The true gain on the full 11L stack is the critical unknown.
Expected gain: 0.001 to 0.033 BPB.

## Expected Outcome

| Scenario | BPB | Delta vs #198 |
|----------|-----|---------------|
| Conservative (TTT ~0.001) | ~1.125 | -0.008 |
| Moderate (TTT ~0.010) | ~1.116 | -0.017 |
| Aggressive (TTT ~0.033) | ~1.093 | -0.040 |

## Run Command

```bash
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

All hyperparameters are baked into defaults. Override with env vars if needed:

```bash
EMA_ENABLED=0 TTT_ENABLED=0 CURRICULUM_ENABLED=0 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Architecture

- 11 layers, 512 dim, 8 heads, 4 KV heads (GQA)
- MLP 3x (hidden=1536), relu-squared activation
- Vocab 1024 (SentencePiece BPE), tied embeddings
- RoPE base 50K, logit softcapping (30.0)
- U-Net skip connections with learned weights
- ~26.8M parameters, ~15.7MB artifact (int6+zstd-22)
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"track": "10min_16mb",
"author": "0xjaishy",
"github_id": "0xjaishy",
"name": "SOTA+ TTT + RoPE50K + EMA + Curriculum",
"blurb": "PR#198 stack (11L Int6 MLP3x SmearGate BigramHash OrthoInit WD0.04) + RoPE 50K + LAWA-EMA + context-length curriculum + full-model SGD test-time training",
"val_loss": 0.0,
"val_bpb": 0.0,
"bytes_code": 67947,
"bytes_total": 0,
"date": "2026-03-20",
"notes": "Pending 8xH100 SXM run. Conservative target: 1.125 BPB. Aggressive target: <1.10 BPB."
}
Loading