openai · 0xjaishy · Mar 20, 2026 · Mar 20, 2026 · Mar 20, 2026 · Mar 20, 2026
diff --git a/HANDOFF.md b/HANDOFF.md
@@ -0,0 +1,98 @@
+# Parameter Golf — full map (where everything lives)
+
+This file is the **single place** that describes what is on GitHub vs what is only on your Mac.
+
+## 1. What counts for the challenge (on GitHub)
+
+| Item | Location |
+|------|----------|
+| Your fork | `https://github.com/0xjaishy/parameter-golf` |
+| PR to OpenAI | `https://github.com/openai/parameter-golf/pull/223` |
+| Git branch | `submission/allinone-smeargate-int6qat-slidingwindow` |
+| **Submission folder** | `records/track_10min_16mb/2026-03-20_SOTA_TTT_RoPE50K_EMA_Curriculum/` |
+| Entry script | `train_gpt.py` (run with `torchrun --standalone --nproc_per_node=8` on **8×H100 SXM**) |
+| Metadata | `submission.json` — set `val_loss`, `val_bpb`, `bytes_total` after a real run |
+| Short write-up | `README.md` in that same folder |
+
+**Rule:** For a clean competition PR, reviewers mainly care about that **records/...** directory. Everything else in the repo is optional.
+
+## 2. One folder on your Mac (no duplicate repo)
+
+Use **only this git clone** (your fork), e.g.:
+
+**`/Users/shivashish/Desktop/parameter-golf-fork`**
+
+Open **that** path in Cursor/VS Code. Do **not** keep a second `parameter-golf` copy on Desktop — it wastes space and drifts out of sync.
+
+| Path | Purpose |
+|------|---------|
+| `HANDOFF.md` (this file) | Map of URLs, paths, commands |
+| `README.md` | Upstream readme + Mac / prep notes |
+| `scripts/check_submission_local.py` | CPU/MPS smoke test for a `train_gpt.py` |
+| `scripts/sample_fineweb_tokens.py` | Decode shard samples |
+| `scripts/validate_submission.py` | AST + sliding ref + import + forward/quant (defaults to SOTA `train_gpt.py`) |
+| `data/datasets/`, `data/tokenizers/` | Downloaded data (**gitignored**, stays local) |
+| `.venv/` | Python venv (**gitignored**, stays local) |
+
+Committed to git: code, `records/`, `HANDOFF.md`, `scripts/`. **Not** committed: `data/datasets`, `.venv` (see `.gitignore`).
+
+## 3. Commands (copy-paste)
+
+**Download minimal data (val + 1 train shard):**
+
+```bash
+python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 1
+```
+
+**Peek at real text in the corpus:**
+
+```bash
+python3 scripts/sample_fineweb_tokens.py --shard val --num-samples 5 --length 96
+```
+
+**Smoke-test your submission file (paths may differ):**
+
+```bash
+python3 scripts/check_submission_local.py \
+  records/track_10min_16mb/2026-03-20_SOTA_TTT_RoPE50K_EMA_Curriculum/train_gpt.py
+```
+
+**Validate (AST + optional torch checks; same default path):**
+
+```bash
+python3 scripts/validate_submission.py
+```
+
+**MLX baseline on Apple Silicon:**
+
+```bash
+RUN_ID=mlx_smoke ITERATIONS=200 TRAIN_BATCH_TOKENS=8192 VAL_LOSS_EVERY=0 python3 train_gpt_mlx.py
+```
+
+**Official training (8×H100, from submission directory):**
+
+```bash
+cd records/track_10min_16mb/2026-03-20_SOTA_TTT_RoPE50K_EMA_Curriculum
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## 4. Git workflow
+
+Work inside this repo only. Edit, then:
+
+```bash
+git add -A && git status
+git commit -m "your message"
+git push origin submission/allinone-smeargate-int6qat-slidingwindow
+```
+
+## 5. After you get a real GPU run
+
+1. Note `val_bpb` (and `val_loss`, artifact bytes) from the log.  
+2. Update `submission.json` in the submission folder.  
+3. Commit and push the fork branch; PR #223 updates automatically.  
+4. Mark the PR ready for review when you meet record rules (e.g. multiple seeds if required).
+
+---
+
+*Last aligned with: SOTA+ submission (PR #198 base + RoPE50K + EMA + curriculum + TTT).*
diff --git a/README.md b/README.md
@@ -102,6 +102,12 @@ python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 10
 This populates `./data/datasets/fineweb10B_sp1024/` and `./data/tokenizers/`.
 By default this downloads the full validation split plus 80 training shards (8B tokens). For a smaller local smoke subset, pass `--train-shards 1`, for example `python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 1`.
 
+After downloading, you can **inspect raw text** (decoded BPE) from a shard:
+
+```bash
+python3 scripts/sample_fineweb_tokens.py --shard val --num-samples 5 --length 96
+```
+
 Then run a small MLX training job:
 
 ```bash
@@ -113,7 +119,24 @@ VAL_BATCH_SIZE=8192 \
 python3 train_gpt_mlx.py
 ```
 
-Validation always runs on the full `fineweb_val_*` split, which is the fixed first-50k-document set. The smoke command above skips periodic validation and just prints the final `val_loss` and `val_bpb` once at the end.
+Validation always runs on the full `fineweb_val_*` split, which is the fixed first-50k-document set. The smoke command above skips periodic validation and just prints the final `val_loss` and `val_bpb` once at the end (that final full-val pass can still take a while if `VAL_BATCH_SIZE` is very small; use the default `VAL_BATCH_SIZE` unless you are debugging memory).
+
+#### Your `records/.../train_gpt.py` submission on a Mac
+
+Leaderboard submissions target **CUDA + 8 GPUs** (`torchrun`). That path will not run end-to-end on Apple Silicon. You can still de-risk locally:
+
+1. **MLX** (`train_gpt_mlx.py`) — same FineWeb shards and the same **BPB idea** (next-token loss converted with the tokenizer byte tables), but a **different** training stack than your CUDA submission. Use it to learn the data pipeline and see loss/BPB trends.
+2. **Smoke test** — load your submission file as a module and run a tiny forward pass + int6 export roundtrip on **CPU or MPS** (no official score):
+
+```bash
+python3 scripts/check_submission_local.py records/track_10min_16mb/<your_run>/train_gpt.py
+```
+
+Optional: `LOCAL_SMOKE_LAYERS=2 LOCAL_SMOKE_DIM=128` for an even smaller model. A real **competition BPB** (sliding window, 10-minute train, quantized artifact) still requires a CUDA machine (e.g. Runpod).
+
+**Prep checklist (in order):** (1) download data + run `sample_fineweb_tokens.py` so you’ve seen the corpus, (2) MLX smoke `train_gpt_mlx.py`, (3) `scripts/check_submission_local.py` on your `records/.../train_gpt.py`, (4) full `torchrun` on **8×H100 SXM** and log `val_bpb`, (5) multiple seeds if you claim a record.
+
+**Use one clone on your machine:** keep a single repo folder (this fork), with `data/` and `.venv` local only — see `HANDOFF.md` at repo root.
 
 ### Scaling Up to a Remote Machine
 

diff --git a/records/track_10min_16mb/2026-03-20_SOTA_TTT_RoPE50K_EMA_Curriculum/README.md b/records/track_10min_16mb/2026-03-20_SOTA_TTT_RoPE50K_EMA_Curriculum/README.md
@@ -0,0 +1,79 @@
+# SOTA+ TTT + RoPE50K + EMA + Curriculum
+
+**Target: sub-1.13 BPB** | 8xH100 SXM, 600s | Pending compute run
+
+## Base: PR #198 Stack (1.1326 BPB)
+
+Every proven technique from the current #1 submission:
+
+| Technique | Detail |
+|-----------|--------|
+| 11 layers | Deeper model, funded by int6 compression |
+| Int6 MLP+Attn / Int8 Embed | Mixed precision quantization + zstd-22 |
+| MLP 3x (1536 hidden) | Wider feed-forward, enabled by int6 savings |
+| SmearGate | Learned per-dim gate blending token with predecessor |
+| BigramHash (2048 buckets) | Hash-based token-pair embeddings |
+| OrthoInit + muP | Orthogonal weight init with output scaling |
+| WD=0.04 (Muon + Adam) | Quantization-friendly weight distribution |
+| FA3 with SDPA fallback | FlashAttention 3 on H100, PyTorch SDPA locally |
+| Sliding window eval (s64) | Near-full context for every scored token |
+| FP16 tied embedding | Embedding never quantized |
+
+## New: Four Untried Improvements
+
+### 1. RoPE Base 50K (was 10K)
+
+Smoother position interpolation at seq2048. Validated by PR #206 (1.1507 on 9L).
+Zero parameter/compute cost. Expected gain: ~0.002 BPB.
+
+### 2. LAWA-EMA (replaces periodic SWA)
+
+Exponential moving average (decay=0.995) updated every step during warmdown,
+instead of periodic SWA checkpoints every 200 steps. Smoother weight averaging
+should reduce noise in the final model. Expected gain: ~0.002 BPB.
+
+### 3. Context-Length Curriculum
+
+Train at seq1024 for first 60% of wallclock (~50ms/step), then switch to seq2048
+(~81ms/step). The short-context phase yields ~60% more optimizer steps, building
+a stronger feature representation before introducing long context. Expected gain: ~0.003 BPB.
+
+### 4. Full-Model SGD Test-Time Training
+
+After training, run 1 epoch of SGD (lr=3e-4, momentum=0.95) over the validation
+set before scoring. Each token predicted with backward-looking context only
+(causal model ensures no leakage). Adapts the model to the evaluation distribution.
+
+Without SmearGate, TTT adds ~0.033 BPB (PR #152). With SmearGate on a 9L model,
+only ~0.001 (PR #178). The true gain on the full 11L stack is the critical unknown.
+Expected gain: 0.001 to 0.033 BPB.
+
+## Expected Outcome
+
+| Scenario | BPB | Delta vs #198 |
+|----------|-----|---------------|
+| Conservative (TTT ~0.001) | ~1.125 | -0.008 |
+| Moderate (TTT ~0.010) | ~1.116 | -0.017 |
+| Aggressive (TTT ~0.033) | ~1.093 | -0.040 |
+
+## Run Command
+
+```bash
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+All hyperparameters are baked into defaults. Override with env vars if needed:
+
+```bash
+EMA_ENABLED=0 TTT_ENABLED=0 CURRICULUM_ENABLED=0 \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## Architecture
+
+- 11 layers, 512 dim, 8 heads, 4 KV heads (GQA)
+- MLP 3x (hidden=1536), relu-squared activation
+- Vocab 1024 (SentencePiece BPE), tied embeddings
+- RoPE base 50K, logit softcapping (30.0)
+- U-Net skip connections with learned weights
+- ~26.8M parameters, ~15.7MB artifact (int6+zstd-22)
diff --git a/records/track_10min_16mb/2026-03-20_SOTA_TTT_RoPE50K_EMA_Curriculum/submission.json b/records/track_10min_16mb/2026-03-20_SOTA_TTT_RoPE50K_EMA_Curriculum/submission.json
@@ -0,0 +1,13 @@
+{
+  "track": "10min_16mb",
+  "author": "0xjaishy",
+  "github_id": "0xjaishy",
+  "name": "SOTA+ TTT + RoPE50K + EMA + Curriculum",
+  "blurb": "PR#198 stack (11L Int6 MLP3x SmearGate BigramHash OrthoInit WD0.04) + RoPE 50K + LAWA-EMA + context-length curriculum + full-model SGD test-time training",
+  "val_loss": 0.0,
+  "val_bpb": 0.0,
+  "bytes_code": 67947,
+  "bytes_total": 0,
+  "date": "2026-03-20",
+  "notes": "Pending 8xH100 SXM run. Conservative target: 1.125 BPB. Aggressive target: <1.10 BPB."
+}