Phase 2 skeleton files: PHASE2_TROUBLESHOOTING.md + PHASE2_RESULTS.md

Takoda Mundy · claude · Takoda Mundy · commit 6a32164887b2 · 2026-04-09T13:24:52.000+10:00
Mirror the Phase 1 append-only log format, pre-seeded with: - Phase 1 baseline context (what we're improving from) - Comp anchors (PR openai#1477/openai#1482/openai#1485) as the target ceiling - Shot-by-shot result slots for S1-S8 - A cumulative speedup tracker table - Phase 2 targets: ≥5x speedup, val_bpb 1.10-1.18 on 1xH100 SXM The Phase 1 dry run val_bpb placeholder will be filled in once Pod L's run completes (~03:30Z). That number becomes the Phase 2 floor we're optimizing against. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/PHASE2_RESULTS.md b/PHASE2_RESULTS.md
@@ -0,0 +1,77 @@
+# PHASE2_RESULTS.md — append-only speedup + val_bpb ledger
+
+**Comp**: openai/parameter-golf
+**Phase**: 2 (speed work)
+**Plan**: PHASE2_PLAN.md
+**Model invariant**: Phase 1 locked-in stack (train.py at 731 lines, 10 patches, git HEAD 3dfc868)
+
+Each row: shot id, hardware, wallclock, steps achieved, ms/step, val_bpb, artifact_bytes, speedup vs Phase 1 baseline, status, timestamp.
+
+| shot | hardware | wallclock | steps | ms/step | tok/s | val_bpb | artifact_bytes | speedup | status | utc |
+|---|---|---|---|---|---|---|---|---|---|---|
+| (P1 baseline) | 1×H100 SXM 80GB | 600s | 180 | ~3300 | ~280K | TBD | TBD | 1.0× | Phase 1 dry run | 20260409T0230Z approx |
+
+---
+
+## Phase 1 baseline context
+
+Phase 1 hit 180 steps in 600s because:
+- `torch.compile` disabled (~3-5× penalty)
+- FA3 not installed, SDPA fallback (~30-50% penalty)
+- N-gram bias forward overhead (~5-10%)
+- 3-layer recurrence adds 13% more layers
+- Small model on a big GPU — kernel launch overhead dominates
+
+**Per-GPU rate**: 0.31 steps/sec (vs comp records' 4.17 steps/sec/GPU = ~13× slower).
+
+## Comp anchors (the target)
+
+| PR | stack | val_bpb | hardware |
+|---|---|---|---|
+| #1485 | 1477 + 3L recurrence + Pre-Quant AdamW TTT + EMA 0.9965 + QK5 | **1.0679** | 8×H100 SXM |
+| #1477 | SP8192 + Parallel Residuals + Score-First TTT | 1.0822 | 8×H100 SXM |
+| #1482 | SP8192 + Pre-Quant TTT QK 5.25 8ep freeze-1 | 1.0787 | 8×H100 SXM |
+
+**Phase 2 target on 1×H100 SXM**: val_bpb in the **1.10-1.18 range** (within 0.10 of comp records). Won't match 8× because we're 1/8 the raw compute, but we should close most of the gap relative to the 8× vs 1× ratio once the code path is optimized.
+
+---
+
+## Shot-by-shot results
+
+### Shot 1 — torch.compile re-enable
+<!-- fill in when run -->
+
+### Shot 2 — FA3 sourcing
+<!-- fill in when run -->
+
+### Shot 3 — Persistent CUDAGraph capture
+<!-- fill in when run -->
+
+### Shot 4 — Fused n-gram bias Triton kernel
+<!-- fill in when run -->
+
+### Shot 5 — GPTQ int6 dequant + matmul fusion
+<!-- fill in when run -->
+
+### Shot 6 — Custom SDPA replacement
+<!-- fill in when run (probably skipped if FA3 lands in Shot 2) -->
+
+### Shot 7 — Int8 tabulation hash GPU gather
+<!-- fill in when run (probably skipped) -->
+
+### Shot 8 — FP8 compute paths
+<!-- fill in when run (probably skipped) -->
+
+---
+
+## Cumulative speedup tracker
+
+| after shot | ms/step | vs P1 baseline | steps in 600s | val_bpb | Δ val_bpb vs P1 |
+|---|---|---|---|---|---|
+| P1 (baseline) | ~3300 | 1.0× | 180 | TBD | — |
+| +S1 (compile) | TBD | TBD | TBD | TBD | TBD |
+| +S2 (FA3) | TBD | TBD | TBD | TBD | TBD |
+| +S3 (CUDAGraph) | TBD | TBD | TBD | TBD | TBD |
+| +S4 (fused ngram) | TBD | TBD | TBD | TBD | TBD |
+| +S5 (GPTQ fusion, eval only) | TBD | TBD | TBD | TBD | TBD |
+| Phase 2 done | **target ≥5× / ≤660 ms/step / ≥900 steps / val_bpb 1.10-1.18** | | | | |
diff --git a/PHASE2_TROUBLESHOOTING.md b/PHASE2_TROUBLESHOOTING.md
@@ -0,0 +1,29 @@
+# PHASE2_TROUBLESHOOTING.md — append-only log
+
+**Comp**: openai/parameter-golf
+**Phase**: 2 (speed work on the locked-in Phase 1 model)
+**Hardware**: cheap 3090/4070 Ti pods (NO H100 until final submission run)
+**Plan reference**: PHASE2_PLAN.md
+**Model invariant**: submission/train.py is LOCKED from Phase 1 (10 patches + PR #1477 base). Phase 2 only changes *how* the math runs, not *what* math runs.
+
+This file is append-only. Each entry: timestamp, what broke, what we did, why, and
+whether the fix is "permanent" (in the repo) or "ad-hoc" (lives only on a specific pod).
+
+## Operating rules (inherited from Phase 1)
+
+1. **Clean python files only** — all changes land in `submission/train.py`, `submission/run.sh`, or new files under `submission/kernels/`. No patcher hunks.
+2. **Every workaround must be repo-checked-in** — if you SSH and `rm`/`mv` files, that's an ad-hoc fix and you must follow up with a permanent fix in the repo so the next clean pod boot can reproduce. Mark each entry below as PERMANENT or AD-HOC.
+3. **Document the WHY** — not just what command, but what error/symptom led to it.
+4. **Never bypass safety** — never `--no-verify`, never `git push --force`.
+5. **val_bpb invariant** — every Phase 2 change must keep val_bpb within ε=0.005 of the Phase 1 baseline. Log any drift in this file.
+
+## Phase 1 baseline to preserve (the floor)
+
+- train.py: 731 lines, git HEAD at 3dfc868
+- 10 patches all active in run.sh defaults
+- Phase 1 dry run val_bpb: **TBD** (waiting on Pod L `55fzwdfhbg9n4u` dry run to land ~2026-04-09 03:30Z)
+- Reference speed: 180 steps in 600s wallclock on 1× H100 SXM (eager mode, SDPA fallback, no compile)
+
+---
+
+<!-- Phase 2 entries get appended below this line. -->