Skip to content

Commit 6a32164

Browse files
Takoda Mundyclaude
andcommitted
Phase 2 skeleton files: PHASE2_TROUBLESHOOTING.md + PHASE2_RESULTS.md
Mirror the Phase 1 append-only log format, pre-seeded with: - Phase 1 baseline context (what we're improving from) - Comp anchors (PR openai#1477/openai#1482/openai#1485) as the target ceiling - Shot-by-shot result slots for S1-S8 - A cumulative speedup tracker table - Phase 2 targets: ≥5x speedup, val_bpb 1.10-1.18 on 1xH100 SXM The Phase 1 dry run val_bpb placeholder will be filled in once Pod L's run completes (~03:30Z). That number becomes the Phase 2 floor we're optimizing against. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 3dfc868 commit 6a32164

2 files changed

Lines changed: 106 additions & 0 deletions

File tree

PHASE2_RESULTS.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# PHASE2_RESULTS.md — append-only speedup + val_bpb ledger
2+
3+
**Comp**: openai/parameter-golf
4+
**Phase**: 2 (speed work)
5+
**Plan**: PHASE2_PLAN.md
6+
**Model invariant**: Phase 1 locked-in stack (train.py at 731 lines, 10 patches, git HEAD 3dfc868)
7+
8+
Each row: shot id, hardware, wallclock, steps achieved, ms/step, val_bpb, artifact_bytes, speedup vs Phase 1 baseline, status, timestamp.
9+
10+
| shot | hardware | wallclock | steps | ms/step | tok/s | val_bpb | artifact_bytes | speedup | status | utc |
11+
|---|---|---|---|---|---|---|---|---|---|---|
12+
| (P1 baseline) | 1×H100 SXM 80GB | 600s | 180 | ~3300 | ~280K | TBD | TBD | 1.0× | Phase 1 dry run | 20260409T0230Z approx |
13+
14+
---
15+
16+
## Phase 1 baseline context
17+
18+
Phase 1 hit 180 steps in 600s because:
19+
- `torch.compile` disabled (~3-5× penalty)
20+
- FA3 not installed, SDPA fallback (~30-50% penalty)
21+
- N-gram bias forward overhead (~5-10%)
22+
- 3-layer recurrence adds 13% more layers
23+
- Small model on a big GPU — kernel launch overhead dominates
24+
25+
**Per-GPU rate**: 0.31 steps/sec (vs comp records' 4.17 steps/sec/GPU = ~13× slower).
26+
27+
## Comp anchors (the target)
28+
29+
| PR | stack | val_bpb | hardware |
30+
|---|---|---|---|
31+
| #1485 | 1477 + 3L recurrence + Pre-Quant AdamW TTT + EMA 0.9965 + QK5 | **1.0679** | 8×H100 SXM |
32+
| #1477 | SP8192 + Parallel Residuals + Score-First TTT | 1.0822 | 8×H100 SXM |
33+
| #1482 | SP8192 + Pre-Quant TTT QK 5.25 8ep freeze-1 | 1.0787 | 8×H100 SXM |
34+
35+
**Phase 2 target on 1×H100 SXM**: val_bpb in the **1.10-1.18 range** (within 0.10 of comp records). Won't match 8× because we're 1/8 the raw compute, but we should close most of the gap relative to the 8× vs 1× ratio once the code path is optimized.
36+
37+
---
38+
39+
## Shot-by-shot results
40+
41+
### Shot 1 — torch.compile re-enable
42+
<!-- fill in when run -->
43+
44+
### Shot 2 — FA3 sourcing
45+
<!-- fill in when run -->
46+
47+
### Shot 3 — Persistent CUDAGraph capture
48+
<!-- fill in when run -->
49+
50+
### Shot 4 — Fused n-gram bias Triton kernel
51+
<!-- fill in when run -->
52+
53+
### Shot 5 — GPTQ int6 dequant + matmul fusion
54+
<!-- fill in when run -->
55+
56+
### Shot 6 — Custom SDPA replacement
57+
<!-- fill in when run (probably skipped if FA3 lands in Shot 2) -->
58+
59+
### Shot 7 — Int8 tabulation hash GPU gather
60+
<!-- fill in when run (probably skipped) -->
61+
62+
### Shot 8 — FP8 compute paths
63+
<!-- fill in when run (probably skipped) -->
64+
65+
---
66+
67+
## Cumulative speedup tracker
68+
69+
| after shot | ms/step | vs P1 baseline | steps in 600s | val_bpb | Δ val_bpb vs P1 |
70+
|---|---|---|---|---|---|
71+
| P1 (baseline) | ~3300 | 1.0× | 180 | TBD ||
72+
| +S1 (compile) | TBD | TBD | TBD | TBD | TBD |
73+
| +S2 (FA3) | TBD | TBD | TBD | TBD | TBD |
74+
| +S3 (CUDAGraph) | TBD | TBD | TBD | TBD | TBD |
75+
| +S4 (fused ngram) | TBD | TBD | TBD | TBD | TBD |
76+
| +S5 (GPTQ fusion, eval only) | TBD | TBD | TBD | TBD | TBD |
77+
| Phase 2 done | **target ≥5× / ≤660 ms/step / ≥900 steps / val_bpb 1.10-1.18** | | | | |

PHASE2_TROUBLESHOOTING.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# PHASE2_TROUBLESHOOTING.md — append-only log
2+
3+
**Comp**: openai/parameter-golf
4+
**Phase**: 2 (speed work on the locked-in Phase 1 model)
5+
**Hardware**: cheap 3090/4070 Ti pods (NO H100 until final submission run)
6+
**Plan reference**: PHASE2_PLAN.md
7+
**Model invariant**: submission/train.py is LOCKED from Phase 1 (10 patches + PR #1477 base). Phase 2 only changes *how* the math runs, not *what* math runs.
8+
9+
This file is append-only. Each entry: timestamp, what broke, what we did, why, and
10+
whether the fix is "permanent" (in the repo) or "ad-hoc" (lives only on a specific pod).
11+
12+
## Operating rules (inherited from Phase 1)
13+
14+
1. **Clean python files only** — all changes land in `submission/train.py`, `submission/run.sh`, or new files under `submission/kernels/`. No patcher hunks.
15+
2. **Every workaround must be repo-checked-in** — if you SSH and `rm`/`mv` files, that's an ad-hoc fix and you must follow up with a permanent fix in the repo so the next clean pod boot can reproduce. Mark each entry below as PERMANENT or AD-HOC.
16+
3. **Document the WHY** — not just what command, but what error/symptom led to it.
17+
4. **Never bypass safety** — never `--no-verify`, never `git push --force`.
18+
5. **val_bpb invariant** — every Phase 2 change must keep val_bpb within ε=0.005 of the Phase 1 baseline. Log any drift in this file.
19+
20+
## Phase 1 baseline to preserve (the floor)
21+
22+
- train.py: 731 lines, git HEAD at 3dfc868
23+
- 10 patches all active in run.sh defaults
24+
- Phase 1 dry run val_bpb: **TBD** (waiting on Pod L `55fzwdfhbg9n4u` dry run to land ~2026-04-09 03:30Z)
25+
- Reference speed: 180 steps in 600s wallclock on 1× H100 SXM (eager mode, SDPA fallback, no compile)
26+
27+
---
28+
29+
<!-- Phase 2 entries get appended below this line. -->

0 commit comments

Comments
 (0)