|
| 1 | +# PHASE2_RESULTS.md — append-only speedup + val_bpb ledger |
| 2 | + |
| 3 | +**Comp**: openai/parameter-golf |
| 4 | +**Phase**: 2 (speed work) |
| 5 | +**Plan**: PHASE2_PLAN.md |
| 6 | +**Model invariant**: Phase 1 locked-in stack (train.py at 731 lines, 10 patches, git HEAD 3dfc868) |
| 7 | + |
| 8 | +Each row: shot id, hardware, wallclock, steps achieved, ms/step, val_bpb, artifact_bytes, speedup vs Phase 1 baseline, status, timestamp. |
| 9 | + |
| 10 | +| shot | hardware | wallclock | steps | ms/step | tok/s | val_bpb | artifact_bytes | speedup | status | utc | |
| 11 | +|---|---|---|---|---|---|---|---|---|---|---| |
| 12 | +| (P1 baseline) | 1×H100 SXM 80GB | 600s | 180 | ~3300 | ~280K | TBD | TBD | 1.0× | Phase 1 dry run | 20260409T0230Z approx | |
| 13 | + |
| 14 | +--- |
| 15 | + |
| 16 | +## Phase 1 baseline context |
| 17 | + |
| 18 | +Phase 1 hit 180 steps in 600s because: |
| 19 | +- `torch.compile` disabled (~3-5× penalty) |
| 20 | +- FA3 not installed, SDPA fallback (~30-50% penalty) |
| 21 | +- N-gram bias forward overhead (~5-10%) |
| 22 | +- 3-layer recurrence adds 13% more layers |
| 23 | +- Small model on a big GPU — kernel launch overhead dominates |
| 24 | + |
| 25 | +**Per-GPU rate**: 0.31 steps/sec (vs comp records' 4.17 steps/sec/GPU = ~13× slower). |
| 26 | + |
| 27 | +## Comp anchors (the target) |
| 28 | + |
| 29 | +| PR | stack | val_bpb | hardware | |
| 30 | +|---|---|---|---| |
| 31 | +| #1485 | 1477 + 3L recurrence + Pre-Quant AdamW TTT + EMA 0.9965 + QK5 | **1.0679** | 8×H100 SXM | |
| 32 | +| #1477 | SP8192 + Parallel Residuals + Score-First TTT | 1.0822 | 8×H100 SXM | |
| 33 | +| #1482 | SP8192 + Pre-Quant TTT QK 5.25 8ep freeze-1 | 1.0787 | 8×H100 SXM | |
| 34 | + |
| 35 | +**Phase 2 target on 1×H100 SXM**: val_bpb in the **1.10-1.18 range** (within 0.10 of comp records). Won't match 8× because we're 1/8 the raw compute, but we should close most of the gap relative to the 8× vs 1× ratio once the code path is optimized. |
| 36 | + |
| 37 | +--- |
| 38 | + |
| 39 | +## Shot-by-shot results |
| 40 | + |
| 41 | +### Shot 1 — torch.compile re-enable |
| 42 | +<!-- fill in when run --> |
| 43 | + |
| 44 | +### Shot 2 — FA3 sourcing |
| 45 | +<!-- fill in when run --> |
| 46 | + |
| 47 | +### Shot 3 — Persistent CUDAGraph capture |
| 48 | +<!-- fill in when run --> |
| 49 | + |
| 50 | +### Shot 4 — Fused n-gram bias Triton kernel |
| 51 | +<!-- fill in when run --> |
| 52 | + |
| 53 | +### Shot 5 — GPTQ int6 dequant + matmul fusion |
| 54 | +<!-- fill in when run --> |
| 55 | + |
| 56 | +### Shot 6 — Custom SDPA replacement |
| 57 | +<!-- fill in when run (probably skipped if FA3 lands in Shot 2) --> |
| 58 | + |
| 59 | +### Shot 7 — Int8 tabulation hash GPU gather |
| 60 | +<!-- fill in when run (probably skipped) --> |
| 61 | + |
| 62 | +### Shot 8 — FP8 compute paths |
| 63 | +<!-- fill in when run (probably skipped) --> |
| 64 | + |
| 65 | +--- |
| 66 | + |
| 67 | +## Cumulative speedup tracker |
| 68 | + |
| 69 | +| after shot | ms/step | vs P1 baseline | steps in 600s | val_bpb | Δ val_bpb vs P1 | |
| 70 | +|---|---|---|---|---|---| |
| 71 | +| P1 (baseline) | ~3300 | 1.0× | 180 | TBD | — | |
| 72 | +| +S1 (compile) | TBD | TBD | TBD | TBD | TBD | |
| 73 | +| +S2 (FA3) | TBD | TBD | TBD | TBD | TBD | |
| 74 | +| +S3 (CUDAGraph) | TBD | TBD | TBD | TBD | TBD | |
| 75 | +| +S4 (fused ngram) | TBD | TBD | TBD | TBD | TBD | |
| 76 | +| +S5 (GPTQ fusion, eval only) | TBD | TBD | TBD | TBD | TBD | |
| 77 | +| Phase 2 done | **target ≥5× / ≤660 ms/step / ≥900 steps / val_bpb 1.10-1.18** | | | | | |
0 commit comments