sunnypatneedi
diff --git a/‎records/track_10min_16mb/2026-03-24_sunnypatneedi_submission/README.md‎
Lines changed: 65 additions & 25 deletions b/‎records/track_10min_16mb/2026-03-24_sunnypatneedi_submission/README.md‎
Lines changed: 65 additions & 25 deletions
diff --git a/‎records/track_10min_16mb/2026-03-24_sunnypatneedi_submission/finalize_submission.py‎
Lines changed: 140 additions & 0 deletions b/‎records/track_10min_16mb/2026-03-24_sunnypatneedi_submission/finalize_submission.py‎
Lines changed: 140 additions & 0 deletions
@@ -1,48 +1,88 @@
-## Record: PLACEHOLDER_TECHNIQUE_NAME
+# LeakyReLU(0.5)^2 + AdamW TTT (30ep cosine + per-layer LR) + XSA + Int6
 
-**val_bpb: PLACEHOLDER** (3-seed mean) | **PLACEHOLDER MB** artifact | 8xH100 SXM, 600s
+**val_bpb: FILL_BPB** (3-seed mean) | **FILL_MB MB** artifact | 8xH100 SXM, 600s train + ~585s eval
 
-### Results (3 seeds, 8xH100 SXM)
+## Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)
 
-| Seed | Steps | Sliding BPB (s64) | Artifact |
-|------|-------|-------------------|----------|
-| 42   | XXXX  | X.XXXX            | XX.XX MB |
-| 1337 | XXXX  | X.XXXX            | XX.XX MB |
-| 2024 | XXXX  | X.XXXX            | XX.XX MB |
+| Seed | Steps | Pre-TTT BPB | Post-TTT BPB (s64) | Artifact |
+|------|-------|-------------|---------------------|----------|
+| 42   | FILL  | FILL        | FILL                | FILL     |
+| 1337 | FILL  | FILL        | FILL                | FILL     |
+| 2024 | FILL  | FILL        | FILL                | FILL     |
 
-**Mean: X.XXXX | Std: X.XXXX**
+**Mean: FILL | Std: FILL**
 
-### Key Innovations
+## Key Innovation: AdamW TTT with cosine + per-layer LR on SOTA base
 
-PLACEHOLDER — describe what's new vs prior SOTA.
+The merged SOTA (PR #549, 1.1194) uses a weak 3-epoch SGD TTT that gives only -0.0025 bpb. We replace it with PR #481's proven AdamW recipe:
 
-### Architecture
+1. **AdamW optimizer** (weight_decay=0) instead of SGD with momentum
+2. **30 epochs** with **cosine LR decay** instead of 3 epochs flat
+3. **Per-layer LR groups**: MLP output projections get 3x base LR (more quant-damaged), MLP input projections get 0.5x, everything else 1x
+4. **All blocks unfrozen** (freeze_blocks=0)
 
-- 11 layers, 512 dim, 8 heads / 4 KV heads (GQA)
-- PLACEHOLDER — list all components
+PR #481 demonstrated this recipe gives -0.066 bpb on their base (1.1577 -> 1.0970). On the stronger PR #549 base (~1.12 pre-TTT), we expect -0.010 to -0.025 bpb.
 
-### Training Configuration
+## Architecture (from PR #549 SOTA)
 
-- PLACEHOLDER — optimizer, LR, batch size, warmdown
+| Component | Setting |
+|-----------|---------|
+| Layers | 11 (512d, 8H, 4KV GQA) |
+| MLP | 3x expansion, **LeakyReLU(0.5)^2** |
+| BigramHash | 2048 |
+| XSA | Last 4 layers |
+| RoPE | Partial (16/64 dims) |
+| LN Scale | 1/sqrt(layer+1) |
+| VE128 | Layers 9-10 |
+| Weight avg | EMA(0.997) + SWA(every 50) |
+| Quantization | GPTQ-lite int6 + zstd-22 |
 
-### Quantization
+## TTT Configuration
 
-- PLACEHOLDER — int5/int6, GPTQ-lite, zstd-22
+| Parameter | Value |
+|-----------|-------|
+| Optimizer | AdamW (weight_decay=0) |
+| Base LR | 0.0005 |
+| Per-layer LR | mlp.proj: 3x, mlp.fc: 0.5x, other: 1x |
+| Epochs | 30 |
+| Schedule | Cosine decay |
+| Freeze blocks | 0 (all unfrozen) |
+| Batch seqs | 64 per GPU (512 total) |
+| Max steps/epoch | 300 |
 
-### Run Command
+## Timing Budget
+
+| Phase | Time |
+|-------|------|
+| Training | 600s (10 min) |
+| Int6 roundtrip eval (diagnostic) | ~20s |
+| AdamW TTT (30 epochs) | ~465s |
+| Sliding window eval (stride=64) | ~120s |
+| **Total eval** | **~605s (within 10 min)** |
+
+## Run Command
 
 ```bash
-SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py
+cd /workspace/parameter-golf
+SEED=42 XSA_LAST_N=4 TTT_ENABLED=1 TTT_LR=0.0005 TTT_EPOCHS=30 \
+TTT_COSINE=1 TTT_PERLAYER=1 TTT_FREEZE_BLOCKS=0 TTT_BATCH_SEQS=64 \
+torchrun --standalone --nproc_per_node=8 \
+  records/track_10min_16mb/2026-03-24_sunnypatneedi_submission/train_gpt.py
 ```
 
-### Provenance
+## Provenance
 
-Built on PR #414 (signalrush, merged SOTA 1.1228). Key additions from:
-- PLACEHOLDER — list PRs and papers we build on
+Built on PR #549 (abaybektursun, merged SOTA 1.1194), with TTT recipe from PR #481 (mrdavtan, 1.0970):
+- PR #549 / PR #414 (signalrush) - base architecture, int6 GPTQ-lite, EMA/SWA, LeakyReLU
+- PR #481 (mrdavtan) - AdamW TTT with cosine decay and per-layer LR
+- PR #198 / PR #503 (jfprincz) - XSA (exclusive self-attention)
+- PR #287 (jfprincz) - Partial RoPE + LN Scale
 
-### Test Plan
+## Test Plan
 
 - [ ] 3 seeds run on 8xH100 SXM
 - [ ] All 3 seeds train in <=600s
+- [ ] All 3 seeds total eval (TTT + sliding) in <=600s
 - [ ] All 3 seeds artifact <=16,000,000 bytes
-- [ ] Sliding window eval (stride=64) consistent
+- [ ] Post-TTT sliding BPB beats 1.1194 by >=0.005 nats
+- [ ] Statistical significance p<0.01 across 3 seeds
@@ -0,0 +1,140 @@
+#!/usr/bin/env python3
+"""
+Post-run script: reads 3 seed logs, fills in README.md and submission.json.
+Run locally after scp-ing logs from RunPod.
+
+Usage:
+    python3 finalize_submission.py [submission_dir]
+    # defaults to the directory containing this script
+"""
+import json
+import os
+import re
+import sys
+from pathlib import Path
+
+def extract_metrics(log_path: str) -> dict:
+    """Extract key metrics from a training log."""
+    text = Path(log_path).read_text()
+    metrics = {}
+
+    # Pre-TTT BPB (int6 roundtrip before TTT)
+    m = re.findall(r"final_int6_roundtrip_exact.*?val_bpb:([\d.]+)", text)
+    if m:
+        metrics["pre_ttt_bpb"] = float(m[-1])
+
+    # BPB from sliding window eval (the submission score — post-TTT)
+    m = re.findall(r"final_int6_sliding_window_exact.*?val_bpb:([\d.]+)", text)
+    if m:
+        metrics["bpb"] = float(m[-1])
+
+    # Artifact size
+    m = re.findall(r"Total submission size.*?(\d+)\s*bytes", text)
+    if m:
+        metrics["artifact"] = int(m[-1])
+
+    # Steps
+    m = re.findall(r"stopping_early.*?step[: ]*(\d+)", text)
+    if not m:
+        m = re.findall(r"step[: ]*(\d+)", text)
+    if m:
+        metrics["steps"] = int(m[-1])
+
+    return metrics
+
+
+def main():
+    sub_dir = Path(sys.argv[1]) if len(sys.argv) > 1 else Path(__file__).parent
+    seeds = [42, 1337, 2024]
+    results = {}
+
+    print("Extracting metrics from logs...")
+    for seed in seeds:
+        log = sub_dir / f"train_seed{seed}.log"
+        if not log.exists():
+            print(f"  WARNING: {log} not found")
+            continue
+        m = extract_metrics(str(log))
+        results[seed] = m
+        print(f"  Seed {seed}: bpb={m.get('bpb', '?')}, artifact={m.get('artifact', '?')}, steps={m.get('steps', '?')}")
+
+    if len(results) < 3:
+        print(f"\nERROR: Only found {len(results)}/3 seed logs. Cannot finalize.")
+        sys.exit(1)
+
+    bpbs = [results[s]["bpb"] for s in seeds]
+    mean_bpb = sum(bpbs) / len(bpbs)
+    std_bpb = (sum((x - mean_bpb) ** 2 for x in bpbs) / len(bpbs)) ** 0.5
+    max_artifact = max(results[s]["artifact"] for s in seeds)
+    mean_artifact_mb = sum(results[s]["artifact"] for s in seeds) / 3 / 1_000_000
+
+    print(f"\n  Mean BPB: {mean_bpb:.4f} (std {std_bpb:.4f})")
+    print(f"  Max artifact: {max_artifact} bytes ({max_artifact/1_000_000:.2f} MB)")
+
+    # Validation checks
+    sota = 1.1194
+    delta = mean_bpb - sota
+    print(f"\n  vs SOTA ({sota}): {delta:+.4f} nats")
+    if delta < -0.005:
+        print(f"  PASS: Beats SOTA by {abs(delta):.4f} nats")
+    elif delta < 0:
+        print(f"  CLOSE: Improves by {abs(delta):.4f} nats but < 0.005 threshold")
+        print(f"  Consider submitting as non-record if techniques are novel.")
+    else:
+        print(f"  DOES NOT BEAT SOTA. Consider as non-record submission.")
+
+    if max_artifact > 16_000_000:
+        print(f"  FAIL: Artifact exceeds 16MB ({max_artifact} bytes)")
+    else:
+        print(f"  PASS: All artifacts under 16MB")
+
+    # Update submission.json
+    json_path = sub_dir / "submission.json"
+    sj = json.loads(json_path.read_text())
+    sj["val_bpb"] = round(mean_bpb, 4)
+    sj["bytes_total"] = max_artifact
+    sj["blurb"] = (
+        f"LeakyReLU(0.5)^2 activation + XSA on last 4 layers + Partial RoPE + LN Scale "
+        f"+ VE128 + EMA/SWA + GPTQ-lite int6 + zstd-22. "
+        f"Built on PR #549 stack. 3-seed mean: {mean_bpb:.4f} (std {std_bpb:.4f}). "
+        f"All artifacts under 16MB."
+    )
+    json_path.write_text(json.dumps(sj, indent=2) + "\n")
+    print(f"\n  Updated {json_path}")
+
+    # Update README.md
+    readme_path = sub_dir / "README.md"
+    readme = readme_path.read_text()
+
+    # Fill header
+    readme = readme.replace("FILL_BPB", f"{mean_bpb:.4f}")
+    readme = readme.replace("FILL_MB", f"{mean_artifact_mb:.2f}")
+
+    # Fill results table
+    for seed in seeds:
+        r = results[seed]
+        old_line = f"| {seed}   | FILL  | FILL              | FILL     |"
+        new_line = (
+            f"| {seed}   | {r.get('steps', '?')}  | {r['bpb']:.4f}            "
+            f"| {r['artifact']/1_000_000:.2f} MB |"
+        )
+        readme = readme.replace(old_line, new_line)
+
+    # Fill mean/std
+    readme = readme.replace("**Mean: FILL | Std: FILL**", f"**Mean: {mean_bpb:.4f} | Std: {std_bpb:.4f}**")
+
+    readme_path.write_text(readme)
+    print(f"  Updated {readme_path}")
+
+    print(f"\n{'='*50}")
+    print("SUBMISSION READY. Next steps:")
+    print(f"  1. Review README.md and submission.json")
+    print(f"  2. git checkout -b submission/sunnypatneedi-leakyrelu-xsa")
+    print(f"  3. git add {sub_dir.relative_to(sub_dir.parent.parent.parent)}/")
+    print(f"  4. git commit -m 'Add submission: LeakyReLU + XSA'")
+    print(f"  5. git push origin submission/sunnypatneedi-leakyrelu-xsa")
+    print(f"  6. Open PR at: https://github.com/openai/parameter-golf/compare")
+
+
+if __name__ == "__main__":
+    main()