openai · PhamPhuHoa-23 · Mar 31, 2026 · Mar 31, 2026 · Mar 31, 2026 · Mar 31, 2026
diff --git a/NOTES.md b/NOTES.md
@@ -0,0 +1,80 @@
+# Experiment Notes
+
+## Key Competitor PRs (as of 2026-04-08)
+
+| PR | BPB | Vocab | Key Technique |
+|----|-----|-------|--------------|
+| [#1450](https://github.com/openai/parameter-golf/pull/1450) | 1.08480 | SP8192 | TMA Megakernel (+10.5% throughput, fused Triton MLP) |
+| [#1437](https://github.com/openai/parameter-golf/pull/1437) | 1.08091 | SP8192 | N-gram Tilt (`p *= exp(beta * 1[t==bigram_hint]) / Z`) |
+| [#1460](https://github.com/openai/parameter-golf/pull/1460) | 1.08269 | SP8192 | Score-first TTT + Eval-Time Hash Embedding |
+
+All top PRs use **SP8192** (8192 BPE vocab) vs our **SP1024** — this is the biggest gap.
+
+---
+
+## sota_16 Changes (from sota_15)
+
+### Eval-time only (no training change)
+
+**1. N-gram Tilt** (from PR #1437)
+- Bigram count table `bg_counts[vocab, vocab]`, add-1 smoothed
+- At scoring: `lf += beta * one_hot(argmax(bg_counts[prev_tok]))`
+- Table updated **AFTER** scoring each chunk (causal, score-first)
+- `NGRAM_BETA=0.5`, expected gain ~0.010–0.015 BPB
+
+**2. Eval-Time Hash Embedding** (from PR #1460)
+- `nn.Embedding(16384, 512)`, zero-init, created fresh at eval
+- `h = (prev_token * 2039 + curr_token) % 16384`
+- Added as residual to `tok_emb` via `register_forward_hook`
+- Trained in TTT SGD alongside model weights
+- `HASH_EMB_SIZE=16384`, expected gain ~0.0004 BPB
+
+**3. TTT LR fix** (2026-04-08, after comparing PR #1460)
+- LR: `0.001 → 0.005` (5× increase, matched to PR #1460)
+- Added **cosine LR decay** within each chunk's TTT steps
+  - `cos_lr = ttt_lr * 0.5 * (1 + cos(π * step / total_steps))`
+  - Starts at full LR, decays to 0 by end of each chunk
+
+---
+
+## sota_15 Changes (from sota_12)
+
+- **DyT** replaces all 6 `RMSNorm` sites: `forward = tanh(alpha * x)`, `alpha` init=0.5
+- **JEPA** auxiliary loss: `JEPAPredictor(512 → 64 → 512)`, weight=0.1
+  - Predicts `h[t+1]` from `h[t]` with cosine loss + stop-gradient target
+  - Training only, zero parameter overhead at eval
+
+---
+
+## Architecture Baseline (sota_12)
+
+- 11L / 512d / 8H / 4KV GQA
+- XSA all layers
+- Full Hessian GPTQ int6
+- Legal score-first TTT
+- MTP (2 heads, weight 0.1)
+- Depth recurrence (L2,3,4,5, starts step 1500)
+- Parallel residuals (L5+)
+- Trigram + VE (L8,9,10)
+- Warmdown 5500 iters
+
+---
+
+## TTT Tips
+
+- **LR**: 0.005 works better than 0.001 (PR #1460 uses 0.005)
+- **Cosine decay** within chunk: start full LR → 0 over all steps in chunk
+- **Momentum**: 0.9 SGD
+- **Epochs**: 3 per chunk
+- **Chunk size**: 32768 tokens
+- **Score-first**: always `inference_mode` score before any `backward`
+
+---
+
+## Todo / Ideas
+
+- [ ] SP8192 tokenizer + dataset (biggest unlock, ~0.01-0.02 BPB)
+- [ ] TMA Megakernel (Triton, H100 TMA, +10.5% steps = ~700 extra iters)
+- [ ] Tune `NGRAM_BETA` in {0.3, 0.5, 0.8, 1.0} if sota_16 underperforms
+- [ ] Try trigram tilt (not just bigram)
+- [ ] Larger hash embedding size (32768, 65536)
diff --git a/..._record_16mb/2026-04-08_XSA11_ParallelResidual_DepthRecurrence_1xH100/README.md b/..._record_16mb/2026-04-08_XSA11_ParallelResidual_DepthRecurrence_1xH100/README.md
@@ -0,0 +1,92 @@
+# Non-Record: XSA-11 + Parallel Residual (L7+) + Depth Recurrence — val_bpb 1.1056 (1-seed, 1×H100)
+
+**Track:** 10-minute / 16MB  
+**Hardware:** 1×H100 80GB SXM  
+**Seeds:** 42 (1 seed — non-record)  
+**Submission size:** 15,652,295 bytes (~15.65 MB)  
+**TTT:** disabled
+
+---
+
+## Results
+
+| Seed | Steps | val_bpb (roundtrip) | val_bpb (sliding, stride 64) | Size (bytes) |
+|------|-------|---------------------|------------------------------|--------------|
+| 42   | 6,927 | 1.12955             | **1.10562**                  | 15,652,295   |
+
+---
+
+## Architecture
+
+| Component | Config | Source |
+|-----------|--------|--------|
+| Layers | 11 (512d, 8 GQA / 4 KV heads) | Baseline |
+| MLP | 3× (1536), LeakyReLU(0.5)² | PR #493 |
+| XSA | All 11 layers (`xsa_last_n=11`) | PR #478 |
+| BigramHash | 3072 × 112 | PR #162 |
+| RoPE | Partial (16/64 dims) | PR #315 |
+| LN Scale | 1/√(layer+1) | PR #315 |
+| VE128 | Layers 9, 10 | PR #374 |
+| SmearGate | Position-mixing gate | PR #65 |
+| Parallel Residual | Layers 7+ | PR #289 |
+| Depth Recurrence | Layers 4, 5 (activated at step 3000) | PR #363 |
+| Weight avg | EMA(0.997) + SWA(every 50) | PR #401 |
+| Quantization | Full Hessian GPTQ int6 (128 AR self-gen seqs × 2048 tokens) | PR #535 |
+| Compression | Brotli-11 | — |
+| Warmdown | 3500 iterations | — |
+| Optimizer | Parallel Muon | PR #399 |
+| Late QAT | STE at LR scale < 0.15 (step 2000) | PR #286 |
+| Flash Attention | Enabled | PR #122 |
-| Compression | Brotli-11 | — |
-| Warmdown | 3500 iterations | — |
-| Optimizer | Parallel Muon | PR #399 |
-| Late QAT | STE at LR scale < 0.15 (step 2000) | PR #286 |
-| Flash Attention | Enabled | PR #122 |
+| Compression | lzma | `train_gpt.py` |
+| Warmdown | 3500 iterations | — |
+| Optimizer | Parallel Muon | PR #399 |
+| Late QAT | STE at LR scale < 0.15 (step 2000) | PR #286 |
+| Attention backend | PyTorch SDPA | `train_gpt.py` |
-| Compression | Brotli-11 | — |
-| Warmdown | 3500 iterations | — |
-| Optimizer | Parallel Muon | PR #399 |
-| Late QAT | STE at LR scale < 0.15 (step 2000) | PR #286 |
-| Flash Attention | Enabled | PR #122 |
+| Compression | lzma | `train_gpt.py` |
+| Warmdown | 3500 iterations | — |
+| Optimizer | Parallel Muon | PR #399 |
+| Late QAT | STE at LR scale < 0.15 (step 2000) | PR #286 |
+| Attention backend | PyTorch SDPA | `train_gpt.py` |
+
+---
+
+## Training Dynamics
+
+| Step | val_bpb | Note |
+|------|---------|------|
+| 0 | 4.1048 | Init |
+| 4000 | 1.2040 | Mid-training checkpoint |
+| 6927 | 1.1266 | End of training |
+| post-EMA | 1.1257 | EMA selected over SWA (14 snapshots) |
+| int6 roundtrip | 1.1295 | After Full Hessian GPTQ |
+| **int6 sliding (stride 64)** | **1.1056** | **Final reported BPB** |
+
+Peak GPU memory: 29,726 MiB allocated / 29,994 MiB reserved.  
+Training time: ~6,186s (~1.72h). Step avg: ~893ms/step.  
+GPTQ calibration: 128 AR self-generated sequences × 2048 tokens, temp=0.8, generated in 478s.  
+Selective ±1 pruning: not needed (model fits at 14.93MB < 15.9MB target).
+
+---
+
+## Run Command
+
+```bash
+SEED=42 \
+DATA_PATH=/kaggle/input/datasets/haphmph/parameter-golf/data/datasets/fineweb10B_sp1024 \
+TOKENIZER_PATH=/kaggle/input/datasets/haphmph/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \
+ITERATIONS=6927 \
+TARGET_MB=15.9 \
+QK_GAIN_INIT=4.0 \
+BIGRAM_DIM=112 \
+PARALLEL_RESIDUAL=1 \
+PARALLEL_START_LAYER=7 \
+RECUR_LAYERS=4,5 \
+RECUR_START_STEP=3000 \
+WARMDOWN_ITERS=3500 \
+GPTQ_AR_SEQS=128 \
+torchrun --standalone --nproc_per_node=1 train_gpt.py
+```
+
+---
+
+## Notes
+
+This is a 1-seed non-record submission documenting the baseline performance of the XSA-11 + Parallel Residual + Depth Recurrence stack on a **single H100 80GB GPU**. Most leaderboard submissions use 8×H100 or similar multi-GPU setups; this run establishes what the same architecture achieves on accessible hardware in ~1.72 hours of wall-clock time.
+
+Key observations:
+- Depth recurrence (layers 4,5) activates at step 3000, causing a noticeable step-time increase (~810ms → ~893ms) but improves final BPB.
+- EMA(0.997) was selected over SWA (14 snapshots), `val_loss 1.9007 < 1.9024`.
+- Full Hessian GPTQ with AR self-gen calibration adds only +0.0023 BPB gap (roundtrip vs pre-quant), consistent with PR #1019 findings.
+- The submission fits inside 16MB without any selective pruning needed.
+
+🤖 Generated with [Claude Sonnet 4.5](https://claude.ai)
diff --git a/..._non_record_16mb/2026-04-08_XSA11_ParallelResidual_DepthRecurrence_1xH100/submission.json b/..._non_record_16mb/2026-04-08_XSA11_ParallelResidual_DepthRecurrence_1xH100/submission.json
@@ -0,0 +1,15 @@
+{
+  "track": "non_record_16mb",
+  "date": "2026-04-08",
+  "name": "XSA-11 + Parallel Residual (L7+) + Depth Recurrence (layers 4,5) — 1×H100",
+  "author": "angela231005",
+  "github_id": "angela231005",
+  "seeds": [42],
+  "val_bpb_sliding_window": 1.10562,
+  "val_bpb_roundtrip": 1.12955,
+  "val_loss": 1.9072,
+  "bytes_total": 15652295,
+  "hardware": "1×H100 80GB",
+  "steps": 6927,
+  "ttt_enabled": false
+}