GodlyDonuts · GodlyDonuts · Apr 27, 2026 · Mar 31, 2026 · Mar 31, 2026 · Mar 31, 2026
diff --git a/EXPERIMENTAL_LOG.md b/EXPERIMENTAL_LOG.md
@@ -0,0 +1,30 @@
+# Antigravity Deep-Golf Experimental Log
+
+This document tracks technical insights, failure modes, and learning encountered during the 16MB Parameter Golf challenge.
+
+## Infrastructure & Environment Insights
+
+### [2026-04-01] FlashAttention GQA Broadcast Error
+- **Problem**: SDPA fallback crashed with `torch.compile` during the forward pass.
+- **Insight**: `F.scaled_dot_product_attention` (fallback) failed because the Query heads (H) and KV heads (Hkv) were not broadcastable in the transposed [B, H, T, D] format used by SDPA.
+- **Fix**: Added explicit `torch.repeat_interleave(dim=2)` for KV heads when `H != Hkv` inside the `flash_attn_3_func` fallback.
+- **Impact**: Training now successfully enters the `torch.compile` and iteration loops on standard PyTorch images.
+
+### [2026-04-01] GPU Utilization Diagnostic
+- **Observation**: 0% GPU utilization with 4% CPU utilization.
+- **Cause**: Repeated initialization crashes (FA3 import, kv_bank attribute, SDPA GQA mismatch).
+- **Status**: **RESOLVED**. Model is now loaded in VRAM (4.6GB/80GB) on all 8 H100s. Currently waiting for `torch.compile` (approx 60-90s) before first step logs.
+
+## Architectural Findings (Antigravity Deep-Golf)
+
+### MLA (Multi-head Latent Attention) Integration
+- **Concept**: Compressing KV projections into a latent vector (`kv_latent_dim`) to save parameters.
+- **Optimization**: Saved parameters are reinvested into the MLP width (scaling from 3.0x to 3.5x).
+- **Implementation Note**: Fixed the `late_qat_step` and `Muon` optimizer banking to include `kv_latent_bank` and `kv_up_bank`.
+
+## Micro-Sweep Results
+
+| Run ID | LR | MLP Mult | Latent Dim | BPB @ 500 | BPB @ 1500 | Status |
+|--------|----|----------|------------|-----------|------------|--------|
+| `micro_lr0.025_mlp3.5_ldim64` | 0.025 | 3.5 | 64 | TBD | TBD | **Executing** |
+| ... | ... | ... | ... | ... | ... |
diff --git a/Opus/DECISIONS.md b/Opus/DECISIONS.md
@@ -0,0 +1,37 @@
+# Decision Log
+
+Audit trail for non-obvious calls. Each entry: date, decision, alternatives considered, reasoning. New entries at top.
+
+---
+
+## 2026-04-27 — Build on PR #1493 SOTA, not on `train_antigravity.py`
+
+**Decision:** Use the SOTA file at `records/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/train_gpt.py` as the base for the leaderboard push. Treat `train_antigravity.py` as a separate, parallel non-record submission.
+
+**Alternatives considered:**
+- (A) Push only the antigravity stack (MLA + 3.5× MLP + Int6 QAT, vocab 1024) for the record.
+- (B) Bolt antigravity ideas (MLA, MTP) onto the SOTA stack from scratch.
+- (C) Build directly on PR #1493 SOTA, keep antigravity as a side submission. **← chosen**
+
+**Reasoning:**
+- The antigravity stack is missing every layer of the current SOTA (no SP8192, no GPTQ-SDClip, no recurrence, no parallel residuals, no legal TTT). Reaching parity is weeks of work, not 3 days.
+- The SOTA author chain is on the same code surface — adding to it is incremental and stylistically expected by the reviewers.
+- Keeping antigravity alive as a non-record submission is ~free in compute and gives us a guaranteed submission either way.
+
+---
+
+## 2026-04-27 — Top angle: smarter TTT, not more architecture
+
+**Decision:** Spend the bulk of compute exploring TTT variants (param-selective, chunk-size sweeps, momentum schedules) rather than architectural changes (MLA, MTP, depth changes).
+
+**Alternatives considered:**
+- New attention variant (MLA / linear / state-space hybrid) — requires retraining from scratch and weeks of tuning.
+- More depth recurrence loops — already at 3 layers × 2 loops; diminishing returns.
+- New optimizer (Lion, Sophia) — Muon is hard to beat at this scale.
+- Mixed-bit GPTQ — kept as fallback (Day 2 pivot).
+
+**Reasoning:**
+- TTT is the newest layer in the SOTA stack (added in PR #549, refined through #1413, #1493). Less time for the community to optimize it.
+- Current TTT is naïve: vanilla SGD on **all** params. A quantized model has only a small fp32 surface (`q_gain`, `attn_scale`, `mlp_scale`, `skip_weights`, `skip_gates`, `resid_mix`, `ln_scale_factor`); training only those is faster, lower-variance, and avoids fighting GPTQ rounding errors.
+- TTT runs at eval time, so we can iterate on a fixed checkpoint without re-paying the 10-min training cost per experiment. This makes Day 2 triage 10× cheaper than retraining sweeps.
+- Theoretical ceiling: TTT currently gives ~0.002 nats (1.0827 sliding → 1.0810 TTT). If we can extract another 0.005 from the same mechanism, that's our submission.
diff --git a/Opus/PLAN.md b/Opus/PLAN.md
@@ -0,0 +1,85 @@
+# 3-Day Execution Plan
+
+**Total budget:** $500 RunPod credits
+**Deadline:** 2026-04-30 (submission window closes)
+**Today:** 2026-04-27
+
+## Bar to clear
+
+- Beat SOTA `val_bpb = 1.0810` by **≥0.005 nats** → target ≤ **1.0760** (3-seed mean)
+- Statistical significance **p < 0.01** across seeds {42, 314, 999}
+- Train under 600s on 8×H100 SXM
+- Eval (sliding + TTT) under 600s
+- Artifact ≤ 16,000,000 bytes (decimal MB)
+
+## Why this stack
+
+The PR #1493 SOTA is a long compounded chain. Recent record-to-record deltas have been 0.0006–0.002 nats. To clear 0.005, we need an **orthogonal** improvement, not a hyperparam tweak. The four candidates from highest to lowest EV:
+
+1. **Smarter TTT** — the current TTT trains *all* params with vanilla SGD. Selective param TTT (only adapt fp32 control tensors that survived quantization) is faster, lower variance, and theoretically better-suited to a quantized model.
+2. **Code-golf the wrapper** — every byte saved becomes a byte we can spend on weights or precision.
+3. **Mixed-bit GPTQ** — per-layer bit allocation by Hessian sensitivity.
+4. **MLA + reinvest** — too much surface area to reach parity in 3 days; keep as non-record submission via `train_antigravity.py`.
+
+## Day-by-day
+
+### Day 1 — 2026-04-27 (today) — budget ~$50
+
+**Goal:** Stand up infra. Reproduce SOTA seed=42 to within ±0.0003 of the published 1.08079.
+
+Tasks:
+- [ ] Spin up 1×H100 RunPod with the official Parameter Golf template (~$3/hr)
+- [ ] Clone repo, download `sp8192` data variant
+- [ ] Run `records/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/train_gpt.py` with `SEED=42 TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=3 ...`
+  - Note: 1×H100 means `grad_accum_steps=8`; one full run is much slower than 8×H100. Budget ~2 hours wallclock for the full reproduction.
+- [ ] Confirm `val_bpb ≈ 1.08079`. If off by more than 0.0003, debug before any other work — this is a leaderboard-validity blocker.
+- [ ] Save the trained checkpoint (`final_model.pt`) — Day 2 reuses it for TTT-only experiments without retraining each time.
+
+**Cost estimate:** 1×H100 × ~12h debugging+1 reproduction = ~$36. Buffer to $50.
+
+### Day 2 — 2026-04-28 — budget ~$200
+
+**Goal:** Find a TTT variant that beats the baseline by ≥0.003 nats single-seed (so it has a shot of clearing 0.005 across 3 seeds).
+
+Tasks:
+- [ ] Switch to 2×H100 ($6/hr) for faster iteration
+- [ ] Reuse Day 1 checkpoint — TTT runs at eval time, no retraining needed for triage
+- [ ] Run the experiment matrix in `experiments/` (one .md per experiment):
+  - `ttt_selective_scales.md` — TTT only on `q_gain`, `attn_scale`, `mlp_scale`, `skip_weights`, `resid_mix`
+  - `ttt_chunk_sweep.md` — chunk sizes {16K, 32K, 64K}
+  - `ttt_lr_sweep.md` — {0.001, 0.003, 0.005, 0.008, 0.012} × {2,3,5} epochs
+  - `ttt_momentum_reset.md` — reset momentum between chunks
+  - `ttt_with_wd.md` — add small weight decay during TTT
+  - `ttt_grad_accum.md` — accumulate gradients across chunks before stepping
+- [ ] Each experiment: 1 seed, log `val_bpb_ttt`, mark winner/loser
+- [ ] By end of day, lock in best config
+
+**Parallel:** I draft a more aggressively golfed code wrapper (target: shave 5KB off the current 16.6KB).
+
+**Cost estimate:** ~30 hours of 2×H100 = $180. Buffer to $200.
+
+### Day 3 — 2026-04-29 — budget ~$230, $20 reserve
+
+**Goal:** Validate winner across 3 seeds on 8×H100. Submit before midnight.
+
+Tasks:
+- [ ] 3 × 8×H100 full runs with seeds {42, 314, 999} using the locked Day 2 config
+  - Each run: ~12 min total (10 train + ~2 eval) × $20/hr = ~$4 each = $12 total compute
+  - Add ~30 min of pod warm-up overhead per run = $10 each ≈ $30 total
+  - Realistic: $40 for 3 seeds
+- [ ] If 3-seed mean clears 1.0760 with std ≤ 0.0007 (so p<0.01 by t-test vs 1.0810): proceed to PR
+- [ ] If marginal (mean 1.0765–1.0775 or std too high): one more 3-seed run with seeds {7, 11, 13} to either confirm or kill
+- [ ] Write the submission README + `submission.json` modeled on PR #1493's
+- [ ] Open the PR on `openai/parameter-golf` with all 3 train logs attached
+
+**Reserve $20:** if everything goes sideways, one final 8×H100 run to test the most-promising untried idea.
+
+## Stopping rules
+
+- If Day 1 reproduction is off by >0.0005 BPB: **stop and debug** before spending more.
+- If Day 2 noon and no TTT variant has hit ≥0.002 single-seed gain: **pivot remaining $$ to mixed-bit GPTQ**.
+- If Day 3 morning and 3-seed mean is >1.0790: **abandon record attempt**, focus the remaining hours on shipping `train_antigravity.py` as a polished non-record submission.
+
+## Parallel non-record track (low priority, low cost)
+
+`train_antigravity.py` (MLA + 3.5× MLP + Int6 QAT) — keep this alive as a non-record submission to the unlimited compute track. Even at modest BPB it goes on the record as a creative submission. Touches none of the leaderboard budget if we don't spend extra compute on it; we package whatever the latest `EXPERIMENTAL_LOG.md` shows.
diff --git a/Opus/README.md b/Opus/README.md
@@ -0,0 +1,41 @@
+# Opus — Parameter Golf Leaderboard Push
+
+This folder is **Opus's working directory** for the OpenAI Parameter Golf challenge (16MB / 10-min track). It exists alongside Antigravity's work so the two agents don't trip over each other.
+
+- **Antigravity's work**: `Antigravity/` (separate folder, not yet present)
+- **Opus's work**: this folder
+
+Opus and Antigravity work independently. Opus may peek at Antigravity's folder to look for ideas worth borrowing, but does not coordinate or hand off work.
+
+## Current goal
+
+Beat the standing SOTA of **1.0810 BPB** (PR #1493 by bigbag, 2026-04-09) by **≥0.005 nats** with 3-seed mean and p<0.01. Submission deadline: **2026-04-30**.
+
+## Strategy in one sentence
+
+Build directly on the PR #1493 SOTA stack. Push hard on **TTT improvements** (highest EV given the tight timeline) with **mixed-bit GPTQ** as a fallback angle. Keep `train_antigravity.py` alive in parallel as a non-record submission.
+
+## Folder layout
+
+```
+Opus/
+├── README.md          # this file — high-level status and pointers
+├── PLAN.md            # 3-day execution plan with budget breakdown
+├── DECISIONS.md       # log of why we chose / rejected directions
+├── experiments/       # one .md per experiment (results, configs, logs)
+├── notes/             # technical notes (SOTA architecture decode, etc.)
+└── (any code we add)  # e.g. opus_train_gpt.py — variants we're testing
+```
+
+## Status
+
+| Date | Phase | Status |
+|------|-------|--------|
+| 2026-04-27 | Setup + reproduction | Not started — waiting on confirm to spend compute |
+
+## How to read this folder
+
+- Start with `PLAN.md` for what we're doing and why.
+- `experiments/` is the source of truth for what we've actually run. Each experiment file has: hypothesis, config, command, result, decision.
+- `DECISIONS.md` is the audit trail for big calls (e.g. "killed mixed-bit GPTQ angle on Day 2 because TTT was tracking").
+- Anything not in this folder is either Antigravity's or shared infrastructure.
diff --git a/Opus/experiments/000_template.md b/Opus/experiments/000_template.md
@@ -0,0 +1,42 @@
+# Experiment NNN — short title
+
+**Date:** YYYY-MM-DD
+**Hypothesis:** What we expect and why (1–2 sentences).
+**Baseline:** What we're comparing against (e.g. SOTA seed=42 = 1.08079).
+**Cost:** Estimated $ and wallclock.
+
+## Config
+
+Diff from baseline (env vars, code patches, etc.):
+
+```bash
+TTT_LR=0.003 TTT_EPOCHS=5 TTT_CHUNK_TOKENS=16384 ...
+```
+
+If a code patch: link to `Opus/<patch_file>` and quote the relevant hunk.
+
+## Command
+
+Exact command(s) run:
+
+```bash
+SEED=42 ... torchrun --standalone --nproc_per_node=N train_gpt.py
+```
+
+## Result
+
+| Metric | Value |
+|--------|-------|
+| `val_bpb_sliding` | |
+| `val_bpb_ttt` | |
+| Wallclock train | |
+| Wallclock eval | |
+| Artifact bytes | |
+
+## Decision
+
+- ✅ Promising → next step
+- ⚠️ Marginal → re-run with different seed?
+- ❌ Killed — reason
+
+Notes / surprises / things to follow up.
diff --git a/Opus/notes/sota_architecture.md b/Opus/notes/sota_architecture.md
@@ -0,0 +1,139 @@
+# SOTA Architecture Decode (PR #1493 / 2026-04-09)
+
+Decoded from `records/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/train_gpt.py` (LZMA-compressed, ~16.6KB on disk → ~48KB Python).
+
+## Model
+
+- **Vocab:** 8192 (SentencePiece, `sp8192` data variant)
+- **Layers:** 11
+- **Hidden:** model_dim=512, embedding_dim=512 (no projection)
+- **Heads:** 8 query heads, 4 KV heads (GQA), head_dim=64
+- **MLP:** 4× expansion, LeakyReLU(0.5)² activation (`fc → leaky_relu(0.5).square() → proj`)
+- **RoPE:** partial, 16/64 dims rotated, base=10000, train_seq_len=2048
+- **Attention scaling:** learnable per-head `q_gain` initialized to 5.0
+- **LN:** RMSNorm (no learnable scale), with per-block multiplicative `attn_scale`/`mlp_scale` parameters and `ln_scale_factor = 1/sqrt(layer_idx+1)`
+- **Tied embeddings:** yes; init std=0.005
+- **Logit softcap:** 30 × tanh(logits/30)
+
+## Block
+
+```python
+mix = resid_mix.to(dtype)            # [2, dim] — value residual mixing
+x_in = mix[0] * x + mix[1] * x0      # x0 = post-embed signal carried through
+attn_out = attn(attn_norm(x_in) * ln_scale_factor)
+if parallel:                          # layers 7+
+    mlp_out = mlp(mlp_norm(x_in) * ln_scale_factor)
+    x_out = x_in + attn_scale * attn_out + mlp_scale * mlp_out
+else:
+    x_out = x_in + attn_scale * attn_out
+    x_out = x_out + mlp_scale * mlp(mlp_norm(x_out) * ln_scale_factor)
+```
+
+- **Parallel residuals** from layer 7 onwards (GPT-J style — attn and MLP read same input, write to same output).
+- **Sequential residuals** for layers 0–6.
+
+## Depth recurrence
+
+- `num_loops=2`, `loop_start=3`, `loop_end=5` — segment is `[3,4,5]`
+- All-indices construction: `[0,1,2] + [3,4,5] + [3,4,5] + [3,4,5] + [6,7,8,9,10]` = 17 virtual layers
+- Split at midpoint into encoder / decoder for U-Net skips
+- **Skip connections** with learnable `skip_weights` (per-dim) and `skip_gates` (per-dim sigmoid for lerp)
+- **Activation:** triggers at `enable_looping_at=0.35` (i.e. ~step 1592 of 4550)
+
+## Attention (XSA on all 11 layers)
+
+```python
+y = flash_attn_3_func(q, k, v, causal=True)   # standard FA3
+if use_xsa:
+    # Subtract value-direction projection per KV-group
+    y_g = y.reshape(B, T, Hkv, group, D)
+    vn = F.normalize(v, dim=-1).unsqueeze(-2)
+    proj = (y_g * vn).sum(-1, keepdim=True) * vn
+    y = (y_g - proj).reshape(B, T, H, D)
+```
+
+## Optimizers
+
+- **Muon** (custom): row-normalized, 5 Newton-Schulz steps, momentum 0.99 (warmup 0.92→0.99 over 1500 steps), nesterov, weight decay 0.095. Applies to all 2D matrices in blocks **except** control tensors.
+- **AdamW** (token embeddings): lr=0.03 (tied) or 0.6 (untied), wd=0.085
+- **AdamW** (scalars + control tensors): lr=0.02, wd=0.02
+- **Adam** (lm_head, only if untied): lr=0.008
+- **Control tensor patterns** (excluded from Muon): `attn_scale, attn_scales, mlp_scale, mlp_scales, resid_mix, resid_mixes, q_gain, skip_weight, skip_weights, skip_gates`
+
+## Training schedule
+
+- **Iterations:** 20000 max, but capped at 600s wallclock (≈ 4550 steps actual)
+- **Batch:** train_batch_tokens=786432, train_seq_len=2048
+- **Warmup:** 20 steps
+- **Warmdown:** 0.72 of training (linear to min_lr=0)
+- **Grad clip:** 0.3 norm
+- **EMA:** decay 0.9965 (applied during training, EMA weights used for eval)
+
+## Quantization (GPTQ + SDClip + Brotli)
+
+- **Bits:** matrices=int6, embeddings=int8
+- **Clip:** `clip = k * std(row)`, k=12.85 for matrices, k=20.0 for embeddings
+- **GPTQ:** Hessian-aware per-column rounding, block_size=128
+- **Calibration:** 64 batches from training data
+- **Compression:** byte-shuffle (stride=2) → Brotli-11
+- **Reserve:** 12s of training budget reserved for GPTQ at end
+
+## Eval
+
+Three eval modes available, selected by env vars:
+
+1. `eval_val` — single-pass cross entropy (baseline, fast)
+2. `eval_val_sliding` — sliding window with `eval_stride=64` (already a win, used as the "before TTT" number)
+3. `eval_val_ttt` — sliding window + chunk-based TTT (the SOTA result)
+
+### TTT (the part we're attacking)
+
+```python
+ttt_params = list(model.parameters())   # ALL params, no filter
+optimizer = SGD(ttt_params, lr=0.005, momentum=0.9)
+
+for ci, windows in enumerate(chunk_windows):
+    # 1. Score all windows in this chunk under no_grad → accumulates loss/tokens/bytes
+    # 2. If not last chunk and ttt_epochs > 0:
+    #    cos_lr = ttt_lr * 0.5 * (1 + cos(pi * ci / (num_chunks-1)))
+    #    for ep in range(ttt_epochs):  # default 3
+    #        for window in shuffled(windows):
+    #            forward + backward
+    #            for p in ttt_params: all_reduce(p.grad)
+    #            clip_grad_norm_(1.0)
+    #            optimizer.step()
+```
+
+- **Chunk size:** 32768 tokens (`ttt_chunk_tokens`)
+- **Epochs per chunk:** 3
+- **LR schedule:** cosine across chunks, no schedule within a chunk
+- **Distributed:** params live on each rank, gradients all-reduced manually (since ttt_params are fp32 and not DDP-wrapped)
+- **Score-first compliance:** scoring of chunk N happens *before* training on chunk N (legality requirement)
+
+## Key tensors and parameter counts
+
+To estimate after architectural changes:
+- Token embedding: 8192 × 512 = 4.19M params (int8 → 4.19MB)
+- 11 blocks × per-block:
+  - Attention: q(512×512), k(512×256), v(512×256), proj(512×512) = 655K
+  - MLP: fc(512×2048), proj(2048×512) = 2.10M
+  - Scales/control: ~3K
+  - Block subtotal: ~2.76M
+- 11 blocks: ~30.3M
+- Skip weights/gates: 2 × num_skips × 512 ≈ 8K
+- **Total params: ~34.5M**
+- At int6 + int8 embed + brotli: ~16MB artifact
+
+## Parameters that are NOT quantized (TTT-relevant!)
+
+These stay fp32 because they're scalar/vector control tensors, **not** in the GPTQ pipeline:
+
+- `q_gain` per block: 8 floats (per attn) × 11 blocks = 88 floats
+- `attn_scale` per block: 512 floats × 11 = 5632
+- `mlp_scale` per block: 512 floats × 11 = 5632
+- `resid_mix` per block: 2 × 512 × 11 = 11264
+- `skip_weights`: num_skips × 512
+- `skip_gates`: num_skips × 512
+- `tok_emb` is int8 not int6, but still quantized
+
+**Total non-quantized control surface:** ~25K floats. Tiny. **This is what selective-TTT can adapt without dequant/requant overhead.**