Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
926a242
Record: Split-LR + BigramHash(2816x160) + Full GPTQ + Brotli — val_bp…
dexhunter Mar 31, 2026
6772e6a
fix: clarify PR #1019 attribution (not our merged PR)
dexhunter Mar 31, 2026
1a2ba5d
Copy PR 1179 record trainer to repo root
msisovic Mar 31, 2026
d95a637
Decompress PR 1179 root trainer wrapper
msisovic Mar 31, 2026
23e26fb
Port mixed quant to PR 1179 root trainer
msisovic Mar 31, 2026
2571aa6
Port depth recurrence to PR 1179 root trainer
msisovic Mar 31, 2026
1486b6a
Log full untie depth recurrence result
msisovic Mar 31, 2026
101fe21
Prioritize shared params in GPTQ
msisovic Mar 31, 2026
d9a7d19
Somewhat working
msisovic Mar 31, 2026
2a2fe32
Log partitioned residual result
msisovic Mar 31, 2026
60857e5
Train wallclock log
msisovic Mar 31, 2026
0ebead6
logging fix
msisovic Mar 31, 2026
e5a01dc
First run in
msisovic Mar 31, 2026
78cf56a
Parallel Residuals readme entry
msisovic Mar 31, 2026
ad65f02
Update submission README and add seed logs
msisovic Apr 1, 2026
b7c4931
Update submission reproducibility notes
msisovic Apr 1, 2026
18e14d3
Add submission metadata for ParallelResiduals run
msisovic Apr 1, 2026
7441f45
Clean root for submission branch
msisovic Apr 1, 2026
ae2f7b7
Restore root files for submission
msisovic Apr 1, 2026
4421a8d
Record: 4096-Vocab + Larger Model + High WD + Simplifications — val_b…
clarkkev Apr 1, 2026
e01b7e1
Add train_gpt.py to submission
msisovic Apr 2, 2026
e1cf90a
Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 — val_bpb 1…
dexhunter Apr 3, 2026
be8c14d
Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R — v…
aryanbhosale Apr 4, 2026
f4fa11f
Record: SP8192 + GPTQ Embeddings + SDClip + Loop45x2 — val_bpb 1.0856…
clarkkev Apr 5, 2026
3f1e814
Non-record: Parallel Residuals + Hessian-Aware SDClip (3-seed mean 1.…
Robby955 Apr 6, 2026
4b57791
Fix LaTeX rendering
Robby955 Apr 6, 2026
cd04a8b
Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 …
dexhunter Apr 6, 2026
5470a39
Record: SP8192 + Parallel Residuals + Score-First TTT — val_bpb 1.082…
aryanbhosale Apr 8, 2026
857de47
Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.…
Apr 9, 2026
c7d01f4
Merge pull request #1179 from dexhunter/submission/splitlr-dim160-gpt…
cocohearts Apr 9, 2026
d96169d
Merge pull request #1218 from clarkkev/submission/vocab4096-mlpmult4-…
cocohearts Apr 9, 2026
444b4f7
Merge pull request #1204 from msisovic/hyperconnections_submission
cocohearts Apr 9, 2026
ee38f46
Merge pull request #1285 from dexhunter/muoneqr-recurrence-wd090-allint6
cocohearts Apr 9, 2026
ed829c0
Merge pull request #1334 from aryanbhosale/submission/sp4096-no-slot-v4
cocohearts Apr 9, 2026
8d62bdd
Merge pull request #1394 from clarkkev/submission/sp8192-gptq-emb-sdc…
cocohearts Apr 9, 2026
6f92b13
Merge pull request #1413 from dexhunter/record/sp8192-qk5-legal-ttt-1…
cocohearts Apr 9, 2026
905ef58
Merge pull request #1477 from aryanbhosale/submission/sp8192-parallel…
cocohearts Apr 9, 2026
bac888c
Merge pull request #1493 from bigbag/submission/sp8192-ttt-clean
cocohearts Apr 9, 2026
c714a4d
Merge pull request #1412 from Robby955/submission/parallel-residuals-…
cocohearts Apr 9, 2026
81b6bd7
Update README leaderboard for April records
cocohearts Apr 9, 2026
75700cb
Merge pull request #1511 from openai/codex/update-april-leaderboard-r…
cocohearts Apr 9, 2026
2bd7916
Record: 11L Muon TTT + Entropy-Adaptive Epochs (8xH100) — val_bpb 1.1…
aamodbhatt Apr 23, 2026
f56ef88
Record: 1.1122 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-s…
dexhunter Apr 23, 2026
8b148a0
Record: Scylla + Full GPTQ + XSA-all + FA3 — val_bpb 0.9485 (3-seed m…
icryo Apr 23, 2026
7427de2
Update README leaderboard with recent record submissions + revert bad…
cocohearts Apr 26, 2026
51f0301
Draft SOTA implementation: MLA + 3.5x MLP + Int6 QAT for tracking
GodlyDonuts Apr 1, 2026
cd67ad0
Add Opus working folder: leaderboard push plan and SOTA decode
GodlyDonuts Apr 27, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions EXPERIMENTAL_LOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Antigravity Deep-Golf Experimental Log

This document tracks technical insights, failure modes, and learning encountered during the 16MB Parameter Golf challenge.

## Infrastructure & Environment Insights

### [2026-04-01] FlashAttention GQA Broadcast Error
- **Problem**: SDPA fallback crashed with `torch.compile` during the forward pass.
- **Insight**: `F.scaled_dot_product_attention` (fallback) failed because the Query heads (H) and KV heads (Hkv) were not broadcastable in the transposed [B, H, T, D] format used by SDPA.
- **Fix**: Added explicit `torch.repeat_interleave(dim=2)` for KV heads when `H != Hkv` inside the `flash_attn_3_func` fallback.
- **Impact**: Training now successfully enters the `torch.compile` and iteration loops on standard PyTorch images.

### [2026-04-01] GPU Utilization Diagnostic
- **Observation**: 0% GPU utilization with 4% CPU utilization.
- **Cause**: Repeated initialization crashes (FA3 import, kv_bank attribute, SDPA GQA mismatch).
- **Status**: **RESOLVED**. Model is now loaded in VRAM (4.6GB/80GB) on all 8 H100s. Currently waiting for `torch.compile` (approx 60-90s) before first step logs.

## Architectural Findings (Antigravity Deep-Golf)

### MLA (Multi-head Latent Attention) Integration
- **Concept**: Compressing KV projections into a latent vector (`kv_latent_dim`) to save parameters.
- **Optimization**: Saved parameters are reinvested into the MLP width (scaling from 3.0x to 3.5x).
- **Implementation Note**: Fixed the `late_qat_step` and `Muon` optimizer banking to include `kv_latent_bank` and `kv_up_bank`.

## Micro-Sweep Results

| Run ID | LR | MLP Mult | Latent Dim | BPB @ 500 | BPB @ 1500 | Status |
|--------|----|----------|------------|-----------|------------|--------|
| `micro_lr0.025_mlp3.5_ldim64` | 0.025 | 3.5 | 64 | TBD | TBD | **Executing** |
| ... | ... | ... | ... | ... | ... |
37 changes: 37 additions & 0 deletions Opus/DECISIONS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Decision Log

Audit trail for non-obvious calls. Each entry: date, decision, alternatives considered, reasoning. New entries at top.

---

## 2026-04-27 — Build on PR #1493 SOTA, not on `train_antigravity.py`

**Decision:** Use the SOTA file at `records/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/train_gpt.py` as the base for the leaderboard push. Treat `train_antigravity.py` as a separate, parallel non-record submission.

**Alternatives considered:**
- (A) Push only the antigravity stack (MLA + 3.5× MLP + Int6 QAT, vocab 1024) for the record.
- (B) Bolt antigravity ideas (MLA, MTP) onto the SOTA stack from scratch.
- (C) Build directly on PR #1493 SOTA, keep antigravity as a side submission. **← chosen**

**Reasoning:**
- The antigravity stack is missing every layer of the current SOTA (no SP8192, no GPTQ-SDClip, no recurrence, no parallel residuals, no legal TTT). Reaching parity is weeks of work, not 3 days.
- The SOTA author chain is on the same code surface — adding to it is incremental and stylistically expected by the reviewers.
- Keeping antigravity alive as a non-record submission is ~free in compute and gives us a guaranteed submission either way.

---

## 2026-04-27 — Top angle: smarter TTT, not more architecture

**Decision:** Spend the bulk of compute exploring TTT variants (param-selective, chunk-size sweeps, momentum schedules) rather than architectural changes (MLA, MTP, depth changes).

**Alternatives considered:**
- New attention variant (MLA / linear / state-space hybrid) — requires retraining from scratch and weeks of tuning.
- More depth recurrence loops — already at 3 layers × 2 loops; diminishing returns.
- New optimizer (Lion, Sophia) — Muon is hard to beat at this scale.
- Mixed-bit GPTQ — kept as fallback (Day 2 pivot).

**Reasoning:**
- TTT is the newest layer in the SOTA stack (added in PR #549, refined through #1413, #1493). Less time for the community to optimize it.
- Current TTT is naïve: vanilla SGD on **all** params. A quantized model has only a small fp32 surface (`q_gain`, `attn_scale`, `mlp_scale`, `skip_weights`, `skip_gates`, `resid_mix`, `ln_scale_factor`); training only those is faster, lower-variance, and avoids fighting GPTQ rounding errors.
- TTT runs at eval time, so we can iterate on a fixed checkpoint without re-paying the 10-min training cost per experiment. This makes Day 2 triage 10× cheaper than retraining sweeps.
- Theoretical ceiling: TTT currently gives ~0.002 nats (1.0827 sliding → 1.0810 TTT). If we can extract another 0.005 from the same mechanism, that's our submission.
85 changes: 85 additions & 0 deletions Opus/PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# 3-Day Execution Plan

**Total budget:** $500 RunPod credits
**Deadline:** 2026-04-30 (submission window closes)
**Today:** 2026-04-27

## Bar to clear

- Beat SOTA `val_bpb = 1.0810` by **≥0.005 nats** → target ≤ **1.0760** (3-seed mean)
- Statistical significance **p < 0.01** across seeds {42, 314, 999}
- Train under 600s on 8×H100 SXM
- Eval (sliding + TTT) under 600s
- Artifact ≤ 16,000,000 bytes (decimal MB)

## Why this stack

The PR #1493 SOTA is a long compounded chain. Recent record-to-record deltas have been 0.0006–0.002 nats. To clear 0.005, we need an **orthogonal** improvement, not a hyperparam tweak. The four candidates from highest to lowest EV:

1. **Smarter TTT** — the current TTT trains *all* params with vanilla SGD. Selective param TTT (only adapt fp32 control tensors that survived quantization) is faster, lower variance, and theoretically better-suited to a quantized model.
2. **Code-golf the wrapper** — every byte saved becomes a byte we can spend on weights or precision.
3. **Mixed-bit GPTQ** — per-layer bit allocation by Hessian sensitivity.
4. **MLA + reinvest** — too much surface area to reach parity in 3 days; keep as non-record submission via `train_antigravity.py`.

## Day-by-day

### Day 1 — 2026-04-27 (today) — budget ~$50

**Goal:** Stand up infra. Reproduce SOTA seed=42 to within ±0.0003 of the published 1.08079.

Tasks:
- [ ] Spin up 1×H100 RunPod with the official Parameter Golf template (~$3/hr)
- [ ] Clone repo, download `sp8192` data variant
- [ ] Run `records/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/train_gpt.py` with `SEED=42 TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=3 ...`
- Note: 1×H100 means `grad_accum_steps=8`; one full run is much slower than 8×H100. Budget ~2 hours wallclock for the full reproduction.
- [ ] Confirm `val_bpb ≈ 1.08079`. If off by more than 0.0003, debug before any other work — this is a leaderboard-validity blocker.
- [ ] Save the trained checkpoint (`final_model.pt`) — Day 2 reuses it for TTT-only experiments without retraining each time.

**Cost estimate:** 1×H100 × ~12h debugging+1 reproduction = ~$36. Buffer to $50.

### Day 2 — 2026-04-28 — budget ~$200

**Goal:** Find a TTT variant that beats the baseline by ≥0.003 nats single-seed (so it has a shot of clearing 0.005 across 3 seeds).

Tasks:
- [ ] Switch to 2×H100 ($6/hr) for faster iteration
- [ ] Reuse Day 1 checkpoint — TTT runs at eval time, no retraining needed for triage
- [ ] Run the experiment matrix in `experiments/` (one .md per experiment):
- `ttt_selective_scales.md` — TTT only on `q_gain`, `attn_scale`, `mlp_scale`, `skip_weights`, `resid_mix`
- `ttt_chunk_sweep.md` — chunk sizes {16K, 32K, 64K}
- `ttt_lr_sweep.md` — {0.001, 0.003, 0.005, 0.008, 0.012} × {2,3,5} epochs
- `ttt_momentum_reset.md` — reset momentum between chunks
- `ttt_with_wd.md` — add small weight decay during TTT
- `ttt_grad_accum.md` — accumulate gradients across chunks before stepping
- [ ] Each experiment: 1 seed, log `val_bpb_ttt`, mark winner/loser
- [ ] By end of day, lock in best config

**Parallel:** I draft a more aggressively golfed code wrapper (target: shave 5KB off the current 16.6KB).

**Cost estimate:** ~30 hours of 2×H100 = $180. Buffer to $200.

### Day 3 — 2026-04-29 — budget ~$230, $20 reserve

**Goal:** Validate winner across 3 seeds on 8×H100. Submit before midnight.

Tasks:
- [ ] 3 × 8×H100 full runs with seeds {42, 314, 999} using the locked Day 2 config
- Each run: ~12 min total (10 train + ~2 eval) × $20/hr = ~$4 each = $12 total compute
- Add ~30 min of pod warm-up overhead per run = $10 each ≈ $30 total
- Realistic: $40 for 3 seeds
- [ ] If 3-seed mean clears 1.0760 with std ≤ 0.0007 (so p<0.01 by t-test vs 1.0810): proceed to PR
- [ ] If marginal (mean 1.0765–1.0775 or std too high): one more 3-seed run with seeds {7, 11, 13} to either confirm or kill
- [ ] Write the submission README + `submission.json` modeled on PR #1493's
- [ ] Open the PR on `openai/parameter-golf` with all 3 train logs attached

**Reserve $20:** if everything goes sideways, one final 8×H100 run to test the most-promising untried idea.

## Stopping rules

- If Day 1 reproduction is off by >0.0005 BPB: **stop and debug** before spending more.
- If Day 2 noon and no TTT variant has hit ≥0.002 single-seed gain: **pivot remaining $$ to mixed-bit GPTQ**.
- If Day 3 morning and 3-seed mean is >1.0790: **abandon record attempt**, focus the remaining hours on shipping `train_antigravity.py` as a polished non-record submission.

## Parallel non-record track (low priority, low cost)

`train_antigravity.py` (MLA + 3.5× MLP + Int6 QAT) — keep this alive as a non-record submission to the unlimited compute track. Even at modest BPB it goes on the record as a creative submission. Touches none of the leaderboard budget if we don't spend extra compute on it; we package whatever the latest `EXPERIMENTAL_LOG.md` shows.
41 changes: 41 additions & 0 deletions Opus/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Opus — Parameter Golf Leaderboard Push

This folder is **Opus's working directory** for the OpenAI Parameter Golf challenge (16MB / 10-min track). It exists alongside Antigravity's work so the two agents don't trip over each other.

- **Antigravity's work**: `Antigravity/` (separate folder, not yet present)
- **Opus's work**: this folder

Opus and Antigravity work independently. Opus may peek at Antigravity's folder to look for ideas worth borrowing, but does not coordinate or hand off work.

## Current goal

Beat the standing SOTA of **1.0810 BPB** (PR #1493 by bigbag, 2026-04-09) by **≥0.005 nats** with 3-seed mean and p<0.01. Submission deadline: **2026-04-30**.

## Strategy in one sentence

Build directly on the PR #1493 SOTA stack. Push hard on **TTT improvements** (highest EV given the tight timeline) with **mixed-bit GPTQ** as a fallback angle. Keep `train_antigravity.py` alive in parallel as a non-record submission.

## Folder layout

```
Opus/
├── README.md # this file — high-level status and pointers
├── PLAN.md # 3-day execution plan with budget breakdown
├── DECISIONS.md # log of why we chose / rejected directions
├── experiments/ # one .md per experiment (results, configs, logs)
├── notes/ # technical notes (SOTA architecture decode, etc.)
└── (any code we add) # e.g. opus_train_gpt.py — variants we're testing
```

## Status

| Date | Phase | Status |
|------|-------|--------|
| 2026-04-27 | Setup + reproduction | Not started — waiting on confirm to spend compute |

## How to read this folder

- Start with `PLAN.md` for what we're doing and why.
- `experiments/` is the source of truth for what we've actually run. Each experiment file has: hypothesis, config, command, result, decision.
- `DECISIONS.md` is the audit trail for big calls (e.g. "killed mixed-bit GPTQ angle on Day 2 because TTT was tracking").
- Anything not in this folder is either Antigravity's or shared infrastructure.
42 changes: 42 additions & 0 deletions Opus/experiments/000_template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Experiment NNN — short title

**Date:** YYYY-MM-DD
**Hypothesis:** What we expect and why (1–2 sentences).
**Baseline:** What we're comparing against (e.g. SOTA seed=42 = 1.08079).
**Cost:** Estimated $ and wallclock.

## Config

Diff from baseline (env vars, code patches, etc.):

```bash
TTT_LR=0.003 TTT_EPOCHS=5 TTT_CHUNK_TOKENS=16384 ...
```

If a code patch: link to `Opus/<patch_file>` and quote the relevant hunk.

## Command

Exact command(s) run:

```bash
SEED=42 ... torchrun --standalone --nproc_per_node=N train_gpt.py
```

## Result

| Metric | Value |
|--------|-------|
| `val_bpb_sliding` | |
| `val_bpb_ttt` | |
| Wallclock train | |
| Wallclock eval | |
| Artifact bytes | |

## Decision

- ✅ Promising → next step
- ⚠️ Marginal → re-run with different seed?
- ❌ Killed — reason

Notes / surprises / things to follow up.
139 changes: 139 additions & 0 deletions Opus/notes/sota_architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# SOTA Architecture Decode (PR #1493 / 2026-04-09)

Decoded from `records/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/train_gpt.py` (LZMA-compressed, ~16.6KB on disk → ~48KB Python).

## Model

- **Vocab:** 8192 (SentencePiece, `sp8192` data variant)
- **Layers:** 11
- **Hidden:** model_dim=512, embedding_dim=512 (no projection)
- **Heads:** 8 query heads, 4 KV heads (GQA), head_dim=64
- **MLP:** 4× expansion, LeakyReLU(0.5)² activation (`fc → leaky_relu(0.5).square() → proj`)
- **RoPE:** partial, 16/64 dims rotated, base=10000, train_seq_len=2048
- **Attention scaling:** learnable per-head `q_gain` initialized to 5.0
- **LN:** RMSNorm (no learnable scale), with per-block multiplicative `attn_scale`/`mlp_scale` parameters and `ln_scale_factor = 1/sqrt(layer_idx+1)`
- **Tied embeddings:** yes; init std=0.005
- **Logit softcap:** 30 × tanh(logits/30)

## Block

```python
mix = resid_mix.to(dtype) # [2, dim] — value residual mixing
x_in = mix[0] * x + mix[1] * x0 # x0 = post-embed signal carried through
attn_out = attn(attn_norm(x_in) * ln_scale_factor)
if parallel: # layers 7+
mlp_out = mlp(mlp_norm(x_in) * ln_scale_factor)
x_out = x_in + attn_scale * attn_out + mlp_scale * mlp_out
else:
x_out = x_in + attn_scale * attn_out
x_out = x_out + mlp_scale * mlp(mlp_norm(x_out) * ln_scale_factor)
```

- **Parallel residuals** from layer 7 onwards (GPT-J style — attn and MLP read same input, write to same output).
- **Sequential residuals** for layers 0–6.

## Depth recurrence

- `num_loops=2`, `loop_start=3`, `loop_end=5` — segment is `[3,4,5]`
- All-indices construction: `[0,1,2] + [3,4,5] + [3,4,5] + [3,4,5] + [6,7,8,9,10]` = 17 virtual layers
- Split at midpoint into encoder / decoder for U-Net skips
- **Skip connections** with learnable `skip_weights` (per-dim) and `skip_gates` (per-dim sigmoid for lerp)
- **Activation:** triggers at `enable_looping_at=0.35` (i.e. ~step 1592 of 4550)

## Attention (XSA on all 11 layers)

```python
y = flash_attn_3_func(q, k, v, causal=True) # standard FA3
if use_xsa:
# Subtract value-direction projection per KV-group
y_g = y.reshape(B, T, Hkv, group, D)
vn = F.normalize(v, dim=-1).unsqueeze(-2)
proj = (y_g * vn).sum(-1, keepdim=True) * vn
y = (y_g - proj).reshape(B, T, H, D)
```

## Optimizers

- **Muon** (custom): row-normalized, 5 Newton-Schulz steps, momentum 0.99 (warmup 0.92→0.99 over 1500 steps), nesterov, weight decay 0.095. Applies to all 2D matrices in blocks **except** control tensors.
- **AdamW** (token embeddings): lr=0.03 (tied) or 0.6 (untied), wd=0.085
- **AdamW** (scalars + control tensors): lr=0.02, wd=0.02
- **Adam** (lm_head, only if untied): lr=0.008
- **Control tensor patterns** (excluded from Muon): `attn_scale, attn_scales, mlp_scale, mlp_scales, resid_mix, resid_mixes, q_gain, skip_weight, skip_weights, skip_gates`

## Training schedule

- **Iterations:** 20000 max, but capped at 600s wallclock (≈ 4550 steps actual)
- **Batch:** train_batch_tokens=786432, train_seq_len=2048
- **Warmup:** 20 steps
- **Warmdown:** 0.72 of training (linear to min_lr=0)
- **Grad clip:** 0.3 norm
- **EMA:** decay 0.9965 (applied during training, EMA weights used for eval)

## Quantization (GPTQ + SDClip + Brotli)

- **Bits:** matrices=int6, embeddings=int8
- **Clip:** `clip = k * std(row)`, k=12.85 for matrices, k=20.0 for embeddings
- **GPTQ:** Hessian-aware per-column rounding, block_size=128
- **Calibration:** 64 batches from training data
- **Compression:** byte-shuffle (stride=2) → Brotli-11
- **Reserve:** 12s of training budget reserved for GPTQ at end

## Eval

Three eval modes available, selected by env vars:

1. `eval_val` — single-pass cross entropy (baseline, fast)
2. `eval_val_sliding` — sliding window with `eval_stride=64` (already a win, used as the "before TTT" number)
3. `eval_val_ttt` — sliding window + chunk-based TTT (the SOTA result)

### TTT (the part we're attacking)

```python
ttt_params = list(model.parameters()) # ALL params, no filter
optimizer = SGD(ttt_params, lr=0.005, momentum=0.9)

for ci, windows in enumerate(chunk_windows):
# 1. Score all windows in this chunk under no_grad → accumulates loss/tokens/bytes
# 2. If not last chunk and ttt_epochs > 0:
# cos_lr = ttt_lr * 0.5 * (1 + cos(pi * ci / (num_chunks-1)))
# for ep in range(ttt_epochs): # default 3
# for window in shuffled(windows):
# forward + backward
# for p in ttt_params: all_reduce(p.grad)
# clip_grad_norm_(1.0)
# optimizer.step()
```

- **Chunk size:** 32768 tokens (`ttt_chunk_tokens`)
- **Epochs per chunk:** 3
- **LR schedule:** cosine across chunks, no schedule within a chunk
- **Distributed:** params live on each rank, gradients all-reduced manually (since ttt_params are fp32 and not DDP-wrapped)
- **Score-first compliance:** scoring of chunk N happens *before* training on chunk N (legality requirement)

## Key tensors and parameter counts

To estimate after architectural changes:
- Token embedding: 8192 × 512 = 4.19M params (int8 → 4.19MB)
- 11 blocks × per-block:
- Attention: q(512×512), k(512×256), v(512×256), proj(512×512) = 655K
- MLP: fc(512×2048), proj(2048×512) = 2.10M
- Scales/control: ~3K
- Block subtotal: ~2.76M
- 11 blocks: ~30.3M
- Skip weights/gates: 2 × num_skips × 512 ≈ 8K
- **Total params: ~34.5M**
- At int6 + int8 embed + brotli: ~16MB artifact

## Parameters that are NOT quantized (TTT-relevant!)

These stay fp32 because they're scalar/vector control tensors, **not** in the GPTQ pipeline:

- `q_gain` per block: 8 floats (per attn) × 11 blocks = 88 floats
- `attn_scale` per block: 512 floats × 11 = 5632
- `mlp_scale` per block: 512 floats × 11 = 5632
- `resid_mix` per block: 2 × 512 × 11 = 11264
- `skip_weights`: num_skips × 512
- `skip_gates`: num_skips × 512
- `tok_emb` is int8 not int6, but still quantized

**Total non-quantized control surface:** ~25K floats. Tiny. **This is what selective-TTT can adapt without dequant/requant overhead.**
Loading