Skip to content

Commit f56f98a

Browse files
leon2k2k2kclaude
andcommitted
spec 015: Recur-Alpha learnable per-pass blending
Active research thread's first experiment. Pinned to commit a9aa141 on exp/recur-alpha. Key decisions baked in: - Screening mode first (~$6 total, skip TTT/GPTQ/EMA) - TRAIN_LOG_EVERY=100 for diagnostic resolution - p2p cosine diagnostic off by default (torch.compile concerns) - Single seed 42; conditional 3-seed + full TTT only if Δ ≤ -0.001 - Identity-at-init safety: α=0 = passthrough, worst case no change Three disproven recurrence-class experiments explicitly NOT in this spec (earlier activation openai#1726, schedule smoothing openai#1663, position shift openai#1726). Those would be wasted spend per existing PG evidence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 18b3775 commit f56f98a

1 file changed

Lines changed: 221 additions & 0 deletions

File tree

research/specs/015-recur-alpha.md

Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
# Spec 015 — Recur-Alpha learnable per-pass blending (port from #1714)
2+
3+
**Slug:** `recur-alpha`
4+
**Created:** 2026-04-21
5+
**Links to idea:** `research/ideas/recurrence-parallel-literature.md`.
6+
7+
## Hypothesis
8+
9+
In #1736's Loop345 (layers 3-5 × 3 passes = 17 virtual layers), every pass fully commits its block output to the residual stream — there's no learned control over how much each pass contributes. Recur-Alpha adds a learnable blend scalar per (non-first pass, looped layer), initialized to zero:
10+
11+
```
12+
y = block(x_current)
13+
x_new = α × y + (1 − α) × x_current
14+
```
15+
16+
- At α=0: pass is pure passthrough (block output ignored, gradient blocked)
17+
- At α=1: standard Loop345 behavior
18+
- At α∈(0,1): partial commitment
19+
20+
The model learns α via gradient descent. If extra passes carry useful signal, α moves toward 1; if not, α stays near 0 (effectively "opting out" of recurrence on that pass).
21+
22+
**Source:** #1714 (Anakintano, Apr 18) tested this on a simpler pre-#1736 stack and got 1.0857 pre-TTT. Their compute grant ran out before phased TTT eval — so **Recur-Alpha's composition with #1736's full phased-TTT / CaseOps / gates / XSA stack has never been measured.** We are uniquely positioned to fill this gap.
23+
24+
## Baseline
25+
26+
Spec 008's seed-42 val_bpb (`runs/008-1736-reproduction/seed_42/final.json`) = **1.0697** (endpoint bare, screening mode).
27+
28+
## Expected Δ
29+
30+
Asymmetric outcome distribution:
31+
32+
- **−0.001 to −0.003 bpb** (best case): α moves to useful values, recurrence becomes more efficient
33+
- **Null (±0.0005)**: α stays near 0 or 1 depending on what's optimal; no effective change
34+
- **Very unlikely**: regression — identity-at-init + small param count blocks catastrophic pathways
35+
36+
Rationale for the thin range: #1714 got 1.0857 on a ~1.08 baseline, gap ~0.002. Porting onto #1736's stronger base retains at most half that gain due to (a) other architectural levers already capturing some benefit, (b) TTT absorbing upstream deltas per spec 010 finding.
37+
38+
## Thoughts (rationale + risks)
39+
40+
### Why this is the strongest remaining recurrence lever
41+
42+
Per `research/ideas/recurrence-parallel-literature.md`, three previously-proposed recurrence experiments have been **directly tested on this stack** and shelved:
43+
44+
- Earlier activation (#1726): 0.15 → +0.050 worse; #1739 step-0 catastrophic
45+
- Smooth-vs-hard schedule (#1663): no difference
46+
- Position shift / range expansion (#1726): layers 5-6 +0.006, 2-7 +0.163 worse
47+
48+
Recur-Alpha is the one recurrence-class lever that (a) has positive evidence elsewhere (#1714), (b) has NOT been composed with #1736's full stack, (c) has identity-at-init safety properties.
49+
50+
### What makes it safe
51+
52+
1. **Identity at init.** All 6 alphas start at 0. The first forward pass is behaviorally equivalent to baseline with all extra passes producing zero contribution (passthrough). If the model never learns to move α, worst case is "training under effective NUM_LOOPS=0 regime" — a known benign state.
53+
2. **Small parameter count.** 6 scalars total, 24 bytes quantized. No artifact budget concern.
54+
3. **No compute overhead.** Blend is 6 scalar-multiplies into the residual stream per forward. Negligible.
55+
4. **Torch.compile friendly.** Python-level conditionals on precomputed static lists; integers known at trace time.
56+
57+
### What makes it interesting
58+
59+
Recur-Alpha turns "do we loop?" from a training-time hyperparameter (currently forced: NUM_LOOPS=2) into a *learned decision* per-pass. Three diagnostic outcomes:
60+
61+
| Final α pattern | Interpretation |
62+
|---|---|
63+
| All near 0 | Model opted out of recurrence; extra passes contribute nothing useful |
64+
| All near 1 | Standard Loop345 behavior is already optimal; α flexibility unused |
65+
| Mixed / intermediate | Model finds partial-commitment useful on some passes |
66+
67+
All three outcomes teach us something concrete about whether our current recurrence config is tuned.
68+
69+
### Optional p2p cosine diagnostic
70+
71+
Separately env-gated (`RECUR_DIAG_P2P_COS=1`). Computes cosine similarity between consecutive pass deltas for each looped layer, logged alongside α. Tells us whether pass outputs are pointing in similar directions (redundancy case → cross-pass XSA is the next research question) or diverse directions (no redundancy → different research direction).
72+
73+
**Off by default for this run.** Reason: the diagnostic uses a Python dict (`self._diag_prev_deltas`) mutated inside `forward_logits`, which may not compose cleanly with `torch.compile(fullgraph=True)`. If we want this data, we'd either flip the flag and accept potential compile fallback, or do a follow-up run.
74+
75+
## Accept criteria
76+
77+
- Training completes without NaN / divergence.
78+
- α gradient norms are non-zero (optimizer is actually updating α).
79+
- Endpoint bare val_bpb measured at `stopping_early: wallclock_cap`.
80+
- **Decision criterion:**
81+
- Δ ≤ −0.001 → promote; optionally enable p2p diagnostic in follow-up; consider 3-seed confirmation + full TTT run
82+
- Δ ∈ (−0.001, −0.0003] → weak positive; weigh vs cost of 3-seed
83+
- Δ ∈ (−0.0003, +0.001) → null; shelve for this push, document α trajectory
84+
- Δ > +0.001 → regression (unexpected given identity-at-init); investigate
85+
86+
## Config diff vs spec 008
87+
88+
```
89+
RECUR_ALPHA_ENABLED=1
90+
TRAIN_LOG_EVERY=100 # increased from default 500 for diagnostic resolution
91+
```
92+
93+
Optional:
94+
```
95+
RECUR_DIAG_P2P_COS=1 # off by default for this run, see reasoning above
96+
```
97+
98+
No other changes.
99+
100+
## Code changes
101+
102+
- **Branch:** `exp/recur-alpha` (worktree at `worktrees/recur-alpha/`).
103+
- **Commit:** `a9aa141`.
104+
- **Patch target:** `records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py`.
105+
- **Patch scope:** 132 insertions, 3 deletions (3 from cosmetic restructure of `x → x_before, x_new`). Components:
106+
- 2 new `Hyperparameters` fields
107+
- `GPT.__init__`: recur_alpha Parameter + precomputed encoder/decoder alpha_info lists (using shared visit-count state spanning encoder + decoder)
108+
- `forward_logits`: encoder loop + decoder single-lane path apply alpha-blending when configured
109+
- `Optimizers.__init__`: recur_alpha routed to scalar AdamW
110+
- Per-step logging of α values / grad norm / p2p cos (when enabled)
111+
- Startup log echoes config
112+
- **Default-off invariant:** with `RECUR_ALPHA_ENABLED=0` (unset), all new code paths guard on `self.recur_alpha is None` and fall through to baseline logic. Verified byte-equivalent to original.
113+
114+
## Hardware ladder
115+
116+
- [x] **2×H100 smoke** (~5 min, ~$1): correctness only, 500 steps, watch for NaN + confirm α grad norms are non-zero. `ITERATIONS=500 RECUR_ALPHA_ENABLED=1 torchrun --nproc_per_node=2 train_gpt.py`. Do NOT read val_bpb.
117+
- [x] **8×H100 screening run** (~$5, seed 42): endpoint bare val_bpb, no TTT/GPTQ/sliding. Primary measurement.
118+
- [x] **(Conditional)** If screen shows Δ ≤ −0.001 → **8×H100 full run** (~$20) for TTT confirmation + proper submission number.
119+
120+
### Pre-registered expectations
121+
122+
Unlike BigramHash (zero-init projection, late divergence), Recur-Alpha's α starts at zero and:
123+
124+
| Step range | Expected behavior |
125+
|---|---|
126+
| 0–300 | train_loss near-identical to spec 008 at matched step (α=0 means no recurrence contribution) |
127+
| 300–1500 | If recurrence is useful, α starts drifting from 0; small grad norms build up |
128+
| 1500–3500 | First real signal on whether model wants α>0. Check α values + p2p cos. |
129+
| 3500–4500 | α trajectory stabilizes; final values inform interpretation |
130+
| Endpoint | Δ measured against spec 008's 1.0697 |
131+
132+
**Surprising would indicate a bug:**
133+
- α moves to negative values (should converge positive if anything)
134+
- α grad norms exactly zero for many steps (optimizer not registering)
135+
- train_loss significantly worse than spec 008 in first 500 steps (identity-at-init should prevent)
136+
137+
### Early-stop guidance
138+
139+
Same joint-executor+user pattern as prior specs. Automatic kill on NaN / inf / step-time blow-up. Joint decision on "train_loss much worse than spec 008 across multiple late-training log entries." Default to finish when ambiguous — α=0 staying is an INFORMATIVE null, not a failure.
140+
141+
## Seed plan
142+
143+
Single seed (42) for screen. 3-seed confirmation only if Δ ≤ −0.001.
144+
145+
## Inputs
146+
147+
- Data: same CaseOps dataset as spec 008
148+
- Tokenizer: bundled with #1736 submission dir
149+
- Hotstart: none, full from-scratch training
150+
151+
## Execution protocol
152+
153+
```bash
154+
cd /workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT
155+
156+
mkdir -p /workspace/runs/015-recur-alpha/seed_42
157+
158+
NCCL_NET=Socket DATA_DIR=/workspace/data \
159+
ARTIFACT_DIR=/workspace/runs/015-recur-alpha/seed_42 \
160+
CASEOPS_ENABLED=1 \
161+
PHASED_TTT_ENABLED=1 PHASED_TTT_PREFIX_DOCS=2000 PHASED_TTT_NUM_PHASES=3 \
162+
MLP_CLIP_SIGMAS=12.0 ATTN_CLIP_SIGMAS=13.0 \
163+
EMBED_BITS=7 EMBED_CLIP_SIGMAS=15.0 \
164+
MATRIX_LR=0.026 \
165+
GPTQ_RESERVE_SECONDS=4 GPTQ_CALIBRATION_BATCHES=16 \
166+
GATED_ATTN_ENABLED=1 GATED_ATTN_INIT_STD=0.005 GATED_ATTN_QUANT_GATE=1 \
167+
RECUR_ALPHA_ENABLED=1 \
168+
TRAIN_LOG_EVERY=100 \
169+
SEED=42 \
170+
torchrun --standalone --nproc_per_node=8 train_gpt.py \
171+
> /workspace/runs/015-recur-alpha/seed_42/train.log 2>&1
172+
```
173+
174+
Expected startup log line:
175+
`recur_alpha: enabled=True num_loops=2 loop_start=3 loop_end=5 diag_p2p_cos=False`
176+
177+
Expected log line at step 2000 (~example):
178+
```
179+
2000/20000 train_loss: 2.85 train_time: 4.2m tok/s: 120000
180+
recur_alpha: values=[[0.02, 0.01, 0.03], [0.00, 0.01, 0.02]] grad_norm=0.0008
181+
```
182+
183+
## Checkpoints / artifacts to emit
184+
185+
- `final_model.pt` (pre-GPTQ FP) — standard, reusable for analysis
186+
- `train.log` (~50 log lines with α trajectory)
187+
- `screen_endpoint.txt` snapshot
188+
- `notes.md` execution narrative
189+
190+
**No intermediate model checkpoints** for this first run. Can add if α trajectory reveals something requiring mid-training inspection.
191+
192+
## Stop-early criteria
193+
194+
- NaN / inf in train_loss → halt
195+
- Step time > 2× spec 008 → halt (indicates compile failure / unexpected overhead)
196+
- α grad norms exactly zero for 5+ consecutive log entries → halt, optimizer routing broken
197+
198+
## Cost estimate
199+
200+
| Item | Cost |
201+
|---|---|
202+
| 2×H100 smoke | ~$1 |
203+
| 8×H100 screening run | ~$5 |
204+
| **First-pass total** | **~$6** |
205+
| (Conditional) 8×H100 full run with TTT | ~$20 |
206+
| (Conditional) 3-seed confirmation | ~$30 additional |
207+
208+
## Open questions for interview
209+
210+
1. **Should we enable `RECUR_DIAG_P2P_COS=1` in this run?** Plan: off first time. Enabling might cause torch.compile fullgraph to fall back to eager mode (~15% slower). If we absolutely want the cross-pass XSA data from this run, flip it on and accept potential slowdown.
211+
2. **Does `_diag_prev_deltas` dict-mutation work under torch.compile?** Unknown until tested. Smoke will reveal. If it crashes, disable p2p diag for the run.
212+
3. **Is the compile time going to surprise us?** This patch adds a few Python-level conditionals to the forward. torch.compile has to re-trace. First-step compile time might be ~2x normal. Not a correctness issue; just a wallclock minute or two eaten upfront.
213+
214+
## What this spec does NOT do
215+
216+
- Does not change recurrence position (proven bad by #1726)
217+
- Does not change activation schedule (proven bad/irrelevant by #1663/#1726)
218+
- Does not implement cross-pass XSA (deferred as follow-up if p2p cos diagnostic reveals stationarity)
219+
- Does not touch parallel residuals or skip connections
220+
- Does not change the main model architecture in any way beyond adding 6 scalars + the blend op
221+
- Does not run 3-seed (single-seed screen only)

0 commit comments

Comments
 (0)