Skip to content

Commit 9a51da1

Browse files
committed
exp_072_slot_qkgain: Score-first SLOT (FAILED, 1.1493 BPB)
Embedding-space delta optimized with 8 AdamW steps per chunk. Worse than both sliding window (1.1246) and naive eval (1.1479). Lesson: SLOT needs L-BFGS in logit space (see exp_075), not AdamW in embedding space. 8 steps underfits, and the embedding-space loss surface is non-convex. Also bumped QK-Gain 1.5 -> 4.0 (free -0.006 BPB from PR openai#1125).
1 parent 520b587 commit 9a51da1

2 files changed

Lines changed: 2029 additions & 0 deletions

File tree

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# exp_072_slot_qkgain — Score-First SLOT (FAILED, 1.1493 BPB)
2+
3+
**Result**: 1.1493 BPB — **worse** than sliding window baseline (1.1246) and naive eval (1.1479).
4+
5+
## Hypothesis
6+
7+
SLOT (Stochastic Latent Optimization at Test-time) with a **score-first** ordering should be legal and should improve eval BPB:
8+
9+
1. Score chunk N with delta from chunk N-1 (first chunk: delta=0)
10+
2. Then optimize delta on chunk N (for use on chunk N+1)
11+
12+
This follows the same causal information flow as legal TTT.
13+
14+
## Implementation
15+
16+
- Delta in **embedding space**: 512-dim vector added after token embedding + smear
17+
- 8 AdamW steps per chunk, LR 0.01
18+
- Non-overlapping chunks (~59 batches, ~83 s eval)
19+
- QK-Gain bumped from 1.5 → 4.0 (free −0.006 BPB from PR #1125)
20+
21+
## Why it failed
22+
23+
Embedding-space deltas optimized with AdamW for only 8 steps underfit.
24+
The top SLOT submissions ([PR #1350](https://github.com/openai/parameter-golf/pull/1350))
25+
instead optimize a **logit-space** delta with **L-BFGS (25 iters, history=20)**,
26+
getting ~1.005 BPB — a fundamentally different approach.
27+
28+
See `exp_075_lbfgs_slot/` for the L-BFGS logit-space version.
29+
30+
## Running
31+
32+
```bash
33+
GPTQ_ENABLED=1 GPTQ_N_BATCHES=64 SLOT_ENABLED=1 SLOT_STEPS=8 SLOT_LR=0.01 \
34+
SLOT_BATCH_SEQS=64 TTT_ENABLED=0 EVAL_STRIDE=64 SEED=1337 \
35+
torchrun --standalone --nproc_per_node=8 train_gpt.py
36+
```

0 commit comments

Comments
 (0)