Skip to content

Commit 88ade23

Browse files
yuyeonclaude
andcommitted
Session 4 research log: full state reconstruction + competition analysis
Comprehensive analysis of current leaderboard state (Apr 4, 2026): - Non-SLOT frontier at 1.0897 BPB (PR openai#1334) - Pre-quant TTT adds -0.009 BPP (PR openai#1351, 1.0807 BPB) - Causal SLOT adds -0.088 BPP (PR openai#1350, 1.0046 BPB) - GPTQ+TTT incompatibility confirmed post-quant, works pre-quant - FiLM gap analysis: ~0.05-0.09 BPP behind frontier - Three strategic paths identified Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 76b53c1 commit 88ade23

1 file changed

Lines changed: 120 additions & 0 deletions

File tree

docs/research_log_session4.md

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# Research Log — Session 4 (2026-04-04)
2+
3+
## Environment
4+
- GPU: 1× NVIDIA H100 80GB HBM3 (single GPU instance)
5+
- PyTorch 2.11.0+cu128, FA3 3.0.0, Triton OK, CUDA 12.8
6+
- Baseline verified: 327ms/step, 2.577 BPB @ 50 steps ✓
7+
8+
## Full State Reconstruction
9+
10+
### Our Best Submission (FiLM)
11+
| Config | 1×H100 600s | Expected 8×H100 |
12+
|--------|-------------|-----------------|
13+
| FiLM 5→7+8xMLP FA3+EMA+QAT | 1.2863 pre-quant, 1.3010 int6 | ~1.14-1.18 (speculative) |
14+
| FiLM 5→7+8xMLP+SLOT24 | Not tested (needs 8×H100) | ~1.05-1.10 (very speculative) |
15+
16+
Architecture: 5 shared blocks, 7 virtual layers, 8× MLP expansion, 512d, 8H/4KV.
17+
Novelty: FiLM-depth weight sharing (per-layer modulation of shared blocks).
18+
Advantage: 349ms/step on 1×H100 vs 669ms for SOTA → 2× more training steps.
19+
20+
### Current Competition Landscape (2026-04-04)
21+
22+
#### Non-SLOT, Non-TTT Frontier
23+
| PR | BPB | Key Stack |
24+
|----|-----|-----------|
25+
| **#1334** | **1.0897** | SP4096 + Depth Recurrence(4,5) + Parallel Residuals(7+) + MuonEq-R + QK-Gain 5.0 |
26+
| #1331 | 1.0900 | MuonEq-R + 3-Layer Recurrence + WD=0.095 |
27+
| #1344 | 1.0923 | SP4096 + Polar Express + MuonEq-R + Depth Recurrence(3,4,5) |
28+
| #1279 | 1.0924 | MuonEq-R + Depth Recurrence + N61 Mixed GPTQ |
29+
30+
#### Pre-Quant TTT (GPTQ-compatible — adapts before quantization)
31+
| PR | BPB | Delta vs no-TTT | Method |
32+
|----|-----|-----------------|--------|
33+
| **#1351** | **1.0807** | -0.009 | Discriminative TTT: per-block AdamW LR (0.3x early, 1.0x late), 10 epochs |
34+
| #1326 | 1.0896 | -0.003 | Legal TTT (SGD, freeze early blocks) |
35+
36+
#### Causal SLOT (legality pending, strong legal argument)
37+
| PR | BPB | SLOT delta | Method |
38+
|----|-----|------------|--------|
39+
| **#1350** | **1.0046** | -0.088 | L-BFGS Causal SLOT (25 iter, logit space, context-only loss) |
40+
| #1333 | 1.0766 | -0.013 | Causal SLOT-16 |
41+
42+
#### Full SLOT (likely illegal — uses future token info)
43+
| PR | BPB | Note |
44+
|----|-----|------|
45+
| #1329 | 0.636 | Per-Sample SLOT, 24 steps |
46+
| #1324 | 0.727 | SLOT-48 + VRL |
47+
48+
### SLOT Legality Assessment
49+
- **Standard SLOT (optimize on all positions)**: Almost certainly illegal. PR #1240 proved 100% causal violation.
50+
- **Causal SLOT (optimize only on already-scored positions)**: Strong legal argument — identical principle to legal score-first TTT. No official ruling. Issue #1336 filed, no maintainer response.
51+
- **Our implementation**: Correct stride-based masking, frozen model, per-sample delta. Would need causal variant for safety.
52+
53+
### GPTQ + TTT Incompatibility (Confirmed)
54+
PR #1341 systematic analysis:
55+
- **Post-quant TTT on GPTQ weights**: +0.03 BPP WORSE. GPTQ's column-wise Hessian error compensation creates fragile weight structure that gradient updates destroy.
56+
- **Pre-quant TTT (before GPTQ)**: -0.009 BPP WORKS. Adapts full-precision weights, then quantizes adapted weights.
57+
- **Implication**: TTT and GPTQ are compatible IF TTT happens before quantization. The "incompatibility" is specifically about updating quantized weights.
58+
59+
### Gap Analysis: FiLM vs Non-SLOT Frontier
60+
61+
Our best extrapolated: ~1.14-1.18 BPB
62+
Non-SLOT frontier: 1.0897 BPB
63+
Gap: **~0.05-0.09 BPB**
64+
65+
Sources of the gap (techniques we haven't adopted):
66+
1. **SP4096 tokenizer**: Every top-5 non-SLOT PR uses 4096 vocab. Bigger vocab = more bits per token = better compression of natural language. We use SP1024.
67+
2. **Depth recurrence with untied patterns**: Repeat layers 3-5 (or 4-5), getting 13-14 virtual layers from 11 physical. We have FiLM depth sharing, but it's a different mechanism.
68+
3. **Parallel residuals (layer 7+)**: Separate attention and MLP residual streams. Not tested on FiLM.
69+
4. **QK-Gain 5.0**: Simple scalar multiplication on attention logits. Proven at -0.003 BPP.
70+
5. **Higher WD (0.09-0.10)**: Quantization-friendly weight regularization. We use 0.04.
71+
6. **Pre-quant discriminative TTT**: Per-block AdamW fine-tuning before GPTQ. -0.009 BPP.
72+
7. **4× MLP (with SP4096)**: SP4096 frees embedding params, allowing wider MLP.
73+
8. **Polar Express NS**: 4-step minimax-optimal Newton-Schulz (vs standard 5-step).
74+
75+
### Can FiLM Close the Gap?
76+
77+
**Favorable factors:**
78+
- FiLM's step-time advantage (349ms vs ~106ms×8 GPUs... wait — 8×H100 changes the picture significantly)
79+
- FiLM's parameter efficiency (shared blocks = smaller model = more room in 16MB)
80+
- SLOT/Causal-SLOT is architecture-agnostic — should work with FiLM
81+
82+
**Unfavorable factors:**
83+
- On 8×H100, data parallelism gives SOTA ~5500 steps in 600s. FiLM on 8×H100 might get ~10000 steps, but the per-step quality difference narrows.
84+
- SP4096 requires significant code changes for FiLM
85+
- Several techniques (depth recurrence, parallel residuals) may not compose well with FiLM's weight sharing
86+
- FiLM was optimized for 1×H100 screening; the 8×H100 scaling behavior is unknown
87+
88+
### Critical Uncertainty
89+
We have never run FiLM on 8×H100. The extrapolation is highly uncertain.
90+
The non-SLOT frontier uses techniques that are proven at 8×H100 scale.
91+
FiLM's advantage is from faster steps, but 8×H100 data parallelism may reduce that advantage.
92+
93+
## Strategic Assessment
94+
95+
### Path 1: FiLM + Latest Techniques (Novel, High Risk)
96+
- Add SP4096, QK-Gain, higher WD, pre-quant TTT to FiLM
97+
- Add Causal SLOT for eval
98+
- Risk: Unknown 8×H100 scaling, many untested compositions
99+
- Upside: Genuinely novel submission with potentially unique architecture
100+
101+
### Path 2: Adopt Best Non-SLOT Stack + Our Innovations (Lower Risk)
102+
- Start from PR #1334 stack (SP4096, depth recurrence, parallel residuals)
103+
- Add MuonEq-R (already ours), pre-quant discriminative TTT
104+
- Add Causal SLOT
105+
- Risk: Not novel (stacking known techniques)
106+
- Upside: More likely to place well
107+
108+
### Path 3: FiLM as Alternative Architecture for SLOT (Novel, Medium Risk)
109+
- FiLM's shared blocks might work especially well with SLOT because:
110+
- Shared blocks create a "compressed" hidden representation
111+
- SLOT's per-sample delta can exploit this compressed structure
112+
- Fewer unique parameters = potentially better SLOT optimization landscape
113+
- Test: FiLM vs standard architecture as SLOT base
114+
- Risk: SLOT legality uncertain
115+
116+
## Immediate Priorities
117+
1. Run FiLM 5→7+8xMLP on THIS H100 for 600s to re-verify our baseline
118+
2. Verify FiLM+Causal SLOT works on 1×H100 (even if slow)
119+
3. Test SP4096 tokenizer with FiLM
120+
4. Profile the non-SLOT frontier techniques individually on 1×H100

0 commit comments

Comments
 (0)