|
| 1 | +# Research Log — Session 4 (2026-04-04) |
| 2 | + |
| 3 | +## Environment |
| 4 | +- GPU: 1× NVIDIA H100 80GB HBM3 (single GPU instance) |
| 5 | +- PyTorch 2.11.0+cu128, FA3 3.0.0, Triton OK, CUDA 12.8 |
| 6 | +- Baseline verified: 327ms/step, 2.577 BPB @ 50 steps ✓ |
| 7 | + |
| 8 | +## Full State Reconstruction |
| 9 | + |
| 10 | +### Our Best Submission (FiLM) |
| 11 | +| Config | 1×H100 600s | Expected 8×H100 | |
| 12 | +|--------|-------------|-----------------| |
| 13 | +| FiLM 5→7+8xMLP FA3+EMA+QAT | 1.2863 pre-quant, 1.3010 int6 | ~1.14-1.18 (speculative) | |
| 14 | +| FiLM 5→7+8xMLP+SLOT24 | Not tested (needs 8×H100) | ~1.05-1.10 (very speculative) | |
| 15 | + |
| 16 | +Architecture: 5 shared blocks, 7 virtual layers, 8× MLP expansion, 512d, 8H/4KV. |
| 17 | +Novelty: FiLM-depth weight sharing (per-layer modulation of shared blocks). |
| 18 | +Advantage: 349ms/step on 1×H100 vs 669ms for SOTA → 2× more training steps. |
| 19 | + |
| 20 | +### Current Competition Landscape (2026-04-04) |
| 21 | + |
| 22 | +#### Non-SLOT, Non-TTT Frontier |
| 23 | +| PR | BPB | Key Stack | |
| 24 | +|----|-----|-----------| |
| 25 | +| **#1334** | **1.0897** | SP4096 + Depth Recurrence(4,5) + Parallel Residuals(7+) + MuonEq-R + QK-Gain 5.0 | |
| 26 | +| #1331 | 1.0900 | MuonEq-R + 3-Layer Recurrence + WD=0.095 | |
| 27 | +| #1344 | 1.0923 | SP4096 + Polar Express + MuonEq-R + Depth Recurrence(3,4,5) | |
| 28 | +| #1279 | 1.0924 | MuonEq-R + Depth Recurrence + N61 Mixed GPTQ | |
| 29 | + |
| 30 | +#### Pre-Quant TTT (GPTQ-compatible — adapts before quantization) |
| 31 | +| PR | BPB | Delta vs no-TTT | Method | |
| 32 | +|----|-----|-----------------|--------| |
| 33 | +| **#1351** | **1.0807** | -0.009 | Discriminative TTT: per-block AdamW LR (0.3x early, 1.0x late), 10 epochs | |
| 34 | +| #1326 | 1.0896 | -0.003 | Legal TTT (SGD, freeze early blocks) | |
| 35 | + |
| 36 | +#### Causal SLOT (legality pending, strong legal argument) |
| 37 | +| PR | BPB | SLOT delta | Method | |
| 38 | +|----|-----|------------|--------| |
| 39 | +| **#1350** | **1.0046** | -0.088 | L-BFGS Causal SLOT (25 iter, logit space, context-only loss) | |
| 40 | +| #1333 | 1.0766 | -0.013 | Causal SLOT-16 | |
| 41 | + |
| 42 | +#### Full SLOT (likely illegal — uses future token info) |
| 43 | +| PR | BPB | Note | |
| 44 | +|----|-----|------| |
| 45 | +| #1329 | 0.636 | Per-Sample SLOT, 24 steps | |
| 46 | +| #1324 | 0.727 | SLOT-48 + VRL | |
| 47 | + |
| 48 | +### SLOT Legality Assessment |
| 49 | +- **Standard SLOT (optimize on all positions)**: Almost certainly illegal. PR #1240 proved 100% causal violation. |
| 50 | +- **Causal SLOT (optimize only on already-scored positions)**: Strong legal argument — identical principle to legal score-first TTT. No official ruling. Issue #1336 filed, no maintainer response. |
| 51 | +- **Our implementation**: Correct stride-based masking, frozen model, per-sample delta. Would need causal variant for safety. |
| 52 | + |
| 53 | +### GPTQ + TTT Incompatibility (Confirmed) |
| 54 | +PR #1341 systematic analysis: |
| 55 | +- **Post-quant TTT on GPTQ weights**: +0.03 BPP WORSE. GPTQ's column-wise Hessian error compensation creates fragile weight structure that gradient updates destroy. |
| 56 | +- **Pre-quant TTT (before GPTQ)**: -0.009 BPP WORKS. Adapts full-precision weights, then quantizes adapted weights. |
| 57 | +- **Implication**: TTT and GPTQ are compatible IF TTT happens before quantization. The "incompatibility" is specifically about updating quantized weights. |
| 58 | + |
| 59 | +### Gap Analysis: FiLM vs Non-SLOT Frontier |
| 60 | + |
| 61 | +Our best extrapolated: ~1.14-1.18 BPB |
| 62 | +Non-SLOT frontier: 1.0897 BPB |
| 63 | +Gap: **~0.05-0.09 BPB** |
| 64 | + |
| 65 | +Sources of the gap (techniques we haven't adopted): |
| 66 | +1. **SP4096 tokenizer**: Every top-5 non-SLOT PR uses 4096 vocab. Bigger vocab = more bits per token = better compression of natural language. We use SP1024. |
| 67 | +2. **Depth recurrence with untied patterns**: Repeat layers 3-5 (or 4-5), getting 13-14 virtual layers from 11 physical. We have FiLM depth sharing, but it's a different mechanism. |
| 68 | +3. **Parallel residuals (layer 7+)**: Separate attention and MLP residual streams. Not tested on FiLM. |
| 69 | +4. **QK-Gain 5.0**: Simple scalar multiplication on attention logits. Proven at -0.003 BPP. |
| 70 | +5. **Higher WD (0.09-0.10)**: Quantization-friendly weight regularization. We use 0.04. |
| 71 | +6. **Pre-quant discriminative TTT**: Per-block AdamW fine-tuning before GPTQ. -0.009 BPP. |
| 72 | +7. **4× MLP (with SP4096)**: SP4096 frees embedding params, allowing wider MLP. |
| 73 | +8. **Polar Express NS**: 4-step minimax-optimal Newton-Schulz (vs standard 5-step). |
| 74 | + |
| 75 | +### Can FiLM Close the Gap? |
| 76 | + |
| 77 | +**Favorable factors:** |
| 78 | +- FiLM's step-time advantage (349ms vs ~106ms×8 GPUs... wait — 8×H100 changes the picture significantly) |
| 79 | +- FiLM's parameter efficiency (shared blocks = smaller model = more room in 16MB) |
| 80 | +- SLOT/Causal-SLOT is architecture-agnostic — should work with FiLM |
| 81 | + |
| 82 | +**Unfavorable factors:** |
| 83 | +- On 8×H100, data parallelism gives SOTA ~5500 steps in 600s. FiLM on 8×H100 might get ~10000 steps, but the per-step quality difference narrows. |
| 84 | +- SP4096 requires significant code changes for FiLM |
| 85 | +- Several techniques (depth recurrence, parallel residuals) may not compose well with FiLM's weight sharing |
| 86 | +- FiLM was optimized for 1×H100 screening; the 8×H100 scaling behavior is unknown |
| 87 | + |
| 88 | +### Critical Uncertainty |
| 89 | +We have never run FiLM on 8×H100. The extrapolation is highly uncertain. |
| 90 | +The non-SLOT frontier uses techniques that are proven at 8×H100 scale. |
| 91 | +FiLM's advantage is from faster steps, but 8×H100 data parallelism may reduce that advantage. |
| 92 | + |
| 93 | +## Strategic Assessment |
| 94 | + |
| 95 | +### Path 1: FiLM + Latest Techniques (Novel, High Risk) |
| 96 | +- Add SP4096, QK-Gain, higher WD, pre-quant TTT to FiLM |
| 97 | +- Add Causal SLOT for eval |
| 98 | +- Risk: Unknown 8×H100 scaling, many untested compositions |
| 99 | +- Upside: Genuinely novel submission with potentially unique architecture |
| 100 | + |
| 101 | +### Path 2: Adopt Best Non-SLOT Stack + Our Innovations (Lower Risk) |
| 102 | +- Start from PR #1334 stack (SP4096, depth recurrence, parallel residuals) |
| 103 | +- Add MuonEq-R (already ours), pre-quant discriminative TTT |
| 104 | +- Add Causal SLOT |
| 105 | +- Risk: Not novel (stacking known techniques) |
| 106 | +- Upside: More likely to place well |
| 107 | + |
| 108 | +### Path 3: FiLM as Alternative Architecture for SLOT (Novel, Medium Risk) |
| 109 | +- FiLM's shared blocks might work especially well with SLOT because: |
| 110 | + - Shared blocks create a "compressed" hidden representation |
| 111 | + - SLOT's per-sample delta can exploit this compressed structure |
| 112 | + - Fewer unique parameters = potentially better SLOT optimization landscape |
| 113 | +- Test: FiLM vs standard architecture as SLOT base |
| 114 | +- Risk: SLOT legality uncertain |
| 115 | + |
| 116 | +## Immediate Priorities |
| 117 | +1. Run FiLM 5→7+8xMLP on THIS H100 for 600s to re-verify our baseline |
| 118 | +2. Verify FiLM+Causal SLOT works on 1×H100 (even if slow) |
| 119 | +3. Test SP4096 tokenizer with FiLM |
| 120 | +4. Profile the non-SLOT frontier techniques individually on 1×H100 |
0 commit comments