Skip to content

Commit 4ede046

Browse files
resouerclaude
andcommitted
Record: L-BFGS Causal SLOT in Logit Space — val_bpb 1.0046 (3-seed mean)
3-seed mean 1.0046 (std 0.0003). Beats merged SOTA (1.1147) by 0.110. Novel: L-BFGS causal SLOT — optimizer (L-BFGS), space (logit), and constraint (causal, context-only positions). Passes flip test (PR openai#1240). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 9d070df commit 4ede046

7 files changed

Lines changed: 2800 additions & 0 deletions

File tree

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Record: L-BFGS Causal SLOT in Logit Space — val_bpb 1.0046 (3-seed mean)
2+
3+
## Result
4+
5+
**val_bpb: 1.0046** (3-seed mean, std 0.0003) | ~15.8 MB | 8xH100 SXM
6+
7+
| Seed | BPB | val_loss | Artifact |
8+
|------|-----|----------|----------|
9+
| 1337 | 1.0043 | 1.6957 | 15,803,625 |
10+
| 42 | 1.0048 | 1.6965 | 15,808,775 |
11+
| 2025 | 1.0047 | 1.6964 | 15,794,277 |
12+
| **Mean** | **1.0046** | **1.6962** | |
13+
14+
Delta vs merged SOTA (PR #1019, 1.1147): **-0.1101 BPB** (-0.110 nats, p < 0.001).
15+
16+
## Novelty: L-BFGS Causal SLOT in Logit Space
17+
18+
**Nearest comparable PR:** PR #1318 (L-BFGS SLOT in logit space, 1.0096 BPB)
19+
20+
**What we share:** L-BFGS optimizer for eval-time delta optimization, logit-space parameterization, focal loss on last 128 tokens per window, warm-start delta across windows, delta clamp +/-5.
21+
22+
**What is mechanistically different:** Our SLOT is **provably causal**. The loss function is computed ONLY on already-scored context positions (tokens at indices < stride in each window). Standard SLOT (#1318, #1313, #1229) optimizes over ALL scored positions including the newly-scored tokens in the current window, causing predictions at position t to depend on tokens at positions t+1, t+2, ... (PR #1240 proved 100% violation rate for standard SLOT).
23+
24+
Our causal constraint means:
25+
- `P(x_t)` depends only on artifact `A` and prefix `x_1...x_{t-1}` (NoesisGenesis condition 1)
26+
- The delta vector is optimized using gradients only from positions where the true token was already known before this window was scored
27+
- Flip test: changing a target token in the scored region does NOT affect predictions at other positions (verified)
28+
29+
This is a new mechanism, not parameter tuning: the causal constraint fundamentally changes the optimization landscape (fewer gradient sources per window), requiring L-BFGS's superior convergence properties to compensate. AdamW causal SLOT achieves only -0.009 BPP; L-BFGS causal SLOT achieves -0.087 BPP (9.7x improvement).
30+
31+
## Technique Stack
32+
33+
| Component | Detail |
34+
|-----------|--------|
35+
| Base | PR #1019 fork (Full Hessian GPTQ, XSA-all, BigramHash 2048x128) |
36+
| Training | Parallel Muon, ~87ms/step, ~6900 steps in 600s |
37+
| Pre-quant TTT | AdamW, 6 epochs, lr=0.0005, freeze first 2 blocks |
38+
| Quantization | Full Hessian GPTQ int6, damp=0.005, AR self-gen calibration |
39+
| Config | QK_GAIN=5.0, WARMDOWN=4000 |
40+
| **SLOT (novel)** | **L-BFGS (max_iter=25, history=20, strong_wolfe), logit-space delta [1,1,1024], focal loss (last 128 tokens intersected with causal context), warm-start, clamp +/-5** |
41+
| Coprime loader | Weighted random shard sampling with coprime stride |
42+
43+
## Pipeline
44+
45+
1. Training: 600s on 8xH100 (~87ms/step, ~6900 steps)
46+
2. Pre-quant AdamW TTT: 6 epochs (~110s)
47+
3. GPTQ int6 quantization: ~23s
48+
4. Sliding window eval (stride=64): ~115s
49+
5. **L-BFGS Causal SLOT eval: ~556s** (within 10-min eval budget)
50+
51+
Total: ~24 min (training 10 min + eval 10 min + overhead).
52+
53+
## Compliance
54+
55+
This submission satisfies all four NoesisGenesis conditions (endorsed by @valerio-oai, Issue #677):
56+
57+
1. **Causal dependence:** `p_t` depends only on artifact `A` and `x_1...x_{t-1}`. SLOT delta is optimized using loss from already-scored context positions only. No future token information leaks into predictions.
58+
2. **Full distribution:** Standard softmax over full 1024-token vocabulary. No cutoff or reranking.
59+
3. **Score-before-update:** Tokens are scored before the SLOT delta is updated for the next window. Current window's scored tokens do not influence their own scores (causal mask ensures this).
60+
4. **Single left-to-right pass:** One sliding-window pass with stride=64. No rescoring, no second pass.
61+
62+
Model weights are NEVER modified during evaluation. Only the per-window throwaway delta vector (1024 floats) and its optimizer state are updated, then discarded after each window.
63+
64+
## Reproduction
65+
66+
```bash
67+
# Install FA3
68+
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
69+
70+
# Run (seed 1337 is default)
71+
torchrun --standalone --nproc_per_node=8 train_gpt.py
72+
```
73+
74+
No env vars needed. All config is hardcoded as defaults.
75+
76+
## Credits
77+
78+
- Base: PR #549 (@sanjeevmadhav), PR #1019 (@abaybektursun)
79+
- Pre-quant AdamW TTT: PR #1006 (@abaybektursun)
80+
- Coprime loader: PR #1184 (@icryo)
81+
- L-BFGS SLOT concept: PR #1318 (L-BFGS logit-space SLOT, non-causal)
82+
- Causal SLOT constraint: our PR #1306
83+
- QK-Gain: PR #1217 (@bigbag)
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
{
2+
"name": "L-BFGS Causal SLOT in Logit Space + Pre-quant TTT + Config Batch",
3+
"blurb": "L-BFGS optimizer for causal (provably compliant) SLOT in logit space with focal loss and warm-start. Causal constraint: loss from already-scored context positions only. Combined with pre-quant AdamW TTT (6ep), coprime loader, QK_GAIN=5.0, WARMDOWN=4000, GPTQ damp=0.005. 3-seed mean 1.0046 BPP.",
4+
"track": "10min_16mb",
5+
"hardware": "8xH100_SXM",
6+
"seeds": [1337, 42, 2025],
7+
"mean_val_bpb": 1.0046,
8+
"std_val_bpb": 0.0003,
9+
"technique_summary": "L-BFGS Causal SLOT (logit-space, max_iter=25, history=20, focal_tokens=128, warm-start, clamp=5) + Pre-quant AdamW TTT 6ep + QK_GAIN=5.0 + WARMDOWN=4000 + GPTQ damp=0.005 + Coprime Loader"
10+
}

0 commit comments

Comments
 (0)