Skip to content

Commit a113a70

Browse files
Add neural_baseline_ablation.md with per-seed decomposition
Pre-answers the "where does the 0.0458 improvement come from" question using exact log excerpts from the three archived runs that produced submission.json: seed 7: neural 1.1481 -> +mixer 0.3948 (delta 0.7533) seed 1337: neural 1.1480 -> +mixer 0.3957 (delta 0.7523) seed 2024: neural 1.1492 -> +mixer 0.3969 (delta 0.7523) mean: neural 1.1484 -> +mixer 0.3958 (delta 0.7526) Includes the mixer convergence curve for seed 7 (1.176 -> 0.395 as counts accumulate in strict score-first order) and positions the submission as an eval-stage refinement of already-merged openai#779 and openai#803 rather than a novel training method. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent cbaacc7 commit a113a70

1 file changed

Lines changed: 107 additions & 0 deletions

File tree

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# Neural-only Ablation — Where the 0.3958 BPB comes from
2+
3+
This file decomposes the 0.3958 BPB submission into **(a) the trained neural
4+
model** and **(b) the eval-time Causal BackoffNgramMixer**, using the exact
5+
log lines from the three archived runs that produced `submission.json`.
6+
7+
**TL;DR:** the trained neural model by itself scores ~1.148 BPB. The same
8+
model + `BackoffNgramMixer` at eval time scores 0.3958 BPB. The **~0.75
9+
BPB improvement is entirely an eval-stage compression refinement**; no
10+
training-objective change, no data leakage, no novel optimizer. This is
11+
a direct descendant of already-merged #779 and #803.
12+
13+
## Per-seed ablation (from the archived run logs)
14+
15+
Source: `swarm_submission/run_final_seed{7,1337,2024}.log`, same runs that
16+
populate `submission.json`.
17+
18+
| seed | post-EMA diagnostic<br>(neural, no quant, no mixer) | `final_int6_roundtrip`<br>(neural, int6 point eval) | `final_int6_sliding_window`<br>(neural + mixer, stride=64) |
19+
|---|---|---|---|
20+
| 7 | **1.1394** | **1.1481** | **0.3948** |
21+
| 1337 | **1.1396** | **1.1480** | **0.3957** |
22+
| 2024 | **1.1404** | **1.1492** | **0.3969** |
23+
| **mean** | **1.1398** | **1.1484** | **0.3958** |
24+
25+
- `post-EMA diagnostic` = `train_gpt.py:1483` — the raw trained model's val_bpb on a standard non-sliding-window eval, taken immediately after EMA weight decay, before any quantization. This is the purest "neural only" number.
26+
- `final_int6_roundtrip` = `train_gpt.py:1551` — same weights after int6 GPTQ-lite quantization + LZMA compression roundtrip, still no mixer, still point eval. ~0.009 BPB of quant noise vs the diagnostic.
27+
- `final_int6_sliding_window` = `train_gpt.py:1577`**same int6 weights**, sliding-window eval at stride=64, **with the mixer enabled**. No further training, no further weight changes.
28+
29+
**Mixer-attributed delta: 1.1484 − 0.3958 = 0.7526 BPB** (mean across seeds).
30+
31+
## Verbatim log excerpts
32+
33+
### seed 7 (`run_final_seed7.log`)
34+
```
35+
step:7024/20000 val_loss:1.9257 val_bpb:1.1405 train_time:600086ms step_avg:85.43ms
36+
stopping_early: wallclock_cap train_time:600086ms step:7024/20000
37+
DIAGNOSTIC post_ema val_loss:1.9239 val_bpb:1.1394 eval_time:1989ms
38+
final_int6_roundtrip val_loss:1.9386 val_bpb:1.1481 eval_time:19276ms
39+
final_int6_sliding_window val_loss:0.6667 val_bpb:0.3948 stride:64 eval_time:582774ms
40+
final_int8_zlib_roundtrip_exact val_loss:0.66665722 val_bpb:0.39483300
41+
```
42+
43+
### seed 1337 (`run_final_seed1337.log`)
44+
```
45+
DIAGNOSTIC post_ema val_loss:1.9241 val_bpb:1.1396 eval_time:1988ms
46+
final_int6_roundtrip val_loss:1.9383 val_bpb:1.1480 eval_time:5946ms
47+
final_int6_sliding_window val_loss:0.6681 val_bpb:0.3957 stride:64 eval_time:593857ms
48+
final_int8_zlib_roundtrip_exact val_loss:0.66811451 val_bpb:0.39569610
49+
```
50+
51+
### seed 2024 (`run_final_seed2024.log`)
52+
```
53+
DIAGNOSTIC post_ema val_loss:1.9254 val_bpb:1.1404 eval_time:2109ms
54+
final_int6_roundtrip val_loss:1.9404 val_bpb:1.1492 eval_time:16040ms
55+
final_int6_sliding_window val_loss:0.6701 val_bpb:0.3969 stride:64 eval_time:595814ms
56+
final_int8_zlib_roundtrip_exact val_loss:0.67013029 val_bpb:0.39688996
57+
```
58+
59+
## Mixer convergence curve (seed 7)
60+
61+
The mixer starts empty and accumulates n-gram counts in strict score-first
62+
order as it walks the val stream. Running BPB across the eval (every ~128K
63+
tokens of 969088 total):
64+
65+
| tokens scored | running bpb |
66+
|---|---|
67+
| 128 / 969088 | 1.175661 |
68+
| 102528 / 969088 | 0.889010 |
69+
| 230528 / 969088 | 0.643985 |
70+
| 358528 / 969088 | 0.538056 |
71+
| 486528 / 969088 | 0.483657 |
72+
| 614528 / 969088 | 0.448113 |
73+
| 742528 / 969088 | 0.423662 |
74+
| 870528 / 969088 | 0.406234 |
75+
| **969088 / 969088** | **0.394833** |
76+
77+
The first scored batch (128 tokens) is at 1.176 BPB — effectively the
78+
neural-only floor since the mixer has no counts yet. As the mixer
79+
accumulates counts from already-scored tokens, BPB drops monotonically
80+
to 0.3948. **At no point does the mixer see a token before it is scored**
81+
(see `train_gpt.py:876-935`, `eval_val_sliding` with mixer).
82+
83+
## Relationship to prior art
84+
85+
- **#779** — original `BackoffNgramMixer`, flat-hash design, entropy-adaptive alpha. Merged.
86+
- **#803**@pentxayc's Complementary Training + `BackoffNgramMixer` at 0.4416. Merged.
87+
- **#1094 (this PR)** — same mixer family as #803, three orthogonal refinements:
88+
1. Higher n-gram orders (2–10 vs 2–7)
89+
2. 4.2M hash buckets per order (vs 1M)
90+
3. Causal sequential chunk eval (score-first-per-batch, strictly backward-looking — `train_gpt.py:876-935`)
91+
92+
The 0.0458 improvement over #803 is an eval-stage refinement on top of a
93+
legal, merged technique — not a new training method, not a new objective,
94+
not a new dataset.
95+
96+
## Reproducibility
97+
98+
```bash
99+
USE_NGRAM_MIXER=1 NGRAM_ORDER=10 NGRAM_BUCKETS=4194304 \
100+
SEED=7 python train_gpt.py # expected: 0.3948 ± 0.001 BPB
101+
SEED=1337 python train_gpt.py # expected: 0.3957 ± 0.001 BPB
102+
SEED=2024 python train_gpt.py # expected: 0.3969 ± 0.001 BPB
103+
```
104+
105+
3-seed mean 0.3958 BPB, std 0.0011, all under the 16 MB artifact cap
106+
(15,943,009 / 15,940,706 / 15,957,577 bytes) and the 600 s eval cap
107+
(583 / 594 / 596 s). See `submission.json`.

0 commit comments

Comments
 (0)