|
| 1 | +# Neural-only Ablation — Where the 0.3958 BPB comes from |
| 2 | + |
| 3 | +This file decomposes the 0.3958 BPB submission into **(a) the trained neural |
| 4 | +model** and **(b) the eval-time Causal BackoffNgramMixer**, using the exact |
| 5 | +log lines from the three archived runs that produced `submission.json`. |
| 6 | + |
| 7 | +**TL;DR:** the trained neural model by itself scores ~1.148 BPB. The same |
| 8 | +model + `BackoffNgramMixer` at eval time scores 0.3958 BPB. The **~0.75 |
| 9 | +BPB improvement is entirely an eval-stage compression refinement**; no |
| 10 | +training-objective change, no data leakage, no novel optimizer. This is |
| 11 | +a direct descendant of already-merged #779 and #803. |
| 12 | + |
| 13 | +## Per-seed ablation (from the archived run logs) |
| 14 | + |
| 15 | +Source: `swarm_submission/run_final_seed{7,1337,2024}.log`, same runs that |
| 16 | +populate `submission.json`. |
| 17 | + |
| 18 | +| seed | post-EMA diagnostic<br>(neural, no quant, no mixer) | `final_int6_roundtrip`<br>(neural, int6 point eval) | `final_int6_sliding_window`<br>(neural + mixer, stride=64) | |
| 19 | +|---|---|---|---| |
| 20 | +| 7 | **1.1394** | **1.1481** | **0.3948** | |
| 21 | +| 1337 | **1.1396** | **1.1480** | **0.3957** | |
| 22 | +| 2024 | **1.1404** | **1.1492** | **0.3969** | |
| 23 | +| **mean** | **1.1398** | **1.1484** | **0.3958** | |
| 24 | + |
| 25 | +- `post-EMA diagnostic` = `train_gpt.py:1483` — the raw trained model's val_bpb on a standard non-sliding-window eval, taken immediately after EMA weight decay, before any quantization. This is the purest "neural only" number. |
| 26 | +- `final_int6_roundtrip` = `train_gpt.py:1551` — same weights after int6 GPTQ-lite quantization + LZMA compression roundtrip, still no mixer, still point eval. ~0.009 BPB of quant noise vs the diagnostic. |
| 27 | +- `final_int6_sliding_window` = `train_gpt.py:1577` — **same int6 weights**, sliding-window eval at stride=64, **with the mixer enabled**. No further training, no further weight changes. |
| 28 | + |
| 29 | +**Mixer-attributed delta: 1.1484 − 0.3958 = 0.7526 BPB** (mean across seeds). |
| 30 | + |
| 31 | +## Verbatim log excerpts |
| 32 | + |
| 33 | +### seed 7 (`run_final_seed7.log`) |
| 34 | +``` |
| 35 | +step:7024/20000 val_loss:1.9257 val_bpb:1.1405 train_time:600086ms step_avg:85.43ms |
| 36 | +stopping_early: wallclock_cap train_time:600086ms step:7024/20000 |
| 37 | +DIAGNOSTIC post_ema val_loss:1.9239 val_bpb:1.1394 eval_time:1989ms |
| 38 | +final_int6_roundtrip val_loss:1.9386 val_bpb:1.1481 eval_time:19276ms |
| 39 | +final_int6_sliding_window val_loss:0.6667 val_bpb:0.3948 stride:64 eval_time:582774ms |
| 40 | +final_int8_zlib_roundtrip_exact val_loss:0.66665722 val_bpb:0.39483300 |
| 41 | +``` |
| 42 | + |
| 43 | +### seed 1337 (`run_final_seed1337.log`) |
| 44 | +``` |
| 45 | +DIAGNOSTIC post_ema val_loss:1.9241 val_bpb:1.1396 eval_time:1988ms |
| 46 | +final_int6_roundtrip val_loss:1.9383 val_bpb:1.1480 eval_time:5946ms |
| 47 | +final_int6_sliding_window val_loss:0.6681 val_bpb:0.3957 stride:64 eval_time:593857ms |
| 48 | +final_int8_zlib_roundtrip_exact val_loss:0.66811451 val_bpb:0.39569610 |
| 49 | +``` |
| 50 | + |
| 51 | +### seed 2024 (`run_final_seed2024.log`) |
| 52 | +``` |
| 53 | +DIAGNOSTIC post_ema val_loss:1.9254 val_bpb:1.1404 eval_time:2109ms |
| 54 | +final_int6_roundtrip val_loss:1.9404 val_bpb:1.1492 eval_time:16040ms |
| 55 | +final_int6_sliding_window val_loss:0.6701 val_bpb:0.3969 stride:64 eval_time:595814ms |
| 56 | +final_int8_zlib_roundtrip_exact val_loss:0.67013029 val_bpb:0.39688996 |
| 57 | +``` |
| 58 | + |
| 59 | +## Mixer convergence curve (seed 7) |
| 60 | + |
| 61 | +The mixer starts empty and accumulates n-gram counts in strict score-first |
| 62 | +order as it walks the val stream. Running BPB across the eval (every ~128K |
| 63 | +tokens of 969088 total): |
| 64 | + |
| 65 | +| tokens scored | running bpb | |
| 66 | +|---|---| |
| 67 | +| 128 / 969088 | 1.175661 | |
| 68 | +| 102528 / 969088 | 0.889010 | |
| 69 | +| 230528 / 969088 | 0.643985 | |
| 70 | +| 358528 / 969088 | 0.538056 | |
| 71 | +| 486528 / 969088 | 0.483657 | |
| 72 | +| 614528 / 969088 | 0.448113 | |
| 73 | +| 742528 / 969088 | 0.423662 | |
| 74 | +| 870528 / 969088 | 0.406234 | |
| 75 | +| **969088 / 969088** | **0.394833** | |
| 76 | + |
| 77 | +The first scored batch (128 tokens) is at 1.176 BPB — effectively the |
| 78 | +neural-only floor since the mixer has no counts yet. As the mixer |
| 79 | +accumulates counts from already-scored tokens, BPB drops monotonically |
| 80 | +to 0.3948. **At no point does the mixer see a token before it is scored** |
| 81 | +(see `train_gpt.py:876-935`, `eval_val_sliding` with mixer). |
| 82 | + |
| 83 | +## Relationship to prior art |
| 84 | + |
| 85 | +- **#779** — original `BackoffNgramMixer`, flat-hash design, entropy-adaptive alpha. Merged. |
| 86 | +- **#803** — @pentxayc's Complementary Training + `BackoffNgramMixer` at 0.4416. Merged. |
| 87 | +- **#1094 (this PR)** — same mixer family as #803, three orthogonal refinements: |
| 88 | + 1. Higher n-gram orders (2–10 vs 2–7) |
| 89 | + 2. 4.2M hash buckets per order (vs 1M) |
| 90 | + 3. Causal sequential chunk eval (score-first-per-batch, strictly backward-looking — `train_gpt.py:876-935`) |
| 91 | + |
| 92 | +The 0.0458 improvement over #803 is an eval-stage refinement on top of a |
| 93 | +legal, merged technique — not a new training method, not a new objective, |
| 94 | +not a new dataset. |
| 95 | + |
| 96 | +## Reproducibility |
| 97 | + |
| 98 | +```bash |
| 99 | +USE_NGRAM_MIXER=1 NGRAM_ORDER=10 NGRAM_BUCKETS=4194304 \ |
| 100 | +SEED=7 python train_gpt.py # expected: 0.3948 ± 0.001 BPB |
| 101 | +SEED=1337 python train_gpt.py # expected: 0.3957 ± 0.001 BPB |
| 102 | +SEED=2024 python train_gpt.py # expected: 0.3969 ± 0.001 BPB |
| 103 | +``` |
| 104 | + |
| 105 | +3-seed mean 0.3958 BPB, std 0.0011, all under the 16 MB artifact cap |
| 106 | +(15,943,009 / 15,940,706 / 15,957,577 bytes) and the 600 s eval cap |
| 107 | +(583 / 594 / 596 s). See `submission.json`. |
0 commit comments