|
| 1 | +# Ablation: WiderGate, RoPE dims, activation slopes, hparam stack (8xH100) |
| 2 | + |
| 3 | +Systematic ablation of 10 configurations on the PR #1693 architecture with CaseOps SP8192. All runs on 8xH100 SXM, 600s wallclock, single seed unless noted. |
| 4 | + |
| 5 | +## Results |
| 6 | + |
| 7 | +Base config: 11L/512d, CaseOps SP8192, AttnOutGate + SmearGate, Polar Express NS, MIN_LR=0.10, LQER OFF, brotli. |
| 8 | + |
| 9 | +| # | Experiment | Change | Pre-quant | Post-quant | Post-TTT | Artifact | Delta vs baseline | |
| 10 | +|---|-----------|--------|-----------|------------|----------|----------|-------------------| |
| 11 | +| 1 | gates_caseops | baseline (EMBED_BITS=8) | 1.0712 | 1.0781 | **1.0674** | 16.93 MB ❌ | — | |
| 12 | +| 2 | optimized_v1 | WARMDOWN=0.85, BETA2=0.99, clip sigmas | 1.0712 | 1.0806 | **1.0703** | 17.26 MB ❌ | +0.003 worse | |
| 13 | +| 3 | rope24 | ROPE_DIMS=24 | 1.0715 | 1.0815 | **1.0706** | 16.39 MB ❌ | +0.003 worse | |
| 14 | +| 4 | rope32 | ROPE_DIMS=32 | 1.0705 | 1.0806 | **1.0698** | 16.39 MB ❌ | +0.002 worse | |
| 15 | +| 5 | v2_baseline | EMBED_BITS=6 | 1.0718 | 1.0941 | **1.0819** | 15.15 MB ✅ | +0.015 worse | |
| 16 | +| 6 | v2_slope03 | LEAKY_SLOPE=0.3 | 1.0729 | 1.0942 | **1.0822** | 15.15 MB ✅ | +0.000 neutral | |
| 17 | +| 7 | v2_slope00 | LEAKY_SLOPE=0.0 (ReLU²) | 1.0737 | 1.0958 | **1.0837** | 15.16 MB ✅ | +0.002 worse | |
| 18 | +| 8 | **v2_gate32** | **GATE_WIDTH=32** | **1.0700** | **1.0908** | **1.0788** | **15.89 MB ✅** | **-0.003 better** | |
| 19 | + |
| 20 | +Experiments 5-8 use EMBED_BITS=6 (int6 embeddings) to fit under 16MB without LQER. |
| 21 | + |
| 22 | +## Key Findings |
| 23 | + |
| 24 | +### 1. Wider attention gates help (GATE_WIDTH=32) |
| 25 | + |
| 26 | +Increasing AttnOutGate input from 12 to 32 dimensions gives **-0.002 pre-quant** and **-0.003 post-TTT** improvement. The wider gate sees more of the residual stream for its per-head gating decision. Cost: 1,760 extra float16 params (negligible). |
| 27 | + |
| 28 | +```python |
| 29 | +# Standard (width=12): gate sees x[:, :, :12] |
| 30 | +# Wider (width=32): gate sees x[:, :, :32] |
| 31 | +gate_in = x_orig[:, :, :gate_w.shape[-1]].contiguous() |
| 32 | +gate = (2.0 * torch.sigmoid(F.linear(gate_in, gate_w))).contiguous() |
| 33 | +return attn_output * gate.unsqueeze(-1) |
| 34 | +``` |
| 35 | + |
| 36 | +**Recommendation:** Adopt GATE_WIDTH=32 as default. Free improvement. |
| 37 | + |
| 38 | +### 2. More RoPE dims hurt post-quantization |
| 39 | + |
| 40 | +| ROPE_DIMS | Pre-quant | Post-quant | Quant gap | |
| 41 | +|-----------|-----------|------------|-----------| |
| 42 | +| 16 (default) | 1.0712 | 1.0781 | 0.0069 | |
| 43 | +| 24 | 1.0715 | 1.0815 | 0.0100 | |
| 44 | +| 32 | 1.0705 | 1.0806 | 0.0101 | |
| 45 | + |
| 46 | +RoPE 32 improves pre-quant by -0.0007 but **increases quant gap** from 0.007 to 0.010. More rotated dimensions create weight distributions that GPTQ handles worse. **Keep ROPE_DIMS=16.** |
| 47 | + |
| 48 | +### 3. Activation slope changes are neutral or worse |
| 49 | + |
| 50 | +| Slope | Pre-quant | Post-TTT | Note | |
| 51 | +|-------|-----------|----------|------| |
| 52 | +| 0.5 (default) | 1.0718 | 1.0819 | baseline | |
| 53 | +| 0.3 (PR #1948) | 1.0729 | 1.0822 | neutral | |
| 54 | +| 0.0 (pure ReLU²) | 1.0737 | 1.0837 | +0.002 worse | |
| 55 | + |
| 56 | +PR #1948 reported slope=0.3 as optimal on a different base config. On the CaseOps+gates stack with EMBED_BITS=6, **slope 0.5 remains optimal**. Pure ReLU² hurts — the leaky negative slope provides useful gradient flow. |
| 57 | + |
| 58 | +### 4. PR #1855 hparam stack does not transfer |
| 59 | + |
| 60 | +| Hparam | Default | PR #1855 | Result | |
| 61 | +|--------|---------|----------|--------| |
| 62 | +| WARMDOWN_FRAC | 0.75 | 0.85 | Neutral pre-quant | |
| 63 | +| BETA2 | 0.95 | 0.99 | Neutral pre-quant | |
| 64 | +| EMBED_CLIP_SIGMAS | 20.0 | 14.0 | **Worse** quant gap (+0.0025) | |
| 65 | +| MLP_CLIP_SIGMAS | 10.0 | 11.5 | **Worse** quant gap | |
| 66 | +| TTT_BETA2 | 0.999 | 0.99 | Neutral | |
| 67 | + |
| 68 | +The 9-hparam stack from PR #1855 was greedy-validated on a different config (SparseAttnGate, no SmearGate widening). On our CaseOps+AttnOutGate stack, **tighter clip sigmas hurt quantization** and WARMDOWN/BETA2 changes are neutral. **Keep defaults.** |
| 69 | + |
| 70 | +### 5. EMBED_BITS=6 costs +0.014 BPP |
| 71 | + |
| 72 | +Dropping embedding precision from int7 to int6 saves ~500KB but costs +0.014 BPB post-TTT. This is the price of fitting under 16MB without per-group lrzip compression. |
| 73 | + |
| 74 | +### 6. LZMA compression is worse than brotli |
| 75 | + |
| 76 | +LZMA produced artifacts ~300KB larger than brotli-11 on this architecture. **Use brotli.** |
| 77 | + |
| 78 | +## Negative Results Summary |
| 79 | + |
| 80 | +| Technique | Expected | Actual | Verdict | |
| 81 | +|-----------|----------|--------|---------| |
| 82 | +| WARMDOWN_FRAC=0.85 | -0.002 | 0.000 | Dead | |
| 83 | +| BETA2=0.99 | -0.001 | 0.000 | Dead | |
| 84 | +| EMBED_CLIP_SIGMAS=14 | better quant | +0.0025 worse | Dead | |
| 85 | +| ROPE_DIMS=24/32 | -0.003 | +0.002/+0.003 | Dead | |
| 86 | +| LeakyReLU slope=0.3 | -0.001 | 0.000 | Dead | |
| 87 | +| Pure ReLU² | -0.003 | +0.002 | Dead | |
| 88 | +| LZMA compressor | better compression | +300KB larger | Dead | |
| 89 | +| LQER + Gates combo | both help | over 16MB | Incompatible | |
| 90 | + |
| 91 | +## Configuration |
| 92 | + |
| 93 | +All experiments use `train_gpt.py` from the record submission (PR #1969) with env var overrides. No code changes needed except GATE_WIDTH and LEAKY_SLOPE which require `train_gpt_v2.py`. |
0 commit comments