Skip to content

Commit 4b37b42

Browse files
committed
Ablation: WiderGate32, RoPE dims, activation slopes, hparam stack (8xH100)
1 parent 6bb607a commit 4b37b42

2 files changed

Lines changed: 121 additions & 0 deletions

File tree

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# Ablation: WiderGate, RoPE dims, activation slopes, hparam stack (8xH100)
2+
3+
Systematic ablation of 10 configurations on the PR #1693 architecture with CaseOps SP8192. All runs on 8xH100 SXM, 600s wallclock, single seed unless noted.
4+
5+
## Results
6+
7+
Base config: 11L/512d, CaseOps SP8192, AttnOutGate + SmearGate, Polar Express NS, MIN_LR=0.10, LQER OFF, brotli.
8+
9+
| # | Experiment | Change | Pre-quant | Post-quant | Post-TTT | Artifact | Delta vs baseline |
10+
|---|-----------|--------|-----------|------------|----------|----------|-------------------|
11+
| 1 | gates_caseops | baseline (EMBED_BITS=8) | 1.0712 | 1.0781 | **1.0674** | 16.93 MB ❌ ||
12+
| 2 | optimized_v1 | WARMDOWN=0.85, BETA2=0.99, clip sigmas | 1.0712 | 1.0806 | **1.0703** | 17.26 MB ❌ | +0.003 worse |
13+
| 3 | rope24 | ROPE_DIMS=24 | 1.0715 | 1.0815 | **1.0706** | 16.39 MB ❌ | +0.003 worse |
14+
| 4 | rope32 | ROPE_DIMS=32 | 1.0705 | 1.0806 | **1.0698** | 16.39 MB ❌ | +0.002 worse |
15+
| 5 | v2_baseline | EMBED_BITS=6 | 1.0718 | 1.0941 | **1.0819** | 15.15 MB ✅ | +0.015 worse |
16+
| 6 | v2_slope03 | LEAKY_SLOPE=0.3 | 1.0729 | 1.0942 | **1.0822** | 15.15 MB ✅ | +0.000 neutral |
17+
| 7 | v2_slope00 | LEAKY_SLOPE=0.0 (ReLU²) | 1.0737 | 1.0958 | **1.0837** | 15.16 MB ✅ | +0.002 worse |
18+
| 8 | **v2_gate32** | **GATE_WIDTH=32** | **1.0700** | **1.0908** | **1.0788** | **15.89 MB ✅** | **-0.003 better** |
19+
20+
Experiments 5-8 use EMBED_BITS=6 (int6 embeddings) to fit under 16MB without LQER.
21+
22+
## Key Findings
23+
24+
### 1. Wider attention gates help (GATE_WIDTH=32)
25+
26+
Increasing AttnOutGate input from 12 to 32 dimensions gives **-0.002 pre-quant** and **-0.003 post-TTT** improvement. The wider gate sees more of the residual stream for its per-head gating decision. Cost: 1,760 extra float16 params (negligible).
27+
28+
```python
29+
# Standard (width=12): gate sees x[:, :, :12]
30+
# Wider (width=32): gate sees x[:, :, :32]
31+
gate_in = x_orig[:, :, :gate_w.shape[-1]].contiguous()
32+
gate = (2.0 * torch.sigmoid(F.linear(gate_in, gate_w))).contiguous()
33+
return attn_output * gate.unsqueeze(-1)
34+
```
35+
36+
**Recommendation:** Adopt GATE_WIDTH=32 as default. Free improvement.
37+
38+
### 2. More RoPE dims hurt post-quantization
39+
40+
| ROPE_DIMS | Pre-quant | Post-quant | Quant gap |
41+
|-----------|-----------|------------|-----------|
42+
| 16 (default) | 1.0712 | 1.0781 | 0.0069 |
43+
| 24 | 1.0715 | 1.0815 | 0.0100 |
44+
| 32 | 1.0705 | 1.0806 | 0.0101 |
45+
46+
RoPE 32 improves pre-quant by -0.0007 but **increases quant gap** from 0.007 to 0.010. More rotated dimensions create weight distributions that GPTQ handles worse. **Keep ROPE_DIMS=16.**
47+
48+
### 3. Activation slope changes are neutral or worse
49+
50+
| Slope | Pre-quant | Post-TTT | Note |
51+
|-------|-----------|----------|------|
52+
| 0.5 (default) | 1.0718 | 1.0819 | baseline |
53+
| 0.3 (PR #1948) | 1.0729 | 1.0822 | neutral |
54+
| 0.0 (pure ReLU²) | 1.0737 | 1.0837 | +0.002 worse |
55+
56+
PR #1948 reported slope=0.3 as optimal on a different base config. On the CaseOps+gates stack with EMBED_BITS=6, **slope 0.5 remains optimal**. Pure ReLU² hurts — the leaky negative slope provides useful gradient flow.
57+
58+
### 4. PR #1855 hparam stack does not transfer
59+
60+
| Hparam | Default | PR #1855 | Result |
61+
|--------|---------|----------|--------|
62+
| WARMDOWN_FRAC | 0.75 | 0.85 | Neutral pre-quant |
63+
| BETA2 | 0.95 | 0.99 | Neutral pre-quant |
64+
| EMBED_CLIP_SIGMAS | 20.0 | 14.0 | **Worse** quant gap (+0.0025) |
65+
| MLP_CLIP_SIGMAS | 10.0 | 11.5 | **Worse** quant gap |
66+
| TTT_BETA2 | 0.999 | 0.99 | Neutral |
67+
68+
The 9-hparam stack from PR #1855 was greedy-validated on a different config (SparseAttnGate, no SmearGate widening). On our CaseOps+AttnOutGate stack, **tighter clip sigmas hurt quantization** and WARMDOWN/BETA2 changes are neutral. **Keep defaults.**
69+
70+
### 5. EMBED_BITS=6 costs +0.014 BPP
71+
72+
Dropping embedding precision from int7 to int6 saves ~500KB but costs +0.014 BPB post-TTT. This is the price of fitting under 16MB without per-group lrzip compression.
73+
74+
### 6. LZMA compression is worse than brotli
75+
76+
LZMA produced artifacts ~300KB larger than brotli-11 on this architecture. **Use brotli.**
77+
78+
## Negative Results Summary
79+
80+
| Technique | Expected | Actual | Verdict |
81+
|-----------|----------|--------|---------|
82+
| WARMDOWN_FRAC=0.85 | -0.002 | 0.000 | Dead |
83+
| BETA2=0.99 | -0.001 | 0.000 | Dead |
84+
| EMBED_CLIP_SIGMAS=14 | better quant | +0.0025 worse | Dead |
85+
| ROPE_DIMS=24/32 | -0.003 | +0.002/+0.003 | Dead |
86+
| LeakyReLU slope=0.3 | -0.001 | 0.000 | Dead |
87+
| Pure ReLU² | -0.003 | +0.002 | Dead |
88+
| LZMA compressor | better compression | +300KB larger | Dead |
89+
| LQER + Gates combo | both help | over 16MB | Incompatible |
90+
91+
## Configuration
92+
93+
All experiments use `train_gpt.py` from the record submission (PR #1969) with env var overrides. No code changes needed except GATE_WIDTH and LEAKY_SLOPE which require `train_gpt_v2.py`.
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
{
2+
"author": "Kamil Krawczyk",
3+
"github_id": "bsisduck",
4+
"name": "Ablation: WiderGate, RoPE dims, activation slopes, hparam stack",
5+
"date": "2026-04-30",
6+
"track": "non_record_16mb",
7+
"type": "ablation",
8+
"hardware": "8xH100 80GB SXM",
9+
"base_config": "PR #1693 + CaseOps SP8192 + AttnOutGate + SmearGate + PolarNS",
10+
"experiments": {
11+
"gates_caseops": {"val_bpb": 1.0674, "change": "baseline", "artifact_bytes": 16934743},
12+
"optimized_v1": {"val_bpb": 1.0703, "change": "WARMDOWN=0.85 BETA2=0.99 clips=tighter", "artifact_bytes": 17256881},
13+
"rope24": {"val_bpb": 1.0706, "change": "ROPE_DIMS=24", "artifact_bytes": 16394907},
14+
"rope32": {"val_bpb": 1.0698, "change": "ROPE_DIMS=32", "artifact_bytes": 16392886},
15+
"v2_baseline": {"val_bpb": 1.0819, "change": "EMBED_BITS=6 LQER_ENABLED=0", "artifact_bytes": 15885705},
16+
"v2_slope03": {"val_bpb": 1.0822, "change": "LEAKY_SLOPE=0.3", "artifact_bytes": 15889757},
17+
"v2_slope00": {"val_bpb": 1.0837, "change": "LEAKY_SLOPE=0.0 (ReLU²)", "artifact_bytes": 15892782},
18+
"v2_gate32": {"val_bpb": 1.0788, "change": "GATE_WIDTH=32", "artifact_bytes": 15891157}
19+
},
20+
"key_findings": [
21+
"GATE_WIDTH=32 improves val_bpb by -0.003 (novel finding)",
22+
"ROPE_DIMS >16 hurts post-quantization despite better pre-quant",
23+
"LeakyReLU slope 0.3/0.0 neutral or worse vs 0.5",
24+
"PR #1855 hparam stack does not transfer to this config",
25+
"EMBED_BITS=6 costs +0.014 BPB",
26+
"LZMA worse than brotli by +300KB"
27+
]
28+
}

0 commit comments

Comments
 (0)