diff --git a/records/track_non_record_16mb/2026-04-30_Ablation_WiderGate_RoPE_Activation_Hparams_8xH100/README.md b/records/track_non_record_16mb/2026-04-30_Ablation_WiderGate_RoPE_Activation_Hparams_8xH100/README.md new file mode 100644 index 0000000000..2868c4ca0e --- /dev/null +++ b/records/track_non_record_16mb/2026-04-30_Ablation_WiderGate_RoPE_Activation_Hparams_8xH100/README.md @@ -0,0 +1,93 @@ +# Ablation: WiderGate, RoPE dims, activation slopes, hparam stack (8xH100) + +Systematic ablation of 10 configurations on the PR #1693 architecture with CaseOps SP8192. All runs on 8xH100 SXM, 600s wallclock, single seed unless noted. + +## Results + +Base config: 11L/512d, CaseOps SP8192, AttnOutGate + SmearGate, Polar Express NS, MIN_LR=0.10, LQER OFF, brotli. + +| # | Experiment | Change | Pre-quant | Post-quant | Post-TTT | Artifact | Delta vs baseline | +|---|-----------|--------|-----------|------------|----------|----------|-------------------| +| 1 | gates_caseops | baseline (EMBED_BITS=8) | 1.0712 | 1.0781 | **1.0674** | 16.93 MB ❌ | — | +| 2 | optimized_v1 | WARMDOWN=0.85, BETA2=0.99, clip sigmas | 1.0712 | 1.0806 | **1.0703** | 17.26 MB ❌ | +0.003 worse | +| 3 | rope24 | ROPE_DIMS=24 | 1.0715 | 1.0815 | **1.0706** | 16.39 MB ❌ | +0.003 worse | +| 4 | rope32 | ROPE_DIMS=32 | 1.0705 | 1.0806 | **1.0698** | 16.39 MB ❌ | +0.002 worse | +| 5 | v2_baseline | EMBED_BITS=6 | 1.0718 | 1.0941 | **1.0819** | 15.15 MB ✅ | +0.015 worse | +| 6 | v2_slope03 | LEAKY_SLOPE=0.3 | 1.0729 | 1.0942 | **1.0822** | 15.15 MB ✅ | +0.000 neutral | +| 7 | v2_slope00 | LEAKY_SLOPE=0.0 (ReLU²) | 1.0737 | 1.0958 | **1.0837** | 15.16 MB ✅ | +0.002 worse | +| 8 | **v2_gate32** | **GATE_WIDTH=32** | **1.0700** | **1.0908** | **1.0788** | **15.89 MB ✅** | **-0.003 better** | + +Experiments 5-8 use EMBED_BITS=6 (int6 embeddings) to fit under 16MB without LQER. + +## Key Findings + +### 1. Wider attention gates help (GATE_WIDTH=32) + +Increasing AttnOutGate input from 12 to 32 dimensions gives **-0.002 pre-quant** and **-0.003 post-TTT** improvement. The wider gate sees more of the residual stream for its per-head gating decision. Cost: 1,760 extra float16 params (negligible). + +```python +# Standard (width=12): gate sees x[:, :, :12] +# Wider (width=32): gate sees x[:, :, :32] +gate_in = x_orig[:, :, :gate_w.shape[-1]].contiguous() +gate = (2.0 * torch.sigmoid(F.linear(gate_in, gate_w))).contiguous() +return attn_output * gate.unsqueeze(-1) +``` + +**Recommendation:** Adopt GATE_WIDTH=32 as default. Free improvement. + +### 2. More RoPE dims hurt post-quantization + +| ROPE_DIMS | Pre-quant | Post-quant | Quant gap | +|-----------|-----------|------------|-----------| +| 16 (default) | 1.0712 | 1.0781 | 0.0069 | +| 24 | 1.0715 | 1.0815 | 0.0100 | +| 32 | 1.0705 | 1.0806 | 0.0101 | + +RoPE 32 improves pre-quant by -0.0007 but **increases quant gap** from 0.007 to 0.010. More rotated dimensions create weight distributions that GPTQ handles worse. **Keep ROPE_DIMS=16.** + +### 3. Activation slope changes are neutral or worse + +| Slope | Pre-quant | Post-TTT | Note | +|-------|-----------|----------|------| +| 0.5 (default) | 1.0718 | 1.0819 | baseline | +| 0.3 (PR #1948) | 1.0729 | 1.0822 | neutral | +| 0.0 (pure ReLU²) | 1.0737 | 1.0837 | +0.002 worse | + +PR #1948 reported slope=0.3 as optimal on a different base config. On the CaseOps+gates stack with EMBED_BITS=6, **slope 0.5 remains optimal**. Pure ReLU² hurts — the leaky negative slope provides useful gradient flow. + +### 4. PR #1855 hparam stack does not transfer + +| Hparam | Default | PR #1855 | Result | +|--------|---------|----------|--------| +| WARMDOWN_FRAC | 0.75 | 0.85 | Neutral pre-quant | +| BETA2 | 0.95 | 0.99 | Neutral pre-quant | +| EMBED_CLIP_SIGMAS | 20.0 | 14.0 | **Worse** quant gap (+0.0025) | +| MLP_CLIP_SIGMAS | 10.0 | 11.5 | **Worse** quant gap | +| TTT_BETA2 | 0.999 | 0.99 | Neutral | + +The 9-hparam stack from PR #1855 was greedy-validated on a different config (SparseAttnGate, no SmearGate widening). On our CaseOps+AttnOutGate stack, **tighter clip sigmas hurt quantization** and WARMDOWN/BETA2 changes are neutral. **Keep defaults.** + +### 5. EMBED_BITS=6 costs +0.014 BPP + +Dropping embedding precision from int7 to int6 saves ~500KB but costs +0.014 BPB post-TTT. This is the price of fitting under 16MB without per-group lrzip compression. + +### 6. LZMA compression is worse than brotli + +LZMA produced artifacts ~300KB larger than brotli-11 on this architecture. **Use brotli.** + +## Negative Results Summary + +| Technique | Expected | Actual | Verdict | +|-----------|----------|--------|---------| +| WARMDOWN_FRAC=0.85 | -0.002 | 0.000 | Dead | +| BETA2=0.99 | -0.001 | 0.000 | Dead | +| EMBED_CLIP_SIGMAS=14 | better quant | +0.0025 worse | Dead | +| ROPE_DIMS=24/32 | -0.003 | +0.002/+0.003 | Dead | +| LeakyReLU slope=0.3 | -0.001 | 0.000 | Dead | +| Pure ReLU² | -0.003 | +0.002 | Dead | +| LZMA compressor | better compression | +300KB larger | Dead | +| LQER + Gates combo | both help | over 16MB | Incompatible | + +## Configuration + +All experiments use `train_gpt.py` from the record submission (PR #1969) with env var overrides. No code changes needed except GATE_WIDTH and LEAKY_SLOPE which require `train_gpt_v2.py`. diff --git a/records/track_non_record_16mb/2026-04-30_Ablation_WiderGate_RoPE_Activation_Hparams_8xH100/submission.json b/records/track_non_record_16mb/2026-04-30_Ablation_WiderGate_RoPE_Activation_Hparams_8xH100/submission.json new file mode 100644 index 0000000000..2d11dd9b20 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-30_Ablation_WiderGate_RoPE_Activation_Hparams_8xH100/submission.json @@ -0,0 +1,28 @@ +{ + "author": "Kamil Krawczyk", + "github_id": "bsisduck", + "name": "Ablation: WiderGate, RoPE dims, activation slopes, hparam stack", + "date": "2026-04-30", + "track": "non_record_16mb", + "type": "ablation", + "hardware": "8xH100 80GB SXM", + "base_config": "PR #1693 + CaseOps SP8192 + AttnOutGate + SmearGate + PolarNS", + "experiments": { + "gates_caseops": {"val_bpb": 1.0674, "change": "baseline", "artifact_bytes": 16934743}, + "optimized_v1": {"val_bpb": 1.0703, "change": "WARMDOWN=0.85 BETA2=0.99 clips=tighter", "artifact_bytes": 17256881}, + "rope24": {"val_bpb": 1.0706, "change": "ROPE_DIMS=24", "artifact_bytes": 16394907}, + "rope32": {"val_bpb": 1.0698, "change": "ROPE_DIMS=32", "artifact_bytes": 16392886}, + "v2_baseline": {"val_bpb": 1.0819, "change": "EMBED_BITS=6 LQER_ENABLED=0", "artifact_bytes": 15885705}, + "v2_slope03": {"val_bpb": 1.0822, "change": "LEAKY_SLOPE=0.3", "artifact_bytes": 15889757}, + "v2_slope00": {"val_bpb": 1.0837, "change": "LEAKY_SLOPE=0.0 (ReLU²)", "artifact_bytes": 15892782}, + "v2_gate32": {"val_bpb": 1.0788, "change": "GATE_WIDTH=32", "artifact_bytes": 15891157} + }, + "key_findings": [ + "GATE_WIDTH=32 improves val_bpb by -0.003 (novel finding)", + "ROPE_DIMS >16 hurts post-quantization despite better pre-quant", + "LeakyReLU slope 0.3/0.0 neutral or worse vs 0.5", + "PR #1855 hparam stack does not transfer to this config", + "EMBED_BITS=6 costs +0.014 BPB", + "LZMA worse than brotli by +300KB" + ] +}