|
| 1 | +# Submission: SP8192 CaseOps + WiderGate32 + PolarNS Muon + GPTQ-int6 |
| 2 | + |
| 3 | +**val_bpb: 1.08037** (3-seed mean, std 0.00139) | **~15.9 MB** | 8×H100 SXM, 600s wallclock | TTT eval |
| 4 | + |
| 5 | +## Results |
| 6 | + |
| 7 | +| Seed | Pre-quant val_bpb | Post-quant val_bpb | **Post-TTT val_bpb** | Artifact | |
| 8 | +|------|-------------------|--------------------|----------------------|----------| |
| 9 | +| 0 | 1.07175 | 1.09419 | **1.08196** | 15,890,131 | |
| 10 | +| 42 | 1.07039 | 1.09076 | **1.07983** | 15,887,137 | |
| 11 | +| 1234 | 1.06982 | 1.09058 | **1.07932** | 15,888,516 | |
| 12 | +| **Mean** | | | **1.08037** | 15,888,595 | |
| 13 | + |
| 14 | +## Architecture |
| 15 | + |
| 16 | +| Component | Setting | Source | |
| 17 | +|-----------|---------|--------| |
| 18 | +| Layers | 11 (512d, 8 GQA heads, 4 KV heads) | Baseline | |
| 19 | +| MLP | 4× (2048) with LeakyReLU(0.5)² | [#493](https://github.com/openai/parameter-golf/pull/493) | |
| 20 | +| Attention | FA3, GQA 2:1 | Baseline | |
| 21 | +| RoPE | Partial (16/64 dims), base 10000 | [#315](https://github.com/openai/parameter-golf/pull/315) | |
| 22 | +| U-Net skips | Encoder-decoder skip connections + skip gates | [#289](https://github.com/openai/parameter-golf/pull/289) | |
| 23 | +| Parallel decoder | 2-lane parallel from layer 8+ | [#1530](https://github.com/openai/parameter-golf/pull/1530) | |
| 24 | +| Depth recurrence | Loop layers 3-5, NUM_LOOPS=2 (17 virtual layers) | [#1344](https://github.com/openai/parameter-golf/pull/1344) | |
| 25 | +| Logit softcap | 30 | Baseline | |
| 26 | +| **Wider AttnOutGate** | Per-head output gate, **GATE_WIDTH=32** (vs standard 12) | [#1787](https://github.com/openai/parameter-golf/pull/1787) + **this work** | |
| 27 | +| **SmearGate** | Position-mixing gate, width=32 | [#1667](https://github.com/openai/parameter-golf/pull/1667) | |
| 28 | +| **Polar-Express Muon** | 5 NS steps, per-iter minimax tuples, momentum 0.97 | [#1344](https://github.com/openai/parameter-golf/pull/1344) | |
| 29 | +| **MIN_LR floor** | 0.10 (warmdown LR floor) | [#1787](https://github.com/openai/parameter-golf/pull/1787) | |
| 30 | +| Quantization | GPTQ int6 all weights (EMBED_BITS=6) + brotli-11 | | |
| 31 | +| TTT | LoRA rank-96, 1 phase, 2000 prefix docs | [#1610](https://github.com/openai/parameter-golf/pull/1610) | |
| 32 | +| Tokenizer | SP8192 CaseOps (bijective case markers) | [#1729](https://github.com/openai/parameter-golf/pull/1729) | |
| 33 | + |
| 34 | +## Key Innovation: Wider Attention Output Gates |
| 35 | + |
| 36 | +Standard AttnOutGate (PR #1787) uses 12 input dimensions from the residual stream to compute per-head gating: |
| 37 | + |
| 38 | +```python |
| 39 | +gate_in = x_orig[:, :, :12] # standard: 12 dims |
| 40 | +gate = 2.0 * sigmoid(linear(gate_in, gate_w)) # -> per-head scalar |
| 41 | +y = attn_output * gate |
| 42 | +``` |
| 43 | + |
| 44 | +We widen the gate input to 32 dimensions (`GATE_WIDTH=32`), giving each head a richer view: |
| 45 | + |
| 46 | +```python |
| 47 | +gate_in = x_orig[:, :, :gate_w.shape[-1]] # wider: 32 dims |
| 48 | +``` |
| 49 | + |
| 50 | +- Gate params per layer: 32 × 8 heads = 256 (vs 96 with width=12) |
| 51 | +- Total extra params: 1,760 across 11 layers (float16 passthrough, negligible) |
| 52 | +- **Pre-quant improvement: −0.002 BPB** vs width=12 |
| 53 | + |
| 54 | +The same widening is applied to SmearGate for consistency. |
| 55 | + |
| 56 | +## Training Configuration |
| 57 | + |
| 58 | +```bash |
| 59 | +VOCAB_SIZE=8192 |
| 60 | +DATA_PATH=./data/datasets/fineweb10B_sp8192_caseops |
| 61 | +TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model |
| 62 | +MAX_WALLCLOCK_SECONDS=600 |
| 63 | +POLAR_EXPRESS_NS=1 |
| 64 | +LQER_ENABLED=0 |
| 65 | +MIN_LR=0.10 |
| 66 | +EMBED_BITS=6 |
| 67 | +COMPRESSOR=brotli |
| 68 | +ATTN_OUT_GATE=1 |
| 69 | +SMEAR_GATE=1 |
| 70 | +GATE_WIDTH=32 |
| 71 | +``` |
| 72 | + |
| 73 | +## Reproduction |
| 74 | + |
| 75 | +```bash |
| 76 | +pip install torch>=2.9.0 sentencepiece brotli triton |
| 77 | +python prepare_caseops_data.py |
| 78 | +torchrun --standalone --nproc_per_node=8 train_gpt.py |
| 79 | +``` |
0 commit comments