Skip to content

Commit 6bb607a

Browse files
committed
Submission: SP8192 CaseOps + WiderGate32 + GPTQ-int6 — val_bpb 1.08037 (3-seed mean)
1 parent 9d070df commit 6bb607a

10 files changed

Lines changed: 6020 additions & 0 deletions

File tree

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# Submission: SP8192 CaseOps + WiderGate32 + PolarNS Muon + GPTQ-int6
2+
3+
**val_bpb: 1.08037** (3-seed mean, std 0.00139) | **~15.9 MB** | 8×H100 SXM, 600s wallclock | TTT eval
4+
5+
## Results
6+
7+
| Seed | Pre-quant val_bpb | Post-quant val_bpb | **Post-TTT val_bpb** | Artifact |
8+
|------|-------------------|--------------------|----------------------|----------|
9+
| 0 | 1.07175 | 1.09419 | **1.08196** | 15,890,131 |
10+
| 42 | 1.07039 | 1.09076 | **1.07983** | 15,887,137 |
11+
| 1234 | 1.06982 | 1.09058 | **1.07932** | 15,888,516 |
12+
| **Mean** | | | **1.08037** | 15,888,595 |
13+
14+
## Architecture
15+
16+
| Component | Setting | Source |
17+
|-----------|---------|--------|
18+
| Layers | 11 (512d, 8 GQA heads, 4 KV heads) | Baseline |
19+
| MLP | 4× (2048) with LeakyReLU(0.5)² | [#493](https://github.com/openai/parameter-golf/pull/493) |
20+
| Attention | FA3, GQA 2:1 | Baseline |
21+
| RoPE | Partial (16/64 dims), base 10000 | [#315](https://github.com/openai/parameter-golf/pull/315) |
22+
| U-Net skips | Encoder-decoder skip connections + skip gates | [#289](https://github.com/openai/parameter-golf/pull/289) |
23+
| Parallel decoder | 2-lane parallel from layer 8+ | [#1530](https://github.com/openai/parameter-golf/pull/1530) |
24+
| Depth recurrence | Loop layers 3-5, NUM_LOOPS=2 (17 virtual layers) | [#1344](https://github.com/openai/parameter-golf/pull/1344) |
25+
| Logit softcap | 30 | Baseline |
26+
| **Wider AttnOutGate** | Per-head output gate, **GATE_WIDTH=32** (vs standard 12) | [#1787](https://github.com/openai/parameter-golf/pull/1787) + **this work** |
27+
| **SmearGate** | Position-mixing gate, width=32 | [#1667](https://github.com/openai/parameter-golf/pull/1667) |
28+
| **Polar-Express Muon** | 5 NS steps, per-iter minimax tuples, momentum 0.97 | [#1344](https://github.com/openai/parameter-golf/pull/1344) |
29+
| **MIN_LR floor** | 0.10 (warmdown LR floor) | [#1787](https://github.com/openai/parameter-golf/pull/1787) |
30+
| Quantization | GPTQ int6 all weights (EMBED_BITS=6) + brotli-11 | |
31+
| TTT | LoRA rank-96, 1 phase, 2000 prefix docs | [#1610](https://github.com/openai/parameter-golf/pull/1610) |
32+
| Tokenizer | SP8192 CaseOps (bijective case markers) | [#1729](https://github.com/openai/parameter-golf/pull/1729) |
33+
34+
## Key Innovation: Wider Attention Output Gates
35+
36+
Standard AttnOutGate (PR #1787) uses 12 input dimensions from the residual stream to compute per-head gating:
37+
38+
```python
39+
gate_in = x_orig[:, :, :12] # standard: 12 dims
40+
gate = 2.0 * sigmoid(linear(gate_in, gate_w)) # -> per-head scalar
41+
y = attn_output * gate
42+
```
43+
44+
We widen the gate input to 32 dimensions (`GATE_WIDTH=32`), giving each head a richer view:
45+
46+
```python
47+
gate_in = x_orig[:, :, :gate_w.shape[-1]] # wider: 32 dims
48+
```
49+
50+
- Gate params per layer: 32 × 8 heads = 256 (vs 96 with width=12)
51+
- Total extra params: 1,760 across 11 layers (float16 passthrough, negligible)
52+
- **Pre-quant improvement: −0.002 BPB** vs width=12
53+
54+
The same widening is applied to SmearGate for consistency.
55+
56+
## Training Configuration
57+
58+
```bash
59+
VOCAB_SIZE=8192
60+
DATA_PATH=./data/datasets/fineweb10B_sp8192_caseops
61+
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model
62+
MAX_WALLCLOCK_SECONDS=600
63+
POLAR_EXPRESS_NS=1
64+
LQER_ENABLED=0
65+
MIN_LR=0.10
66+
EMBED_BITS=6
67+
COMPRESSOR=brotli
68+
ATTN_OUT_GATE=1
69+
SMEAR_GATE=1
70+
GATE_WIDTH=32
71+
```
72+
73+
## Reproduction
74+
75+
```bash
76+
pip install torch>=2.9.0 sentencepiece brotli triton
77+
python prepare_caseops_data.py
78+
torchrun --standalone --nproc_per_node=8 train_gpt.py
79+
```

0 commit comments

Comments
 (0)