SP8192 CaseOps + WiderGate32 + GPTQ-int6 — val_bpb 1.08037 (3-seed mean) by bsisduck · Pull Request #1969 · openai/parameter-golf

bsisduck · 2026-04-30T08:44:39Z

Submission: SP8192 CaseOps + WiderGate32 + PolarNS Muon + GPTQ-int6

val_bpb: 1.08037 (3-seed mean, std 0.00139) | ~15.9 MB | 8×H100 SXM, 600s wallclock | TTT eval

Results

Seed	Pre-quant val_bpb	Post-quant val_bpb	Post-TTT val_bpb	Artifact
0	1.07175	1.09419	1.08196	15,890,131
42	1.07039	1.09076	1.07983	15,887,137
1234	1.06982	1.09058	1.07932	15,888,516
Mean			1.08037	15,888,595

Architecture

Component	Setting	Source
Layers	11 (512d, 8 GQA heads, 4 KV heads)	Baseline
MLP	4× (2048) with LeakyReLU(0.5)²	#493
Attention	FA3, GQA 2:1	Baseline
RoPE	Partial (16/64 dims), base 10000	#315
U-Net skips	Encoder-decoder skip connections + skip gates	#289
Parallel decoder	2-lane parallel from layer 8+	#1530
Depth recurrence	Loop layers 3-5, NUM_LOOPS=2 (17 virtual layers)	#1344
Logit softcap	30	Baseline
Wider AttnOutGate	Per-head output gate, GATE_WIDTH=32 (vs standard 12)	#1787 + this work
SmearGate	Position-mixing gate, width=32	#1667
Polar-Express Muon	5 NS steps, per-iter minimax tuples, momentum 0.97	#1344
MIN_LR floor	0.10 (warmdown LR floor)	#1787
Quantization	GPTQ int6 all weights (EMBED_BITS=6) + brotli-11
TTT	LoRA rank-96, 1 phase, 2000 prefix docs	#1610
Tokenizer	SP8192 CaseOps (bijective case markers)	#1729

Key Innovation: Wider Attention Output Gates

Standard AttnOutGate (PR #1787) uses 12 input dimensions from the residual stream to compute per-head gating:

gate_in = x_orig[:, :, :12]  # standard: 12 dims
gate = 2.0 * sigmoid(linear(gate_in, gate_w))  # -> per-head scalar
y = attn_output * gate

We widen the gate input to 32 dimensions (GATE_WIDTH=32), giving each head a richer view:

gate_in = x_orig[:, :, :gate_w.shape[-1]]  # wider: 32 dims

Gate params per layer: 32 × 8 heads = 256 (vs 96 with width=12)
Total extra params: 1,760 across 11 layers (float16 passthrough, negligible)
Pre-quant improvement: −0.002 BPB vs width=12

The same widening is applied to SmearGate for consistency.

Training Configuration

VOCAB_SIZE=8192
DATA_PATH=./data/datasets/fineweb10B_sp8192_caseops
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model
MAX_WALLCLOCK_SECONDS=600
POLAR_EXPRESS_NS=1
LQER_ENABLED=0
MIN_LR=0.10
EMBED_BITS=6
COMPRESSOR=brotli
ATTN_OUT_GATE=1
SMEAR_GATE=1
GATE_WIDTH=32

Reproduction

pip install torch>=2.9.0 sentencepiece brotli triton
python prepare_caseops_data.py
torchrun --standalone --nproc_per_node=8 train_gpt.py

…7 (3-seed mean)

…H100)

…tack (8xH100)" This reverts commit 4b37b42.

Submission: SP8192 CaseOps + WiderGate32 + GPTQ-int6 — val_bpb 1.0803…

6bb607a

…7 (3-seed mean)

bsisduck changed the title ~~Submission: SP8192 CaseOps + WiderGate32 + GPTQ-int6 — val_bpb 1.08037 (3-seed mean)~~ SP8192 CaseOps + WiderGate32 + GPTQ-int6 — val_bpb 1.08037 (3-seed mean) Apr 30, 2026

bsisduck added 2 commits April 30, 2026 10:49

Ablation: WiderGate32, RoPE dims, activation slopes, hparam stack (8x…

4b37b42

…H100)

Revert "Ablation: WiderGate32, RoPE dims, activation slopes, hparam s…

d00a648

…tack (8xH100)" This reverts commit 4b37b42.

bsisduck mentioned this pull request Apr 30, 2026

Ablation: WiderGate32, RoPE dims, activation slopes, hparam stack (8xH100) #1970

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SP8192 CaseOps + WiderGate32 + GPTQ-int6 — val_bpb 1.08037 (3-seed mean)#1969

SP8192 CaseOps + WiderGate32 + GPTQ-int6 — val_bpb 1.08037 (3-seed mean)#1969
bsisduck wants to merge 3 commits intoopenai:mainfrom
bsisduck:submission/wider-gate32-caseops-1.0804

bsisduck commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bsisduck commented Apr 30, 2026

Submission: SP8192 CaseOps + WiderGate32 + PolarNS Muon + GPTQ-int6

Results

Architecture

Key Innovation: Wider Attention Output Gates

Training Configuration

Reproduction

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant