5 novel architecture ablations on SOTA baseline by ssatia · Pull Request #584 · openai/parameter-golf

ssatia · 2026-03-23T22:50:39Z

Summary

5 self-contained ablation scripts built on top of the current SOTA record (11L EMA + GPTQ-lite, val_bpb=1.1233)
Each file is a complete training script with a single targeted modification, ready to run with torchrun --nproc_per_node=8
Includes baseline.py as the unmodified control

Ablations

#	File	Technique	Param Cost	Env Vars
1	`ablation1_swiglu.py`	SwiGLU MLP (replace relu²)	~neutral	none
2	`ablation2_sliding_window.py`	Window attention on early layers	zero	`SW_WINDOW_SIZE=256`, `SW_NUM_LAYERS=5`
3	`ablation3_register_tokens.py`	Learnable register/sink tokens	~8KB	`NUM_REGISTERS=4`
4	`ablation4_head_temperature.py`	Gated V-norm (learned RMS norm on values)	4 scalars	none
5	`ablation5_mixture_of_softmax.py`	Mixture of Softmax (breaks rank bottleneck)	~384KB at K=2	`MOS_NUM_EXPERTS=2`

Rationale

These target techniques not yet explored in the competition:

SwiGLU: Standard in LLaMA/Gemma, never compared against relu² in this codebase
Sliding Window: FlashAttn 3 supports it natively, saves FLOPs on early layers
Register Tokens: From "ViTs Need Registers" (ICLR 2024), absorbs attention sinks
Gated V-Norm: Q and K are normalized but V isn't — this should help quantization
MoS: Yang et al. 2018 "Breaking the Softmax Bottleneck" — directly targets BPB metric

Test plan

Run baseline.py with SEED=1337 as control
Run each ablation with SEED=1337
Compare final_int6_sliding_window_exact val_bpb across all runs
Verify all artifacts fit within 16MB budget

🤖 Generated with Claude Code

Each ablation is a self-contained training script based on the latest record, with a single targeted modification to isolate its effect: 1. SwiGLU MLP - replaces relu-squared with gated linear unit 2. Sliding Window Attention - window attn on early encoder layers 3. Register/Sink Tokens - learnable prefix tokens as attention sinks 4. Gated Value Normalization - learned RMS norm gate on V vectors 5. Mixture of Softmax - breaks the softmax rank bottleneck Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ssatia marked this pull request as draft March 23, 2026 22:51

ssatia closed this Mar 23, 2026

ssatia deleted the claude/brave-tereshkova branch March 23, 2026 22:51

PiyushDatta mentioned this pull request May 1, 2026

Record: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.089 (best), val_bpb 1.090 (3-seed mean) - PiyushDatta #2106

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

5 novel architecture ablations on SOTA baseline#584

5 novel architecture ablations on SOTA baseline#584
ssatia wants to merge 1 commit intoopenai:mainfrom
ssatia:claude/brave-tereshkova

ssatia commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ssatia commented Mar 23, 2026

Summary

Ablations

Rationale

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant