Skip to content

5 novel architecture ablations on SOTA baseline#584

Closed
ssatia wants to merge 1 commit intoopenai:mainfrom
ssatia:claude/brave-tereshkova
Closed

5 novel architecture ablations on SOTA baseline#584
ssatia wants to merge 1 commit intoopenai:mainfrom
ssatia:claude/brave-tereshkova

Conversation

@ssatia
Copy link
Copy Markdown

@ssatia ssatia commented Mar 23, 2026

Summary

  • 5 self-contained ablation scripts built on top of the current SOTA record (11L EMA + GPTQ-lite, val_bpb=1.1233)
  • Each file is a complete training script with a single targeted modification, ready to run with torchrun --nproc_per_node=8
  • Includes baseline.py as the unmodified control

Ablations

# File Technique Param Cost Env Vars
1 ablation1_swiglu.py SwiGLU MLP (replace relu²) ~neutral none
2 ablation2_sliding_window.py Window attention on early layers zero SW_WINDOW_SIZE=256, SW_NUM_LAYERS=5
3 ablation3_register_tokens.py Learnable register/sink tokens ~8KB NUM_REGISTERS=4
4 ablation4_head_temperature.py Gated V-norm (learned RMS norm on values) 4 scalars none
5 ablation5_mixture_of_softmax.py Mixture of Softmax (breaks rank bottleneck) ~384KB at K=2 MOS_NUM_EXPERTS=2

Rationale

These target techniques not yet explored in the competition:

  • SwiGLU: Standard in LLaMA/Gemma, never compared against relu² in this codebase
  • Sliding Window: FlashAttn 3 supports it natively, saves FLOPs on early layers
  • Register Tokens: From "ViTs Need Registers" (ICLR 2024), absorbs attention sinks
  • Gated V-Norm: Q and K are normalized but V isn't — this should help quantization
  • MoS: Yang et al. 2018 "Breaking the Softmax Bottleneck" — directly targets BPB metric

Test plan

  • Run baseline.py with SEED=1337 as control
  • Run each ablation with SEED=1337
  • Compare final_int6_sliding_window_exact val_bpb across all runs
  • Verify all artifacts fit within 16MB budget

🤖 Generated with Claude Code

Each ablation is a self-contained training script based on the latest
record, with a single targeted modification to isolate its effect:

1. SwiGLU MLP - replaces relu-squared with gated linear unit
2. Sliding Window Attention - window attn on early encoder layers
3. Register/Sink Tokens - learnable prefix tokens as attention sinks
4. Gated Value Normalization - learned RMS norm gate on V vectors
5. Mixture of Softmax - breaks the softmax rank bottleneck

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant