Skip to content

Record: 11L XSA + EMA + TTT + Partial RoPE + LN Scale — val_bpb=1.1401#371

Closed
mrdavtan wants to merge 1 commit intoopenai:mainfrom
mrdavtan:11l-xsa-ema-ttt
Closed

Record: 11L XSA + EMA + TTT + Partial RoPE + LN Scale — val_bpb=1.1401#371
mrdavtan wants to merge 1 commit intoopenai:mainfrom
mrdavtan:11l-xsa-ema-ttt

Conversation

@mrdavtan
Copy link
Copy Markdown

Summary

  • val_bpb = 1.1401 (seed=1337, sliding window stride=32)
  • artifact = 15.4 MB (int6 + zstd-22)
  • 8×H100 SXM, 600s wallclock, ~7100 steps at ~82ms/step

11-layer transformer stacking: XSA (last 4 layers), EMA (decay=0.997), Test-Time Training (3-epoch SGD), U-Net skip connections, Partial RoPE (16/64 dims), LN Scale, SmearGate, BigramHash, OrthoInit, and late QAT (absmax int6 STE).

Key findings

  • SmearGate should use interpolation (torch.lerp(x, prev, g)), not additive blending. The additive formula inflates magnitude at default gate values.
  • BigramHash benefits from XOR-based hashing with large primes and a learned output scalar (init 0.05).
  • U-Net skip connections with skip added before decoder block and single nn.Parameter tensor for torch.compile compatibility.
  • Partial RoPE (16/64 dims) and LN Scale (1/sqrt(layer+1)) provide improvements at zero parameter cost.
  • Optimizer coverage: SmearGate and BigramHash parameters must be explicitly added to optimizer groups — they silently freeze from initialization otherwise.

Full development log in records/track_10min_16mb/2026-03-21_11L_XSA_EMA_TTT/SESSION_LOG.md.

Reproduction

cd /workspace
git clone https://github.com/mrdavtan/parameter-golf.git
cd parameter-golf && git checkout 11l-xsa-ema-ttt
pip install flash-attn --no-cache-dir --no-build-isolation
pip install zstandard sentencepiece huggingface_hub
python3 data/cached_challenge_fineweb.py --variant sp1024

unset MLP_HIDDEN QUANT_BITS RUN_ID SEED TIER2_MODE && \
ROPE_DIMS=16 LN_SCALE=1 ROPE_BASE=10000 \
EVAL_STRIDE=32 DOC_ISOLATED_EVAL=0 \
torchrun --standalone --nproc_per_node=8 \
  records/track_10min_16mb/2026-03-21_11L_XSA_EMA_TTT/train_gpt.py

Hardware: 8×H100 SXM (RunPod), PyTorch 2.9.1+cu128, Flash Attention 2

@mrdavtan
Copy link
Copy Markdown
Author

Closing in favor of PR #212 (earlier in queue, same code).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant