Skip to content

Non-record: 11L XSA + SwiGLU + LoRA TTT (val_bpb=1.1573, 1xH100)#2

Open
swapp1990 wants to merge 109 commits intomainfrom
submission/nonrecord-11l-xsa-lora-ttt
Open

Non-record: 11L XSA + SwiGLU + LoRA TTT (val_bpb=1.1573, 1xH100)#2
swapp1990 wants to merge 109 commits intomainfrom
submission/nonrecord-11l-xsa-lora-ttt

Conversation

@swapp1990
Copy link
Copy Markdown
Owner

Summary

  • val_bpb: 1.1573 (LoRA TTT) | 15.02 MB artifact | 1xH100 PCIe, ~80 min
  • 11-layer transformer: XSA (last 4 layers), SwiGLU 3x MLP, SmearGate, U-Net skips, OrthoInit, Muon WD=0.04, SWA
  • Mixed quantization: int5-MLP + int6-attn + int8-embed + zstd-22
  • Score-then-train LoRA TTT (rank-8, 256-token chunks) brings val_bpb from 1.191 → 1.157
  • 18 experiments over 5 days, from val_bpb=3.10 to 1.1573 (~$50 total compute)

Why Non-Record

Trained on 1xH100 PCIe with grad accumulation (~80 min), not 8xH100 in 10 min. Architecture is identical to what would run on 8xH100.

Test plan

🤖 Generated with Claude Code

0hq and others added 30 commits March 18, 2026 09:33
MLX Timing Mismatch with Main Script
Fix MLX multi-batch validation memory growth
## Submission: Mixed Quantization (int6 blocks + int8 embeddings) + Sliding Window Eval

**val_bpb: 1.1630** | **Total size: 15,353,490 bytes** (under 16MB)

Four orthogonal improvements over the naive baseline:

1. **Wider MLP (MLP_MULT=3)** — 2x→3x expansion (hidden=1536), enabled by aggressive quantization
2. **Mixed-precision quantization** — int6 per-row (31 levels) on STE-protected block weights, int8 per-row (127 levels) on the token embedding which lacks STE fake-quant. Reduces quant penalty from +0.048 to +0.0015 BPB.
3. **Optimized throughput** — seq_len=1024 + batch=524K tokens for 48.4ms/step, ~6.5B total tokens in 10 minutes
4. **Sliding window eval (stride=64)** — each scored token gets 960 tokens of context, ~0.034 BPB improvement, zero artifact cost

### Run command

```bash
RUN_ID=v2_int6_qat_mlp3 MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=2000 TRAIN_LOG_EVERY=200 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

### Key metrics

| Metric | Value |
|--------|-------|
| Steps (10 min cap) | 12,395 |
| int6/int8 sliding val_bpb | **1.1630** |
| Quantization penalty | +0.0015 BPB |
| Artifact size | 15,353,490 bytes |
… 1.2129)

10-layer transformer with mixed-precision export achieving mean val_bpb=1.2129
across 5 seeds on 8xH100 SXM, improving on the naive baseline by 0.0248 nats
(t=34.12, p<<0.001).

Key changes:
- 10 layers (vs 9 baseline)
- Lower LRs: MATRIX_LR=0.02, SCALAR_LR=0.02, TIED_EMBED_LR=0.03
- FP16 tied embedding export (reduces quant gap)
- Int6 quantization for middle layers 2-7 (fits under 16MB)

Mean artifact size: 15.36MB (under 16MB cap).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…aluating the graph after each sub-batch step
Use eager mx.eval() to fix running train script on 16GB Mac devices
keep tok_emb.weight in fp16 during int8 export (kills the quant gap),
shrink MLP hidden to 992 to fit under 16MB, bump warmdown to 3600
and matrix LR to 0.06.

tested on 8xH100 SXM (2 seeds) and 8xH200 SXM (3 seeds).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* SOTA attempt

* Improve score on SXM

---------

Co-authored-by: spokane-way <spokane@way>
Major upgrade from previous 10L submission (1.2129 -> 1.1652 BPB).

Key changes:
- 9L with MLP_MULT=3 (wider MLP, 3x expansion, 21.8M params)
- QAT: STE fake-quantize simulates int6 during training
- Int6 quantization on all block weights (layers 0-8)
- Sliding window eval (stride=64) for ~0.033 BPB free gain
- FP16 tied embedding + lower LRs (carried over)

5-seed results on 8xH100 SXM:
  Mean slide_bpb: 1.1652 (std=0.0017)
  Mean rt_bpb:    1.1985
  t-statistic:    78.93 (p << 0.001)
  All artifacts under 16MB (mean: 15.64MB)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The window_starts filter dropped windows shorter than stride,
silently skipping up to (stride-1) tokens at the end of the
validation set. Now includes all windows with >= 1 scoreable
token, and clamps the score start for short final windows.
Co-authored-by: spokane-way <spokane@way>
abaybektursun and others added 25 commits March 23, 2026 11:27
…ed mean)

LeakyReLU(0.5)² activation (-0.003 vs relu²) + legal score-first TTT
(PR openai#461 recipe, 3ep SGD, all blocks unfrozen) + BigramHash(1536) on
openai#414 stack with Parameter Banking + Parallel Muon (PR openai#399).

3-seed results:
  Seed 1337: 1.1192 bpb, 410s TTT, 15.98 MB
  Seed 42:   1.1200 bpb, 408s TTT, 15.88 MB
  Seed 2025: 1.1189 bpb, 408s TTT, 15.99 MB
  Mean:      1.1194 (std 0.0006)

All artifacts under 16MB. All eval under 10 min.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Record: 11L + Efficient Partial XSA (val_bpb: 1.1307)
…nt6-mlp3x-wd04-1.1271

Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271)
…e-lateqat-1.1248

Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)
…-1.1233

Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233)
…oard-merged-records

Update README leaderboard with merged record submissions
…u-legal-ttt-1.1183

Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean)
…U-Net + NeoMuon + 4xrelu²MLP + Smear + Fact Tied Emb + Poly5 Softcap + YaRN2048 + 8192BPE + FP8 + Bit-packing LZMA + Stride-16 Eval - 2h (openai#641)

* Notable Non-Record Submission: 1.1239 BPB - 106.2M Binary U-Net (15L 768d 8192BPE relu² 4xMLP FP8 SmearGate, 50k steps)

* Updated README.md for Non-record submission.

---------

Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>
… relu² 4xMLP FP8) (openai#640)

Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>
…What Works, What Doesn't, and Why (openai#363)

* Non-record: depth recurrence + quantization error amplification finding

4 unique blocks × 3 cycles = 12 effective depth, 768d, 3x MLP
BigramHash + XSA + LoRA + Late STE QAT + int8+zstd

Key finding: quantization error amplifies ~900x through recurrence cycles,
making int6 incompatible with weight-sharing architectures. Int8 for shared
blocks reduces the gap from 1.14 to 0.37 bpb.

3-seed mean: 2.0711 bpb (pre-quant), 2.4402 bpb (post-quant int8)

* docs: comprehensive depth recurrence research writeup

Complete 4-day experimental report on looped transformers in Parameter Golf:
- Controlled flat vs looped comparison: 1.1648 vs 1.1894 bpb (+0.025 gap)
- Noisy QAT: novel technique collapsing quant error from 0.37 to 0.002 bpb
- 3x3 > 2x5 loop finding: more unique blocks with fewer repeats wins
- 12 negative results with specific numbers
- Hyperparameter sweep data (EMA, warmdown, MTP, WD, grad clip)
- Updated training script with all experimental features

* Update README.md

me when I cant write

* fix: remove extra files, update writeup per reviewer feedback

- Remove pr325_train_gpt.py from PR (dev file, not submission)
- Restore original README.md
- Update records/ writeup with v2 content
- Add hyperlink for Ciprian-Florin Ifrim (FIIZiK_)
- Clarify T=0.90 is activation-dependent (relu² specific, found via grid search)

---------

Co-authored-by: Evangeline Kamin <eve@aurora.lan>
…, 1xH100)

11-layer transformer with XSA, SwiGLU, SmearGate, and score-first LoRA TTT.
Trained on 1xH100 PCIe (~80 min). val_bpb: 1.1573, artifact: 15.02 MB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@swapp1990 swapp1990 force-pushed the submission/nonrecord-11l-xsa-lora-ttt branch from 1f5091a to 370a048 Compare March 29, 2026 18:14
swapp1990 and others added 3 commits March 29, 2026 11:17
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.