Skip to content

Commit a968e4e

Browse files
committed
SLOT + QK-Gain 4.0 + XSA-11 on VRL/LeakyReLU2 base
Integrates four proven post-March-25 techniques: - QK-Gain 4.0 (PR openai#1125 sweep) - XSA all 11 layers (PR openai#1176) - SLOT per-sample delta + logit bias with scored-position masking (PR openai#1229) - forward_hidden/compute_logits refactor for SLOT compatibility
1 parent 9d070df commit a968e4e

3 files changed

Lines changed: 1531 additions & 0 deletions

File tree

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# SLOT + QK-Gain 4.0 + XSA-11 + TTT
2+
3+
Record submission integrating four proven post-March-25 techniques onto the VRL + LeakyReLU2 base (PR #175).
4+
5+
## Architecture
6+
7+
11L, 512d, 8H/4KV GQA, LeakyReLU(0.5)^2 MLP 3x, VRL, VE128, BigramHash(2048), XSA on all 11 layers, Partial RoPE 16/64, LN Scale, SmearGate, U-Net skips, EMA(0.997) + Tight SWA, Late QAT, int6 + lzma, FA3 Hopper, Muon WD=0.04.
8+
9+
## New Techniques (over PR #175 base)
10+
11+
| Technique | Source | Expected Impact |
12+
|-----------|--------|-----------------|
13+
| QK-Gain 4.0 | PR #1125 (45-experiment sweep) | -0.006 BPB |
14+
| XSA all 11 layers | PR #1176 | -0.002 BPB |
15+
| SLOT (per-sample delta + logit bias) | PR #1229 (arXiv:2505.12392v2) | -0.021 to -0.060 BPB |
16+
17+
### SLOT Details
18+
19+
Scored-position Learned Output Tuning optimizes a per-sequence additive delta vector [bsz, 1, 512] at the last hidden layer plus a per-sequence logit bias [bsz, 1, vocab], using frozen model weights. Optimization is 16 AdamW steps with cosine LR 0.008 -> 0.0008. Only scored positions (last stride=64 tokens per non-first window) contribute to the SLOT loss, aligning optimization with the eval metric.
20+
21+
## Reproduction
22+
23+
```bash
24+
QK_GAIN_INIT=4.0 XSA_LAST_N=11 SLOT_ENABLED=1 SLOT_STEPS=16 SLOT_LR=0.008 \
25+
SEED=1337 DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
26+
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model VOCAB_SIZE=1024 \
27+
torchrun --standalone --nproc_per_node=8 train_gpt.py
28+
```
29+
30+
## Credits
31+
32+
- Base model: PR #175 (anthony-maio)
33+
- SLOT mechanism: Hu et al. arXiv:2505.12392v2, PR #1176 (@bigbag), PR #1229 (@resouer)
34+
- QK-Gain 4.0: PR #1125 (45-experiment sweep)
35+
- VRL: ResFormer (arXiv:2410.17897)
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
{
2+
"name": "SLOT_QKGain4_XSA11_TTT",
3+
"author": "Anthony Maio",
4+
"github_id": "anthony-maio",
5+
"date": "2026-04-02",
6+
"track": "10min_16mb",
7+
"num_gpus": 8,
8+
"gpu_type": "H100 SXM",
9+
"training_time_seconds": 600,
10+
"val_bpb": null,
11+
"seed_results": {},
12+
"bytes_total": null,
13+
"bytes_code": null,
14+
"blurb": "SLOT per-sample delta + logit bias (scored-position masked, cosine LR), QK-Gain 4.0, XSA all 11 layers, on VRL + LeakyReLU2 + BigramHash + EMA/SWA + int6+lzma base."
15+
}

0 commit comments

Comments
 (0)