Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Record: SP8192 + 3-Layer Depth Recurrence + Parallel Residuals + EMA + QK5 + Pre-Quant AdamW TTT

**val_bpb = 1.0679** (3-seed mean, std 0.0012) | **~15.95 MB** | 8xH100 SXM

## 3-Seed Results

| Seed | Sliding BPB | Roundtrip BPB | Steps | Artifact |
|------|-------------|---------------|-------|----------|
| 42 | **1.06919475** | 1.08454243 | 5001 | 15,948,623 |
| 1337 | **1.06759772** | 1.08281588 | 5163 | 15,954,178 |
| 2024 | **1.06690869** | 1.08219302 | 5167 | 15,960,801 |
| **Mean** | **1.06790039** | | | |

Merged SOTA (PR #1019): **1.1147 BPB**. Delta: **-0.0468 BPB**.

## Novel Contribution

First submission combining **all six techniques** in one stack:

1. **3-layer depth recurrence** (layers 3,4,5 repeated -> 13 virtual layers from 11 physical)
2. **Parallel residuals** from layer 7 (GPT-J style separate attention/MLP lanes)
3. **EMA 0.9965** (exponential moving average of weights)
4. **QK-Gain 5.0** (learnable per-head, applied to Q only)
5. **Pre-quant AdamW TTT** (6 epochs on val data BEFORE GPTQ, baked into artifact)
6. **SDClip GPTQ int6** + int8 embeddings + brotli compression

Prior work had subsets:
- PR #1471: recurrence + par7 + EMA + QK5 (no TTT) -> 1.0866
- PR #1423: TTT + QK5 (no recurrence, no par7) -> 1.0791
- PR #1477: recurrence(2-layer) + par7 + score-first TTT -> 1.0822
- **This: all combined -> 1.0679**

## Architecture

| Component | Setting |
|-----------|---------|
| Tokenizer | SP8192 (SentencePiece BPE) |
| Layers | 11 physical, 13 virtual (recurrence 3,4,5) |
| Dim | 512, 8 heads, 4 KV (GQA 2:1) |
| MLP | 4x, squared LeakyReLU (slope 0.5) |
| Activation | leaky_relu(x, 0.5).square() |
| Optimizer | MuonEq-R (row-normalized Newton-Schulz) |
| Recurrence | Layers [3,4,5] after step 3000 |
| Parallel | GPT-J style from layer 7 |
| EMA | decay=0.9965 |
| QK-Gain | 5.0, learnable per-head |
| Skip gates | Sigmoid gates on U-Net connections |
| Pre-quant TTT | AdamW, lr=0.0005, 6ep, freeze 2 blocks, cosine |
| Quantization | SDClip GPTQ int6 (k=12.85) + int8 embed (k=20.0) |
| Compression | brotli |

## Compliance (Track A)

- Pre-quant TTT trains on validation data BEFORE quantization
- Result baked into artifact at submission time
- No eval-time adaptation, no SLOT, no n-gram cache
- Fixed predictor at evaluation time
- All training within 600s wallclock on 8xH100

## Reproduction

```bash
pip install brotli sentencepiece kernels
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192
VOCAB_SIZE=8192 SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Credits

- PR #1471 @X-Abhishek-X (base architecture, depth recurrence, parallel residuals)
- PR #1423 @aryanbhosale (pre-quant AdamW TTT technique)
- PR #1394 @clarkkev (SDClip, GPTQ embeddings)
- PR #1204 @msisovic (MuonEq-R optimizer)
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{
"author": "ndokutovich",
"github_id": "ndokutovich",
"name": "SP8192 + 3-Layer Depth Recurrence + Parallel Residuals + EMA + QK5 + Pre-Quant AdamW TTT",
"blurb": "Merged stack: PR #1471 base (11L, 512dim, GQA 8/4, squared LeakyReLU, MuonEq-R) with 3-layer depth recurrence (layers 3,4,5 -> 13 virtual layers), parallel residuals from layer 7, EMA 0.9965, QK-Gain 5.0, skip gates, and pre-quantization AdamW TTT (6 epochs, lr=0.0005, freeze 2 blocks, cosine decay) ported from PR #1423. SDClip GPTQ int6 (k=12.85) + int8 embeddings (k=20.0) + brotli compression. Track A: no eval-time adaptation.",
"date": "2026-04-09T00:00:00Z",
"val_loss": 2.75849894,
"val_bpb": 1.06790039,
"val_loss_std": 0.00241,
"val_bpb_std": 0.00117,
"seeds": [42, 1337, 2024],
"seed_results": {
"42": {"val_loss": 2.76184107, "val_bpb": 1.06919475},
"1337": {"val_loss": 2.75771579, "val_bpb": 1.06759772},
"2024": {"val_loss": 2.75593596, "val_bpb": 1.06690869}
},
"pre_quant_val_loss": 2.81114117,
"pre_quant_val_bpb": 1.08828035,
"step_stop": 5001,
"wallclock_seconds": 590.057,
"eval_time_seconds": 104.389,
"bytes_total": 15948623,
"bytes_model_int6_brotli": 15861058,
"bytes_code": 87565
}
Loading