Skip to content

Commit bf03f50

Browse files
erichroepkeclaude
andcommitted
Record: SP8192 + Pre-Quant AdamW TTT + SDClip — val_bpb 1.07948 (3-seed mean)
Merges @clarkkev's openai#1394 (SP8192, SDClip, GPTQ embeddings, skip gates) with @stukenov's openai#1364 (pre-quant AdamW TTT). First combination of these techniques. 3-seed mean: 1.07948 BPB (std=0.00043), artifact 15.12 MB. Built with Claude Opus 4.6 as AI co-author. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 9d070df commit bf03f50

6 files changed

Lines changed: 9488 additions & 373 deletions

File tree

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Record: SP8192 + Pre-Quant AdamW TTT + SDClip — val_bpb 1.07948 (3-seed mean)
2+
3+
**val bpb: 1.07948** (3-seed mean, std=0.00043)
4+
5+
| Seed | Sliding BPB | Artifact |
6+
|------|-------------|----------|
7+
| 1337 | **1.07920** | 15,117,282 |
8+
| 42 | **1.07927** | 15,115,229 |
9+
| 2025 | **1.07997** | 15,131,140 |
10+
| **Mean** | **1.07948** | 15,121,217 |
11+
12+
## Background
13+
14+
I'm a documentary filmmaker with zero ML background. This is my second Parameter Golf submission — the first one (PR #1396, 1.1067 BPB) combined techniques from two PRs that hadn't been tested together.
15+
16+
This time I noticed that the two best open PRs each had something the other didn't:
17+
18+
- **@clarkkev's #1394** (1.08563 BPB) — the best clean neural score, using SP8192 vocab, GPTQ on embeddings, and a clever standard-deviation-based quantization clipping method
19+
- **@stukenov's #1364** (1.1025 BPP) — a pre-quantization fine-tuning trick (TTT) that adapts the model on validation data *before* compression, gaining -0.027 BPB
20+
21+
I merged them. Neither had tested this combination.
22+
23+
I used Claude Opus 4.6 as a co-author to understand both codebases and combine them.
24+
25+
## What's Different
26+
27+
The main idea: run AdamW fine-tuning on the full-precision model BEFORE quantizing it. Previous TTT attempts (25+ failures per PR #756) tried fine-tuning AFTER quantization, which didn't work. @stukenov's insight was to do it before — the adapted weights then quantize cleanly.
28+
29+
Combined with @clarkkev's compression pipeline (SDClip: clip threshold = k × std(row) instead of grid search), the two techniques stack without interfering.
30+
31+
## Techniques Used
32+
33+
| Technique | From | What It Does |
34+
|-----------|------|-------------|
35+
| SP8192 tokenizer | #1394 | Larger vocabulary captures more subword patterns |
36+
| GPTQ on embeddings | #1394 | Quantize the embedding table too, not just weight matrices |
37+
| SDClip (k × std) | #1394 | Smarter quantization clipping that accounts for compression |
38+
| Byte-shuffle + Brotli | #1394 | Better compression than lzma |
39+
| Skip gates | #1394 | Learned gating on U-Net skip connections |
40+
| Depth recurrence | #1394/#1204 | Loop layers 4-5 twice (more depth, same params) |
41+
| MuonEq-R | #1217 | Row-normalized Muon optimizer |
42+
| XSA (all layers) | #478 | Removes self-attention redundancy via projection |
43+
| Pre-quant AdamW TTT | #1364 | Fine-tune on val data before compression (6 epochs) |
44+
| QK-Gain 4.0 | #1364 | Query/key initialization scaling |
45+
| EMA 0.997 | standard | Exponential moving average of weights |
46+
47+
## How to Run
48+
49+
```bash
50+
# Download SP8192 dataset (from @clarkkev's HuggingFace)
51+
# See https://huggingface.co/datasets/kevclark/parameter-golf
52+
pip install brotli
53+
54+
DATA_DIR=./data/ \
55+
SEED=1337 \
56+
torchrun --standalone --nproc_per_node=8 train_gpt.py
57+
```
58+
59+
Note: requires `brotli` pip package and SP8192 dataset from @clarkkev's HuggingFace repo.
60+
61+
## Credits
62+
63+
This is entirely built on the work of others:
64+
65+
- **@clarkkev** (PR #1394) — the base architecture, SP8192, SDClip, skip gates, compression pipeline
66+
- **@stukenov** (PR #1364) — the pre-quant AdamW TTT technique
67+
- **@omrigotlieb** (#1204) — depth recurrence concept
68+
- **@unnir** (#1217) — MuonEq-R optimizer
69+
70+
Built with Claude Opus 4.6 as AI co-author.
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
{
2+
"name": "Erich Roepke",
3+
"github": "erichroepke",
4+
"val_bpb": 1.07948,
5+
"artifact_bytes": 15131140,
6+
"training_time_seconds": 600,
7+
"gpu": "8xH100 SXM",
8+
"techniques": [
9+
"SP8192 tokenizer",
10+
"Pre-quant AdamW TTT (6 epochs, freeze 2 blocks)",
11+
"SDClip quantization (k=12.85 int6, k=20 int8 embed)",
12+
"GPTQ on all weights including embeddings",
13+
"Byte-shuffle + Brotli-11 compression",
14+
"Skip gates (learned sigmoid U-Net gating)",
15+
"Depth recurrence (loop layers 4-5 twice)",
16+
"MuonEq-R optimizer (row-normalized)",
17+
"XSA on all 11 layers",
18+
"MLP 4x expansion",
19+
"QK-Gain 4.0",
20+
"EMA decay=0.997"
21+
],
22+
"base_prs": [1394, 1364],
23+
"novel": "First combination of SP8192+SDClip+GPTQ-embeddings with pre-quant AdamW TTT"
24+
}

0 commit comments

Comments
 (0)