|
| 1 | +# Record: SP8192 + Pre-Quant AdamW TTT + SDClip — val_bpb 1.07948 (3-seed mean) |
| 2 | + |
| 3 | +**val bpb: 1.07948** (3-seed mean, std=0.00043) |
| 4 | + |
| 5 | +| Seed | Sliding BPB | Artifact | |
| 6 | +|------|-------------|----------| |
| 7 | +| 1337 | **1.07920** | 15,117,282 | |
| 8 | +| 42 | **1.07927** | 15,115,229 | |
| 9 | +| 2025 | **1.07997** | 15,131,140 | |
| 10 | +| **Mean** | **1.07948** | 15,121,217 | |
| 11 | + |
| 12 | +## Background |
| 13 | + |
| 14 | +I'm a documentary filmmaker with zero ML background. This is my second Parameter Golf submission — the first one (PR #1396, 1.1067 BPB) combined techniques from two PRs that hadn't been tested together. |
| 15 | + |
| 16 | +This time I noticed that the two best open PRs each had something the other didn't: |
| 17 | + |
| 18 | +- **@clarkkev's #1394** (1.08563 BPB) — the best clean neural score, using SP8192 vocab, GPTQ on embeddings, and a clever standard-deviation-based quantization clipping method |
| 19 | +- **@stukenov's #1364** (1.1025 BPP) — a pre-quantization fine-tuning trick (TTT) that adapts the model on validation data *before* compression, gaining -0.027 BPB |
| 20 | + |
| 21 | +I merged them. Neither had tested this combination. |
| 22 | + |
| 23 | +I used Claude Opus 4.6 as a co-author to understand both codebases and combine them. |
| 24 | + |
| 25 | +## What's Different |
| 26 | + |
| 27 | +The main idea: run AdamW fine-tuning on the full-precision model BEFORE quantizing it. Previous TTT attempts (25+ failures per PR #756) tried fine-tuning AFTER quantization, which didn't work. @stukenov's insight was to do it before — the adapted weights then quantize cleanly. |
| 28 | + |
| 29 | +Combined with @clarkkev's compression pipeline (SDClip: clip threshold = k × std(row) instead of grid search), the two techniques stack without interfering. |
| 30 | + |
| 31 | +## Techniques Used |
| 32 | + |
| 33 | +| Technique | From | What It Does | |
| 34 | +|-----------|------|-------------| |
| 35 | +| SP8192 tokenizer | #1394 | Larger vocabulary captures more subword patterns | |
| 36 | +| GPTQ on embeddings | #1394 | Quantize the embedding table too, not just weight matrices | |
| 37 | +| SDClip (k × std) | #1394 | Smarter quantization clipping that accounts for compression | |
| 38 | +| Byte-shuffle + Brotli | #1394 | Better compression than lzma | |
| 39 | +| Skip gates | #1394 | Learned gating on U-Net skip connections | |
| 40 | +| Depth recurrence | #1394/#1204 | Loop layers 4-5 twice (more depth, same params) | |
| 41 | +| MuonEq-R | #1217 | Row-normalized Muon optimizer | |
| 42 | +| XSA (all layers) | #478 | Removes self-attention redundancy via projection | |
| 43 | +| Pre-quant AdamW TTT | #1364 | Fine-tune on val data before compression (6 epochs) | |
| 44 | +| QK-Gain 4.0 | #1364 | Query/key initialization scaling | |
| 45 | +| EMA 0.997 | standard | Exponential moving average of weights | |
| 46 | + |
| 47 | +## How to Run |
| 48 | + |
| 49 | +```bash |
| 50 | +# Download SP8192 dataset (from @clarkkev's HuggingFace) |
| 51 | +# See https://huggingface.co/datasets/kevclark/parameter-golf |
| 52 | +pip install brotli |
| 53 | + |
| 54 | +DATA_DIR=./data/ \ |
| 55 | +SEED=1337 \ |
| 56 | +torchrun --standalone --nproc_per_node=8 train_gpt.py |
| 57 | +``` |
| 58 | + |
| 59 | +Note: requires `brotli` pip package and SP8192 dataset from @clarkkev's HuggingFace repo. |
| 60 | + |
| 61 | +## Credits |
| 62 | + |
| 63 | +This is entirely built on the work of others: |
| 64 | + |
| 65 | +- **@clarkkev** (PR #1394) — the base architecture, SP8192, SDClip, skip gates, compression pipeline |
| 66 | +- **@stukenov** (PR #1364) — the pre-quant AdamW TTT technique |
| 67 | +- **@omrigotlieb** (#1204) — depth recurrence concept |
| 68 | +- **@unnir** (#1217) — MuonEq-R optimizer |
| 69 | + |
| 70 | +Built with Claude Opus 4.6 as AI co-author. |
0 commit comments