Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# 10L Int5-MLP + MuonWD=0.04 + SWA/50 + SmearGate + BigramHash

**val_bpb: 1.14526** (sliding window stride=64, post int6+zstd quantization roundtrip)

Comment thread
thwu1 marked this conversation as resolved.
Outdated
## Key Innovation: Mixed Int5/Int6 Quantization

Instead of uniform int6 quantization for all weights, use:
- **Int5 [-16,15]** for MLP weights (largest tensors, most compressible)
- **Int6 [-32,31]** for attention weights (more precision-sensitive)
- **FP16** for tied embeddings and last-layer key projections

Int5 values stored in int8 bytes have **3 zero high bits** vs 2 for int6. zstd-22 compresses int5 at 1.88x vs int6 at 1.51x — saving **1.86MB** (14.07MB vs 16.05MB). This funds a **10th transformer layer** while staying under 16MB.

## Architecture
- 10 layers, 512 dim, 8 heads, 4 KV heads (GQA)
- MLP 3x expansion (hidden=1536), relu² activation
- SmearGate + BigramHash (4096 buckets, dim=128)
- Orthogonal init with muP-scaled output projections
- U-Net skip connections

## Training
- Weight decay: 0.04 (Muon + AdamW) — improves quantization friendliness
- SWA: every 50 steps, ~29 checkpoint average during warmdown
- Muon optimizer: momentum 0.99, warmup 0.92→0.99 over 1500 steps
- matrix_lr=0.02, scalar_lr=0.02, tied_embed_lr=0.03
- seq_len=2048, batch=786K tokens, warmdown=3000 iters
- Sliding window eval: stride=64

## Metrics
- Model params: 24,730,705
- Steps: 6,694 in 600s (89.5 ms/step)
- Pre-quant val_bpb: 1.1748
- Post-quant (int6+zstd roundtrip): **val_bpb: 1.14526**
- Artifact: 15,521,020 bytes (15.52MB)

## Run Command
```bash
NUM_LAYERS=10 bash eval/eval.sh
Comment thread
thwu1 marked this conversation as resolved.
Outdated
Comment thread
thwu1 marked this conversation as resolved.
Outdated
```
All other defaults are baked into train_gpt.py.

## Ablation Summary
| Change | BPB | Delta |
|--------|-----|-------|
| 9L int6 (PR162 base) | 1.14847 | baseline |
| + int5 MLP (9L) | 1.15663 | +0.008 (quant cost) |
| + 10th layer (int5 MLP) | 1.14803 | -0.0005 (depth compensates) |
| + WD=0.04 + SWA/50 | **1.14526** | **-0.003** |

Built on PR #162 by @unnir (SmearGate, BigramHash, OrthoInit).
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"name": "10L Int5-MLP + MuonWD=0.04 + SWA/50 + SmearGate + BigramHash",
"val_bpb": 1.14526,
"artifact_bytes": 15521020,
"blurb": "10 layers with mixed int5/int6 quantization (int5 for MLP, int6 for attention). Int5 MLP compression saves 1.86MB vs uniform int6, funding the extra layer. Weight decay=0.04 improves quantization friendliness. SWA every 50 steps (~29 checkpoint average). SmearGate + BigramHash + OrthoInit + zstd-22.",
"authors": ["thwu1"]
Comment thread
thwu1 marked this conversation as resolved.
Outdated
}
Loading
Loading