Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# 2023-03-19: Vocab Size, NorMuon, Selective Quantization

Day 1! This record contains three main new ideas, as well as some tweaks to the baseline, particularly vocab size. I had several ideas I wanted to try today, and these are the ones that worked - I want to chase further on quantization in the coming days.

Changes in this model:
- Vocab size 1024 -> 8192
- New "sp8192" tokenizer trained using
```bash
./data/download_hf_docs_and_tokenize.py --output-root ./data --tokenizer-config ./data/tokenizer_specs.json --max-train-tokens 8000000000 --tokenizer-train-docs 100000
```
with this tokenizer_spec:
```json
{
"tokenizers": [
{
"name": "sp_bpe_1024",
"dataset_suffix": "sp1024",
"vocab_size": 1024
},
{
"name": "sp_bpe_8192",
"dataset_suffix": "sp8192",
"vocab_size": 8192
}
]
}

```
with a 50/50 val/train split as a result. Tokenizers for sp1024, 2048, 4096 and 8192 with data are publicly available on [my huggingface](https://huggingface.co/sproos/parameter-golf-tokenizers/tree/main).
- NorMuon implementation from [the original paper](https://github.com/zichongli5/NorMuon), popularized by `modded-nanogpt`, replacing Muon
- Selective Quantization: the weights are quantized to int6, while the embeddings are kept at int8. Not sure if this is optimal and have seen plenty of weird behaviour from this, but I think it's in the right direction; I think being precise about precision will be really key to this challenge and I want to dig into it more. From now on there will be a lot of trading off precision between areas of the model!

Configuration:
- All hyperparams as in default NaiveBaseline except VOCAB_SIZE, TRAIN_SEQ_LEN, WARMDOWN_ITERS and NUM_LAYERS; unfortunately to get the increased vocab size we have to sacrifice a layer. I'm sure there's a better architectural setup here, but I don't know if it's recurrence.
- Tested on Hyperbolic Labs 8xH100 node with SXM5; reproduced baseline with `step_avg:43.67ms` and `final_int8_zlib_roundtrip_exact val_bpb:1.22731147` immediately before.

Command:
```bash
NCCL_IB_DISABLE=1 \
RUN_ID=verify_sp8192_w6e8_8gpu \
DATA_PATH=./data/datasets/fineweb10B_sp8192 \
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \
VOCAB_SIZE=8192 \
MAX_WALLCLOCK_SECONDS=600 \
WEIGHT_QUANTIZATION_BITS=6 \
EMBED_QUANTIZATION_BITS=8 \
WARMDOWN_ITERS=3000 \
TRAIN_SEQ_LEN=4096 \
NUM_LAYERS=8 \
torchrun --standalone --nproc_per_node=8 ./records/track_10min_16mb/2026-03-19_VocabSize_NorMuon_SelectiveQuant/train_gpt.py
```

Key metrics (from `train.log`):
- Timed training stopped at `9359/20000` steps due to the wallclock cap.
- Pre-quant eval at stop: `val_loss:3.0261`, `val_bpb:1.1717`
- Post-quant roundtrip eval: `val_loss:3.06233041`, `val_bpb:1.18576208
- Train time: `600075ms` (`step_avg:64.12ms`)
- Serialized model w6e8+zlib: `14743224 bytes`
- Code size: `53612 bytes`
- Total submission size w6e8+zlib: `14796836 bytes`

Training volume:
- Global batch: `524288` tokens/step
- Total train tokens seen: `7224688640`

Included files:
- `train_gpt.py` (code snapshot used for the run)
- `train.log` (exact remote training log)
- `submission.json` (leaderboard metadata)
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"author": "Spruce Campbell",
"github_id": "mtybadger",
"name": "8192 Vocab Size, NorMuon, Selective Quantization",
"blurb": "SP-8192 8xH100 SXM5 run using a new tokenizer and NorMuon implementation from the original paper, 1.185 bpb on val, 14.79 MB.",
"date": "2026-03-19T12:09:29Z",
"val_loss": 3.06233041,
"val_bpb": 1.18576208,
"bytes_total": 14796836,
"bytes_code": 53612
}
Loading