Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 59 additions & 0 deletions records/track_10min_16mb/2026-03-18_FP16Embed_WD3600/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
Kept the tied embedding in fp16 instead of quantizing it to int8, and tuned the LR schedule. Turns out the embedding is by far the most sensitive tensor to quantize — it's pulling double duty as the output head, so every bit of precision matters.

## what changed

**fp16 embedding passthrough**: one-line change in the quantization function. Instead of int8-quantizing `tok_emb.weight`, I pass it through as fp16. This drops the post-quant BPB degradation from ~0.007 to basically nothing (~0.0005). The tradeoff is ~500KB extra in the artifact, so I shrank the MLP hidden from 1024 to 992 to stay under 16MB.

**warmdown + LR**: bumped `WARMDOWN_ITERS` from 1200 to 3600 and `MATRIX_LR` from 0.04 to 0.06. The default schedule assumes way more steps than you actually get in 10 minutes, so a longer warmdown and higher LR help the model converge properly.

## config

```
VOCAB_SIZE=1024 NUM_LAYERS=9 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4
MLP_HIDDEN=992 TIE_EMBEDDINGS=1 WARMDOWN_ITERS=3600 MATRIX_LR=0.06
```

## run command

```bash
RUN_ID=fp16embed \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
MLP_HIDDEN=992 \
WARMDOWN_ITERS=3600 \
MATRIX_LR=0.06 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

Note: don't set `NCCL_IB_DISABLE=1` — it tanks step throughput on pods with IB/NVLink (~60ms vs ~44ms per step).

## results

8xH100 SXM (RunPod secure cloud):

| seed | steps | val_loss | val_bpb | artifact size |
|------|-------|----------|---------|---------------|
| 1337 | 13,692 | 2.0595 | 1.2197 | 15.90MB |
| 42 | 13,722 | 2.0600 | 1.2201 | 15.90MB |

Pre-quant vs post-quant gap: ~0.0005 BPB (baseline gap is ~0.007).

Improvement over baseline: ~0.013 nats.

Also ran 3 seeds on 8xH200 SXM (all consistent, 1.2163-1.2179 BPB).

## things I tried that didn't work

- **SwiGLU**: better per-step quality but 45% slower on 8-GPU, so fewer total steps. Net negative.
- **depth recurrence** (looping layers): promising idea but needs way more steps than 10 min allows.
- **QAT**: tried both full-training and late-stage. The overhead per step wasn't worth the small quant gap reduction.
- **lzma compression**: actually compresses worse than zlib for int8 weight data.
- **higher embed LR** (0.08 vs 0.05): hurt convergence.

## files

- `train_gpt.py` — modified training script
- `train.log` — 8xH100 log (seed 1337)
- `train_seed42.log` — 8xH100 log (seed 42)
- `submission.json`
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"author": "Renier Velazco",
"github_id": "chonchiog",
"name": "FP16 Tied Embedding + LR/Warmdown Tuning",
"blurb": "Keep tok_emb.weight in fp16 during int8 quantization to eliminate the output-head quantization gap (0.007 -> 0.0005 BPB). Slightly reduce MLP hidden (992 vs 1024) to fit within 16MB. Tune warmdown (3600 vs 1200) and matrix LR (0.06 vs 0.04) for better convergence under the 10-min wallclock cap.",
"date": "2026-03-18",
"val_loss": 2.05945460,
"val_bpb": 1.21972502,
"bytes_total": 15896222,
"bytes_code": 48125
}
Loading