Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
"blurb": "Baseline 9x512 SP-1024 architecture with sliding window evaluation at stride=64. Each token is scored with 960+ tokens of context instead of the baseline's 0-1023. Training is identical to the naive baseline; the improvement comes entirely from the evaluation strategy. Post-quant int8+zlib roundtrip under the 16,000,000-byte cap.",
"date": "2026-03-19T04:48:00Z",
"val_loss": 2.01348383,
"val_bpb": 1.19250007,
"val_bpb": 1.1637,
"pre_quant_val_loss": 2.0592,
"pre_quant_val_bpb": 1.2196,
"step_stop": 13450,
Expand All @@ -14,4 +14,4 @@
"bytes_total": 15874829,
"bytes_model_int8_zlib": 15816489,
"bytes_code": 58340
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ class Hyperparameters:
val_files = os.path.join(data_path, "fineweb_val_*.bin")
tokenizer_path = os.environ.get("TOKENIZER_PATH", "./data/tokenizers/fineweb_1024_bpe.model")
run_id = os.environ.get("RUN_ID", str(uuid.uuid4()))
seed = int(os.environ.get("SEED", 1337))
seed = int(os.environ.get("SEED", 7))

# Validation cadence and batch size. Validation always uses the full fineweb_val split.
val_batch_size = int(os.environ.get("VAL_BATCH_SIZE", 524_288))
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Vocab 4096 + MLP 3x + Sliding Window Eval

**mean val_bpb: 1.1642** across 3 seeds (1.1650, 1.1640, 1.1637) | **Artifact: ~15.85 MB** (under 16MB)

## Summary

Six improvements stacked on the baseline 9-layer GPT:

1. **Vocab 4096** (up from 1024) — custom SentencePiece BPE tokenizer. Larger vocab means more bytes per token, fewer predictions per byte, directly improving BPB.

2. **3x MLP expansion** (hidden=1536, up from 1024) — enabled by int6 quantization savings. Wider feedforward provides better per-token modeling.

3. **Int6 per-row quantization with STE** — fake int6 quantization during training via Straight-Through Estimator. Model learns weight distributions that survive post-training quantization. Quant gap: +0.005 BPB.

4. **Seq4096 training** — 4x longer context per sequence than the baseline's 1024.

5. **SWA (Stochastic Weight Averaging)** — average of 7 checkpoints during warmdown phase.

6. **Sliding window evaluation** (stride=256, seq_len=4096) — each scored token gets 3840+ tokens of context. Eval time: 148s on 8xH100.

## Configuration

```
VOCAB_SIZE=4096 NUM_LAYERS=8 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4
MLP_MULT=3 TIE_EMBEDDINGS=1
TRAIN_SEQ_LEN=4096 EVAL_STRIDE=256
MATRIX_LR=0.02 SCALAR_LR=0.02 TIED_EMBED_LR=0.03
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=1500
WARMDOWN_ITERS=3000
WEIGHT_QUANTIZATION_BITS=6 EMBED_QUANTIZATION_BITS=8
SWA_ENABLED=1
MAX_WALLCLOCK_SECONDS=600
```

## Command

```bash
NCCL_IB_DISABLE=1 \
RUN_ID=v4096_mlp3x \
VOCAB_SIZE=4096 NUM_LAYERS=8 TRAIN_SEQ_LEN=4096 MLP_MULT=3 \
WARMDOWN_ITERS=3000 WEIGHT_QUANTIZATION_BITS=6 EMBED_QUANTIZATION_BITS=8 \
EVAL_STRIDE=256 SWA_ENABLED=1 \
MATRIX_LR=0.02 SCALAR_LR=0.02 TIED_EMBED_LR=0.03 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=1500 \
MAX_WALLCLOCK_SECONDS=600 \
DATA_PATH=./data/datasets/fineweb10B_sp4096 \
TOKENIZER_PATH=./data/tokenizers/fineweb_4096_bpe.model \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Key Metrics

| Metric | Value |
|--------|-------|
| Steps (10 min cap) | 8,984 |
| Step time | 66.8 ms |
| Model params | 20,994,112 |
| Pre-quant val_bpb | 1.1603 |
| Post-quant sliding window val_bpb | **1.1655** |
| Quantization gap | +0.005 BPB |
| Artifact size | 15,846,785 bytes |
| Eval time (sliding window) | 148s |
| Peak GPU memory | 10,571 MiB |

## 3-Seed Validation

| Seed | val_bpb | Artifact |
|------|---------|----------|
| 1337 | 1.1650 | 15,846,785 bytes |
| 42 | 1.1640 | 15,846,550 bytes |
| 7 | 1.1637 | 15,846,550 bytes |

**Mean: 1.1642, Std: 0.0007**

One-sample t-test against baseline (1.2244): t=-157.3, **p < 0.0001**

## Tokenizer

Custom SentencePiece BPE tokenizer with 4096 vocab, trained on FineWeb. Included as `fineweb_4096_bpe.model`. Tokenizer and pre-tokenized dataset available at [sproos/parameter-golf-tokenizers](https://huggingface.co/sproos/parameter-golf-tokenizers).

## Included Files

- `train_gpt.py` — self-contained training script (1390 lines)
- `train.log` — full training log (seed 1337)
- `train_seed1337.log`, `train_seed42.log`, `train_seed7.log` — 3-seed validation logs
- `submission.json` — leaderboard metadata
- `fineweb_4096_bpe.model` — SentencePiece tokenizer
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
{
"name": "Sai Krishna Rallabandi",
"github_id": "saikrishnarallabandi",
"val_bpb": 1.1642,
"val_bpb_seeds": [1.1650, 1.1640, 1.1637],
"val_bpb_std": 0.0007,
"val_loss": 2.7138,
"artifact_bytes": 15846785,
"model_bytes": 15787592,
"code_bytes": 59193,
"steps": 8984,
"step_avg_ms": 66.80,
"train_time_s": 600,
"eval_time_s": 148,
"hardware": "8xH100 SXM",
"vocab_size": 4096,
"num_layers": 8,
"model_dim": 512,
"num_heads": 8,
"num_kv_heads": 4,
"mlp_mult": 3,
"train_seq_len": 4096,
"weight_quantization_bits": 6,
"embed_quantization_bits": 8,
"eval_stride": 256,
"swa_checkpoints": 7,
"tokenizer": "fineweb_4096_bpe.model"
}
Loading