Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Record: 33.6M Int5 GPTQ + Score-First TTT + Temp Calibration

**3-seed mean val_bpb: 1.1145 (std 0.0003)**

## Approach

Train a larger model (33.6M params, d=576) and compress harder with int5 GPTQ. Add legal score-first backward-looking TTT with temperature calibration.

## Architecture
- **Model**: 33.6M params, d=576, 11 layers (U-Net skip), 8 heads, MLP 3.5x (hidden=1792)
- **Features**: SmearGate, BigramHash(8192), XSA-all(11), Value Embeddings, Partial RoPE (16 dims), LN Scale
- **Quantization**: Int5 GPTQ (clip_range=15, [-16,15]) + zstd-22. GPTQ calibration within training budget (256 training samples)
- **Eval**: Score-first TTT + sliding window (stride=64) + temperature calibration (T=0.98)

## Results

| Seed | Base BPB (no TTT) | TTT T=0.98 BPB |
|------|-------------------|----------------|
| 1337 | 1.1243 | **1.1142** |
| 42 | 1.1242 | **1.1148** |
| 2025 | 1.1245 | **1.1144** |
| **Mean** | **1.1243** | **1.1145** |
| **Std** | **0.0002** | **0.0003** |

- Artifact: 15,885,838 bytes (under 16MB)
- Training: ~6,131 steps in 600s on 8xH100 SXM (~98ms/step)
- Eval: ~465s total (87s sliding window + 296s TTT + 82s post-TTT recal)

## Statistical Significance

vs #549 (current SOTA, 1.1194): improvement = 0.0049 nats, t-stat = 28.3, p << 0.01

## TTT Implementation (Legal Score-First)

The TTT processes validation tokens in 131K-token chunks:
1. **SCORE** each chunk under `torch.inference_mode()` — accumulates loss
2. **TRAIN** on the scored chunk — AdamW (lr=1e-4, cosine LR), 3 epochs, last 2 blocks unfrozen
3. After all chunks: re-eval with T=0.98 temperature calibration (fixes TTT overconfidence)

No token is trained on before it is scored. No val tokens in artifact. GPTQ runs within training budget.

## Run Command
```bash
pip install --break-system-packages zstandard
NCCL_IB_DISABLE=1 SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py
```

Based on PR #576 by @cmcdnd.
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/bin/bash
# Approach B: PR #576 fork — "train larger, quantize harder" (33.6M params, int5 GPTQ)
# Requires: pip install zstandard (for zstd compression)
pip install --break-system-packages zstandard 2>/dev/null

NCCL_IB_DISABLE=1 SEED=${SEED:-1337} \
torchrun --standalone --nproc_per_node=8 train_gpt.py 2>&1 | tee /workspace/run_b.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"author": "ibarrajo",
"github_id": "ibarrajo",
"name": "33.6M Int5 GPTQ + Score-First TTT + Temp Calibration",
"blurb": "Train larger (33.6M params, d=576, MLP 3.5x=1792), quantize harder (int5 GPTQ). Legal score-first TTT (AdamW, cosine LR, 3 epochs) + T=0.98 temp calibration. GPTQ calibration within 600s training budget. 3-seed mean: 1.1150 BPB (std 0.0003).",
"date": "2026-03-27",
"val_bpb": 1.1150,
"val_loss": 1.8827,
"bytes_total": 15824161,
"seeds": {
"1337": {"val_bpb": 1.1148, "bytes_total": 15288826, "train_plus_gptq_s": 593.9},
"42": {"val_bpb": 1.1154, "bytes_total": 15303508, "train_plus_gptq_s": 593.7},
"2025": {"val_bpb": 1.1148, "bytes_total": 15824161, "train_plus_gptq_s": 593.9}
},
"mean_val_bpb": 1.1150,
"std_val_bpb": 0.0003
}
Loading