Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions records/track_10min_16mb/2026-03-24_DeepQuant_V10b/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# DeepQuant — 11L INT6 + 8-epoch Cosine LoRA TTT

**val_bpb: 0.5850** (seed=42, eval 582s, 15.46MB)

## Approach

We explored how far per-document test-time training can push a small 16MB language model. The core hypothesis: a well-trained base model combined with aggressive per-document LoRA adaptation at eval time can dramatically reduce bits-per-byte by specializing the model to each document's distribution.

## Architecture

Standard 11-layer transformer backbone:
- dim=512, 8 attention heads, 4 KV heads (GQA), MLP expansion 3x (1536)
- BigramHash(2048) + SmearGate for parameter-efficient bigram context
- U-Net skip connections between encoder/decoder layer pairs
- Depth-scaled residuals: 1/sqrt(layer+1) for stable deep training
- RoPE positional encoding (base=50000)
- Logit softcap=30.0

## Training (600s, 8xH100 SXM)

- Muon optimizer (Newton-Schulz whitening) for matrix params + AdamW for scalars/embeddings
- Wallclock-based LR schedule with warmdown
- EMA (decay=0.999, every 10 steps) + SWA (12 checkpoints in final warmdown)
- ~7100 training steps, batch tokens=786,432
- INT6 uniform quantization (64 levels per row) + zstd-22 compression
- 4% magnitude pruning before quantization

## Test-Time Training (TTT) — Key Innovation

Per-document LoRA adaptation at eval time with several design choices that proved critical:

### 1. 8-epoch multi-pass adaptation
Each document gets 8 full passes of LoRA training. We found TTT gain scales strongly with epoch count — each additional epoch provides meaningful BPB improvement as the LoRA captures deeper document-specific patterns.

### 2. Score-every-epoch compliance
Every token is scored before being trained on, in every epoch. Scores are overwritten each epoch, so the final score reflects the most adapted LoRA state. This satisfies backward-looking TTT requirements.

### 3. Cosine LR decay within TTT
Per-step cosine schedule (from base LR down to 10%) across all epochs×chunks steps. This prevents overfitting in later passes while allowing aggressive early adaptation. Constant LR overshoots on later chunks.

### 4. LM-head LoRA rank-16
The output projection (dim→vocab) is the highest-leverage layer for BPB. We use rank-16 for the LM-head LoRA while keeping rank-8 for Q/V projections. This doubles the model's capacity to adapt its output distribution per document.

### 5. Per-block bias tuning
During TTT, we tune a bias vector (512 params) per transformer block alongside LoRA. This provides a cheap "domain shift" — adjusting activation means to match document statistics without extra matmul cost.

### 6. Post-TTT temperature rescaling (T=0.98)
Multi-epoch LoRA adaptation tends to make the model overconfident. Scaling logits by 0.98 corrects this calibration error for a consistent ~0.003 BPB improvement at zero compute cost.

### 7. Zigzag GPU load balancing
Documents are distributed across 8 GPUs using a zigzag pattern (GPU 0→7, then 7→0, repeating) instead of contiguous blocks. This ensures each GPU processes a balanced mix of document lengths, eliminating a ~220s synchronization bottleneck from GPU workload imbalance.

### 8. Outlier document filtering
Documents exceeding 50,000 tokens are scored with the base model without TTT. These extreme outliers take disproportionate compute (quadratic in chunk count) while being too few to meaningfully affect average BPB.

### 9. Wall-clock TTT budget
A configurable time limit (570s default) on the TTT batch loop. If exceeded, remaining documents fall back to batched base-model scoring. This guarantees eval completes within the 600s budget.

## TTT Configuration

| Parameter | Value |
|-----------|-------|
| LoRA rank (Q, V) | 8 |
| LoRA rank (LM-head) | 16 |
| TTT LR | 0.01 (Adam, betas=0.9/0.95) |
| TTT epochs | 8 |
| TTT chunk size | 256 |
| TTT batch size | 64 documents |
| TTT min doc length | 512 tokens |
| TTT max doc length | 50,000 tokens |
| Temperature rescale | 0.98 |
| Cosine LR | enabled (min 10%) |
| Bias tuning | enabled |

## How to run

```bash
DATA_PATH=/path/to/fineweb10B_sp1024 \
TOKENIZER_PATH=/path/to/fineweb_1024_bpe.model \
SEED=42 TTT_EPOCHS=8 \
torchrun --nproc_per_node=8 train_gpt.py
```

## Timing breakdown

| Phase | Time |
|-------|------|
| Training | 600s |
| Post-processing (SWA+EMA+pruning) | <1s |
| Serialization (quant+compress) | 38s |
| Post-quant eval | 5s |
| TTT eval (short docs) | 22s |
| TTT eval (long docs, 62 batches) | 559s |
| TTT overhead | 2s |
| **Total eval** | **582s** |
10 changes: 10 additions & 0 deletions records/track_10min_16mb/2026-03-24_DeepQuant_V10b/submission.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"author": "UrukHan",
"github_id": "UrukHan",
"name": "DeepQuant — 11L INT6 + 8-epoch Cosine LoRA TTT",
"blurb": "8ep cosine TTT + LM rank-16 + bias tuning + zigzag GPU balancing",
"date": "2026-03-24T00:00:00Z",
"val_loss": 0.9878,
"val_bpb": 0.5850,
"bytes_total": 15463955
}
Loading