Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# SP8192 + BOS-Fix SmearGate + LQER Asym + Phased TTT (10L)

**val_bpb: 1.07171** (single seed 314) | **15.37 MB** | 8xH100 SXM, 596s

Applies the full SOTA stack from PR #1851 (BOS-fixed SmearGate + LQER Asymmetric + Phased TTT + layer looping) with the SP8192 tokenizer instead of SP1024. Uses 10 transformer layers instead of 11 to fit the larger embedding table (8192 vocab) under the 16MB artifact limit with brotli compression.

## Results

| Metric | Value |
|---|---|
| Pre-quant val_bpb | 1.07399 |
| Post-GPTQ val_bpb | 1.08251 |
| Post-TTT val_bpb | 1.07171 |
| Artifact size | 15,373,365 bytes |
| Training steps | 5,218 |
| Training time | 596s |
| Hardware | 8xH100 80GB SXM |
| Seed | 314 |

## Changes vs PR #1851

- **SP8192 tokenizer** instead of SP1024 — better tokenization efficiency, ~0.02 BPB improvement
- **10 layers** instead of 11 — required to fit under 16MB with the larger 8192-vocab embedding table
- All other settings identical to PR #1851: BOS-fixed SmearGate, GPTQ int6 + LQER Asymmetric, Phased TTT (1 phase, 2000 prefix docs), layer looping, SparseAttnGate

## Architecture

| Component | Setting |
|---|---|
| Layers | 10 (512d, 8 heads, 4 KV heads) |
| Vocabulary | SP8192 |
| Layer loop | encoder [0,1,2,3,4,5,3,4] decoder [5,3,4,5,6,7,8,9] |
| SmearGate | BOS-boundary fixed (PR #1851) |
| Quantization | GPTQ int6 + LQER Asymmetric |
| TTT | Phased score-first LoRA TTT (rank 96, 1 phase) |
| Compression | brotli quality=11 |

## Run Command

```bash
TORCHINDUCTOR_CACHE_DIR=/workspace/inductor_cache \
RUN_ID=sota_sp8192_10L SEED=314 VOCAB_SIZE=8192 NUM_LAYERS=10 \
SMEAR_GATE_ENABLED=1 SPARSE_ATTN_GATE_ENABLED=1 SPARSE_ATTN_GATE_SCALE=0.5 \
LQER_ENABLED=1 LQER_ASYM_ENABLED=1 \
MLP_CLIP_SIGMAS=11.5 EMBED_CLIP_SIGMAS=14.0 \
WARMDOWN_FRAC=0.85 MIN_LR=0.1 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Lineage

```
PR #1851 (1.0614 BPB) — BOS-fix SmearGate + LQER Asym + Phased TTT
└── This work:
├── SP8192 tokenizer (from sproos/parameter-golf-tokenizers)
└── 10 layers (down from 11 to fit 16MB with larger vocab embedding)
```
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
torch==2.9.1+cu128
sentencepiece==0.2.1
flash-attn==2.8.3
numpy
brotli
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{
"author": "am-commits",
"github_id": "am-commits",
"name": "SP8192 + BOS-Fix SmearGate + LQER Asym + Phased TTT (10L)",
"blurb": "PR #1851 SOTA stack (BOS-fixed SmearGate + LQER Asymmetric + Phased TTT + layer looping) with SP8192 tokenizer. 10 layers instead of 11 to fit under 16MB with larger 8192-vocab embedding table.",
"date": "2026-04-30",
"track": "10min_16mb",
"val_bpb": 1.07170518,
"pre_quant_val_bpb": 1.07398921,
"step_stop": 5218,
"wallclock_seconds": 596,
"bytes_total": 15373365,
"gpu": "8xH100-80GB-SXM",
"seed": 314,
"vocab_size": 8192,
"num_layers": 10,
"comparison_baseline_pr": 1851,
"metrics": {
"pre_quant_post_ema_val_bpb": 1.07398921,
"post_quant_val_bpb": 1.08251175,
"post_ttt_val_bpb": 1.07170518,
"gptq_time_s": 10.0,
"ttt_eval_time_s": 487
}
}
Loading