Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions records/track_10min_16mb/2026-03-24_11L_SOTA_MLP35x/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Record: 11L MLP3.5x LeakyReLU(0.5)^2 + Full SOTA Stack (mean val_bpb=1.1330)

**3-seed mean val_bpb: 1.1330** (std=0.0007)

| Seed | val_bpb | val_loss | Steps |
|------|---------|----------|-------|
| 1337 | 1.1334 | 1.9136 | 3842 |
| 42 | 1.1322 | 1.9116 | 3885 |
| 2024 | 1.1334 | 1.9136 | 3857 |

## Architecture (31.4M parameters)
- 11 transformer layers, dim=512, 8 heads / 4 KV heads (GQA)
- MLP 3.5x expansion (hidden=1792) with **LeakyReLU(0.5)^2** activation
- **SmearGate** + **BigramHash(10240, dim=128)** + **TrigramHash(4096, dim=128)**
- **Value Residual (ResFormer)** — cache V from layer 0, blend via learned lambda
- **Gated Attention** — per-head sigmoid gate (nn.Linear, bias init 4.0)
- **XSA on all 11 layers** — exclusive self-attention
- **Partial RoPE** — 16/64 head dimensions
- Tied FP16 embeddings, U-Net skip connections, orthogonal initialization

## Training
- Muon optimizer: lr=0.03, momentum 0.92→0.99/1500 steps, WD=0.04
- Adam for embeddings (lr=0.035) and scalars (lr=0.03)
- Batch 786,432 tokens, seq_len 2048
- EMA (decay=0.997), warmdown 3500 iterations
- Late QAT via STE (final 15% of wallclock)
- Gradient clipping 0.3

## Quantization
- Int6 uniform per-row with GPTQ-lite (5-percentile clip search per row)
- FP16 passthrough for tied embeddings
- zstd-22 compression

## Evaluation
- Sliding window eval, stride=64

## Development Process
30-experiment autoresearch loop on 1xH100 (~8 hours), then validated on 8xH100 SXM.

### Feature ablation (measured on 1xH100):

| Feature | BPB Impact |
|---------|-----------|
| Value Residual | -0.017 |
| SmearGate | -0.010 |
| XSA all 11 layers | -0.005 |
| Gated Attention | -0.004 |
| Partial RoPE (16/64) | -0.004 |
| TrigramHash | -0.002 |
| Late QAT | -0.002 |
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"author": "Aryan Bhosale",
"github_id": "aryanbhosale",
"name": "11L MLP3.5x LeakyReLU(0.5)^2 + Full SOTA Stack (mean val_bpb=1.1330)",
"blurb": "11-layer 512d transformer with MLP 3.5x LeakyReLU(0.5)^2, SmearGate, BigramHash(10240), TrigramHash(4096), Value Residual, Gated Attention, XSA-all-11, Partial RoPE(16/64). Muon lr=0.03 WD=0.04, EMA(0.997), Late QAT, int6+GPTQ-lite+zstd-22. 3-seed mean 1.1330 (std=0.0007) on 8xH100 SXM.",
"date": "2026-03-24T12:00:00Z",
"val_loss": 1.9129,
"val_bpb": 1.1330,
"bytes_total": 10500000,
"bytes_code": 70872,
"seeds": {
"1337": {"val_bpb": 1.1334, "val_loss": 1.9136, "steps": 3842},
"42": {"val_bpb": 1.1322, "val_loss": 1.9116, "steps": 3885},
"2024": {"val_bpb": 1.1334, "val_loss": 1.9136, "steps": 3857}
}
}
Loading