Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT

**val_bpb = 1.07785** (3-seed mean, std 0.00047) | **~15.99 MB** | 8xH100 SXM

## 3-Seed Results

| Seed | Sliding BPP | **TTT BPP** | Artifact |
|------|-------------|-------------|----------|
| 42 | 1.07880 | **1.07718** | 15,990,780 |
| 314 | 1.07959 | **1.07810** | 15,987,449 |
| 999 | 1.07963 | **1.07826** | 15,987,550 |
| **Mean** | **1.07934** | **1.07785** | **15,988,593** |
| **Std** | **0.00039** | **0.00047** | |

Merged SOTA (PR #1493, our previous): **1.0810 BPP**. Delta: **-0.0032 BPP**.

## Key Techniques

1. **Improved Parallel Residuals** (from PR #1529 @msisovic) -- cross-lane routing where attention and MLP outputs route to BOTH lanes via learned scalars. 66 new scalar params (`par_post[11,2,2]` + `par_resid[11,2]`). Final output = MLP lane (lane1). Starts at layer 7.

2. **Muon Momentum 0.97** (from PR #1514 @dexhunter) -- reduced from 0.99. Shorter memory horizon (~33 steps) better tracks the rapidly changing loss surface during warmdown.

3. **MATRIX_LR = 0.03** -- re-tuned for momentum 0.97 (higher LR pairs with lower momentum). Sweep: 0.022 → 1.0797, 0.03 → 1.0795, 0.04 → 1.0811.

4. **3-Layer Depth Recurrence** (L3-5, activate at frac=0.35) -- 17 virtual layers from 11 physical.

5. **QK-Gain 5.25** -- monotonic improvement from 4.0 to 5.25.

6. **Legal Score-First TTT** -- SGD (lr=0.005, mom=0.9), 3 epochs per 32K-token chunk, cosine LR decay.

7. **SP8192 + GPTQ SDClip** -- int6 matrices (k=12.85), int8 embeddings (k=20.0), Brotli-11 compression.

8. **Tuned Hyperparameters** -- WD=0.095, EMA=0.9965, warmdown=0.72.

## Architecture

11L x 512d x 8H / 4KV, MLP 4x, LeakyReLU(0.5)^2, Partial RoPE (16/64 dims), layerwise LN scale, tied embeddings, logit softcap=30.0. Depth recurrence: encoder [0,1,2,3,4,5,3,4] decoder [5,3,4,5,6,7,8,9,10]. Improved parallel residuals from layer 7: attention reads from lane0, MLP reads from lane1, both outputs route to both lanes via learned `par_post` and `par_resid` scalars. Skip gates (sigmoid-gated U-Net connections).

## Compliance (Track B)

Per Issue #1017:
- **Condition 1 (Causality):** Sliding-window eval, prefix only
- **Condition 2 (Normalized):** Standard softmax, no n-gram/logit bias
- **Condition 3 (Score before update):** Each chunk scored under `torch.no_grad()` BEFORE SGD
- **Condition 4 (Single pass):** Each token scored once, no rescoring

No SLOT, no pre-quant TTT, no ETLB, no n-gram cache. All artifacts < 16MB, train < 600s, eval < 600s.

## Reproduction

```bash
SEED=42 QK_GAIN_INIT=5.25 MUON_MOMENTUM=0.97 MATRIX_LR=0.03 \
TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=3 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Credits

- **@msisovic** -- Improved parallel residuals (PR #1529, #1204)
- **@clarkkev** -- SP8192 + GPTQ + SDClip + MuonEq-R (PR #1394)
- **@dexhunter** -- Muon 0.97 (PR #1514), depth recurrence (PR #1331, #1437), TTT on SP8192 (PR #1413)
- **@abaybektursun** -- Score-first TTT framework (PR #549)
- **@X-Abhishek-X** -- Hyperparameter tuning (PR #1445, #1471)
- **@Robby955** -- Parallel residuals on SP8192 (PR #1412)

## Acknowledgements

Thanks to OpenAI's Advanced Competitor grant ($500 compute credit via RunPod).

## Included Files

- `README.md` (this file)
- `submission.json`
- `train_gpt.py`
- `train_seed42.log`
- `train_seed314.log`
- `train_seed999.log`
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
{
"author": "bigbag",
"github_id": "bigbag",
"name": "SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal Score-First TTT",
"date": "2026-04-11",
"track": "10min_16mb",
"val_bpb": 1.07785,
"val_bpb_std": 0.00047,
"seeds": [42, 314, 999],
"seed_results": {
"42": {"val_bpb": 1.07718, "artifact_bytes": 15990780},
"314": {"val_bpb": 1.07810, "artifact_bytes": 15987449},
"999": {"val_bpb": 1.07826, "artifact_bytes": 15987550}
},
"hardware": "8xH100 80GB SXM",
"pytorch_version": "2.9.1+cu128",
"technique_summary": "SP8192 + Improved Parallel Residuals (cross-lane routing L7+) + 3-Layer Depth Recurrence (L3-5) + Muon 0.97 + LR 0.03 + QK-Gain 5.25 + EMA 0.9965 + WD 0.095 + Score-First TTT (SGD 3ep) + GPTQ SDClip + Brotli",
"compliance": {
"train_under_600s": true,
"artifact_under_16mb": true,
"eval_under_600s": true,
"no_slot": true,
"no_pre_quant_ttt": true,
"no_etlb": true,
"no_ngram_cache": true,
"score_first_ttt": true,
"three_seeds": true
},
"attribution": {
"sp8192_gptq_sdclip": "@clarkkev (PR #1394)",
"depth_recurrence": "@dexhunter (PR #1331, #1437)",
"improved_parallel_residuals": "@msisovic (PR #1529, #1204)",
"legal_ttt_framework": "@abaybektursun (PR #549), @dexhunter (PR #1413)",
"muon_097": "@dexhunter (PR #1514)",
"hyperparameter_tuning": "@X-Abhishek-X (PR #1445)"
}
}

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
W0411 02:14:48.169000 47104 torch/distributed/run.py:803]
W0411 02:14:48.169000 47104 torch/distributed/run.py:803] *****************************************
W0411 02:14:48.169000 47104 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0411 02:14:48.169000 47104 torch/distributed/run.py:803] *****************************************
Hyperparameters:
adam_eps: 1e-08
adam_wd: 0.02
beta1: 0.9
beta2: 0.95
compressor: brotli
data_dir: ./data/
datasets_dir: ./data/datasets/fineweb10B_sp8192
distributed: True
ema_decay: 0.9965
embed_bits: 8
embed_clip_sigmas: 20.0
embed_lr: 0.6
embed_wd: 0.085
embedding_dim: 512
enable_looping_at: 0.35
etlb_clip: 3.0
etlb_enabled: False
etlb_lr: 0.05
etlb_steps: 5
eval_seq_len: 2048
eval_stride: 64
gptq_calibration_batches: 64
gptq_reserve_seconds: 12.0
grad_accum_steps: 1
grad_clip_norm: 0.3
head_lr: 0.008
is_main_process: True
iterations: 20000
ln_scale: True
local_rank: 0
logfile: logs/3f0916f2-8576-4b2d-95dd-fae9c621b1a2.txt
logit_softcap: 30.0
loop_end: 5
loop_start: 3
matrix_bits: 6
matrix_clip_sigmas: 12.85
matrix_lr: 0.03
max_wallclock_seconds: 600.0
min_lr: 0.0
mlp_mult: 4.0
model_dim: 512
model_path: final_model.pt
muon_backend_steps: 5
muon_beta2: 0.95
muon_momentum: 0.97
muon_momentum_warmup_start: 0.92
muon_momentum_warmup_steps: 1500
muon_row_normalize: True
muon_wd: 0.095
num_heads: 8
num_kv_heads: 4
num_layers: 11
num_loops: 2
parallel_residual_start: 7
qk_gain_init: 5.25
quantized_model_path: final_model.int6.ptz
rank: 0
rope_base: 10000.0
rope_dims: 16
rope_train_seq_len: 2048
run_id: 3f0916f2-8576-4b2d-95dd-fae9c621b1a2
scalar_lr: 0.02
seed: 314
skip_gates_enabled: True
sliding_window_enabled: True
tie_embeddings: True
tied_embed_init_std: 0.005
tied_embed_lr: 0.03
tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model
train_batch_tokens: 786432
train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin
train_log_every: 500
train_seq_len: 2048
ttt_chunk_tokens: 32768
ttt_enabled: True
ttt_epochs: 3
ttt_hash_buckets: 16384
ttt_hash_embed: True
ttt_lr: 0.005
ttt_momentum: 0.9
val_batch_tokens: 524288
val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin
val_loss_every: 4000
vocab_size: 8192
warmdown_frac: 0.72
warmup_steps: 20
world_size: 8
xsa_last_n: 11
train_shards: 128
val_tokens: 40540160
model_params:35944602
gptq:reserving 12s, effective=588000ms
warmup_step: 1/20
warmup_step: 2/20
warmup_step: 3/20
warmup_step: 4/20
warmup_step: 5/20
warmup_step: 6/20
warmup_step: 10/20
warmup_step: 20/20
loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
loop_warmup_step: 1/20
loop_warmup_step: 2/20
loop_warmup_step: 3/20
loop_warmup_step: 4/20
loop_warmup_step: 5/20
loop_warmup_step: 6/20
loop_warmup_step: 10/20
loop_warmup_step: 20/20
0/20000 val_loss: 9.0096 val_bpb: 3.4879
1/20000 train_loss: 9.0112 train_time: 0.0m tok/s: 7915600
2/20000 train_loss: 12.5748 train_time: 0.0m tok/s: 7809948
3/20000 train_loss: 11.4700 train_time: 0.0m tok/s: 7718096
4/20000 train_loss: 9.7598 train_time: 0.0m tok/s: 7668274
5/20000 train_loss: 8.5575 train_time: 0.0m tok/s: 7639140
500/20000 train_loss: 3.3436 train_time: 0.9m tok/s: 7404018
1000/20000 train_loss: 3.2122 train_time: 1.8m tok/s: 7388854
1500/20000 train_loss: 3.1225 train_time: 2.7m tok/s: 7389355
layer_loop:enabled step:1935 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
2000/20000 train_loss: 3.0884 train_time: 3.6m tok/s: 7284216
2500/20000 train_loss: 3.0880 train_time: 4.9m tok/s: 6705687
3000/20000 train_loss: 2.9604 train_time: 6.2m tok/s: 6371157
3500/20000 train_loss: 2.9751 train_time: 7.5m tok/s: 6152445
4000/20000 train_loss: 2.9034 train_time: 8.7m tok/s: 5998426
4000/20000 val_loss: 2.8647 val_bpb: 1.1090
4414/20000 val_loss: 2.8071 val_bpb: 1.0867
stopping_early: wallclock_cap train_time: 588150ms step: 4414/20000
peak memory allocated: 39718 MiB reserved: 39742 MiB
ema:applying EMA weights
pre-quantization post-ema val_loss:2.80693411 val_bpb:1.08665167 eval_time:6466ms
Serialized model: 135431741 bytes
Code size: 17184 bytes
GPTQ:collecting Hessians from calibration data...
GPTQ:collected 67 Hessians in 13.2s
Quantized weights:
gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight
gptq (int8): tok_emb.weight
passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, par_post, par_resid, skip_gates, skip_weights
Serialized model quantized+brotli: 15970265 bytes
Total submission size quantized+brotli: 15987449 bytes
quantized val_loss:2.83210781 val_bpb:1.09639719 eval_time:9232ms
quantized_sliding_window val_loss:2.78869620 val_bpb:1.07959120 eval_time:93165ms
ttt:start chunks=1238 ttt_lr=0.005 ttt_epochs=3
quantized_ttt val_loss:2.78483468 val_bpb:1.07809629 eval_time:334646ms
Loading