Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
# Depth Recurrence Sweep: Mapping the Layer Loop Design Space

**Non-Record Submission (Research Contribution)**
**Author:** [@krishs0404](https://github.com/krishs0404)
**Date:** April 15–17, 2026
**Hardware:** RunPod 1×H100 SXM 80GB, 6 runs × 10-min wallclock cap ≈ ~60 GPU-minutes total
**Base:** Current SOTA stack (sp8192, int6 block weights, int8 embeddings, brotli, depth recurrence)
**Best result:** 1.4689 post-quant bpb (SOTA baseline, included as reference)

---

## Summary

Systematic ablation of depth recurrence loop configuration in the current SOTA training stack. Five variants tested against baseline across three axes: which layers to loop (`LOOP_START`, `LOOP_END`) and when to activate the loop (`ENABLE_LOOPING_AT`). Every variant was worse than the SOTA config. The SOTA authors found the right hyperparameters.

The key finding: the middle-layer sweet spot (layers 3–5) is genuine, not arbitrary. Moving the loop to earlier layers, later layers, expanding it, or activating it early all hurt. The minimal 2-layer variant (layers 5–6 only) is surprisingly competitive at +0.006 bpb, suggesting most of the recurrence benefit is concentrated in those two layers specifically.

---

## Motivation

Depth recurrence is one of the distinguishing features of current top submissions, but no public ablation documents which layers to reuse, how many to loop, or when to activate the loop. The current SOTA uses `LOOP_START=3`, `LOOP_END=5`, `ENABLE_LOOPING_AT=0.35` — but these look like they could equally well be 2–6 or 4–7. This sweep answers: does the specific layer range actually matter?

The short answer is yes, and the answer is non-obvious: the middle layers (3–5) are uniquely important, early/late alternatives are significantly worse, and the activation timing matters almost as much as the layer selection.

---

## Experimental Setup

All runs used the SOTA training script with identical configuration except the looping parameters:

- **Hardware**: RunPod 1×H100 SXM 80GB (132 SMs, HBM3)
- **Tokenizer**: sp8192 (8192-vocab SentencePiece)
- **Model**: 11 layers, 512d, 8 heads / 4 KV heads, tied embeddings
- **Training budget**: `MAX_WALLCLOCK_SECONDS=600` (10 min, same as competition runs)
- **GPTQ reserve**: 12s, so effective training = 588s
- **Quantization**: GPTQ int6 block weights, GPTQ int8 embeddings, brotli compression
- **Baseline looping config**: `LOOP_START=3`, `LOOP_END=5`, `ENABLE_LOOPING_AT=0.35`, `NUM_LOOPS=2`
- **Evaluation**: Standard val_bpb + sliding window val_bpb, after dequantization

How the loop index lists are constructed from `LOOP_START`/`LOOP_END`: the looped segment `[loop_start, loop_end]` is repeated `NUM_LOOPS` times. The remaining layers (`[0, loop_start)` and `(loop_end, num_layers)`) are split into encoder (first half) and decoder (second half) with U-Net skip connections. For the baseline, this yields `encoder:[0,1,2,3,4,5,3,4]` and `decoder:[5,3,4,5,6,7,8,9,10]`.

---

## Results

| Experiment | `LOOP_START` | `LOOP_END` | `ENABLE_LOOPING_AT` | Steps | val_bpb (pre-quant) | val_bpb (post-quant) | Δ vs baseline |
|---|---|---|---|---|---|---|---|
| **Baseline (SOTA)** | 3 | 5 | 0.35 | **568** | **1.2885** | **1.4689** | — |
| A — minimal reuse | 5 | 6 | 0.35 | 565 | 1.2971 | 1.4750 | +0.006 |
| D — early layers | 1 | 4 | 0.35 | 538 | 1.2930 | 1.5072 | +0.038 |
| C — late layers | 7 | 10 | 0.35 | 541 | 1.3052 | 1.5181 | +0.049 |
| E — early activation | 3 | 5 | 0.15 | 522 | 1.2985 | 1.5190 | +0.050 |
| B — heavy reuse | 2 | 7 | 0.35 | 451 | 1.3189 | 1.6321 | +0.163 |

The encoder/decoder index lists for each experiment:

| Experiment | Encoder indices | Decoder indices |
|---|---|---|
| Baseline | [0,1,2,3,4,5,3,4] | [5,3,4,5,6,7,8,9,10] |
| A — minimal (LS=5, LE=6) | [0,1,2,3,4,5,6] | [5,6,5,6,7,8,9,10] |
| D — early (LS=1, LE=4) | [0,1,2,3,4,1,2,3,4] | [1,2,3,4,5,6,7,8,9,10] |
| C — late (LS=7, LE=10) | [0,1,2,3,4,5,6,7,8] | [9,10,7,8,9,10,7,8,9,10] |
| E — early act (LS=3, LE=5, ELA=0.15) | [0,1,2,3,4,5,3,4] | [5,3,4,5,6,7,8,9,10] |
| B — heavy (LS=2, LE=7) | [0,1,2,3,4,5,6,7,2,3,4] | [5,6,7,2,3,4,5,6,7,8,9,10] |

---

## Key Findings

### The SOTA config is genuinely optimal, not a lucky guess

Every variant tested was worse than the baseline. The ordering — minimal reuse (A) is closest, then early/late shifts (D, C), then early activation (E), then heavy reuse (B) catastrophically worse — forms a coherent picture of what makes depth recurrence work at the 10-minute training budget.

This is not a case where the SOTA config was picked arbitrarily and any similar config would do. Moving the loop range by just a few layers (early: D, late: C) costs roughly +0.04–0.05 bpb. That's a large penalty for a small change.

### Minimal reuse (2 layers, Exp A) is surprisingly competitive at +0.006

Experiment A loops only layers 5–6 instead of the baseline's 3–5. Despite reusing one fewer layer, performance drops by only 0.006 post-quant bpb. This is the closest any variant came to the baseline, and the gap is small enough to be within run-to-run variance at this training budget.

The implication is that the recurrence benefit is concentrated specifically in layers 5–6. Layers 3 and 4 contribute only marginally. Why layers 5–6? Speculative, but these are mid-depth layers where the model has built reasonable representations from the embedding and early layers, but hasn't yet committed to the final abstract features. Reusing them lets the model refine intermediate representations without perturbing early feature extraction or final classification layers.

On an 8×H100 run where far more training steps are possible, this minimal configuration might be preferable: fewer looped layers means more training steps per wall-clock minute, and the accuracy gap may close with more iterations.

### Heavy reuse is catastrophically worse (+0.163) due to throughput loss

Experiment B expands the loop to layers 2–7 — six layers instead of three. The result is a disaster: post-quant bpb of 1.6321, the worst result by a wide margin, and only 451 training steps completed versus 568 for the baseline.

The step count difference is the key. Looping 6 layers instead of 3 means each forward pass takes roughly twice as long. In a 10-minute budget, this costs ~117 training steps. At this early stage of training (sub-600 steps vs. a 20,000-step schedule), each step matters enormously — the model is still rapidly descending from the initial loss. Losing 117 steps to compute overhead is a severe penalty that the additional depth cannot compensate for.

Heavier looping is only justified if the accuracy-per-step improvement exceeds the step-count penalty. At the 1×H100 / 10-minute scale tested here, it clearly does not. This might change at longer training budgets where the model has already extracted most of the easy gradient signal and additional depth becomes more valuable.

### Early/late layer shifts are symmetrically bad (+0.038 to +0.049)

Experiments C (late: 7–10) and D (early: 1–4) both hurt significantly. The losses are roughly symmetric around the baseline, with late layers slightly worse than early layers. Neither extreme is good.

The early layers (D) fail because layers 1–4 are closest to the raw embedding. Reusing them means the model runs embedding-proximal computation multiple times, but the abstract representations needed for useful recurrence haven't formed yet. The late layers (C) fail for the opposite reason: layers 7–10 are already computing high-level features close to the output. Reusing them duplicates computation that should be done at most once before the final projection.

The middle layers (3–5 in the SOTA) sit at the point where the model has built enough abstraction to benefit from recurrence without those abstractions being so finalized that recomputation is wasteful.

### Early loop activation hurts: the model needs stable representations first

Experiment E uses the same LOOP_START=3, LOOP_END=5 as baseline but activates the loop at 15% of training progress instead of 35%. This yields 522 steps and 1.5190 post-quant bpb — a +0.050 penalty and 46 fewer training steps.

Two effects combine here. First, activating earlier introduces loop overhead earlier, costing steps. Second, and likely more important, the loop activates before the model has learned stable intermediate representations. At 15% progress (roughly step 90), the model's layer-5 outputs are still changing rapidly. Reusing them via the U-Net encoder/decoder causes the looped representations to be built on shifting foundations, degrading the benefit.

The SOTA's 35% threshold appears to be calibrated for when representations have stabilized sufficiently for reuse to be helpful. Earlier than this, recurrence introduces noise rather than refinement.

---

## Implications for Future Work

### Asymmetric recurrence: loop only the highest-impact layers

The +0.006 gap for minimal reuse (layers 5–6 only) compared to the full baseline (layers 3–5) suggests a promising direction: identify the single highest-impact layer within the loop, loop only that one, and spend the recovered wall-clock time on additional training steps.

On 8×H100 where ~4,500–5,500 steps complete in 10 minutes, the tradeoff changes dramatically. The step-count savings from a smaller loop matter less as a fraction of total training, while the accuracy-per-depth benefit of using middle-layer recurrence could accumulate over more steps. This sweep was run at 1×H100 scale; the Pareto frontier between loop width and step count will be different at full competition scale.

### Adaptive loop activation scheduling

The `ENABLE_LOOPING_AT` parameter is currently a fixed fraction of total training. A more principled approach would monitor validation loss, gradient norms, or representation similarity (CKA between layers) and activate the loop when representations have stabilized. This would be especially valuable in runs with different training budgets or batch sizes, where the fixed 35% threshold may not correspond to the same stage of model maturation.

### 8×H100 validation

All findings here are from 1×H100 runs at sub-600-step training. The rankings may hold at 8×H100 scale, but given that heavy reuse's main failure mode is step-count loss (which is a smaller relative penalty with more steps), the hierarchy is not guaranteed to be stable. In particular, Exp B (heavy reuse) might be less catastrophic at full scale if the 117-step loss is a smaller fraction of 5,000+ total steps. Exp A (minimal reuse) might close the gap further with more steps to leverage the saved per-step compute.

---

## Hardware

- **Pod**: RunPod 1×H100 SXM 80GB (HBM3)
- **CUDA**: 12.8
- **Total GPU time**: ~60 minutes across 6 experiments (baseline + 5 ablations)
- **Total cost**: ~$3 at RunPod spot rates

---

## Logs

Raw training logs for all experiments are included in this directory:

| File | Description |
|---|---|
| `baseline.txt` | SOTA baseline (LOOP_START=3, LOOP_END=5, ELA=0.35) |
| `exp_a_minimal.txt` | Exp A: minimal reuse (LOOP_START=5, LOOP_END=6) |
| `exp_b_heavy.txt` | Exp B: heavy reuse (LOOP_START=2, LOOP_END=7) |
| `exp_c_late.txt` | Exp C: late layers (LOOP_START=7, LOOP_END=10) |
| `exp_d_early.txt` | Exp D: early layers (LOOP_START=1, LOOP_END=4) |
| `exp_e_early_act.txt` | Exp E: early activation (ELA=0.15, same loop as baseline) |
| `gptq_ablation.log` | Bonus: simple int8 vs GPTQ int8 for embeddings (+0.003 bpb for GPTQ) |
| `ref_1gpu.txt` | Reference run log from April 14 (575 steps, 1.4684 post-quant bpb) |

---

## Reproducing

All experiments use `train_gpt_sota.py` (the competition SOTA script) with `MAX_WALLCLOCK_SECONDS=600`:

```bash
# Baseline
MAX_WALLCLOCK_SECONDS=600 LOOP_START=3 LOOP_END=5 ENABLE_LOOPING_AT=0.35 \
RUN_ID=baseline torchrun --standalone --nproc_per_node=1 train_gpt_sota.py

# Exp A — minimal reuse
MAX_WALLCLOCK_SECONDS=600 LOOP_START=5 LOOP_END=6 ENABLE_LOOPING_AT=0.35 \
RUN_ID=exp_a_minimal torchrun --standalone --nproc_per_node=1 train_gpt_sota.py

# Exp B — heavy reuse (warning: ~117 fewer training steps at 1xH100)
MAX_WALLCLOCK_SECONDS=600 LOOP_START=2 LOOP_END=7 ENABLE_LOOPING_AT=0.35 \
RUN_ID=exp_b_heavy torchrun --standalone --nproc_per_node=1 train_gpt_sota.py

# Exp C — late layers
MAX_WALLCLOCK_SECONDS=600 LOOP_START=7 LOOP_END=10 ENABLE_LOOPING_AT=0.35 \
RUN_ID=exp_c_late torchrun --standalone --nproc_per_node=1 train_gpt_sota.py

# Exp D — early layers
MAX_WALLCLOCK_SECONDS=600 LOOP_START=1 LOOP_END=4 ENABLE_LOOPING_AT=0.35 \
RUN_ID=exp_d_early torchrun --standalone --nproc_per_node=1 train_gpt_sota.py

# Exp E — early activation
MAX_WALLCLOCK_SECONDS=600 LOOP_START=3 LOOP_END=5 ENABLE_LOOPING_AT=0.15 \
RUN_ID=exp_e_early_act torchrun --standalone --nproc_per_node=1 train_gpt_sota.py
```

The `run_sweep.sh` script (included in the sweep_results directory) runs all six sequentially on a single GPU.

---

*Baseline: 568 steps, 1.2885 pre-quant bpb, 1.4689 post-quant bpb, 16,005,909 bytes | Best ablation: Exp A at 1.4750 (+0.006) | Worst: Exp B at 1.6321 (+0.163)*
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
====================================================================================================
Hyperparameters:
adam_eps: 1e-08
adam_wd: 0.02
beta1: 0.9
beta2: 0.95
compressor: brotli
data_dir: ./data/
datasets_dir: ./data/datasets/fineweb10B_sp8192
distributed: False
ema_decay: 0.9965
embed_bits: 8
embed_clip_sigmas: 20.0
embed_lr: 0.6
embed_wd: 0.085
embedding_dim: 512
enable_looping_at: 0.35
etlb_clip: 3.0
etlb_enabled: False
etlb_lr: 0.05
etlb_steps: 5
eval_seq_len: 2048
eval_stride: 64
gptq_calibration_batches: 64
gptq_reserve_seconds: 12.0
grad_accum_steps: 8
grad_clip_norm: 0.3
head_lr: 0.008
is_main_process: True
iterations: 20000
ln_scale: True
local_rank: 0
logfile: logs/baseline.txt
logit_softcap: 30.0
loop_end: 5
loop_start: 3
matrix_bits: 6
matrix_clip_sigmas: 12.85
matrix_lr: 0.022
max_wallclock_seconds: 600.0
min_lr: 0.0
mlp_mult: 4.0
model_dim: 512
model_path: final_model.pt
muon_backend_steps: 5
muon_beta2: 0.95
muon_momentum: 0.99
muon_momentum_warmup_start: 0.92
muon_momentum_warmup_steps: 1500
muon_row_normalize: True
muon_wd: 0.095
num_heads: 8
num_kv_heads: 4
num_layers: 11
num_loops: 2
parallel_residual_start: 7
qk_gain_init: 5.0
quantized_model_path: final_model.int6.ptz
rank: 0
rope_base: 10000.0
rope_dims: 16
rope_train_seq_len: 2048
run_id: baseline
scalar_lr: 0.02
seed: 1337
skip_gates_enabled: True
sliding_window_enabled: True
tie_embeddings: True
tied_embed_init_std: 0.005
tied_embed_lr: 0.03
tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model
train_batch_tokens: 786432
train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin
train_log_every: 500
train_seq_len: 2048
ttt_chunk_tokens: 32768
ttt_enabled: False
ttt_epochs: 3
ttt_lr: 0.005
ttt_momentum: 0.9
val_batch_tokens: 524288
val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin
val_loss_every: 4000
vocab_size: 8192
warmdown_frac: 0.72
warmup_steps: 20
world_size: 1
xsa_last_n: 11
====================================================================================================
Running Python 3.12.3 (main, Nov 6 2025, 13:44:16) [GCC 13.3.0]
Running PyTorch 2.9.1+cu128
Fri Apr 17 18:55:11 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 |
| N/A 27C P0 99W / 700W | 527MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 53426 C python3 518MiB |
+-----------------------------------------------------------------------------------------+

====================================================================================================
train_shards: 56
val_tokens: 40540160
model_params:35944536
gptq:reserving 12s, effective=588000ms
warmup_step: 1/20
warmup_step: 2/20
warmup_step: 3/20
warmup_step: 4/20
warmup_step: 5/20
warmup_step: 6/20
warmup_step: 10/20
warmup_step: 20/20
loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
loop_warmup_step: 1/20
loop_warmup_step: 2/20
loop_warmup_step: 3/20
loop_warmup_step: 4/20
loop_warmup_step: 5/20
loop_warmup_step: 6/20
loop_warmup_step: 10/20
loop_warmup_step: 20/20
0/20000 val_loss: 9.0047 val_bpb: 3.6831
1/20000 train_loss: 9.0047 train_time: 0.0m tok/s: 986727
2/20000 train_loss: 12.2965 train_time: 0.0m tok/s: 976995
3/20000 train_loss: 11.0398 train_time: 0.0m tok/s: 970965
4/20000 train_loss: 9.4833 train_time: 0.1m tok/s: 968533
5/20000 train_loss: 8.3441 train_time: 0.1m tok/s: 967300
layer_loop:enabled step:253 frac:0.351 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
500/20000 train_loss: 3.1760 train_time: 8.4m tok/s: 778651
568/20000 val_loss: 3.1501 val_bpb: 1.2885
stopping_early: wallclock_cap train_time: 588325ms step: 568/20000
peak memory allocated: 39152 MiB reserved: 39190 MiB
ema:applying EMA weights
pre-quantization post-ema val_loss:3.57865007 val_bpb:1.46373941 eval_time:23729ms
Serialized model: 135431033 bytes
Code size: 16594 bytes
GPTQ:collecting Hessians from calibration data...
GPTQ:collected 67 Hessians in 14.6s
Quantized weights:
gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight
gptq (int8): tok_emb.weight
passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights
Serialized model quantized+brotli: 15989315 bytes
Total submission size quantized+brotli: 16005909 bytes
quantized val_loss:3.59131523 val_bpb:1.46891972 eval_time:38254ms
quantized_sliding_window val_loss:3.55189052 val_bpb:1.45279422 eval_time:781002ms
Loading