openai · krishs0404 · Apr 18, 2026
diff --git a/records/track_non_record_16mb/2026-04-15_DepthRecurrenceSweep/README.md b/records/track_non_record_16mb/2026-04-15_DepthRecurrenceSweep/README.md
@@ -0,0 +1,189 @@
+# Depth Recurrence Sweep: Mapping the Layer Loop Design Space
+
+**Non-Record Submission (Research Contribution)**
+**Author:** [@krishs0404](https://github.com/krishs0404)
+**Date:** April 15–17, 2026
+**Hardware:** RunPod 1×H100 SXM 80GB, 6 runs × 10-min wallclock cap ≈ ~60 GPU-minutes total
+**Base:** Current SOTA stack (sp8192, int6 block weights, int8 embeddings, brotli, depth recurrence)
+**Best result:** 1.4689 post-quant bpb (SOTA baseline, included as reference)
+
+---
+
+## Summary
+
+Systematic ablation of depth recurrence loop configuration in the current SOTA training stack. Five variants tested against baseline across three axes: which layers to loop (`LOOP_START`, `LOOP_END`) and when to activate the loop (`ENABLE_LOOPING_AT`). Every variant was worse than the SOTA config. The SOTA authors found the right hyperparameters.
+
+The key finding: the middle-layer sweet spot (layers 3–5) is genuine, not arbitrary. Moving the loop to earlier layers, later layers, expanding it, or activating it early all hurt. The minimal 2-layer variant (layers 5–6 only) is surprisingly competitive at +0.006 bpb, suggesting most of the recurrence benefit is concentrated in those two layers specifically.
+
+---
+
+## Motivation
+
+Depth recurrence is one of the distinguishing features of current top submissions, but no public ablation documents which layers to reuse, how many to loop, or when to activate the loop. The current SOTA uses `LOOP_START=3`, `LOOP_END=5`, `ENABLE_LOOPING_AT=0.35` — but these look like they could equally well be 2–6 or 4–7. This sweep answers: does the specific layer range actually matter?
+
+The short answer is yes, and the answer is non-obvious: the middle layers (3–5) are uniquely important, early/late alternatives are significantly worse, and the activation timing matters almost as much as the layer selection.
+
+---
+
+## Experimental Setup
+
+All runs used the SOTA training script with identical configuration except the looping parameters:
+
+- **Hardware**: RunPod 1×H100 SXM 80GB (132 SMs, HBM3)
+- **Tokenizer**: sp8192 (8192-vocab SentencePiece)
+- **Model**: 11 layers, 512d, 8 heads / 4 KV heads, tied embeddings
+- **Training budget**: `MAX_WALLCLOCK_SECONDS=600` (10 min, same as competition runs)
+- **GPTQ reserve**: 12s, so effective training = 588s
+- **Quantization**: GPTQ int6 block weights, GPTQ int8 embeddings, brotli compression
+- **Baseline looping config**: `LOOP_START=3`, `LOOP_END=5`, `ENABLE_LOOPING_AT=0.35`, `NUM_LOOPS=2`
+- **Evaluation**: Standard val_bpb + sliding window val_bpb, after dequantization
+
+How the loop index lists are constructed from `LOOP_START`/`LOOP_END`: the looped segment `[loop_start, loop_end]` is repeated `NUM_LOOPS` times. The remaining layers (`[0, loop_start)` and `(loop_end, num_layers)`) are split into encoder (first half) and decoder (second half) with U-Net skip connections. For the baseline, this yields `encoder:[0,1,2,3,4,5,3,4]` and `decoder:[5,3,4,5,6,7,8,9,10]`.
+
+---
+
+## Results
+
+| Experiment | `LOOP_START` | `LOOP_END` | `ENABLE_LOOPING_AT` | Steps | val_bpb (pre-quant) | val_bpb (post-quant) | Δ vs baseline |
+|---|---|---|---|---|---|---|---|
+| **Baseline (SOTA)** | 3 | 5 | 0.35 | **568** | **1.2885** | **1.4689** | — |
+| A — minimal reuse | 5 | 6 | 0.35 | 565 | 1.2971 | 1.4750 | +0.006 |
+| D — early layers | 1 | 4 | 0.35 | 538 | 1.2930 | 1.5072 | +0.038 |
+| C — late layers | 7 | 10 | 0.35 | 541 | 1.3052 | 1.5181 | +0.049 |
+| E — early activation | 3 | 5 | 0.15 | 522 | 1.2985 | 1.5190 | +0.050 |
+| B — heavy reuse | 2 | 7 | 0.35 | 451 | 1.3189 | 1.6321 | +0.163 |
+
+The encoder/decoder index lists for each experiment:
+
+| Experiment | Encoder indices | Decoder indices |
+|---|---|---|
+| Baseline | [0,1,2,3,4,5,3,4] | [5,3,4,5,6,7,8,9,10] |
+| A — minimal (LS=5, LE=6) | [0,1,2,3,4,5,6] | [5,6,5,6,7,8,9,10] |
+| D — early (LS=1, LE=4) | [0,1,2,3,4,1,2,3,4] | [1,2,3,4,5,6,7,8,9,10] |
+| C — late (LS=7, LE=10) | [0,1,2,3,4,5,6,7,8] | [9,10,7,8,9,10,7,8,9,10] |
+| E — early act (LS=3, LE=5, ELA=0.15) | [0,1,2,3,4,5,3,4] | [5,3,4,5,6,7,8,9,10] |
+| B — heavy (LS=2, LE=7) | [0,1,2,3,4,5,6,7,2,3,4] | [5,6,7,2,3,4,5,6,7,8,9,10] |
+
+---
+
+## Key Findings
+
+### The SOTA config is genuinely optimal, not a lucky guess
+
+Every variant tested was worse than the baseline. The ordering — minimal reuse (A) is closest, then early/late shifts (D, C), then early activation (E), then heavy reuse (B) catastrophically worse — forms a coherent picture of what makes depth recurrence work at the 10-minute training budget.
+
+This is not a case where the SOTA config was picked arbitrarily and any similar config would do. Moving the loop range by just a few layers (early: D, late: C) costs roughly +0.04–0.05 bpb. That's a large penalty for a small change.
+
+### Minimal reuse (2 layers, Exp A) is surprisingly competitive at +0.006
+
+Experiment A loops only layers 5–6 instead of the baseline's 3–5. Despite reusing one fewer layer, performance drops by only 0.006 post-quant bpb. This is the closest any variant came to the baseline, and the gap is small enough to be within run-to-run variance at this training budget.
+
+The implication is that the recurrence benefit is concentrated specifically in layers 5–6. Layers 3 and 4 contribute only marginally. Why layers 5–6? Speculative, but these are mid-depth layers where the model has built reasonable representations from the embedding and early layers, but hasn't yet committed to the final abstract features. Reusing them lets the model refine intermediate representations without perturbing early feature extraction or final classification layers.
+
+On an 8×H100 run where far more training steps are possible, this minimal configuration might be preferable: fewer looped layers means more training steps per wall-clock minute, and the accuracy gap may close with more iterations.
+
+### Heavy reuse is catastrophically worse (+0.163) due to throughput loss
+
+Experiment B expands the loop to layers 2–7 — six layers instead of three. The result is a disaster: post-quant bpb of 1.6321, the worst result by a wide margin, and only 451 training steps completed versus 568 for the baseline.
+
+The step count difference is the key. Looping 6 layers instead of 3 means each forward pass takes roughly twice as long. In a 10-minute budget, this costs ~117 training steps. At this early stage of training (sub-600 steps vs. a 20,000-step schedule), each step matters enormously — the model is still rapidly descending from the initial loss. Losing 117 steps to compute overhead is a severe penalty that the additional depth cannot compensate for.
+
+Heavier looping is only justified if the accuracy-per-step improvement exceeds the step-count penalty. At the 1×H100 / 10-minute scale tested here, it clearly does not. This might change at longer training budgets where the model has already extracted most of the easy gradient signal and additional depth becomes more valuable.
+
+### Early/late layer shifts are symmetrically bad (+0.038 to +0.049)
+
+Experiments C (late: 7–10) and D (early: 1–4) both hurt significantly. The losses are roughly symmetric around the baseline, with late layers slightly worse than early layers. Neither extreme is good.
+
+The early layers (D) fail because layers 1–4 are closest to the raw embedding. Reusing them means the model runs embedding-proximal computation multiple times, but the abstract representations needed for useful recurrence haven't formed yet. The late layers (C) fail for the opposite reason: layers 7–10 are already computing high-level features close to the output. Reusing them duplicates computation that should be done at most once before the final projection.
+
+The middle layers (3–5 in the SOTA) sit at the point where the model has built enough abstraction to benefit from recurrence without those abstractions being so finalized that recomputation is wasteful.
+
+### Early loop activation hurts: the model needs stable representations first
+
+Experiment E uses the same LOOP_START=3, LOOP_END=5 as baseline but activates the loop at 15% of training progress instead of 35%. This yields 522 steps and 1.5190 post-quant bpb — a +0.050 penalty and 46 fewer training steps.
+
+Two effects combine here. First, activating earlier introduces loop overhead earlier, costing steps. Second, and likely more important, the loop activates before the model has learned stable intermediate representations. At 15% progress (roughly step 90), the model's layer-5 outputs are still changing rapidly. Reusing them via the U-Net encoder/decoder causes the looped representations to be built on shifting foundations, degrading the benefit.
+
+The SOTA's 35% threshold appears to be calibrated for when representations have stabilized sufficiently for reuse to be helpful. Earlier than this, recurrence introduces noise rather than refinement.
+
+---
+
+## Implications for Future Work
+
+### Asymmetric recurrence: loop only the highest-impact layers
+
+The +0.006 gap for minimal reuse (layers 5–6 only) compared to the full baseline (layers 3–5) suggests a promising direction: identify the single highest-impact layer within the loop, loop only that one, and spend the recovered wall-clock time on additional training steps.
+
+On 8×H100 where ~4,500–5,500 steps complete in 10 minutes, the tradeoff changes dramatically. The step-count savings from a smaller loop matter less as a fraction of total training, while the accuracy-per-depth benefit of using middle-layer recurrence could accumulate over more steps. This sweep was run at 1×H100 scale; the Pareto frontier between loop width and step count will be different at full competition scale.
+
+### Adaptive loop activation scheduling
+
+The `ENABLE_LOOPING_AT` parameter is currently a fixed fraction of total training. A more principled approach would monitor validation loss, gradient norms, or representation similarity (CKA between layers) and activate the loop when representations have stabilized. This would be especially valuable in runs with different training budgets or batch sizes, where the fixed 35% threshold may not correspond to the same stage of model maturation.
+
+### 8×H100 validation
+
+All findings here are from 1×H100 runs at sub-600-step training. The rankings may hold at 8×H100 scale, but given that heavy reuse's main failure mode is step-count loss (which is a smaller relative penalty with more steps), the hierarchy is not guaranteed to be stable. In particular, Exp B (heavy reuse) might be less catastrophic at full scale if the 117-step loss is a smaller fraction of 5,000+ total steps. Exp A (minimal reuse) might close the gap further with more steps to leverage the saved per-step compute.
+
+---
+
+## Hardware
+
+- **Pod**: RunPod 1×H100 SXM 80GB (HBM3)
+- **CUDA**: 12.8
+- **Total GPU time**: ~60 minutes across 6 experiments (baseline + 5 ablations)
+- **Total cost**: ~$3 at RunPod spot rates
+
+---
+
+## Logs
+
+Raw training logs for all experiments are included in this directory:
+
+| File | Description |
+|---|---|
+| `baseline.txt` | SOTA baseline (LOOP_START=3, LOOP_END=5, ELA=0.35) |
+| `exp_a_minimal.txt` | Exp A: minimal reuse (LOOP_START=5, LOOP_END=6) |
+| `exp_b_heavy.txt` | Exp B: heavy reuse (LOOP_START=2, LOOP_END=7) |
+| `exp_c_late.txt` | Exp C: late layers (LOOP_START=7, LOOP_END=10) |
+| `exp_d_early.txt` | Exp D: early layers (LOOP_START=1, LOOP_END=4) |
+| `exp_e_early_act.txt` | Exp E: early activation (ELA=0.15, same loop as baseline) |
+| `gptq_ablation.log` | Bonus: simple int8 vs GPTQ int8 for embeddings (+0.003 bpb for GPTQ) |
+| `ref_1gpu.txt` | Reference run log from April 14 (575 steps, 1.4684 post-quant bpb) |
+
+---
+
+## Reproducing
+
+All experiments use `train_gpt_sota.py` (the competition SOTA script) with `MAX_WALLCLOCK_SECONDS=600`:
+
+```bash
+# Baseline
+MAX_WALLCLOCK_SECONDS=600 LOOP_START=3 LOOP_END=5 ENABLE_LOOPING_AT=0.35 \
+  RUN_ID=baseline torchrun --standalone --nproc_per_node=1 train_gpt_sota.py
+
+# Exp A — minimal reuse
+MAX_WALLCLOCK_SECONDS=600 LOOP_START=5 LOOP_END=6 ENABLE_LOOPING_AT=0.35 \
+  RUN_ID=exp_a_minimal torchrun --standalone --nproc_per_node=1 train_gpt_sota.py
+
+# Exp B — heavy reuse (warning: ~117 fewer training steps at 1xH100)
+MAX_WALLCLOCK_SECONDS=600 LOOP_START=2 LOOP_END=7 ENABLE_LOOPING_AT=0.35 \
+  RUN_ID=exp_b_heavy torchrun --standalone --nproc_per_node=1 train_gpt_sota.py
+
+# Exp C — late layers
+MAX_WALLCLOCK_SECONDS=600 LOOP_START=7 LOOP_END=10 ENABLE_LOOPING_AT=0.35 \
+  RUN_ID=exp_c_late torchrun --standalone --nproc_per_node=1 train_gpt_sota.py
+
+# Exp D — early layers
+MAX_WALLCLOCK_SECONDS=600 LOOP_START=1 LOOP_END=4 ENABLE_LOOPING_AT=0.35 \
+  RUN_ID=exp_d_early torchrun --standalone --nproc_per_node=1 train_gpt_sota.py
+
+# Exp E — early activation
+MAX_WALLCLOCK_SECONDS=600 LOOP_START=3 LOOP_END=5 ENABLE_LOOPING_AT=0.15 \
+  RUN_ID=exp_e_early_act torchrun --standalone --nproc_per_node=1 train_gpt_sota.py
+```
+
+The `run_sweep.sh` script (included in the sweep_results directory) runs all six sequentially on a single GPU.
+
+---
+
+*Baseline: 568 steps, 1.2885 pre-quant bpb, 1.4689 post-quant bpb, 16,005,909 bytes | Best ablation: Exp A at 1.4750 (+0.006) | Worst: Exp B at 1.6321 (+0.163)*
diff --git a/records/track_non_record_16mb/2026-04-15_DepthRecurrenceSweep/baseline.txt b/records/track_non_record_16mb/2026-04-15_DepthRecurrenceSweep/baseline.txt
@@ -0,0 +1,159 @@
+====================================================================================================
+Hyperparameters:
+  adam_eps: 1e-08
+  adam_wd: 0.02
+  beta1: 0.9
+  beta2: 0.95
+  compressor: brotli
+  data_dir: ./data/
+  datasets_dir: ./data/datasets/fineweb10B_sp8192
+  distributed: False
+  ema_decay: 0.9965
+  embed_bits: 8
+  embed_clip_sigmas: 20.0
+  embed_lr: 0.6
+  embed_wd: 0.085
+  embedding_dim: 512
+  enable_looping_at: 0.35
+  etlb_clip: 3.0
+  etlb_enabled: False
+  etlb_lr: 0.05
+  etlb_steps: 5
+  eval_seq_len: 2048
+  eval_stride: 64
+  gptq_calibration_batches: 64
+  gptq_reserve_seconds: 12.0
+  grad_accum_steps: 8
+  grad_clip_norm: 0.3
+  head_lr: 0.008
+  is_main_process: True
+  iterations: 20000
+  ln_scale: True
+  local_rank: 0
+  logfile: logs/baseline.txt
+  logit_softcap: 30.0
+  loop_end: 5
+  loop_start: 3
+  matrix_bits: 6
+  matrix_clip_sigmas: 12.85
+  matrix_lr: 0.022
+  max_wallclock_seconds: 600.0
+  min_lr: 0.0
+  mlp_mult: 4.0
+  model_dim: 512
+  model_path: final_model.pt
+  muon_backend_steps: 5
+  muon_beta2: 0.95
+  muon_momentum: 0.99
+  muon_momentum_warmup_start: 0.92
+  muon_momentum_warmup_steps: 1500
+  muon_row_normalize: True
+  muon_wd: 0.095
+  num_heads: 8
+  num_kv_heads: 4
+  num_layers: 11
+  num_loops: 2
+  parallel_residual_start: 7
+  qk_gain_init: 5.0
+  quantized_model_path: final_model.int6.ptz
+  rank: 0
+  rope_base: 10000.0
+  rope_dims: 16
+  rope_train_seq_len: 2048
+  run_id: baseline
+  scalar_lr: 0.02
+  seed: 1337
+  skip_gates_enabled: True
+  sliding_window_enabled: True
+  tie_embeddings: True
+  tied_embed_init_std: 0.005
+  tied_embed_lr: 0.03
+  tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model
+  train_batch_tokens: 786432
+  train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin
+  train_log_every: 500
+  train_seq_len: 2048
+  ttt_chunk_tokens: 32768
+  ttt_enabled: False
+  ttt_epochs: 3
+  ttt_lr: 0.005
+  ttt_momentum: 0.9
+  val_batch_tokens: 524288
+  val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin
+  val_loss_every: 4000
+  vocab_size: 8192
+  warmdown_frac: 0.72
+  warmup_steps: 20
+  world_size: 1
+  xsa_last_n: 11
+====================================================================================================
+Running Python 3.12.3 (main, Nov  6 2025, 13:44:16) [GCC 13.3.0]
+Running PyTorch 2.9.1+cu128
+Fri Apr 17 18:55:11 2026       
++-----------------------------------------------------------------------------------------+
+| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
++-----------------------------------------+------------------------+----------------------+
+| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
+|                                         |                        |               MIG M. |
+|=========================================+========================+======================|
+|   0  NVIDIA H100 80GB HBM3          On  |   00000000:5D:00.0 Off |                    0 |
+| N/A   27C    P0             99W /  700W |     527MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+
++-----------------------------------------------------------------------------------------+
+| Processes:                                                                              |
+|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
+|        ID   ID                                                               Usage      |
+|=========================================================================================|
+|    0   N/A  N/A           53426      C   python3                                 518MiB |
++-----------------------------------------------------------------------------------------+
+
+====================================================================================================
+train_shards: 56
+val_tokens: 40540160
+model_params:35944536
+gptq:reserving 12s, effective=588000ms
+warmup_step: 1/20
+warmup_step: 2/20
+warmup_step: 3/20
+warmup_step: 4/20
+warmup_step: 5/20
+warmup_step: 6/20
+warmup_step: 10/20
+warmup_step: 20/20
+loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+loop_warmup_step: 1/20
+loop_warmup_step: 2/20
+loop_warmup_step: 3/20
+loop_warmup_step: 4/20
+loop_warmup_step: 5/20
+loop_warmup_step: 6/20
+loop_warmup_step: 10/20
+loop_warmup_step: 20/20
+0/20000 val_loss: 9.0047 val_bpb: 3.6831
+1/20000 train_loss: 9.0047 train_time: 0.0m tok/s: 986727
+2/20000 train_loss: 12.2965 train_time: 0.0m tok/s: 976995
+3/20000 train_loss: 11.0398 train_time: 0.0m tok/s: 970965
+4/20000 train_loss: 9.4833 train_time: 0.1m tok/s: 968533
+5/20000 train_loss: 8.3441 train_time: 0.1m tok/s: 967300
+layer_loop:enabled step:253 frac:0.351 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+500/20000 train_loss: 3.1760 train_time: 8.4m tok/s: 778651
+568/20000 val_loss: 3.1501 val_bpb: 1.2885
+stopping_early: wallclock_cap train_time: 588325ms step: 568/20000
+peak memory allocated: 39152 MiB reserved: 39190 MiB
+ema:applying EMA weights
+pre-quantization post-ema val_loss:3.57865007 val_bpb:1.46373941 eval_time:23729ms
+Serialized model: 135431033 bytes
+Code size: 16594 bytes
+GPTQ:collecting Hessians from calibration data...
+GPTQ:collected 67 Hessians in 14.6s
+Quantized weights:
+  gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight
+  gptq (int8): tok_emb.weight
+  passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights
+Serialized model quantized+brotli: 15989315 bytes
+Total submission size quantized+brotli: 16005909 bytes
+quantized val_loss:3.59131523 val_bpb:1.46891972 eval_time:38254ms
+quantized_sliding_window val_loss:3.55189052 val_bpb:1.45279422 eval_time:781002ms