Ivan Verbovoy (@iverbovoy) · 20.03.2026 → 20.04.2026
Single-person submission exploring depth recurrence (3 shared transformer blocks × 4 repeats = 12 effective layers) as an alternative to the flat 10-11 layer architectures used by the leaderboard. Best result: val_bpb 1.1324 (3-seed mean) on the 10-min track (PR #1453). Additional 4-hour non-record 1.0889 (PR #895). OpenAI-acknowledged the approach as novel and published a dedicated non-record PR #363 inspired by similar exploration.
tok_emb (+ optional BigramHash) + value_embeds × 2
│
for repeat in {0..3}:
for block in {A, B, C}: # 3 shared blocks
x += loop_embed[layer_idx] # per effective layer
x += Σ value_scales[l,e] * ve_e # per effective layer
x += cross_repeat_scale * block_out_prev_repeat # stateful recurrence
x = block(x, x0, use_xsa=(layer_idx ≥ xsa_start))
final_norm + tied LM head + softcap
Key weight-sharing components:
- loop_embed
(effective_depth, model_dim)— positional signal per effective layer - cross_repeat_scales
(num_blocks, num_repeats-1, dim)— stateful residual from prev repeat - resid_mix — learned per-dim mix between current and block-0 residual
- XSA — last 4 effective layers subtract self-value projection
- Hedge Mixer — eval-time online mixture of Neural + Unigram + Bigram + Trigram(hash 65K) + Entropy experts
| Date | PR | Track | Key idea | val_bpb |
|---|---|---|---|---|
| 20.03 | #148 | 10min | Depth Recurrence + Cross-Repeat Skip | 1.2196 |
| 25.03 | #784 | 10min | + XSA(4) + LeakyReLU²(0.5) | 1.2065 |
| 26.03 | #835 | 10min | + Progressive Depth (2→3→4 repeats) | 1.1980 |
| 26.03 | #856 | 10min | + Hedge Mixer | 1.1454 |
| 26.03 | #895 | 4h | 4-hour Progressive Depth | 1.0889 |
| 05.04 | #1384 | 10min | + tuned schedule + WD + SWA (3-seed) | 1.1441 |
| 07.04 | #1453 | 10min | + Int7 attn + Int5 MLP mixed quant (3-seed) | 1.1324 |
| Technique | Effect | Notes |
|---|---|---|
| Depth Recurrence 3×4 | — | Core architecture, enables 23.7M params in 16MB |
| Cross-Repeat Skip | −0.03 | Prev-repeat residual makes recurrence stateful |
| Value embeds (2 tables) | −0.07 | Critical. Adds per-layer token lookup |
| XSA last 4 | −0.01 | Self-value bias removal at top layers |
| Progressive Depth (0.30:2, 0.50:3, 1.0:4) | −0.005 | Ramp repeats during training |
| SWA (start 0.6, every 30) | −0.01 | ~44 checkpoints averaged |
| Hedge Mixer (5 experts) | −0.05 | Eval-time mixture, but stochastic (std 0.013) |
| Int7 attn + Int5 MLP mixed quant | −0.012 | Frees 2MB for d=880 mlp×3 vs d=832 mlp×2 |
| Muon optimizer + WD=0.04 | — | Standard for challenge |
Tested on 1–3 seeds and verified neither sliding nor hedge-mean improves:
| Technique | Result | Why |
|---|---|---|
| BigramHash 2048×112 | −0.005 ❌ | Too few buckets, hash collisions dominate |
| BigramHash 3072×112 | +0.005 ❌ | Single-seed −0.003 but 3-seed mean worse: stabilizes hedge but cuts peaks (seed 7 went 1.1193→1.1444) |
| BigramHash 4096×112 | +0.004 ❌ | Past sweet spot, sparse buckets degrade |
| Noisy QAT (default) | +0.011 ❌ | Noise on int5 MLP too large (~amax/15), SWA collects pre-QAT checkpoints |
| LoRA rank-2 per-repeat (attn.proj, mlp.proj) | +0.013 ❌ | Per-repeat signal already saturated by loop_embed + cross_repeat_scales |
| XSA-all (12 layers) | worse | Optimum is last 4, early XSA hurts |
| Inter-repeat RMSNorm | worse | Breaks scaling balance |
| EMA (τ=0.997) | +22ms/step | CPU overhead > benefit at our scale |
| Partial RoPE + VRL + LN Scale (combined) | worse | Too many interacting changes |
| MuonEq-R optimizer | diverged | Incompatible with our Muon setup |
| Auxiliary losses (edge-of-chaos regularization) | neutral | χ stabilized but bpb unchanged at 5 repeats |
| 3×6 d=960 | worse | Fewer steps dominates |
| 6×2 d=640/736 | worse | Too narrow |
| 4L × 3rep | worse | Fewer unique blocks in limited compute |
| TTT (LoRA-based) | −0.002 | Positive but 410s eval; dropped for budget |
| SD-clip k=3.5, k=10 | worse | Percentile-search already near optimum for int8 |
Implemented column-wise GPTQ with training-data calibration (no access to val). Collects X^T X per nn.Linear over 5 training batches, then column-by-column quantization with Cholesky(H_inv) error compensation. ~100 lines added to 1496-line submission.
| Seed | roundtrip Δ | sliding Δ | hedge Δ |
|---|---|---|---|
| 1337 | −0.0034 | −0.0033 | +0.008 |
| 42 | −0.0007 | −0.0008 | −0.0006 |
| 7 | −0.0013 | −0.0013 | +0.023 |
| 3-seed mean | −0.0018 | −0.0018 | +0.010 |
Deterministic improvement on sliding/roundtrip (both −0.002). Hedge mean worse by +0.010 — submission #1453's seed 7 hedge was unusually low (1.1193) and we couldn't reproduce that luck in our session.
Implication: GPTQ makes the model genuinely better (sliding/roundtrip = deterministic metric of model quality), but val_bpb is scored on hedge which has ±0.013 seed variance + ±0.008 session variance. The model-level gain gets dominated by hedge stochasticity.
Not submitting GPTQ as replacement — #1453 remains the best hedge-mean result. GPTQ-enhanced code kept as reference.
Our 1.1324 (3-seed) vs SOTA 1.1147 (abaybektursun's flat 11×512 + AR Self-Gen GPTQ + BigramHash 3072×112). Gap ~0.018. Evangelinehelsinki's separate exploration found flat 11L beats 3×3 recurrence by ~0.025 at same trick stack. Recurrence trades unique parameters for effective depth, which helps fit 23.7M params in 16MB but underperforms flat architecture per-layer.
Hedge gives ~−0.05 bpb lift over sliding but has huge variance:
- ±0.013 bpb between seeds (same config)
- ±0.008 bpb between sessions at identical model weights (sanity-run confirmed roundtrip/sliding match to 0.0002, hedge diverged 0.008)
Most architectural gains get absorbed by hedge noise. Deterministic metrics (sliding, roundtrip) are the reliable signal.
On 3×4 recurrence:
- loop_embed + cross_repeat_scales + value_scales already provide per-repeat variance
- LoRA per-repeat on top hurt (+0.006 sliding) — the model was already using available capacity
- Inter-repeat RMSNorm also hurt
Additional per-repeat degrees of freedom have diminishing/negative returns.
Shifting schedule from (0.40:2, 0.65:3, 1.0:4) to (0.30:2, 0.50:3, 1.0:4) gave −0.004 bpb — 55% more full-depth training steps. Combined with longer warmdown (3000 vs 2000) and denser SWA (every 30 vs 50) at higher start frac (0.6 vs 0.4) for ~44 averaged checkpoints.
Separating attn (int7, 63 levels) from MLP (int5, 16 levels):
- Attention quality drop dominates total loss at low precision → keep attn higher
- MLP tolerates aggressive quantization → allows 2MB saving
- 2MB saved → model width up from d=832 mlp×2 → d=880 mlp×3
Gain: −0.012 bpb.
Original percentile-search GPTQ ("GPTQ-lite" in our code) only optimizes per-row clip point via MSE. Full GPTQ with column-wise Hessian error compensation gave deterministic −0.002..−0.003 on sliding. Training-data calibration worked; AR self-gen calibration would likely stabilize further.
- Main submission:
records/track_non_record_16mb/2026-04-08_DepthRecurrence_Int7MixedQuant_HedgeMixer/(PR #1453 backing) - 4-hour submission: PR #895
- Experimental code variants in repo root:
train_gpt_refactored.py,train_gpt_exp1.py, etc.
Config used for PR #1453 (submitted):
MODEL_DIM=880 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=3
NUM_LAYERS=3 NUM_REPEATS=4
QUANT_LEVELS=63 MLP_QUANT_LEVELS=16
PROG_DEPTH="0.30:2,0.50:3,1.0:4"
WARMDOWN_ITERS=3000
SWA_START_FRAC=0.6 SWA_EVERY=30
MATRIX_LR=0.018 MUON_WD=0.04
XSA_LAST_N=4 QK_GAIN_INIT=1.5
USE_HEDGE=1 HEDGE_ETA=0.1
MAX_WALLCLOCK_SECONDS=600
3 seeds tested (1337, 42, 7) on 8× H100 SXM 80GB, PyTorch 2.5.1.
- RunPod compute grant: ~$950 of $1000 used
- ~25 full training runs + calibration experiments
- 1 person, 32 days
Thanks to OpenAI for running this challenge and sponsoring the compute grant. Thanks to abaybektursun, thwu1, Raahil Shah, Evangelinehelsinki for publishing detailed submissions that informed several of my experiments (particularly GPTQ calibration, BigramHash sizing, and the noisy-QAT analysis for recurrent architectures).