Skip to content

Commit 3204b0f

Browse files
evangelinehelsinkiEvangeline Kamin
andauthored
Non-record: Depth Recurrence in Parameter-Constrained Transformers — What Works, What Doesn't, and Why (openai#363)
* Non-record: depth recurrence + quantization error amplification finding 4 unique blocks × 3 cycles = 12 effective depth, 768d, 3x MLP BigramHash + XSA + LoRA + Late STE QAT + int8+zstd Key finding: quantization error amplifies ~900x through recurrence cycles, making int6 incompatible with weight-sharing architectures. Int8 for shared blocks reduces the gap from 1.14 to 0.37 bpb. 3-seed mean: 2.0711 bpb (pre-quant), 2.4402 bpb (post-quant int8) * docs: comprehensive depth recurrence research writeup Complete 4-day experimental report on looped transformers in Parameter Golf: - Controlled flat vs looped comparison: 1.1648 vs 1.1894 bpb (+0.025 gap) - Noisy QAT: novel technique collapsing quant error from 0.37 to 0.002 bpb - 3x3 > 2x5 loop finding: more unique blocks with fewer repeats wins - 12 negative results with specific numbers - Hyperparameter sweep data (EMA, warmdown, MTP, WD, grad clip) - Updated training script with all experimental features * Update README.md me when I cant write * fix: remove extra files, update writeup per reviewer feedback - Remove pr325_train_gpt.py from PR (dev file, not submission) - Restore original README.md - Update records/ writeup with v2 content - Add hyperlink for Ciprian-Florin Ifrim (FIIZiK_) - Clarify T=0.90 is activation-dependent (relu² specific, found via grid search) --------- Co-authored-by: Evangeline Kamin <eve@aurora.lan>
1 parent 847cd9f commit 3204b0f

4 files changed

Lines changed: 1755 additions & 0 deletions

File tree

0 commit comments

Comments
 (0)