Commit 3480a20
Non-record: Depth Recurrence in Parameter-Constrained Transformers — What Works, What Doesn't, and Why (#363)
* Non-record: depth recurrence + quantization error amplification finding
4 unique blocks × 3 cycles = 12 effective depth, 768d, 3x MLP
BigramHash + XSA + LoRA + Late STE QAT + int8+zstd
Key finding: quantization error amplifies ~900x through recurrence cycles,
making int6 incompatible with weight-sharing architectures. Int8 for shared
blocks reduces the gap from 1.14 to 0.37 bpb.
3-seed mean: 2.0711 bpb (pre-quant), 2.4402 bpb (post-quant int8)
* docs: comprehensive depth recurrence research writeup
Complete 4-day experimental report on looped transformers in Parameter Golf:
- Controlled flat vs looped comparison: 1.1648 vs 1.1894 bpb (+0.025 gap)
- Noisy QAT: novel technique collapsing quant error from 0.37 to 0.002 bpb
- 3x3 > 2x5 loop finding: more unique blocks with fewer repeats wins
- 12 negative results with specific numbers
- Hyperparameter sweep data (EMA, warmdown, MTP, WD, grad clip)
- Updated training script with all experimental features
* Update README.md
me when I cant write
* fix: remove extra files, update writeup per reviewer feedback
- Remove pr325_train_gpt.py from PR (dev file, not submission)
- Restore original README.md
- Update records/ writeup with v2 content
- Add hyperlink for Ciprian-Florin Ifrim (FIIZiK_)
- Clarify T=0.90 is activation-dependent (relu² specific, found via grid search)
---------
Co-authored-by: Evangeline Kamin <eve@aurora.lan>1 parent 999b940 commit 3480a20
4 files changed
Lines changed: 1755 additions & 0 deletions
File tree
- records/track_non_record_16mb/2026-03-21_DepthRecurrence_MixedPrecisionQuant
0 commit comments