A proof of concept implementation for head-wise adaptive RoPE (Rotary Position Embedding) that allows each attention head to learn its own frequency and phase scaling for improved long-context recall and copy-style in-context learning.
This project implements and evaluates head-wise adaptive RoPE variants where each attention head can learn:
- Per-head frequency scaling (how fast the rotation changes with position)
- Per-head phase scaling (initial phase offset)
- Baseline RoPE implementation
- Head-wise adaptive RoPE with learnable scales and phases
- Synthetic copy and associative recall tasks
- Mechanistic analysis tools (attention patching, head scale analysis)
- Evaluation on context lengths from 2k to 16k
pip install -r requirements.txtpython train.py --model_size 4 --context_length 2048 --test_length 16384rope.py- RoPE implementations (baseline and adaptive)model.py- Transformer model with adaptive RoPEtasks.py- Synthetic copy and associative recall taskstrain.py- Training and evaluation scriptanalysis.py- Head analysis and attention patching toolsexperiments.py- Experiment configurations and comparisons