|
| 1 | +# Random Linear Maps + Learned LoRA Adapters |
| 2 | + |
| 3 | +## Summary |
| 4 | + |
| 5 | +This submission implements the **"Learning adapters on random linear maps"** idea from the challenge wishlist — a previously unclaimed approach that inverts the standard train→compress paradigm. |
| 6 | + |
| 7 | +**Core idea**: Instead of training all weights and then compressing to fit in 16MB, we: |
| 8 | + |
| 9 | +1. **Freeze most weights as pseudo-random projections** initialized from a deterministic seed (stored in code = 0 bytes in artifact). |
| 10 | +2. **Only train small LoRA-style low-rank adapters** (rank 16) on each layer, plus embeddings, norms, and control parameters. |
| 11 | +3. **At save time**, serialize only the trained adapter weights + seed. |
| 12 | +4. **At load time**, regenerate the full random backbone from the seed and apply the trained adapters. |
| 13 | + |
| 14 | +## Why This Is Interesting |
| 15 | + |
| 16 | +- **Massive model for free**: The frozen random backbone (12 layers, 768 dim, 3x MLP) has ~70M+ parameters but costs **0 bytes** in the artifact since it's reproducible from a seed. |
| 17 | +- **Only ~5-10M trainable params**: These fit easily in 16MB even at FP16, leaving headroom for wider/deeper architectures. |
| 18 | +- **Theoretically motivated**: Random features are surprisingly powerful (Random Kitchen Sinks, Lottery Ticket Hypothesis). The frozen random projections provide a rich feature basis that the adapters learn to combine. |
| 19 | +- **Novel for this challenge**: Nobody has tried this approach — it's fundamentally different from the quantization-focused submissions on the leaderboard. |
| 20 | + |
| 21 | +## Architecture |
| 22 | + |
| 23 | +| Component | Value | |
| 24 | +| -------------------- | -------------------- | |
| 25 | +| Layers | 12 | |
| 26 | +| Model dim | 768 | |
| 27 | +| Heads | 12 (4 KV heads, GQA) | |
| 28 | +| MLP mult | 3x | |
| 29 | +| LoRA rank | 16 | |
| 30 | +| Vocab | 1024 (sp1024) | |
| 31 | +| Backbone seed | 42 | |
| 32 | +| Trainable params | ~5-10M | |
| 33 | +| Frozen random params | ~70M+ | |
| 34 | + |
| 35 | +## Key Components |
| 36 | + |
| 37 | +### LoRALinear Module |
| 38 | + |
| 39 | +Each linear layer has: |
| 40 | + |
| 41 | +- A **frozen random base weight** `W` (from deterministic seed, stored as buffer) |
| 42 | +- **Trainable low-rank adapters** `A` (rank×in) and `B` (out×rank) |
| 43 | +- Output: `W@x + (B@A)@x * scale` |
| 44 | +- `B` initialized to zero so initial behavior = pure random projection |
| 45 | + |
| 46 | +### Optimizer Split |
| 47 | + |
| 48 | +- **Muon**: LoRA adapter matrices (lora_A, lora_B) |
| 49 | +- **Adam**: Token embeddings, scalar/control parameters, norms |
| 50 | + |
| 51 | +### Serialization |
| 52 | + |
| 53 | +- Only trainable parameters are saved (not frozen buffers) |
| 54 | +- Int8 + zlib compression on the trainable subset |
| 55 | +- At load time: regenerate random backbone from seed, apply dequantized adapters |
| 56 | + |
| 57 | +## Running |
| 58 | + |
| 59 | +```bash |
| 60 | +cd /workspace/parameter-golf |
| 61 | + |
| 62 | +python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 1 |
| 63 | + |
| 64 | +RUN_ID=random_lora_v1 \ |
| 65 | +DATA_PATH=./data/datasets/fineweb10B_sp1024/ \ |
| 66 | +TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \ |
| 67 | +VOCAB_SIZE=1024 \ |
| 68 | +torchrun --standalone --nproc_per_node=1 records/track_10min_16mb/2026-04-03_RandomLinearMaps_LoRA_Adapters/train_gpt.py |
| 69 | +``` |
| 70 | + |
| 71 | +## Potential Improvements |
| 72 | + |
| 73 | +- **Selective unfreezing**: Unfreeze first/last layer base weights for better embedding-to-hidden and hidden-to-logit projections. |
| 74 | +- **Larger LoRA rank** on critical layers (attention Q/K vs MLP). |
| 75 | +- **Different random initialization** schemes (orthogonal, spectral norm matching). |
| 76 | +- **Hybrid**: Freeze only MLP base weights (largest), train attention fully. |
| 77 | +- **Combine with proven techniques**: BigramHash, sliding eval, EMA/SWA. |
| 78 | + |
| 79 | +## Theoretical Background |
| 80 | + |
| 81 | +- [Random Features for Large-Scale Kernel Machines](https://papers.nips.cc/paper/2007/hash/013a006f03dbc5392effeb8f18fda755-Abstract.html) (Rahimi & Recht, 2007) |
| 82 | +- [The Lottery Ticket Hypothesis](https://arxiv.org/abs/1803.03635) (Frankle & Carlin, 2018) |
| 83 | +- [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) (Hu et al., 2021) |
| 84 | +- [Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning](https://arxiv.org/abs/2012.13255) (Aghajanyan et al., 2020) |
0 commit comments