Skip to content

Commit c4a33ee

Browse files
committed
Add Random Linear Maps + LoRA Adapters submission
1 parent 7ba5111 commit c4a33ee

3 files changed

Lines changed: 1042 additions & 0 deletions

File tree

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# Random Linear Maps + Learned LoRA Adapters
2+
3+
## Summary
4+
5+
This submission implements the **"Learning adapters on random linear maps"** idea from the challenge wishlist — a previously unclaimed approach that inverts the standard train→compress paradigm.
6+
7+
**Core idea**: Instead of training all weights and then compressing to fit in 16MB, we:
8+
9+
1. **Freeze most weights as pseudo-random projections** initialized from a deterministic seed (stored in code = 0 bytes in artifact).
10+
2. **Only train small LoRA-style low-rank adapters** (rank 16) on each layer, plus embeddings, norms, and control parameters.
11+
3. **At save time**, serialize only the trained adapter weights + seed.
12+
4. **At load time**, regenerate the full random backbone from the seed and apply the trained adapters.
13+
14+
## Why This Is Interesting
15+
16+
- **Massive model for free**: The frozen random backbone (12 layers, 768 dim, 3x MLP) has ~70M+ parameters but costs **0 bytes** in the artifact since it's reproducible from a seed.
17+
- **Only ~5-10M trainable params**: These fit easily in 16MB even at FP16, leaving headroom for wider/deeper architectures.
18+
- **Theoretically motivated**: Random features are surprisingly powerful (Random Kitchen Sinks, Lottery Ticket Hypothesis). The frozen random projections provide a rich feature basis that the adapters learn to combine.
19+
- **Novel for this challenge**: Nobody has tried this approach — it's fundamentally different from the quantization-focused submissions on the leaderboard.
20+
21+
## Architecture
22+
23+
| Component | Value |
24+
| -------------------- | -------------------- |
25+
| Layers | 12 |
26+
| Model dim | 768 |
27+
| Heads | 12 (4 KV heads, GQA) |
28+
| MLP mult | 3x |
29+
| LoRA rank | 16 |
30+
| Vocab | 1024 (sp1024) |
31+
| Backbone seed | 42 |
32+
| Trainable params | ~5-10M |
33+
| Frozen random params | ~70M+ |
34+
35+
## Key Components
36+
37+
### LoRALinear Module
38+
39+
Each linear layer has:
40+
41+
- A **frozen random base weight** `W` (from deterministic seed, stored as buffer)
42+
- **Trainable low-rank adapters** `A` (rank×in) and `B` (out×rank)
43+
- Output: `W@x + (B@A)@x * scale`
44+
- `B` initialized to zero so initial behavior = pure random projection
45+
46+
### Optimizer Split
47+
48+
- **Muon**: LoRA adapter matrices (lora_A, lora_B)
49+
- **Adam**: Token embeddings, scalar/control parameters, norms
50+
51+
### Serialization
52+
53+
- Only trainable parameters are saved (not frozen buffers)
54+
- Int8 + zlib compression on the trainable subset
55+
- At load time: regenerate random backbone from seed, apply dequantized adapters
56+
57+
## Running
58+
59+
```bash
60+
cd /workspace/parameter-golf
61+
62+
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 1
63+
64+
RUN_ID=random_lora_v1 \
65+
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
66+
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
67+
VOCAB_SIZE=1024 \
68+
torchrun --standalone --nproc_per_node=1 records/track_10min_16mb/2026-04-03_RandomLinearMaps_LoRA_Adapters/train_gpt.py
69+
```
70+
71+
## Potential Improvements
72+
73+
- **Selective unfreezing**: Unfreeze first/last layer base weights for better embedding-to-hidden and hidden-to-logit projections.
74+
- **Larger LoRA rank** on critical layers (attention Q/K vs MLP).
75+
- **Different random initialization** schemes (orthogonal, spectral norm matching).
76+
- **Hybrid**: Freeze only MLP base weights (largest), train attention fully.
77+
- **Combine with proven techniques**: BigramHash, sliding eval, EMA/SWA.
78+
79+
## Theoretical Background
80+
81+
- [Random Features for Large-Scale Kernel Machines](https://papers.nips.cc/paper/2007/hash/013a006f03dbc5392effeb8f18fda755-Abstract.html) (Rahimi & Recht, 2007)
82+
- [The Lottery Ticket Hypothesis](https://arxiv.org/abs/1803.03635) (Frankle & Carlin, 2018)
83+
- [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) (Hu et al., 2021)
84+
- [Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning](https://arxiv.org/abs/2012.13255) (Aghajanyan et al., 2020)
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
{
2+
"author": "austinluk",
3+
"github_id": "austinluk",
4+
"date": "2026-04-03",
5+
"val_bpb": null,
6+
"description": "Random Linear Maps + Learned LoRA Adapters: freeze most weights as pseudo-random (from seed = 0 bytes in artifact), train only low-rank adapters. Enables a much larger backbone (12L, 768d, 3x MLP) under 16MB since frozen weights cost nothing in the artifact.",
7+
"track": "10min_16mb",
8+
"tags": ["random-linear-maps", "lora", "adapters", "creative"]
9+
}

0 commit comments

Comments
 (0)