Non-record: Learning Adapters on Random Linear Maps (val_bpb 2.2017)#1195
Non-record: Learning Adapters on Random Linear Maps (val_bpb 2.2017)#1195dentity007 wants to merge 3 commits intoopenai:mainfrom
Conversation
…er optimization, and SSM exploration
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Research Expansion: Ablation ResultsRan an overnight ablation study on DGX Spark GB10 to expand on this submission. 200 training steps, sp1024, no torch.compile. Results
FindingAll configurations land near 2.51 BPB regardless of adapter configuration. Wider adapters (RND-2) actually hurt, landing at 2.63. The progressive unfreezing strategy (RND-4) shows no benefit at 200 steps. The gap from 2.51 to the trained baseline (~1.57 at 200 steps) represents the value of learning actual projection directions, not just scales. Interesting implication: transformer architecture itself (attention patterns, residuals, norms) provides most of the inductive bias. Learned weights add ~1 BPB of improvement on top of random projections. Full raw data and logs: https://gist.github.com/dentity007/324ac35505c27acd18e7ffb468f4fa08 |
Community Review — Non-record: Learning Adapters on Random Linear Maps (val_bpb 2.2017)Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache SummaryPR #1195 ("Learning Adapters on Random Linear Maps") by NathanMaine submits a non-record entry with val_bpb=2.2017 on 1×RTX5090. The submission directory is Checks performedILLEGAL n-gram family bugNOT PRESENT. Searched for: ILLEGAL Pre-Quant TTT (multi-epoch on val_tokens without score-first)NOT PRESENT. The only references to
LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard)NOT PRESENT. No TTT of any kind exists. Searched for: HOLD scored-region SLOTNOT PRESENT. No scored-region SLOT mechanism. Searched for CLEAN pure neuralCONFIRMED. The architecture is a standard transformer with frozen random orthogonal projections ( ConclusionPure neural, standard training loop on train split, inference-only evaluation on val split, post-training quantization. No cheating patterns present. Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission. Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually. |
|
Thanks for the audit. Confirming your read: the frozen orthogonal projections are initialized once as non-trainable buffers and the only trainable parameters are the diagonal scale/shift adapters plus embeddings. Classic frozen-backbone-with-diagonal-adapters setup. The research finding is actually the interesting part of this one. With only 0.5% of parameters trainable (roughly 600K), the model reaches 2.51 BPB on sp1024 (vs 1.57 for the full baseline at the same step count). That gap quantifies how much of the transformer's inductive bias comes from the attention pattern, residual connections, and normalization rather than from learning specific projection directions. About 1 BPB of the baseline's performance comes from learning the actual weights, the rest comes from the architecture itself. |
Non-record: Learning Adapters on Random Linear Maps
val_bpb: 2.2017 | 1x RTX 5090 Ada 16GB, 180s wallclock | sp1024
Implements OpenAI's requested "Learning adapters on random linear maps" research direction.
Architecture
Results
Key Findings
Random projections carry surprising capacity. At 2.2017 BPB, the model learns meaningful language structure despite 99.5% of weights being completely random. This is far better than chance (~8 BPB for uniform prediction over 1024 tokens).
The result aligns with the lottery ticket hypothesis. Random networks contain useful substructures. The diagonal adapters find and amplify these structures.
Artifact size is minimal. Only the adapters, embeddings, and layer norms need to be stored. The random weights can be regenerated from the seed at eval time.
The gap to the baseline (2.20 vs 1.22) quantifies what learned linear maps contribute. Roughly 1.0 BPB of the baseline's performance comes from learning the actual projection directions, not just their scales.
Comparison to Naive Baseline
Reproduction
Discussion
This result establishes a useful lower bound: how much can you learn with frozen random structure? The 2.2017 BPB result suggests that element-wise scaling of random projections captures roughly 75% of the distance from uniform to a trained baseline. This has implications for model compression, pruning, and understanding what neural networks actually learn.
Potential directions: combining random projections with low-rank learned corrections (LoRA on top of random), or using structured random matrices (Hadamard, Toeplitz) instead of dense orthogonal.
Would appreciate thoughts on whether this direction is worth pursuing further.
Credits
Script:
train_gpt_random_adapters.pyImplements OpenAI's requested "Learning adapters on random linear maps" direction from the README.