Skip to content

Non-record: Learning Adapters on Random Linear Maps (val_bpb 2.2017)#1195

Open
dentity007 wants to merge 3 commits intoopenai:mainfrom
NathanMaine:research/random-adapters
Open

Non-record: Learning Adapters on Random Linear Maps (val_bpb 2.2017)#1195
dentity007 wants to merge 3 commits intoopenai:mainfrom
NathanMaine:research/random-adapters

Conversation

@dentity007
Copy link
Copy Markdown

@dentity007 dentity007 commented Mar 31, 2026

Non-record: Learning Adapters on Random Linear Maps

val_bpb: 2.2017 | 1x RTX 5090 Ada 16GB, 180s wallclock | sp1024

Implements OpenAI's requested "Learning adapters on random linear maps" research direction.

Architecture

  • All major weight matrices (Q, K, V, output projection, MLP up, MLP down) are frozen random orthogonal projections
  • Only diagonal scale and shift adapters are trainable (~0.5% of total parameters)
  • Tests whether random linear maps provide sufficient structure for language modeling when combined with learned element-wise transformations
  • Base config: 9 layers, d=512, 8 heads, sp1024 vocab, MLP 2x

Results

Metric Value
val_bpb (final) 2.2017
Trainable params ~0.5% of total
Training time 180s (1x RTX 5090)

Key Findings

  1. Random projections carry surprising capacity. At 2.2017 BPB, the model learns meaningful language structure despite 99.5% of weights being completely random. This is far better than chance (~8 BPB for uniform prediction over 1024 tokens).

  2. The result aligns with the lottery ticket hypothesis. Random networks contain useful substructures. The diagonal adapters find and amplify these structures.

  3. Artifact size is minimal. Only the adapters, embeddings, and layer norms need to be stored. The random weights can be regenerated from the seed at eval time.

  4. The gap to the baseline (2.20 vs 1.22) quantifies what learned linear maps contribute. Roughly 1.0 BPB of the baseline's performance comes from learning the actual projection directions, not just their scales.

Comparison to Naive Baseline

Naive Baseline Random Adapters
Trainable weights 100% 0.5%
val_bpb 1.2244 2.2017
Artifact size ~15 MB < 1 MB (adapters only)

Reproduction

pip install sentencepiece brotli
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 10
MAX_WALLCLOCK_SECONDS=180 python3 train_gpt_random_adapters.py

Discussion

This result establishes a useful lower bound: how much can you learn with frozen random structure? The 2.2017 BPB result suggests that element-wise scaling of random projections captures roughly 75% of the distance from uniform to a trained baseline. This has implications for model compression, pruning, and understanding what neural networks actually learn.

Potential directions: combining random projections with low-rank learned corrections (LoRA on top of random), or using structured random matrices (Hadamard, Toeplitz) instead of dense orthogonal.

Would appreciate thoughts on whether this direction is worth pursuing further.

Credits

Script: train_gpt_random_adapters.py
Implements OpenAI's requested "Learning adapters on random linear maps" direction from the README.

@dentity007 dentity007 closed this Apr 1, 2026
@dentity007 dentity007 reopened this Apr 1, 2026
@dentity007
Copy link
Copy Markdown
Author

Research Expansion: Ablation Results

Ran an overnight ablation study on DGX Spark GB10 to expand on this submission. 200 training steps, sp1024, no torch.compile.

Results

Run Config Trainable val_bpb ms/step
RND-1 Default (0.5%) ~600K 2.5123 894
RND-2 Wider adapters (rank 8) ~480K 2.6323 881
RND-3 5% unfrozen ~600K 2.5120 895
RND-4 Progressive unfreeze ~600K 2.5122 894

Finding

All configurations land near 2.51 BPB regardless of adapter configuration. Wider adapters (RND-2) actually hurt, landing at 2.63. The progressive unfreezing strategy (RND-4) shows no benefit at 200 steps.

The gap from 2.51 to the trained baseline (~1.57 at 200 steps) represents the value of learning actual projection directions, not just scales. Interesting implication: transformer architecture itself (attention patterns, residuals, norms) provides most of the inductive bias. Learned weights add ~1 BPB of improvement on top of random projections.

Full raw data and logs: https://gist.github.com/dentity007/324ac35505c27acd18e7ffb468f4fa08

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: Learning Adapters on Random Linear Maps (val_bpb 2.2017)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

Summary

PR #1195 ("Learning Adapters on Random Linear Maps") by NathanMaine submits a non-record entry with val_bpb=2.2017 on 1×RTX5090. The submission directory is records/track_non_record_16mb/2026-03-31_RandomLinearMapAdapters/.

Checks performed

ILLEGAL n-gram family bug

NOT PRESENT. Searched for: ngram, bigram, BigramHash, trigram, hash, XOR. Zero matches. No hash-key manipulation or target-XOR'd-into-hash patterns anywhere in the file.

ILLEGAL Pre-Quant TTT (multi-epoch on val_tokens without score-first)

NOT PRESENT. The only references to val_tokens are:

  • Line 871: val_tokens = load_validation_tokens(...) — loading at startup
  • Line 243: total_seqs = (val_tokens.numel() - 1) // args.train_seq_len — inside eval_val()

val_tokens is never passed to any training loop. The training loop (lines 1056–1138) draws exclusively from DistributedTokenLoader backed by train_files. No multi-epoch or any-epoch fine-tuning on validation data occurs.

LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard)

NOT PRESENT. No TTT of any kind exists. Searched for: ttt, test.time, score.first, is_last_chunk. Zero matches.

HOLD scored-region SLOT

NOT PRESENT. No scored-region SLOT mechanism. Searched for scored.region, SLOT. Zero matches.

CLEAN pure neural

CONFIRMED. The architecture is a standard transformer with frozen random orthogonal projections (AdaptedLinear, lines 497–547) serving as non-trainable buffers, and small learned per-output-channel scale/shift adapters. Training uses Adam and Muon optimizers on train data only. Validation is inference-only (torch.inference_mode(), line 251). Quantization is post-training (lines 1157–1180), applied once after the training loop ends. No illegal feedback mechanisms detected.

Conclusion

Pure neural, standard training loop on train split, inference-only evaluation on val split, post-training quantization. No cheating patterns present.

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

@dentity007
Copy link
Copy Markdown
Author

Thanks for the audit. Confirming your read: the frozen orthogonal projections are initialized once as non-trainable buffers and the only trainable parameters are the diagonal scale/shift adapters plus embeddings. Classic frozen-backbone-with-diagonal-adapters setup.

The research finding is actually the interesting part of this one. With only 0.5% of parameters trainable (roughly 600K), the model reaches 2.51 BPB on sp1024 (vs 1.57 for the full baseline at the same step count). That gap quantifies how much of the transformer's inductive bias comes from the attention pattern, residual connections, and normalization rather than from learning specific projection directions. About 1 BPB of the baseline's performance comes from learning the actual weights, the rest comes from the architecture itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants