Non-record: Learning Adapters on Random Linear Maps (val_bpb 2.2017) by dentity007 · Pull Request #1195 · openai/parameter-golf

dentity007 · 2026-03-31T20:41:01Z

Non-record: Learning Adapters on Random Linear Maps

val_bpb: 2.2017 | 1x RTX 5090 Ada 16GB, 180s wallclock | sp1024

Implements OpenAI's requested "Learning adapters on random linear maps" research direction.

Architecture

All major weight matrices (Q, K, V, output projection, MLP up, MLP down) are frozen random orthogonal projections
Only diagonal scale and shift adapters are trainable (~0.5% of total parameters)
Tests whether random linear maps provide sufficient structure for language modeling when combined with learned element-wise transformations
Base config: 9 layers, d=512, 8 heads, sp1024 vocab, MLP 2x

Results

Metric	Value
val_bpb (final)	2.2017
Trainable params	~0.5% of total
Training time	180s (1x RTX 5090)

Key Findings

Random projections carry surprising capacity. At 2.2017 BPB, the model learns meaningful language structure despite 99.5% of weights being completely random. This is far better than chance (~8 BPB for uniform prediction over 1024 tokens).
The result aligns with the lottery ticket hypothesis. Random networks contain useful substructures. The diagonal adapters find and amplify these structures.
Artifact size is minimal. Only the adapters, embeddings, and layer norms need to be stored. The random weights can be regenerated from the seed at eval time.
The gap to the baseline (2.20 vs 1.22) quantifies what learned linear maps contribute. Roughly 1.0 BPB of the baseline's performance comes from learning the actual projection directions, not just their scales.

Comparison to Naive Baseline

	Naive Baseline	Random Adapters
Trainable weights	100%	0.5%
val_bpb	1.2244	2.2017
Artifact size	~15 MB	< 1 MB (adapters only)

Reproduction

pip install sentencepiece brotli
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 10
MAX_WALLCLOCK_SECONDS=180 python3 train_gpt_random_adapters.py

Discussion

This result establishes a useful lower bound: how much can you learn with frozen random structure? The 2.2017 BPB result suggests that element-wise scaling of random projections captures roughly 75% of the distance from uniform to a trained baseline. This has implications for model compression, pruning, and understanding what neural networks actually learn.

Potential directions: combining random projections with low-rank learned corrections (LoRA on top of random), or using structured random matrices (Hadamard, Toeplitz) instead of dense orthogonal.

Would appreciate thoughts on whether this direction is worth pursuing further.

Credits

Script: train_gpt_random_adapters.py
Implements OpenAI's requested "Learning adapters on random linear maps" direction from the README.

…er optimization, and SSM exploration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dentity007 · 2026-04-11T20:38:32Z

Research Expansion: Ablation Results

Ran an overnight ablation study on DGX Spark GB10 to expand on this submission. 200 training steps, sp1024, no torch.compile.

Results

Run	Config	Trainable	val_bpb	ms/step
RND-1	Default (0.5%)	~600K	2.5123	894
RND-2	Wider adapters (rank 8)	~480K	2.6323	881
RND-3	5% unfrozen	~600K	2.5120	895
RND-4	Progressive unfreeze	~600K	2.5122	894

Finding

All configurations land near 2.51 BPB regardless of adapter configuration. Wider adapters (RND-2) actually hurt, landing at 2.63. The progressive unfreezing strategy (RND-4) shows no benefit at 200 steps.

The gap from 2.51 to the trained baseline (~1.57 at 200 steps) represents the value of learning actual projection directions, not just scales. Interesting implication: transformer architecture itself (attention patterns, residuals, norms) provides most of the inductive bias. Learned weights add ~1 BPB of improvement on top of random projections.

Full raw data and logs: https://gist.github.com/dentity007/324ac35505c27acd18e7ffb468f4fa08

MatoTeziTanka · 2026-04-12T06:02:57Z

Community Review — Non-record: Learning Adapters on Random Linear Maps (val_bpb 2.2017)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

Summary

PR #1195 ("Learning Adapters on Random Linear Maps") by NathanMaine submits a non-record entry with val_bpb=2.2017 on 1×RTX5090. The submission directory is records/track_non_record_16mb/2026-03-31_RandomLinearMapAdapters/.

Checks performed

ILLEGAL n-gram family bug

NOT PRESENT. Searched for: ngram, bigram, BigramHash, trigram, hash, XOR. Zero matches. No hash-key manipulation or target-XOR'd-into-hash patterns anywhere in the file.

ILLEGAL Pre-Quant TTT (multi-epoch on val_tokens without score-first)

NOT PRESENT. The only references to val_tokens are:

Line 871: val_tokens = load_validation_tokens(...) — loading at startup
Line 243: total_seqs = (val_tokens.numel() - 1) // args.train_seq_len — inside eval_val()

val_tokens is never passed to any training loop. The training loop (lines 1056–1138) draws exclusively from DistributedTokenLoader backed by train_files. No multi-epoch or any-epoch fine-tuning on validation data occurs.

LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard)

NOT PRESENT. No TTT of any kind exists. Searched for: ttt, test.time, score.first, is_last_chunk. Zero matches.

HOLD scored-region SLOT

NOT PRESENT. No scored-region SLOT mechanism. Searched for scored.region, SLOT. Zero matches.

CLEAN pure neural

CONFIRMED. The architecture is a standard transformer with frozen random orthogonal projections (AdaptedLinear, lines 497–547) serving as non-trainable buffers, and small learned per-output-channel scale/shift adapters. Training uses Adam and Muon optimizers on train data only. Validation is inference-only (torch.inference_mode(), line 251). Quantization is post-training (lines 1157–1180), applied once after the training loop ends. No illegal feedback mechanisms detected.

Conclusion

Pure neural, standard training loop on train split, inference-only evaluation on val split, post-training quantization. No cheating patterns present.

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

dentity007 · 2026-04-15T03:23:56Z

Thanks for the audit. Confirming your read: the frozen orthogonal projections are initialized once as non-trainable buffers and the only trainable parameters are the diagonal scale/shift adapters plus embeddings. Classic frozen-backbone-with-diagonal-adapters setup.

The research finding is actually the interesting part of this one. With only 0.5% of parameters trainable (roughly 600K), the model reaches 2.51 BPB on sp1024 (vs 1.57 for the full baseline at the same step count). That gap quantifies how much of the transformer's inductive bias comes from the attention pattern, residual connections, and normalization rather than from learning specific projection directions. About 1 BPB of the baseline's performance comes from learning the actual weights, the rest comes from the architecture itself.

dentity007 and others added 3 commits March 30, 2026 19:12

Add approach notes for parameter golf challenge

ad23b7f

Update approach with depth recurrence, factorized embeddings, tokeniz…

300eb5c

…er optimization, and SSM exploration

Non-record: Learning Adapters on Random Linear Maps (val_bpb 2.2017)

f2c462d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dentity007 closed this Apr 1, 2026

dentity007 reopened this Apr 1, 2026

This was referenced Apr 13, 2026

Non-record: 11L XSA4 + EMA + SDTTT (3-seed mean val_bpb=1.1287) #406

Open

Record: 11L XSA4 + EMA + LoRA TTT + Partial RoPE + dim480 — val_bpb 1.13112 (3-seed) #1127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Learning Adapters on Random Linear Maps (val_bpb 2.2017)#1195

Non-record: Learning Adapters on Random Linear Maps (val_bpb 2.2017)#1195
dentity007 wants to merge 3 commits intoopenai:mainfrom
NathanMaine:research/random-adapters

dentity007 commented Mar 31, 2026 •

edited

Loading

Uh oh!

dentity007 commented Apr 11, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

dentity007 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dentity007 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!