Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 81 additions & 0 deletions records/track_10min_16mb/2026-04-16_RandomBasisMLP_LoRA/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Random-Basis MLPs + LoRA

This PR does **not** attempt to beat the 1.0985 SOTA and is not claiming to.

### TL;DR of the partial seed-42 run

| Seed | Steps | Stopped | Pre-quant val_bpb | Post-quant val_bpb | Artifact |
|------|-------|---------|-------------------|---------------------|----------|
| 42 | 2564 / 20000 | wallclock_cap | **1.25536** | _not captured¹_ | _not captured¹_ |

¹ The first training pass crashed at `serialize(...)` because the pod's image lacked `brotli` and then promptly ran out of Runpod credits.
The partial training log itself is attached as [`train_seed42_partial.log`](train_seed42_partial.log).
With a (re-)deployed pod it is straightforward to produce a complete pre-/post-quant + artifact-size line as the training loop itself runs
cleanly end-to-end.

## Why it's interesting


Compression is not the same as entropy-coding. Most approaches are just competing to pack weights tighter into int6 to squeeze information density. However, random matrices don't work that way: they have maximum entropy, so entropy coders can't touch them. But a 4-byte seed can regenerate an arbitrarily large matrix from scratch. The 16MB artifact never says anything about how the process works. So instead of compressing weights, this PR stores seeds.

Rahimi & Recht (2007) showed that a ReLU²-activated Gaussian random projection is a universal nonlinear feature map: it can, in principle, represent any function. You get all that capacity for free, just from the seed. LoRA (Hu et al., 2021) then provides the small learnable layer on top; a rank-16 correction and the per-hidden gate that take those random features and push them toward directions that are actually useful for the task.

The per-hidden diagonal gate is doing the most work. A gate of 0 kills a feature entirely (same logic as Lottery Ticket pruning); a gate > 1 amplifies it. This also explains the hidden width choice. Everyone else is paying int6 cost per element, so wider hidden layers are expensive. My random hidden layer costs nothing but the seed. That's why 4x the hidden width of SOTA (mlp_mult=16 vs 4) is now affordable.


## Artifact math

At `dim=512`, `mlp_mult=16`, `lora_rank=16`, `num_layers=11`:

| Component | Baseline (mlp_mult=4) | This submission (mlp_mult=16) |
|----------------------------|--------------------------------|--------------------------------------------------------|
| MLP weights stored | 11 × (2 M params) int6 ≈ 17 MB | 11 × 0 params ≈ 4 bytes total (one seed) |
| LoRA A (16, 512) × 2 | — | 16 K params × 11 (small -> fp16 passthrough) ≈ 0.35 MB |
| LoRA B (8192, 16) × 2 | — | 262 K params × 11 int6 ≈ 2.4 MB |
| Per-hidden gate (8192,) | — | 8 K params × 11 fp32 passthrough ≈ 0.35 MB |
| Attention, embed, norms | ~8 MB | ~8 MB |
| Code | 68 KB | 74 KB |
| **Total (pre-compression)** | ~25 MB -> ~16 MB after brotli | ~11 MB -> ~8–10 MB after brotli |

We net ~4–6 MB of headroom even while running a 4× wider hidden
dimension. That headroom is available to spend on more layers, a bigger
vocab (SP8192/SP12288), or a longer training schedule.

## Hyperparameters

```bash
# Defaults in Hyperparameters — can be overridden via env vars
MLP_MULT=16
LORA_RANK=16
RANDOM_BASIS_ENABLED=1
RANDOM_BASIS_SEED=0xD15EA5E # deterministic per-layer via base_seed + 1009*layer_idx
NUM_LAYERS=11
MODEL_DIM=512
VOCAB_SIZE=4096 # same tokenizer path as PR #1218 backbone
```

Training/optimizer hyperparameters (Muon, warmdown, WD, EMA, GPTQ) are inherited unchanged from the forked backbone
(`records/track_10min_16mb/2026-04-01_Vocab4096_MLPMult4_WD085/`) to keep the ablation clean the only change is the MLP module.

## Command(s)

```bash
RUN_ID=rbmlp_seed42 \
DATA_PATH=./data/datasets/fineweb10B_sp4096/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_4096_bpe.model \
SEED=42 VOCAB_SIZE=4096 \
torchrun --standalone --nproc_per_node=8 \
records/track_10min_16mb/2026-04-16_RandomBasisMLP_LoRA/train_gpt.py
```


## Attribution

- Backbone forked from
[`records/track_10min_16mb/2026-04-01_Vocab4096_MLPMult4_WD085`](../2026-04-01_Vocab4096_MLPMult4_WD085/)
(Kevin Clark, PR #1218).
- The LoRA pattern is a direct adaptation of the one in
[`records/track_10min_16mb/2026-03-17_LoRA_TTT`](../2026-03-17_LoRA_TTT/)
(samacqua).
- Random-features math: Rahimi & Recht, "Random Features for Large-Scale Kernel Machines", NeurIPS 2007. [nips.cc](https://papers.nips.cc/paper_files/paper/2007/hash/013a006f03dbc5392effeb8f18fda755-Abstract.html)
- Low-rank adaptation: Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models", ICLR 2022. [arxiv.org(https://arxiv.org/abs/2106.09685)
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# torch and flash-attn-3 are installed separately in setup.sh
brotli
huggingface-hub
numpy
sentencepiece
tqdm
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
{
"author": "Camden Rush",
"github_id": "camden-git",
"name": "Random-Basis MLPs + LoRA (Free Weights via PRNG Seed)",
"blurb": "MLP weight matrices are stored as a 4-byte PRNG seed. The Gaussian basis is frozen and rematerialized at eval, so nothing large ever hits the artifact, just a LoRA correction (r=16) and a per-hidden diagonal gate. This frees up the hidden width. SOTA runs at 4x expansion; this runs at 16x, which similar to the Gaussian basis does not touch the file size budget. First partial run: val_bpb=1.2554 pre-quant at 2564/20000 steps, wallclock-capped with loss still descending. The 4x wider hidden cuts throughput to roughly 70% of baseline, which is the expected tradeoff.",
"date": "2026-04-16",
"track": "10min_16mb",
"val_loss": null,
"val_bpb": null,
"val_bpb_std": null,
"seeds": [42],
"seed_results": {
"42": {
"val_loss_pre_quant": 2.88861509,
"val_bpb_pre_quant": 1.25535884,
"steps": 2564,
"steps_planned": 20000,
"stopped": "wallclock_cap",
"train_time_ms": 590061,
"tok_per_sec": 3417153,
"model_params": 14486619,
"post_quant_val_bpb": null,
"artifact_bytes": null,
"notes": "Serialize path (GPTQ + brotli) not exercised in this run due to a missing brotli dep on the pod. Fixed in requirements.txt."
}
},
"hardware": "8xH100 80GB SXM",
"pytorch_version": "2.9.1+cu128",
"bytes_total": null,
"bytes_code": 74191,
"status": "non_record_exploratory",
"notes": "Submission format filed under non-record / exploratory track per repo rules."
}
Loading