Non-record: Discarding Transformers. Elastic Associative Memory as a Language Model, 97% Intelligence Transfer in 14MBs#1146
Open
nguthiru wants to merge 1 commit intoopenai:mainfrom
Open
Conversation
sisegod
added a commit
to sisegod/parameter-golf
that referenced
this pull request
Apr 8, 2026
Reviewer pointed out that the algorithm's originality was scattered across the PR body (one block quote under Headline + a rANS-baseline table in the middle + a Shannon-floor section at the bottom) and wasn't clearly attributable. This commit adds a dedicated '## Originality' section right after the Headline / trajectory table in both PR_BODY.md and README.md, enumerating seven discrete contributions in order of impact: 1. Custom rANS entropy codec for NN weights (prior in chain, openai#1123/openai#1146). THE ONLY submission in the entire competition pushing mixed-precision weights through a rANS codec -- MLP-up 2.32 bits/weight, MLP-down 1.20 bits/weight, vs ~4.0 bits/weight for a naive Int4 baseline. This is why a 32.8 M-parameter model fits in 15 MB at all. 2. Aggressive SLOT tuning for the 32 M regime (prior in chain, openai#1146). PR openai#1176's lr=0.003 steps=5 defaults are ~33x too small at 32 M scale. Stride=64 full-eval sweep showed SLOT is monotonically helpful up to steps=100 lr=0.1, delivering -0.087 bpb over the base eval. 3. Phase 1A int6 tied-embedding quantization (new in this PR). EMBED_QUANT_BITS=6 EMBED_QUANT_TOK_EMB=1 is a free -0.6 MB on the rANS artifact with zero bpb regression. Phase 1A sanity sweep established that int6 is the right operating point (vs pent_tok regression of +0.043). 4. Phase 5a trivial-wins composition (new in this PR). QK-Gain 5.0 + MuonEq-R + EMA 0.9965 + hidden_mult 5 + int6 tied embed, all stacked on top of the rANS HybridQuant backbone. -0.010124 bpb over v6.1 SLOT-100. 5. Shannon-floor empirical check (new in this PR). Inter-layer delta prediction experiment showed delta entropy >= raw-weight entropy across all 11 layers; rANS reaches 2.32 bits/weight on MLP-up vs a Shannon theoretical minimum of 2.28 bits/weight on the same tensors. First empirical confirmation in the competition that HybridQuant rANS is already entropy-bound at the single-token coder level. 6. Negative-results catalog for the 32 M regime (new in this PR). 11 completed-to-eval experiments (Phase 1B / 1C / 2A-C / 3 / 5b / 5b') documented so other submitters can skip them. 7. Legal Muon-TTT non-competitive finding (new in this PR). 3-seed full-eval TTT mean 1.205215 vs SLOT-100 mean 1.136399, SLOT wins by 0.069 bpb. Strong negative result: aggressive SLOT already captures most of what TTT can extract for a 32 M model. Each item is tagged '(prior in this chain)' or '(new in this PR)' so reviewers can cleanly separate what was introduced earlier in the v6.1 chain from what this specific PR contributes. No changes to the reported bpb numbers -- this is purely an originality-claim clarification pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
12 tasks
Community Review — Non-record: Discarding Transformers. Elastic Associative Memory as a Language Model, 97% Intelligence Transfer in 14MBsCompliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache Architecture PR #1146 ("EAM IS THE MODEL") is a non-standard architecture: it trains a Teacher transformer (discarded after training), then distills knowledge into an Elastic Associative Memory (EAM) with a small Encoder and Decoder. At inference, only Encoder + EAM + Decoder are used — no transformer runs at eval time. ## Checks ### N-gram / BigramHash family bug No n-gram logic, BigramHash, or XOR hash key construction is present anywhere in the file. The EAM operates on normalized float key vectors produced by the Encoder, not discrete hash lookups. Not applicable. ### Pre-Quant TTT (multi-epoch on val_tokens) The EAM, Encoder, and Decoder are all trained exclusively on
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Non-Record: Discarding Transformers. Elastic Associative Memory as a Language Model
We replace the transformer at inference with an Elastic Associative Memory (EAM). Train a teacher transformer, write its hidden states into EAM through additive counter superposition, throw the teacher away.
Artifact: Encoder (426K params) + EAM (12K self-organized locations) + Decoder (1.6M params) = 14.1 MB. Zero transformer layers at eval time.
Results (FineWeb validation)
The bpb looks underwhelming because the teacher is deliberately weak — basic AdamW, 1 GPU, 68 seconds, 1 data shard, loss 3.69. Swap in a better teacher and the EAM gets better proportionally. The architecture doesn't care what sits upstream.
How it works
Eval path:
tokens → encoder → keys → EAM.read(keys) → decoder → logits, plus a flat kNN store using score-first protocol.What's going on here
14.1 MB of associative memory holds onto 97.9% of a transformer's learned representations. These are stored as counter superposition, not weight matrices. Because EAM counters are accumulate-only, new writes don't clobber old patterns — there's no mechanism for catastrophic forgetting, it just can't happen.
The kNN store turned out to be more than a fallback. EAM + kNN beats the teacher at every scale we tested, which we didn't expect going in. Superposed patterns and exact-match retrieval seem to cover for each other's weaknesses.
Reconstruction quality tracks location count predictably (0.768 at 10K, 0.805 at 20K, 0.844 at 50K), and inference is fixed at 4 layers regardless of how deep the original teacher was.
The accumulate-only property also opens the door to continual learning — you can keep writing new knowledge into the same EAM without degrading what's already there, which is the opposite of how fine-tuning usually goes. Distillation becomes a write operation instead of a training run: any teacher's representations go straight into counters. And because the EAM itself is just a lookup table with simple arithmetic, inference doesn't need a GPU. The whole eval pipeline could run on a CPU or edge device at 14.1 MB.
Based on Elastic Associative Memory (Nguthiru, 2026). Details in the README.
Checklist
- README.md
- submission.json
- train_gpt.py (runs standalone)
- train_log.txt
- requirements.txt
- Artifact under 16 MB (14.1 MB)
## Non-Record: Elastic Associative Memory as a Language ModelWe replace the transformer at inference with an Elastic Associative Memory (EAM). Train a teacher transformer, write its hidden states into EAM through additive counter superposition, throw the teacher away.
Artifact: Encoder (426K params) + EAM (12K self-organized locations) + Decoder (1.6M params) = 14.1 MB. Zero transformer layers at eval time.
Results (FineWeb validation)
The bpb looks underwhelming because the teacher is deliberately weak — basic AdamW, 1 GPU, 68 seconds, 1 data shard, loss 3.69. Swap in a better teacher and the EAM gets better proportionally. The architecture doesn't care what sits upstream.
How it works
Eval path:
tokens → encoder → keys → EAM.read(keys) → decoder → logits, plus a flat kNN store using score-first protocol.What's going on here
14.1 MB of associative memory holds onto 97.9% of a transformer's learned representations. These are stored as counter superposition, not weight matrices. Because EAM counters are accumulate-only, new writes don't clobber old patterns — there's no mechanism for catastrophic forgetting, it just can't happen.
The kNN store turned out to be more than a fallback. EAM + kNN beats the teacher at every scale we tested, which we didn't expect going in. Superposed patterns and exact-match retrieval seem to cover for each other's weaknesses.
Reconstruction quality tracks location count predictably (0.768 at 10K, 0.805 at 20K, 0.844 at 50K), and inference is fixed at 4 layers regardless of how deep the original teacher was.
The accumulate-only property also opens the door to continual learning — you can keep writing new knowledge into the same EAM without degrading what's already there, which is the opposite of how fine-tuning usually goes. Distillation becomes a write operation instead of a training run: any teacher's representations go straight into counters. And because EAM works by arithmetic, CPU inference becomes trivial.
Based on [Elastic Associative Memory](https://doi.org/10.5281/zenodo.18783160) (Nguthiru, 2026). Details in the README.
Checklist