Non-record: Discarding Transformers. Elastic Associative Memory as a Language Model, 97% Intelligence Transfer in 14MBs by nguthiru · Pull Request #1146 · openai/parameter-golf

nguthiru · 2026-03-30T19:00:14Z

Non-Record: Discarding Transformers. Elastic Associative Memory as a Language Model

We replace the transformer at inference with an Elastic Associative Memory (EAM). Train a teacher transformer, write its hidden states into EAM through additive counter superposition, throw the teacher away.

Artifact: Encoder (426K params) + EAM (12K self-organized locations) + Decoder (1.6M params) = 14.1 MB. Zero transformer layers at eval time.

Results (FineWeb validation)

Metric	Value
val_bpb	2.4646
Intelligence transfer (teacher → EAM)	97.9%
EAM + kNN vs teacher	wins by 5.9%
Reconstruction cosine sim	0.844 (at 50K locations)

The bpb looks underwhelming because the teacher is deliberately weak — basic AdamW, 1 GPU, 68 seconds, 1 data shard, loss 3.69. Swap in a better teacher and the EAM gets better proportionally. The architecture doesn't care what sits upstream.

How it works

Train a teacher transformer
Write all teacher hidden states into EAM (self-organizing locations accumulate patterns via additive superposition)
Train an encoder: tokens → EAM key space
Train a decoder: EAM readout → logits
Delete the teacher. Ship encoder + EAM + decoder.

Eval path: tokens → encoder → keys → EAM.read(keys) → decoder → logits, plus a flat kNN store using score-first protocol.

What's going on here

14.1 MB of associative memory holds onto 97.9% of a transformer's learned representations. These are stored as counter superposition, not weight matrices. Because EAM counters are accumulate-only, new writes don't clobber old patterns — there's no mechanism for catastrophic forgetting, it just can't happen.

The kNN store turned out to be more than a fallback. EAM + kNN beats the teacher at every scale we tested, which we didn't expect going in. Superposed patterns and exact-match retrieval seem to cover for each other's weaknesses.

Reconstruction quality tracks location count predictably (0.768 at 10K, 0.805 at 20K, 0.844 at 50K), and inference is fixed at 4 layers regardless of how deep the original teacher was.

The accumulate-only property also opens the door to continual learning — you can keep writing new knowledge into the same EAM without degrading what's already there, which is the opposite of how fine-tuning usually goes. Distillation becomes a write operation instead of a training run: any teacher's representations go straight into counters. And because the EAM itself is just a lookup table with simple arithmetic, inference doesn't need a GPU. The whole eval pipeline could run on a CPU or edge device at 14.1 MB.

Based on Elastic Associative Memory (Nguthiru, 2026). Details in the README.

Checklist

## Non-Record: Elastic Associative Memory as a Language Model

We replace the transformer at inference with an Elastic Associative Memory (EAM). Train a teacher transformer, write its hidden states into EAM through additive counter superposition, throw the teacher away.

Artifact: Encoder (426K params) + EAM (12K self-organized locations) + Decoder (1.6M params) = 14.1 MB. Zero transformer layers at eval time.

Results (FineWeb validation)

Metric	Value
val_bpb	2.4646
Intelligence transfer (teacher → EAM)	97.9%
EAM + kNN vs teacher	wins by 5.9%
Reconstruction cosine sim	0.844 (at 50K locations)

The bpb looks underwhelming because the teacher is deliberately weak — basic AdamW, 1 GPU, 68 seconds, 1 data shard, loss 3.69. Swap in a better teacher and the EAM gets better proportionally. The architecture doesn't care what sits upstream.

How it works

Train a teacher transformer
Write all teacher hidden states into EAM (self-organizing locations accumulate patterns via additive superposition)
Train an encoder: tokens → EAM key space
Train a decoder: EAM readout → logits
Delete the teacher. Ship encoder + EAM + decoder.

Eval path: tokens → encoder → keys → EAM.read(keys) → decoder → logits, plus a flat kNN store using score-first protocol.

What's going on here

14.1 MB of associative memory holds onto 97.9% of a transformer's learned representations. These are stored as counter superposition, not weight matrices. Because EAM counters are accumulate-only, new writes don't clobber old patterns — there's no mechanism for catastrophic forgetting, it just can't happen.

The kNN store turned out to be more than a fallback. EAM + kNN beats the teacher at every scale we tested, which we didn't expect going in. Superposed patterns and exact-match retrieval seem to cover for each other's weaknesses.

Reconstruction quality tracks location count predictably (0.768 at 10K, 0.805 at 20K, 0.844 at 50K), and inference is fixed at 4 layers regardless of how deep the original teacher was.

The accumulate-only property also opens the door to continual learning — you can keep writing new knowledge into the same EAM without degrading what's already there, which is the opposite of how fine-tuning usually goes. Distillation becomes a write operation instead of a training run: any teacher's representations go straight into counters. And because EAM works by arithmetic, CPU inference becomes trivial.

Based on [Elastic Associative Memory](https://doi.org/10.5281/zenodo.18783160) (Nguthiru, 2026). Details in the README.

Checklist

Reviewer pointed out that the algorithm's originality was scattered across the PR body (one block quote under Headline + a rANS-baseline table in the middle + a Shannon-floor section at the bottom) and wasn't clearly attributable. This commit adds a dedicated '## Originality' section right after the Headline / trajectory table in both PR_BODY.md and README.md, enumerating seven discrete contributions in order of impact: 1. Custom rANS entropy codec for NN weights (prior in chain, openai#1123/openai#1146). THE ONLY submission in the entire competition pushing mixed-precision weights through a rANS codec -- MLP-up 2.32 bits/weight, MLP-down 1.20 bits/weight, vs ~4.0 bits/weight for a naive Int4 baseline. This is why a 32.8 M-parameter model fits in 15 MB at all. 2. Aggressive SLOT tuning for the 32 M regime (prior in chain, openai#1146). PR openai#1176's lr=0.003 steps=5 defaults are ~33x too small at 32 M scale. Stride=64 full-eval sweep showed SLOT is monotonically helpful up to steps=100 lr=0.1, delivering -0.087 bpb over the base eval. 3. Phase 1A int6 tied-embedding quantization (new in this PR). EMBED_QUANT_BITS=6 EMBED_QUANT_TOK_EMB=1 is a free -0.6 MB on the rANS artifact with zero bpb regression. Phase 1A sanity sweep established that int6 is the right operating point (vs pent_tok regression of +0.043). 4. Phase 5a trivial-wins composition (new in this PR). QK-Gain 5.0 + MuonEq-R + EMA 0.9965 + hidden_mult 5 + int6 tied embed, all stacked on top of the rANS HybridQuant backbone. -0.010124 bpb over v6.1 SLOT-100. 5. Shannon-floor empirical check (new in this PR). Inter-layer delta prediction experiment showed delta entropy >= raw-weight entropy across all 11 layers; rANS reaches 2.32 bits/weight on MLP-up vs a Shannon theoretical minimum of 2.28 bits/weight on the same tensors. First empirical confirmation in the competition that HybridQuant rANS is already entropy-bound at the single-token coder level. 6. Negative-results catalog for the 32 M regime (new in this PR). 11 completed-to-eval experiments (Phase 1B / 1C / 2A-C / 3 / 5b / 5b') documented so other submitters can skip them. 7. Legal Muon-TTT non-competitive finding (new in this PR). 3-seed full-eval TTT mean 1.205215 vs SLOT-100 mean 1.136399, SLOT wins by 0.069 bpb. Strong negative result: aggressive SLOT already captures most of what TTT can extract for a 32 M model. Each item is tagged '(prior in this chain)' or '(new in this PR)' so reviewers can cleanly separate what was introduced earlier in the v6.1 chain from what this specific PR contributes. No changes to the reported bpb numbers -- this is purely an originality-claim clarification pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-12T06:09:00Z

Community Review — Non-record: Discarding Transformers. Elastic Associative Memory as a Language Model, 97% Intelligence Transfer in 14MBs

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

Architecture PR #1146 ("EAM IS THE MODEL") is a non-standard architecture: it trains a Teacher transformer (discarded after training), then distills knowledge into an Elastic Associative Memory (EAM) with a small Encoder and Decoder. At inference, only Encoder + EAM + Decoder are used — no transformer runs at eval time. ## Checks ### N-gram / BigramHash family bug No n-gram logic, BigramHash, or XOR hash key construction is present anywhere in the file. The EAM operates on normalized float key vectors produced by the Encoder, not discrete hash lookups. Not applicable. ### Pre-Quant TTT (multi-epoch on val_tokens) The EAM, Encoder, and Decoder are all trained exclusively on `train_tokens` / `train_x` / `train_y` (lines 372–512). No gradient updates touch val_tokens at any point. `val_tokens` is only read during the sliding-window evaluation loop (lines 589–617) and never used for weight updates. No Pre-Quant TTT violation. ### Score-first TTT / flat_store The FlatStore is a kNN retrieval augmentation used during eval. At lines 641–660, the eval loop: 1. Computes `nll` (scored) from the EAM model before consulting the flat_store (line 637). 2. Retrieves from `flat_store` (line 642) — only used to optionally blend probabilities if they improve the score (lines 643–650). 3. Accumulates loss_sum / token_count (lines 652–657). 4. Then writes current window keys/tokens to `flat_store` (line 660, commented "after scoring — legal"). This is a strict score-first, write-after pattern. The kNN store is never written before scoring the current window. There is no `is_last_chunk` guard here, but none is needed...

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

EAM is the model initial commit

04113fd

nguthiru changed the title ~~Non-record: Elastic Associative Memory as a Language Model, 97% Intelligence Transfer in 14MBs~~ Non-record: Discarding Transformers. Elastic Associative Memory as a Language Model, 97% Intelligence Transfer in 14MBs Mar 30, 2026

sisegod mentioned this pull request Apr 8, 2026

Non-record: v6.2 Phase 5a SOTA-trivial stack (3-seed @76% = 1.136399, -0.010 vs prior; TTT 1.205 not competitive) #1465

Open

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Discarding Transformers. Elastic Associative Memory as a Language Model, 97% Intelligence Transfer in 14MBs#1146

Non-record: Discarding Transformers. Elastic Associative Memory as a Language Model, 97% Intelligence Transfer in 14MBs#1146
nguthiru wants to merge 1 commit intoopenai:mainfrom
nguthiru:eam-is-model

nguthiru commented Mar 30, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nguthiru commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!