Skip to content

Non-record: Discarding Transformers. Elastic Associative Memory as a Language Model, 97% Intelligence Transfer in 14MBs#1146

Open
nguthiru wants to merge 1 commit intoopenai:mainfrom
nguthiru:eam-is-model
Open

Non-record: Discarding Transformers. Elastic Associative Memory as a Language Model, 97% Intelligence Transfer in 14MBs#1146
nguthiru wants to merge 1 commit intoopenai:mainfrom
nguthiru:eam-is-model

Conversation

@nguthiru
Copy link
Copy Markdown

@nguthiru nguthiru commented Mar 30, 2026

Non-Record: Discarding Transformers. Elastic Associative Memory as a Language Model

We replace the transformer at inference with an Elastic Associative Memory (EAM). Train a teacher transformer, write its hidden states into EAM through additive counter superposition, throw the teacher away.

Artifact: Encoder (426K params) + EAM (12K self-organized locations) + Decoder (1.6M params) = 14.1 MB. Zero transformer layers at eval time.

Results (FineWeb validation)

Metric Value
val_bpb 2.4646
Intelligence transfer (teacher → EAM) 97.9%
EAM + kNN vs teacher wins by 5.9%
Reconstruction cosine sim 0.844 (at 50K locations)

The bpb looks underwhelming because the teacher is deliberately weak — basic AdamW, 1 GPU, 68 seconds, 1 data shard, loss 3.69. Swap in a better teacher and the EAM gets better proportionally. The architecture doesn't care what sits upstream.

How it works

  1. Train a teacher transformer
  2. Write all teacher hidden states into EAM (self-organizing locations accumulate patterns via additive superposition)
  3. Train an encoder: tokens → EAM key space
  4. Train a decoder: EAM readout → logits
  5. Delete the teacher. Ship encoder + EAM + decoder.

Eval path: tokens → encoder → keys → EAM.read(keys) → decoder → logits, plus a flat kNN store using score-first protocol.

What's going on here

14.1 MB of associative memory holds onto 97.9% of a transformer's learned representations. These are stored as counter superposition, not weight matrices. Because EAM counters are accumulate-only, new writes don't clobber old patterns — there's no mechanism for catastrophic forgetting, it just can't happen.

The kNN store turned out to be more than a fallback. EAM + kNN beats the teacher at every scale we tested, which we didn't expect going in. Superposed patterns and exact-match retrieval seem to cover for each other's weaknesses.

Reconstruction quality tracks location count predictably (0.768 at 10K, 0.805 at 20K, 0.844 at 50K), and inference is fixed at 4 layers regardless of how deep the original teacher was.

The accumulate-only property also opens the door to continual learning — you can keep writing new knowledge into the same EAM without degrading what's already there, which is the opposite of how fine-tuning usually goes. Distillation becomes a write operation instead of a training run: any teacher's representations go straight into counters. And because the EAM itself is just a lookup table with simple arithmetic, inference doesn't need a GPU. The whole eval pipeline could run on a CPU or edge device at 14.1 MB.

Based on Elastic Associative Memory (Nguthiru, 2026). Details in the README.

Checklist

  • README.md
  • submission.json
  • train_gpt.py (runs standalone)
  • train_log.txt
  • requirements.txt
  • Artifact under 16 MB (14.1 MB)
## Non-Record: Elastic Associative Memory as a Language Model

We replace the transformer at inference with an Elastic Associative Memory (EAM). Train a teacher transformer, write its hidden states into EAM through additive counter superposition, throw the teacher away.

Artifact: Encoder (426K params) + EAM (12K self-organized locations) + Decoder (1.6M params) = 14.1 MB. Zero transformer layers at eval time.

Results (FineWeb validation)

Metric Value
val_bpb 2.4646
Intelligence transfer (teacher → EAM) 97.9%
EAM + kNN vs teacher wins by 5.9%
Reconstruction cosine sim 0.844 (at 50K locations)

The bpb looks underwhelming because the teacher is deliberately weak — basic AdamW, 1 GPU, 68 seconds, 1 data shard, loss 3.69. Swap in a better teacher and the EAM gets better proportionally. The architecture doesn't care what sits upstream.

How it works

  1. Train a teacher transformer
  2. Write all teacher hidden states into EAM (self-organizing locations accumulate patterns via additive superposition)
  3. Train an encoder: tokens → EAM key space
  4. Train a decoder: EAM readout → logits
  5. Delete the teacher. Ship encoder + EAM + decoder.

Eval path: tokens → encoder → keys → EAM.read(keys) → decoder → logits, plus a flat kNN store using score-first protocol.

What's going on here

14.1 MB of associative memory holds onto 97.9% of a transformer's learned representations. These are stored as counter superposition, not weight matrices. Because EAM counters are accumulate-only, new writes don't clobber old patterns — there's no mechanism for catastrophic forgetting, it just can't happen.

The kNN store turned out to be more than a fallback. EAM + kNN beats the teacher at every scale we tested, which we didn't expect going in. Superposed patterns and exact-match retrieval seem to cover for each other's weaknesses.

Reconstruction quality tracks location count predictably (0.768 at 10K, 0.805 at 20K, 0.844 at 50K), and inference is fixed at 4 layers regardless of how deep the original teacher was.

The accumulate-only property also opens the door to continual learning — you can keep writing new knowledge into the same EAM without degrading what's already there, which is the opposite of how fine-tuning usually goes. Distillation becomes a write operation instead of a training run: any teacher's representations go straight into counters. And because EAM works by arithmetic, CPU inference becomes trivial.

Based on [Elastic Associative Memory](https://doi.org/10.5281/zenodo.18783160) (Nguthiru, 2026). Details in the README.

Checklist

  • README.md
  • submission.json
  • train_gpt.py (runs standalone)
  • train_log.txt
  • requirements.txt
  • Artifact under 16 MB (14.1 MB)

@nguthiru nguthiru changed the title Non-record: Elastic Associative Memory as a Language Model, 97% Intelligence Transfer in 14MBs Non-record: Discarding Transformers. Elastic Associative Memory as a Language Model, 97% Intelligence Transfer in 14MBs Mar 30, 2026
sisegod added a commit to sisegod/parameter-golf that referenced this pull request Apr 8, 2026
Reviewer pointed out that the algorithm's originality was scattered across
the PR body (one block quote under Headline + a rANS-baseline table in the
middle + a Shannon-floor section at the bottom) and wasn't clearly
attributable. This commit adds a dedicated '## Originality' section right
after the Headline / trajectory table in both PR_BODY.md and README.md,
enumerating seven discrete contributions in order of impact:

  1. Custom rANS entropy codec for NN weights (prior in chain, openai#1123/openai#1146).
     THE ONLY submission in the entire competition pushing mixed-precision
     weights through a rANS codec -- MLP-up 2.32 bits/weight, MLP-down 1.20
     bits/weight, vs ~4.0 bits/weight for a naive Int4 baseline. This is
     why a 32.8 M-parameter model fits in 15 MB at all.

  2. Aggressive SLOT tuning for the 32 M regime (prior in chain, openai#1146).
     PR openai#1176's lr=0.003 steps=5 defaults are ~33x too small at 32 M scale.
     Stride=64 full-eval sweep showed SLOT is monotonically helpful up to
     steps=100 lr=0.1, delivering -0.087 bpb over the base eval.

  3. Phase 1A int6 tied-embedding quantization (new in this PR). EMBED_QUANT_BITS=6
     EMBED_QUANT_TOK_EMB=1 is a free -0.6 MB on the rANS artifact with zero
     bpb regression. Phase 1A sanity sweep established that int6 is the right
     operating point (vs pent_tok regression of +0.043).

  4. Phase 5a trivial-wins composition (new in this PR). QK-Gain 5.0 +
     MuonEq-R + EMA 0.9965 + hidden_mult 5 + int6 tied embed, all stacked on
     top of the rANS HybridQuant backbone. -0.010124 bpb over v6.1 SLOT-100.

  5. Shannon-floor empirical check (new in this PR). Inter-layer delta
     prediction experiment showed delta entropy >= raw-weight entropy across
     all 11 layers; rANS reaches 2.32 bits/weight on MLP-up vs a Shannon
     theoretical minimum of 2.28 bits/weight on the same tensors. First
     empirical confirmation in the competition that HybridQuant rANS is
     already entropy-bound at the single-token coder level.

  6. Negative-results catalog for the 32 M regime (new in this PR). 11
     completed-to-eval experiments (Phase 1B / 1C / 2A-C / 3 / 5b / 5b')
     documented so other submitters can skip them.

  7. Legal Muon-TTT non-competitive finding (new in this PR). 3-seed
     full-eval TTT mean 1.205215 vs SLOT-100 mean 1.136399, SLOT wins by
     0.069 bpb. Strong negative result: aggressive SLOT already captures
     most of what TTT can extract for a 32 M model.

Each item is tagged '(prior in this chain)' or '(new in this PR)' so
reviewers can cleanly separate what was introduced earlier in the v6.1
chain from what this specific PR contributes. No changes to the reported
bpb numbers -- this is purely an originality-claim clarification pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: Discarding Transformers. Elastic Associative Memory as a Language Model, 97% Intelligence Transfer in 14MBs

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

Architecture PR #1146 ("EAM IS THE MODEL") is a non-standard architecture: it trains a Teacher transformer (discarded after training), then distills knowledge into an Elastic Associative Memory (EAM) with a small Encoder and Decoder. At inference, only Encoder + EAM + Decoder are used — no transformer runs at eval time. ## Checks ### N-gram / BigramHash family bug No n-gram logic, BigramHash, or XOR hash key construction is present anywhere in the file. The EAM operates on normalized float key vectors produced by the Encoder, not discrete hash lookups. Not applicable. ### Pre-Quant TTT (multi-epoch on val_tokens) The EAM, Encoder, and Decoder are all trained exclusively on train_tokens / train_x / train_y (lines 372–512). No gradient updates touch val_tokens at any point. val_tokens is only read during the sliding-window evaluation loop (lines 589–617) and never used for weight updates. No Pre-Quant TTT violation. ### Score-first TTT / flat_store The FlatStore is a kNN retrieval augmentation used during eval. At lines 641–660, the eval loop: 1. Computes nll (scored) from the EAM model before consulting the flat_store (line 637). 2. Retrieves from flat_store (line 642) — only used to optionally blend probabilities if they improve the score (lines 643–650). 3. Accumulates loss_sum / token_count (lines 652–657). 4. Then writes current window keys/tokens to flat_store (line 660, commented "after scoring — legal"). This is a strict score-first, write-after pattern. The kNN store is never written before scoring the current window. There is no is_last_chunk guard here, but none is needed...

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants