Non-record: notes on the recurrence band (mixing parameters, MLP sizing, loop sizing) by leon2k2k2k · Pull Request #2137 · openai/parameter-golf

leon2k2k2k · 2026-05-01T23:56:43Z

Notes on the recurrence band in compressed transformers

A small set of architectural studies on the loop band (layers 3–5) of the
#1736 / 060A baseline. Each section is independent.

Section 1 — Learning mixing parameters in depth-recurrent loops

A depth-recurrent loop runs the canonical Markov iteration through the loop
band (layers 3–5):

x_{k+1} = f(x_k)

Each pass uses only the previous pass's output. We replace this with a
learned mixing rule, train it end-to-end, and observe that the learned
mixing coefficients converge to a stable, nearly seed-invariant pattern
within a few hundred steps after looping activates. Once stabilized, the
coefficients can be read off the trained model and used as fixed constants
in a fresh training run.

Recurrent α-β

We add learnable scalars to control how each pass commits to the residual
and to allow detached cross-layer carries within the same pass:

x_{k+1} = β_k · f(x_k) + Σ_j α_{k,j} · stop_grad(x_k^{(j)})

with β_k initialized to 1 and α_{k,j} initialized to 0, so the loop
starts from the canonical Markov rule. Across the loop band (layers 3–5,
NL=2) this is a small number of scalars; they are routed to the scalar
optimizer and trained jointly with the rest of the model.

During a full training run on the #1736 base, the scalars drift off their
initialization once looping activates at frac=0.35, then plateau. The
final values are reproducible across seeds — for example, layer 4 converges
to a self-subtract pattern at α ≈ −0.348 (a learned gate), and layer 5
stabilizes into a positive aggregation of the signals from layers 3 and 4.

Freezing the learned values

We then read the converged values off the trained model and use them as
fixed constants in a new training run from scratch. The optimizer state
and per-step gradient on these scalars are dropped; only the values
survive. Because the loop now starts at the converged mixing pattern
rather than at the canonical Markov rule, the run is no longer
identity-at-init, but training-end quality matches.

This is shipped as PR #1779 on top of #1736:

Submission	Mixing rule in loop band	val_bpb (3-seed mean)	Δ vs #1736
#1736 (base)	canonical Markov	1.06549	—
#1779 (frozen α-β)	fixed α-β with cross-layer carry	1.06421	−0.00128

3-seed std on #1779 is 0.00023, so the gain is well outside seed noise.
Artifact size is unchanged (the frozen scalars are baked into the model
weights serialized into the 16 MB budget).

The converged values used as fixed constants in #1779 are:

β = [1.5973, 1.8828, 1.9922]                          # layers 3, 4, 5

         L3       L4       L5
α = [[ 0.2520, −0.0210, −0.0124],     # L3 contributions
     [ 0.0669, −0.3477,  0.0031],     # L4 contributions
     [ 0.1387,  0.2412,  0.0272]]     # L5 contributions

Two patterns stand out. Every β is well above 1, so each pass amplifies
its own block output rather than damping it — the optimizer chose to
overshoot the canonical Markov rule. And the diagonal of α is mixed: L3
adds back ~25% of itself, L4 subtracts ~35% of itself (the learned-gate
self-subtract behavior), L5 leaves itself roughly alone but absorbs ~24%
of L4. The off-diagonal entries in row L5 also confirm L5 acts as an
aggregator over L3 and L4.

Anderson acceleration with frozen coefficients

The same idea applies to a different mixing rule. Anderson acceleration
replaces the Markov iteration with a length-m mix of past iterates,
solved per batch via a small least-squares problem:

g_i = f(x_i) − x_i                                     # residuals
α* = argmin_α  ‖Σ_{i=k−m+1..k} α_i · g_i‖²,  Σ α_i = 1
x_{k+1} = Σ α*_i · f(x_i)

Trained end-to-end (length-3 Anderson, per-batch LS), the coefficients
land in the noise band of canonical recurrence but pay a ~25% throughput
penalty for the per-batch solve. Inspecting the trained model, the
per-batch α distribution concentrates tightly around

α ≈ [+0.55, −0.67, +1.12]

Following the same procedure as for α-β, we drop the LS solve and
hardcode these coefficients as constants. The result is a
fixed-coefficient extrapolation across the last three iterates with no
runtime overhead beyond the canonical loop.

Variant	Mixing rule	Throughput vs canonical	val_bpb (single seed)
Canonical	Markov	1.00×	1.06108
Anderson, learned per-batch α	length-3 LS	0.75×	1.06083
Anderson, frozen α	fixed `[+0.55, −0.67, +1.12]`	1.00×	1.05968

The frozen-Anderson result is single-seed; multi-seed confirmation has
not been run.

Section 2 — MLP sizing across the three stages

The loop band runs each of layers 3, 4, 5 three times per forward pass
(NL=2). Each pass reads the same FFN weights, so the parameters in the
loop band see roughly 3× the use per token of the FFN parameters in the
non-looped layers. A natural question is whether the loop band deserves
more FFN capacity than the rest of the model at fixed total parameters —
i.e., whether reallocating width from the non-looped layers into the
loop band is a free win.

We split the 11 physical layers into three positional stages and
parameterize the FFN width as a per-stage multiplier of model_dim:

stage     layers    width multiplier
early     0–2       MLP_EARLY_MULT
middle    3–5       MLP_MIDDLE_MULT     # the loop band
late      6–10      MLP_LATE_MULT

The baseline uses 4.0 everywhere, for a total of 11 × 4.0 = 44.0
width-units. We tried three reallocation schemes that hold the total
fixed at 44.0 width-units while widening the middle stage to 5.0:

arm	early	middle	late	direction
baseline	4.0	4.0	4.0	uniform
040A	3.625	5.0	3.625	shrink both sides evenly
040B	3.0	5.0	4.0	shrink early, keep late
040C	4.0	5.0	3.4	keep early, shrink late

Single-seed training-only screen on the 038/039 fullfloat research line,
2×H100, 600s wallclock cap, no quantization or TTT. The absolute val_bpb
values are pre-quant post-EMA from this short screen, not directly
comparable to the post-quant post-TTT numbers in Section 1 — this is a
relative comparison of training quality between MLP schedules, not an
endpoint number. Pre-quant post-EMA val_bpb on the validation set:

arm	val_bpb (pre-quant post-EMA)	Δ vs uniform
baseline (uniform 4.0)	1.16501	—
040A (3.625 / 5.0 / 3.625)	1.16742	+0.00241
040B (3.0 / 5.0 / 4.0)	1.16744	+0.00244
040C (4.0 / 5.0 / 3.4)	1.16484	−0.00017

Three observations:

The middle-widen direction is real but small. 040C is the only
reallocation that doesn't regress, and the gain is comfortably inside
single-seed noise (Δ ≈ −0.0002 on a screen with no seed average).
Treat it as "tied with baseline," not a win.
Shrinking the early stage is more expensive than shrinking the
late stage. 040B (early shrunk to 3.0, late kept at 4.0) loses
+0.00244; 040C (early kept at 4.0, late shrunk to 3.4) gains
−0.00017. A symmetric shrink (040A) lands close to 040B. The early
layers (0–2) are doing work that doesn't compress; the late layers
(6–10) tolerate it.
The middle-stage gain is bounded above by what the late-shrink
costs. Whatever extra capacity the middle stage absorbs from going
4.0 → 5.0, the late stage gives back roughly the same amount when it
goes 4.0 → 3.4. The two effects nearly cancel. The implication is that
the loop band is not obviously starved for FFN capacity at the
uniform baseline.

Section 3 — Sizing the loop band

The canonical 060A loop band is the contiguous set {3, 4, 5} run at
NL=2, so each of layers 3, 4, 5 is visited three times per forward
pass. The full forward does 17 layer-applications, with 9 of them
inside the loop band. Two knobs control the total compute spent inside
the band: which layers form the band (band-set), and how many times
each is visited (NL). We screened both directions on 060A.

spec	band-set	NL	loop-band passes	description
060A canonical	{3,4,5}	2	9	reference
041B	{3,4,5}	1	3	half the canonical loop compute
041D	{5}	2	3	single-layer band, only layer 5
041H	{4,5}	2	6	drop the front of the band
070	{3,4}	2	6	drop the back of the band
041L	{3,4,5}	3	12	more visits per layer
041N	{3,4,5}	4	15	more still

Same screen protocol throughout: single seed 42, 4×H100, 1200s
wallclock, no TTT. Pre-quant post-EMA val_bpb:

spec	structure	pre-quant post-EMA	Δ vs canonical
060A canonical	{3,4,5} NL=2	1.06358	—
041B	{3,4,5} NL=1	1.06842	+0.00484
041D	{5} NL=2	1.06993	+0.00635
041H	{4,5} NL=2	1.06693	+0.00335
070	{3,4} NL=2	1.06595	+0.00237
041L	{3,4,5} NL=3	1.06615	+0.00257
041N	{3,4,5} NL=4	1.06888	+0.00530

Two observations:

Canonical is locally optimal in both directions. Both shrinking
(NL=1, single-layer band, drop a layer) and growing (NL=3, NL=4) lose
to the canonical {3,4,5} NL=2 — the loss is monotonic in how far the
configuration sits from canonical. NL=3 (+0.00257) is the closest
miss; NL=4 (+0.00530) loses about as much as halving the loop
compute.
Band shape is roughly position-symmetric. Dropping layer 3 (041H,
+0.00335) and dropping layer 5 (070, +0.00237) cost similar amounts.
Reducing to a single layer (041D, +0.00635) is worse than either, but
in the same direction. There's no specific layer in {3,4,5} that's
uniquely load-bearing; the band-as-a-whole is what matters.

The 041L NL=3 result is interesting in isolation — the gap to
canonical (+0.00257) is small enough that with multi-seed averaging
it may close. We did not promote it past the screen.

Three short sections on architectural studies of the loop band (layers 3-5) of the openai#1736 / 060A baseline: 1. Learning mixing parameters in depth-recurrent loops (recur-α-β + Anderson; the freezing-after-stabilization recipe shipped as PR openai#1779). 2. MLP sizing across the three stages (040A/B/C width-reallocation screen). 3. Sizing the loop band (band-set + NL count variants on 060A). Draft — single-seed screens unless noted, no submission artifact, no leaderboard claim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… cleanup Merging uncommitted changes from 7 dirty worktrees ahead of worktree removal.

…enGeunGeun

…enGeunGeun

Parts: competition setup, key techniques (tokenizer/training/quant/TTT), disqualification zoo, CaseOps val-set leak, final leaderboard. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- SmearGate: correct from "position-mixing gate at document boundaries" to "learned gate blending each token's hidden state with the previous token's"; BOS fix masks gate at document-start positions - AWQ-lite: correct from "asymmetric per-channel scaling" to "activation-aware salient-group int8 promotion" - MuonEq-R: make precise — row-normalizes gradient matrices before Newton-Schulz orthogonalization; not just "equivariant variant" - Parallel residuals: fix layer number from "8+" to "7+" (PARALLEL_START_LAYER=7 confirmed in all records) - Final SOTA: correct openai#1855 1.0611 → openai#2135 1.0565 in intro and end-of-Part-1 snapshot reference Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…detail Parallel residuals: original PR used layer 7 but final lineage (PR openai#1530+) consistently uses layer 8; corrected in both the architecture section and the n-gram cross-check was separate. N-gram tilt: expand to explain the three-expert structure — token expert is causal (prefix-only hash), within-word and word-start experts are non-causal (read boundary_lut[tokens[i]] = TARGET token's type). Two of three were disabled to make the kernel legal; only the token expert survives in the final implementation. Added to both Part 1 (TTT section) and Part 2 (N-gram: Hard to Do Right). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

CaseOps: expand mechanics — lowercase tokenization + capitalization bitmask sidecar; explains *why* it works for a half-technical audience. N-gram/PPM closing argument: add the orthogonality framing — NNs are already well-calibrated on their own uncertainty; classical methods help only when they provide signal orthogonal to the NN's learned distribution; n-gram/PPM are mostly correlated, not orthogonal. Token-only tilt is the narrow exception (exact prefix repetition, which softmax smooths over). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Restructures Part 0 into flowing prose: setup/timing facts, scoring from first principles (log-prob formula, 256-option multiple-choice analogy), C1-C4 rules with analogies. Removes 'deliberately extreme' paragraph and old subsection headers. Removes Parts 1-4 placeholders from draft. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds Part 1 opening and baseline walkthrough across five components: tokenizer (SP1024), model architecture (U-Net with math block, GQA, RoPE), training (Muon), quantization (bf16→int6, precision tradeoff explained), post-training (none — sets up TTT). Removes BigramHash (not in original baseline — verified against source). Renames Part 1 title. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Parallel residuals: credit PR openai#1204 (first) and PR openai#1529 (layer 8) - Depth recurrence: clarify 2-pass origin (openai#1344), 3-pass is final config - Loop curriculum frac=0.35: correct PR to openai#1420 (not openai#1344) - SmearGate: note original PR openai#162, SP8192 reintroduction in openai#1667 - Attention output gate: PR openai#1667 first, PR openai#1787 narrowed it - LeakyReLU²: replaced relu², not GELU - EMA decay: remove specific wrong value, defer to PR openai#287 - GPTQ first: correct to PR openai#535 (not openai#1019) - Global SGD phase: 2000 docs (openai#1610) refined to 2500 (openai#1626) - Training closing line: "halfway point" → "35% mark" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…n closing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…m analogy

Clarified explanation of the gate's behavior and updated the description of the fix in PR openai#1514.

…nalogy

…penai#1905 to footnote

…values

…m analogy

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Depth recurrence: openai#1344 → openai#1204 (first appearance on leaderboard) - GPTQ: openai#535 → openai#374 (GPTQ-lite first introduced in openai#374) - int7 embeddings: openai#1586 → openai#1626 (openai#1586 not in leaderboard) - LQER: openai#1797 → openai#1851 (technique evolution credits openai#1851) - XSA table: openai#287 → openai#265 (openai#265 is "first XSA") - SmearGate table: openai#1667 → openai#1851 (openai#1851 fixed BOS bug; openai#1667 just reused SmearGate) - LeakyReLU² table: openai#493 → openai#549 (openai#549 is "first" per leaderboard) - AWQ-lite table: openai#1908 → openai#1945 (openai#1908 not in leaderboard) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rboard) The previous commit changed these to leaderboard-first values, but the actual first introduction appeared in earlier non-leaderboard PRs: - LeakyReLU²: openai#549 → openai#493 (PR openai#493 README confirms introduction) - int7 embeddings: openai#1626 → openai#1586 (PR openai#1586 credited in later records) - LQER: openai#1851 → openai#1797 (PR openai#1797 README credits rank-4 quant correction) - AWQ-lite: openai#1945 → openai#1908 (git history references openai#1908 as origin) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…penai#65) after direct PR inspection - GPTQ: openai#374 had no GPTQ; openai#414 was GPTQ-lite (clip percentile, no Hessian); openai#535 is "Full GPTQ with Hessian calibration" which matches the draft description - SmearGate: openai#65 (2026-03-19) introduced SmearGate one day before openai#162 (2026-03-20); PR openai#65 body contains full SmearGate spec identical to openai#162 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

leon2k2k2k marked this pull request as ready for review May 1, 2026 23:59

leon2k2k2k and others added 28 commits May 2, 2026 10:54

chore: consolidate unsaved research notes + run from worktrees before…

0376b4e

… cleanup Merging uncommitted changes from 7 dirty worktrees ahead of worktree removal.

fix: correct author handle for PR openai#2050 — @someone114514 → @Aid…

8943d54

…enGeunGeun

writeup: add blog post outline (5-part structure)

3e3f5d3

Parts: competition setup, key techniques (tokenizer/training/quant/TTT), disqualification zoo, CaseOps val-set leak, final leaderboard. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

writeup: add tokenizer and model architecture sections to Part 1

76f979e

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

writeup: add training, quantization, and TTT sections to Part 1

cec9e59

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

writeup: add closing paragraph to Part 1 final model section

62b2ac3

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

writeup: add Part 2 disqualifications — n-gram tilt and PPM-D sections

407830f

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

writeup: fix internal inconsistency in global SGD doc count

94d1d09

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

writeup: add CaseOps leak section and closing Looking Back paragraph

1fe9390

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

writeup: tighten closing paragraph to single sentence

b43a4ea

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

writeup: fix "picture-book" → "picture-perfect" finish

ef98897

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

writeup: rename "The Final Model" section to "The Evolution"

8ebcc08

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

writeup: promote "Drama on the Last Day" to top-level section, fold i…

6f6ba33

…n closing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

writeup: remove all em-dashes throughout draft

1bf836c

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

writeup: rename 060A Research Model to Near-SOTA Model in table

3633339

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

writeup: add "Let's break it down" transition in baseline section

bf0b915

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

writeup: tighten baseline quantization paragraph

25aa472

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

writeup: reformat model architecture section as bullet points

b7ca759

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

writeup: reformat training, quantization, TTT sections as bullet points

0258c80

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

writeup: add v2 outline and draft skeleton with hook

d6d1b9d

writeup: fix title and remove section 0 header

a0fcfd2

writeup: fix byte count and scoring description in hook

bb1815a

leon2k2k2k and others added 30 commits May 6, 2026 00:36

writeup: add transition line into section 3

4dcc044

writeup: remove transition line before section 3

12969bc

writeup: soften other changes intro

59a88c0

writeup: draft section 3 hook

012fbe7

writeup: add closing line to section 2

d445053

writeup: add ablations mention to section 2 closing line

55a90b0

writeup: draft section 3 - too good to be true

2f177bf

writeup: connect hook sentences with while

4673875

writeup: restructure n-gram section, move C1-C4 footnote, improve exa…

835ecda

…m analogy

Refine explanation of token prediction and PR openai#1514

237f935

Clarified explanation of the gate's behavior and updated the description of the fix in PR openai#1514.

writeup: rewrite PPM-D section with flashier hook and Russian novel a…

a1219ba

…nalogy

writeup: restructure PPM-D section, tighten C2 explanation, defer PR o…

9bb020e

…penai#1905 to footnote

writeup: add expected entropy footnote to lesson section

a37cf00

writeup: implement user's PPM-D edits from GitHub

8d2029d

Update draft_v2.md

c649d91

writeup: draft section 4 - drama on the last day

c78613b

writeup: add hook line to section 4, clarify PR openai#2014, add gap …

8fc21c5

…values

writeup: rewrite data leak discovery with cleaner explanation and exa…

55971bd

…m analogy

Update draft_v2.md

5834765

writeup: rephrase leaky PRs paragraph, change fifteen to many

0cbea32

writeup: simplify the ironic detail in section 4

b8d4051

writeup: rewrite section 4 closing with proper drama arc

3e7e24e

writeup: fix timing of leaky PRs - shortly after not final hours

46ffdc3

Update draft_v2.md

5f7e97d

writeup: add closing paragraph

754522d

writeup: remove all em-dashes from draft_v2

e00c3ab

writeup: fix PR openai#1344 loop attribution and GPTQ scope description

833c99a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: notes on the recurrence band (mixing parameters, MLP sizing, loop sizing)#2137

Non-record: notes on the recurrence band (mixing parameters, MLP sizing, loop sizing)#2137
leon2k2k2k wants to merge 84 commits intoopenai:mainfrom
leon2k2k2k:writeup/recurrence-band-notes

leon2k2k2k commented May 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leon2k2k2k commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Notes on the recurrence band in compressed transformers

Section 1 — Learning mixing parameters in depth-recurrent loops

Recurrent α-β

Freezing the learned values

Anderson acceleration with frozen coefficients

Section 2 — MLP sizing across the three stages

Section 3 — Sizing the loop band

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

leon2k2k2k commented May 1, 2026 •

edited

Loading