Skip to content

Non-record: notes on the recurrence band (mixing parameters, MLP sizing, loop sizing)#2137

Open
leon2k2k2k wants to merge 84 commits intoopenai:mainfrom
leon2k2k2k:writeup/recurrence-band-notes
Open

Non-record: notes on the recurrence band (mixing parameters, MLP sizing, loop sizing)#2137
leon2k2k2k wants to merge 84 commits intoopenai:mainfrom
leon2k2k2k:writeup/recurrence-band-notes

Conversation

@leon2k2k2k
Copy link
Copy Markdown

@leon2k2k2k leon2k2k2k commented May 1, 2026

Notes on the recurrence band in compressed transformers

A small set of architectural studies on the loop band (layers 3–5) of the
#1736 / 060A baseline. Each section is independent.


Section 1 — Learning mixing parameters in depth-recurrent loops

A depth-recurrent loop runs the canonical Markov iteration through the loop
band (layers 3–5):

x_{k+1} = f(x_k)

Each pass uses only the previous pass's output. We replace this with a
learned mixing rule, train it end-to-end, and observe that the learned
mixing coefficients converge to a stable, nearly seed-invariant pattern
within a few hundred steps after looping activates. Once stabilized, the
coefficients can be read off the trained model and used as fixed constants
in a fresh training run.

Recurrent α-β

We add learnable scalars to control how each pass commits to the residual
and to allow detached cross-layer carries within the same pass:

x_{k+1} = β_k · f(x_k) + Σ_j α_{k,j} · stop_grad(x_k^{(j)})

with β_k initialized to 1 and α_{k,j} initialized to 0, so the loop
starts from the canonical Markov rule. Across the loop band (layers 3–5,
NL=2) this is a small number of scalars; they are routed to the scalar
optimizer and trained jointly with the rest of the model.

During a full training run on the #1736 base, the scalars drift off their
initialization once looping activates at frac=0.35, then plateau. The
final values are reproducible across seeds — for example, layer 4 converges
to a self-subtract pattern at α ≈ −0.348 (a learned gate), and layer 5
stabilizes into a positive aggregation of the signals from layers 3 and 4.

Freezing the learned values

We then read the converged values off the trained model and use them as
fixed constants in a new training run from scratch. The optimizer state
and per-step gradient on these scalars are dropped; only the values
survive. Because the loop now starts at the converged mixing pattern
rather than at the canonical Markov rule, the run is no longer
identity-at-init, but training-end quality matches.

This is shipped as PR #1779 on top of #1736:

Submission Mixing rule in loop band val_bpb (3-seed mean) Δ vs #1736
#1736 (base) canonical Markov 1.06549
#1779 (frozen α-β) fixed α-β with cross-layer carry 1.06421 −0.00128

3-seed std on #1779 is 0.00023, so the gain is well outside seed noise.
Artifact size is unchanged (the frozen scalars are baked into the model
weights serialized into the 16 MB budget).

The converged values used as fixed constants in #1779 are:

β = [1.5973, 1.8828, 1.9922]                          # layers 3, 4, 5

         L3       L4       L5
α = [[ 0.2520, −0.0210, −0.0124],     # L3 contributions
     [ 0.0669, −0.3477,  0.0031],     # L4 contributions
     [ 0.1387,  0.2412,  0.0272]]     # L5 contributions

Two patterns stand out. Every β is well above 1, so each pass amplifies
its own block output rather than damping it — the optimizer chose to
overshoot the canonical Markov rule. And the diagonal of α is mixed: L3
adds back ~25% of itself, L4 subtracts ~35% of itself (the learned-gate
self-subtract behavior), L5 leaves itself roughly alone but absorbs ~24%
of L4. The off-diagonal entries in row L5 also confirm L5 acts as an
aggregator over L3 and L4.

Anderson acceleration with frozen coefficients

The same idea applies to a different mixing rule. Anderson acceleration
replaces the Markov iteration with a length-m mix of past iterates,
solved per batch via a small least-squares problem:

g_i = f(x_i) − x_i                                     # residuals
α* = argmin_α  ‖Σ_{i=k−m+1..k} α_i · g_i‖²,  Σ α_i = 1
x_{k+1} = Σ α*_i · f(x_i)

Trained end-to-end (length-3 Anderson, per-batch LS), the coefficients
land in the noise band of canonical recurrence but pay a ~25% throughput
penalty for the per-batch solve. Inspecting the trained model, the
per-batch α distribution concentrates tightly around

α ≈ [+0.55, −0.67, +1.12]

Following the same procedure as for α-β, we drop the LS solve and
hardcode these coefficients as constants. The result is a
fixed-coefficient extrapolation across the last three iterates with no
runtime overhead beyond the canonical loop.

Variant Mixing rule Throughput vs canonical val_bpb (single seed)
Canonical Markov 1.00× 1.06108
Anderson, learned per-batch α length-3 LS 0.75× 1.06083
Anderson, frozen α fixed [+0.55, −0.67, +1.12] 1.00× 1.05968

The frozen-Anderson result is single-seed; multi-seed confirmation has
not been run.


Section 2 — MLP sizing across the three stages

The loop band runs each of layers 3, 4, 5 three times per forward pass
(NL=2). Each pass reads the same FFN weights, so the parameters in the
loop band see roughly 3× the use per token of the FFN parameters in the
non-looped layers. A natural question is whether the loop band deserves
more FFN capacity than the rest of the model at fixed total parameters —
i.e., whether reallocating width from the non-looped layers into the
loop band is a free win.

We split the 11 physical layers into three positional stages and
parameterize the FFN width as a per-stage multiplier of model_dim:

stage     layers    width multiplier
early     0–2       MLP_EARLY_MULT
middle    3–5       MLP_MIDDLE_MULT     # the loop band
late      6–10      MLP_LATE_MULT

The baseline uses 4.0 everywhere, for a total of 11 × 4.0 = 44.0
width-units. We tried three reallocation schemes that hold the total
fixed at 44.0 width-units while widening the middle stage to 5.0:

arm early middle late direction
baseline 4.0 4.0 4.0 uniform
040A 3.625 5.0 3.625 shrink both sides evenly
040B 3.0 5.0 4.0 shrink early, keep late
040C 4.0 5.0 3.4 keep early, shrink late

Single-seed training-only screen on the 038/039 fullfloat research line,
2×H100, 600s wallclock cap, no quantization or TTT. The absolute val_bpb
values are pre-quant post-EMA from this short screen, not directly
comparable to the post-quant post-TTT numbers in Section 1 — this is a
relative comparison of training quality between MLP schedules, not an
endpoint number. Pre-quant post-EMA val_bpb on the validation set:

arm val_bpb (pre-quant post-EMA) Δ vs uniform
baseline (uniform 4.0) 1.16501
040A (3.625 / 5.0 / 3.625) 1.16742 +0.00241
040B (3.0 / 5.0 / 4.0) 1.16744 +0.00244
040C (4.0 / 5.0 / 3.4) 1.16484 −0.00017

Three observations:

  • The middle-widen direction is real but small. 040C is the only
    reallocation that doesn't regress, and the gain is comfortably inside
    single-seed noise (Δ ≈ −0.0002 on a screen with no seed average).
    Treat it as "tied with baseline," not a win.
  • Shrinking the early stage is more expensive than shrinking the
    late stage.
    040B (early shrunk to 3.0, late kept at 4.0) loses
    +0.00244; 040C (early kept at 4.0, late shrunk to 3.4) gains
    −0.00017. A symmetric shrink (040A) lands close to 040B. The early
    layers (0–2) are doing work that doesn't compress; the late layers
    (6–10) tolerate it.
  • The middle-stage gain is bounded above by what the late-shrink
    costs.
    Whatever extra capacity the middle stage absorbs from going
    4.0 → 5.0, the late stage gives back roughly the same amount when it
    goes 4.0 → 3.4. The two effects nearly cancel. The implication is that
    the loop band is not obviously starved for FFN capacity at the
    uniform baseline.

Section 3 — Sizing the loop band

The canonical 060A loop band is the contiguous set {3, 4, 5} run at
NL=2, so each of layers 3, 4, 5 is visited three times per forward
pass. The full forward does 17 layer-applications, with 9 of them
inside the loop band. Two knobs control the total compute spent inside
the band: which layers form the band (band-set), and how many times
each is visited (NL). We screened both directions on 060A.

spec band-set NL loop-band passes description
060A canonical {3,4,5} 2 9 reference
041B {3,4,5} 1 3 half the canonical loop compute
041D {5} 2 3 single-layer band, only layer 5
041H {4,5} 2 6 drop the front of the band
070 {3,4} 2 6 drop the back of the band
041L {3,4,5} 3 12 more visits per layer
041N {3,4,5} 4 15 more still

Same screen protocol throughout: single seed 42, 4×H100, 1200s
wallclock, no TTT. Pre-quant post-EMA val_bpb:

spec structure pre-quant post-EMA Δ vs canonical
060A canonical {3,4,5} NL=2 1.06358
041B {3,4,5} NL=1 1.06842 +0.00484
041D {5} NL=2 1.06993 +0.00635
041H {4,5} NL=2 1.06693 +0.00335
070 {3,4} NL=2 1.06595 +0.00237
041L {3,4,5} NL=3 1.06615 +0.00257
041N {3,4,5} NL=4 1.06888 +0.00530

Two observations:

  • Canonical is locally optimal in both directions. Both shrinking
    (NL=1, single-layer band, drop a layer) and growing (NL=3, NL=4) lose
    to the canonical {3,4,5} NL=2 — the loss is monotonic in how far the
    configuration sits from canonical. NL=3 (+0.00257) is the closest
    miss; NL=4 (+0.00530) loses about as much as halving the loop
    compute.
  • Band shape is roughly position-symmetric. Dropping layer 3 (041H,
    +0.00335) and dropping layer 5 (070, +0.00237) cost similar amounts.
    Reducing to a single layer (041D, +0.00635) is worse than either, but
    in the same direction. There's no specific layer in {3,4,5} that's
    uniquely load-bearing; the band-as-a-whole is what matters.

The 041L NL=3 result is interesting in isolation — the gap to
canonical (+0.00257) is small enough that with multi-seed averaging
it may close. We did not promote it past the screen.

Three short sections on architectural studies of the loop band (layers 3-5)
of the openai#1736 / 060A baseline:

1. Learning mixing parameters in depth-recurrent loops (recur-α-β + Anderson;
   the freezing-after-stabilization recipe shipped as PR openai#1779).
2. MLP sizing across the three stages (040A/B/C width-reallocation screen).
3. Sizing the loop band (band-set + NL count variants on 060A).

Draft — single-seed screens unless noted, no submission artifact, no
leaderboard claim.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@leon2k2k2k leon2k2k2k marked this pull request as ready for review May 1, 2026 23:59
leon2k2k2k and others added 28 commits May 2, 2026 10:54
… cleanup

Merging uncommitted changes from 7 dirty worktrees ahead of worktree removal.
Parts: competition setup, key techniques (tokenizer/training/quant/TTT),
disqualification zoo, CaseOps val-set leak, final leaderboard.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- SmearGate: correct from "position-mixing gate at document boundaries"
  to "learned gate blending each token's hidden state with the previous
  token's"; BOS fix masks gate at document-start positions
- AWQ-lite: correct from "asymmetric per-channel scaling" to
  "activation-aware salient-group int8 promotion"
- MuonEq-R: make precise — row-normalizes gradient matrices before
  Newton-Schulz orthogonalization; not just "equivariant variant"
- Parallel residuals: fix layer number from "8+" to "7+" (PARALLEL_START_LAYER=7
  confirmed in all records)
- Final SOTA: correct openai#1855 1.0611 → openai#2135 1.0565 in intro and end-of-Part-1
  snapshot reference

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…detail

Parallel residuals: original PR used layer 7 but final lineage (PR openai#1530+)
consistently uses layer 8; corrected in both the architecture section and
the n-gram cross-check was separate.

N-gram tilt: expand to explain the three-expert structure — token expert
is causal (prefix-only hash), within-word and word-start experts are
non-causal (read boundary_lut[tokens[i]] = TARGET token's type). Two of
three were disabled to make the kernel legal; only the token expert survives
in the final implementation. Added to both Part 1 (TTT section) and
Part 2 (N-gram: Hard to Do Right).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CaseOps: expand mechanics — lowercase tokenization + capitalization bitmask
sidecar; explains *why* it works for a half-technical audience.

N-gram/PPM closing argument: add the orthogonality framing — NNs are already
well-calibrated on their own uncertainty; classical methods help only when
they provide signal orthogonal to the NN's learned distribution; n-gram/PPM
are mostly correlated, not orthogonal. Token-only tilt is the narrow exception
(exact prefix repetition, which softmax smooths over).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Restructures Part 0 into flowing prose: setup/timing facts, scoring from
first principles (log-prob formula, 256-option multiple-choice analogy),
C1-C4 rules with analogies. Removes 'deliberately extreme' paragraph and
old subsection headers. Removes Parts 1-4 placeholders from draft.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds Part 1 opening and baseline walkthrough across five components:
tokenizer (SP1024), model architecture (U-Net with math block, GQA, RoPE),
training (Muon), quantization (bf16→int6, precision tradeoff explained),
post-training (none — sets up TTT). Removes BigramHash (not in original
baseline — verified against source). Renames Part 1 title.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Parallel residuals: credit PR openai#1204 (first) and PR openai#1529 (layer 8)
- Depth recurrence: clarify 2-pass origin (openai#1344), 3-pass is final config
- Loop curriculum frac=0.35: correct PR to openai#1420 (not openai#1344)
- SmearGate: note original PR openai#162, SP8192 reintroduction in openai#1667
- Attention output gate: PR openai#1667 first, PR openai#1787 narrowed it
- LeakyReLU²: replaced relu², not GELU
- EMA decay: remove specific wrong value, defer to PR openai#287
- GPTQ first: correct to PR openai#535 (not openai#1019)
- Global SGD phase: 2000 docs (openai#1610) refined to 2500 (openai#1626)
- Training closing line: "halfway point" → "35% mark"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n closing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
leon2k2k2k and others added 30 commits May 6, 2026 00:36
Clarified explanation of the gate's behavior and updated the description of the fix in PR openai#1514.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Depth recurrence: openai#1344openai#1204 (first appearance on leaderboard)
- GPTQ: openai#535openai#374 (GPTQ-lite first introduced in openai#374)
- int7 embeddings: openai#1586openai#1626 (openai#1586 not in leaderboard)
- LQER: openai#1797openai#1851 (technique evolution credits openai#1851)
- XSA table: openai#287openai#265 (openai#265 is "first XSA")
- SmearGate table: openai#1667openai#1851 (openai#1851 fixed BOS bug; openai#1667 just reused SmearGate)
- LeakyReLU² table: openai#493openai#549 (openai#549 is "first" per leaderboard)
- AWQ-lite table: openai#1908openai#1945 (openai#1908 not in leaderboard)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rboard)

The previous commit changed these to leaderboard-first values, but the
actual first introduction appeared in earlier non-leaderboard PRs:
- LeakyReLU²: openai#549openai#493 (PR openai#493 README confirms introduction)
- int7 embeddings: openai#1626openai#1586 (PR openai#1586 credited in later records)
- LQER: openai#1851openai#1797 (PR openai#1797 README credits rank-4 quant correction)
- AWQ-lite: openai#1945openai#1908 (git history references openai#1908 as origin)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…penai#65) after direct PR inspection

- GPTQ: openai#374 had no GPTQ; openai#414 was GPTQ-lite (clip percentile, no Hessian);
  openai#535 is "Full GPTQ with Hessian calibration" which matches the draft description
- SmearGate: openai#65 (2026-03-19) introduced SmearGate one day before openai#162 (2026-03-20);
  PR openai#65 body contains full SmearGate spec identical to openai#162

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant