Non-record: notes on the recurrence band (mixing parameters, MLP sizing, loop sizing)#2137
Open
leon2k2k2k wants to merge 84 commits intoopenai:mainfrom
Open
Non-record: notes on the recurrence band (mixing parameters, MLP sizing, loop sizing)#2137leon2k2k2k wants to merge 84 commits intoopenai:mainfrom
leon2k2k2k wants to merge 84 commits intoopenai:mainfrom
Conversation
Three short sections on architectural studies of the loop band (layers 3-5) of the openai#1736 / 060A baseline: 1. Learning mixing parameters in depth-recurrent loops (recur-α-β + Anderson; the freezing-after-stabilization recipe shipped as PR openai#1779). 2. MLP sizing across the three stages (040A/B/C width-reallocation screen). 3. Sizing the loop band (band-set + NL count variants on 060A). Draft — single-seed screens unless noted, no submission artifact, no leaderboard claim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… cleanup Merging uncommitted changes from 7 dirty worktrees ahead of worktree removal.
Parts: competition setup, key techniques (tokenizer/training/quant/TTT), disqualification zoo, CaseOps val-set leak, final leaderboard. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- SmearGate: correct from "position-mixing gate at document boundaries" to "learned gate blending each token's hidden state with the previous token's"; BOS fix masks gate at document-start positions - AWQ-lite: correct from "asymmetric per-channel scaling" to "activation-aware salient-group int8 promotion" - MuonEq-R: make precise — row-normalizes gradient matrices before Newton-Schulz orthogonalization; not just "equivariant variant" - Parallel residuals: fix layer number from "8+" to "7+" (PARALLEL_START_LAYER=7 confirmed in all records) - Final SOTA: correct openai#1855 1.0611 → openai#2135 1.0565 in intro and end-of-Part-1 snapshot reference Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…detail Parallel residuals: original PR used layer 7 but final lineage (PR openai#1530+) consistently uses layer 8; corrected in both the architecture section and the n-gram cross-check was separate. N-gram tilt: expand to explain the three-expert structure — token expert is causal (prefix-only hash), within-word and word-start experts are non-causal (read boundary_lut[tokens[i]] = TARGET token's type). Two of three were disabled to make the kernel legal; only the token expert survives in the final implementation. Added to both Part 1 (TTT section) and Part 2 (N-gram: Hard to Do Right). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CaseOps: expand mechanics — lowercase tokenization + capitalization bitmask sidecar; explains *why* it works for a half-technical audience. N-gram/PPM closing argument: add the orthogonality framing — NNs are already well-calibrated on their own uncertainty; classical methods help only when they provide signal orthogonal to the NN's learned distribution; n-gram/PPM are mostly correlated, not orthogonal. Token-only tilt is the narrow exception (exact prefix repetition, which softmax smooths over). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Restructures Part 0 into flowing prose: setup/timing facts, scoring from first principles (log-prob formula, 256-option multiple-choice analogy), C1-C4 rules with analogies. Removes 'deliberately extreme' paragraph and old subsection headers. Removes Parts 1-4 placeholders from draft. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds Part 1 opening and baseline walkthrough across five components: tokenizer (SP1024), model architecture (U-Net with math block, GQA, RoPE), training (Muon), quantization (bf16→int6, precision tradeoff explained), post-training (none — sets up TTT). Removes BigramHash (not in original baseline — verified against source). Renames Part 1 title. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Parallel residuals: credit PR openai#1204 (first) and PR openai#1529 (layer 8) - Depth recurrence: clarify 2-pass origin (openai#1344), 3-pass is final config - Loop curriculum frac=0.35: correct PR to openai#1420 (not openai#1344) - SmearGate: note original PR openai#162, SP8192 reintroduction in openai#1667 - Attention output gate: PR openai#1667 first, PR openai#1787 narrowed it - LeakyReLU²: replaced relu², not GELU - EMA decay: remove specific wrong value, defer to PR openai#287 - GPTQ first: correct to PR openai#535 (not openai#1019) - Global SGD phase: 2000 docs (openai#1610) refined to 2500 (openai#1626) - Training closing line: "halfway point" → "35% mark" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n closing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Clarified explanation of the gate's behavior and updated the description of the fix in PR openai#1514.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Depth recurrence: openai#1344 → openai#1204 (first appearance on leaderboard) - GPTQ: openai#535 → openai#374 (GPTQ-lite first introduced in openai#374) - int7 embeddings: openai#1586 → openai#1626 (openai#1586 not in leaderboard) - LQER: openai#1797 → openai#1851 (technique evolution credits openai#1851) - XSA table: openai#287 → openai#265 (openai#265 is "first XSA") - SmearGate table: openai#1667 → openai#1851 (openai#1851 fixed BOS bug; openai#1667 just reused SmearGate) - LeakyReLU² table: openai#493 → openai#549 (openai#549 is "first" per leaderboard) - AWQ-lite table: openai#1908 → openai#1945 (openai#1908 not in leaderboard) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rboard) The previous commit changed these to leaderboard-first values, but the actual first introduction appeared in earlier non-leaderboard PRs: - LeakyReLU²: openai#549 → openai#493 (PR openai#493 README confirms introduction) - int7 embeddings: openai#1626 → openai#1586 (PR openai#1586 credited in later records) - LQER: openai#1851 → openai#1797 (PR openai#1797 README credits rank-4 quant correction) - AWQ-lite: openai#1945 → openai#1908 (git history references openai#1908 as origin) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…penai#65) after direct PR inspection - GPTQ: openai#374 had no GPTQ; openai#414 was GPTQ-lite (clip percentile, no Hessian); openai#535 is "Full GPTQ with Hessian calibration" which matches the draft description - SmearGate: openai#65 (2026-03-19) introduced SmearGate one day before openai#162 (2026-03-20); PR openai#65 body contains full SmearGate spec identical to openai#162 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Notes on the recurrence band in compressed transformers
A small set of architectural studies on the loop band (layers 3–5) of the
#1736 / 060A baseline. Each section is independent.
Section 1 — Learning mixing parameters in depth-recurrent loops
A depth-recurrent loop runs the canonical Markov iteration through the loop
band (layers 3–5):
Each pass uses only the previous pass's output. We replace this with a
learned mixing rule, train it end-to-end, and observe that the learned
mixing coefficients converge to a stable, nearly seed-invariant pattern
within a few hundred steps after looping activates. Once stabilized, the
coefficients can be read off the trained model and used as fixed constants
in a fresh training run.
Recurrent α-β
We add learnable scalars to control how each pass commits to the residual
and to allow detached cross-layer carries within the same pass:
with
β_kinitialized to 1 andα_{k,j}initialized to 0, so the loopstarts from the canonical Markov rule. Across the loop band (layers 3–5,
NL=2) this is a small number of scalars; they are routed to the scalar
optimizer and trained jointly with the rest of the model.
During a full training run on the #1736 base, the scalars drift off their
initialization once looping activates at
frac=0.35, then plateau. Thefinal values are reproducible across seeds — for example, layer 4 converges
to a self-subtract pattern at
α ≈ −0.348(a learned gate), and layer 5stabilizes into a positive aggregation of the signals from layers 3 and 4.
Freezing the learned values
We then read the converged values off the trained model and use them as
fixed constants in a new training run from scratch. The optimizer state
and per-step gradient on these scalars are dropped; only the values
survive. Because the loop now starts at the converged mixing pattern
rather than at the canonical Markov rule, the run is no longer
identity-at-init, but training-end quality matches.
This is shipped as PR #1779 on top of #1736:
3-seed std on #1779 is 0.00023, so the gain is well outside seed noise.
Artifact size is unchanged (the frozen scalars are baked into the model
weights serialized into the 16 MB budget).
The converged values used as fixed constants in #1779 are:
Two patterns stand out. Every β is well above 1, so each pass amplifies
its own block output rather than damping it — the optimizer chose to
overshoot the canonical Markov rule. And the diagonal of α is mixed: L3
adds back ~25% of itself, L4 subtracts ~35% of itself (the learned-gate
self-subtract behavior), L5 leaves itself roughly alone but absorbs ~24%
of L4. The off-diagonal entries in row L5 also confirm L5 acts as an
aggregator over L3 and L4.
Anderson acceleration with frozen coefficients
The same idea applies to a different mixing rule. Anderson acceleration
replaces the Markov iteration with a length-
mmix of past iterates,solved per batch via a small least-squares problem:
Trained end-to-end (length-3 Anderson, per-batch LS), the coefficients
land in the noise band of canonical recurrence but pay a ~25% throughput
penalty for the per-batch solve. Inspecting the trained model, the
per-batch α distribution concentrates tightly around
Following the same procedure as for α-β, we drop the LS solve and
hardcode these coefficients as constants. The result is a
fixed-coefficient extrapolation across the last three iterates with no
runtime overhead beyond the canonical loop.
[+0.55, −0.67, +1.12]The frozen-Anderson result is single-seed; multi-seed confirmation has
not been run.
Section 2 — MLP sizing across the three stages
The loop band runs each of layers 3, 4, 5 three times per forward pass
(NL=2). Each pass reads the same FFN weights, so the parameters in the
loop band see roughly 3× the use per token of the FFN parameters in the
non-looped layers. A natural question is whether the loop band deserves
more FFN capacity than the rest of the model at fixed total parameters —
i.e., whether reallocating width from the non-looped layers into the
loop band is a free win.
We split the 11 physical layers into three positional stages and
parameterize the FFN width as a per-stage multiplier of
model_dim:The baseline uses
4.0everywhere, for a total of11 × 4.0 = 44.0width-units. We tried three reallocation schemes that hold the total
fixed at 44.0 width-units while widening the middle stage to 5.0:
Single-seed training-only screen on the 038/039 fullfloat research line,
2×H100, 600s wallclock cap, no quantization or TTT. The absolute val_bpb
values are pre-quant post-EMA from this short screen, not directly
comparable to the post-quant post-TTT numbers in Section 1 — this is a
relative comparison of training quality between MLP schedules, not an
endpoint number. Pre-quant post-EMA val_bpb on the validation set:
Three observations:
reallocation that doesn't regress, and the gain is comfortably inside
single-seed noise (Δ ≈ −0.0002 on a screen with no seed average).
Treat it as "tied with baseline," not a win.
late stage. 040B (early shrunk to 3.0, late kept at 4.0) loses
+0.00244; 040C (early kept at 4.0, late shrunk to 3.4) gains
−0.00017. A symmetric shrink (040A) lands close to 040B. The early
layers (0–2) are doing work that doesn't compress; the late layers
(6–10) tolerate it.
costs. Whatever extra capacity the middle stage absorbs from going
4.0 → 5.0, the late stage gives back roughly the same amount when it
goes 4.0 → 3.4. The two effects nearly cancel. The implication is that
the loop band is not obviously starved for FFN capacity at the
uniform baseline.
Section 3 — Sizing the loop band
The canonical 060A loop band is the contiguous set {3, 4, 5} run at
NL=2, so each of layers 3, 4, 5 is visited three times per forward
pass. The full forward does 17 layer-applications, with 9 of them
inside the loop band. Two knobs control the total compute spent inside
the band: which layers form the band (band-set), and how many times
each is visited (NL). We screened both directions on 060A.
Same screen protocol throughout: single seed 42, 4×H100, 1200s
wallclock, no TTT. Pre-quant post-EMA val_bpb:
Two observations:
(NL=1, single-layer band, drop a layer) and growing (NL=3, NL=4) lose
to the canonical {3,4,5} NL=2 — the loss is monotonic in how far the
configuration sits from canonical. NL=3 (+0.00257) is the closest
miss; NL=4 (+0.00530) loses about as much as halving the loop
compute.
+0.00335) and dropping layer 5 (070, +0.00237) cost similar amounts.
Reducing to a single layer (041D, +0.00635) is worse than either, but
in the same direction. There's no specific layer in {3,4,5} that's
uniquely load-bearing; the band-as-a-whole is what matters.
The 041L NL=3 result is interesting in isolation — the gap to
canonical (+0.00257) is small enough that with multi-seed averaging
it may close. We did not promote it past the screen.