Record: Casefold V4 + AttnOutGate + Multi-Phase Global SGD TTT — val_bpb 1.05733 (3-seed mean)#1693
Open
dexhunter wants to merge 1 commit intoopenai:mainfrom
Open
Conversation
…TTT — val_bpb 1.05733 (3-seed mean) Stacks per-head Attention Output Gate (PR openai#1667 @MarioPaerle) and SmearGate on top of PR openai#1670's Casefold V4 + Multi-Phase Global SGD TTT base. Zero-init gates (identity at init) add 1,056 + 13 parameters total. - Seed 42: val_bpb=1.05693, val_loss=3.04604, artifact=15,936,269 B - Seed 0: val_bpb=1.05730, val_loss=3.04712, artifact=15,937,514 B - Seed 1234: val_bpb=1.05777, val_loss=3.04846, artifact=15,938,772 B - 3-seed mean val_bpb=1.05733 (std 0.00035), val_loss=3.04721 nats - Delta vs casefold leader (PR openai#1585): -0.00657 BPB / -0.01697 nats (>3x the 0.005-nat bar) - Delta vs PR openai#1670 casefold base: -0.00237 BPB / -0.00680 nats Casefold legality pending organizer review at Issue openai#1604. AttnOutGate and SmearGate are pure architectural additions and comply with all Issue openai#1017 conditions (causality, normalized distribution, score-before- update, single pass).
sunnypatneedi
pushed a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Apr 17, 2026
…ct concern; PR openai#1687 CLOSED BPB bug; PR openai#1693 casefold 1.05733; SOTA Day 8; Session 16 https://claude.ai/code/session_01LVvBLAM46dRKg53renpkq4
amrayach
added a commit
to amrayach/parameter-golf
that referenced
this pull request
Apr 18, 2026
…verlay Supersedes the 2026-04-18 post-Ultrareview pin (876bb36). Rev 5 adds provenance auto-capture, repo-type=model fix, exact-SHA env-var pin, and run_all.sh/README alignment; new pin reflects the pipeline-patch commit. Also records the live-guidance absolute-BPB overlay and 04b deprecation driven by open-PR competitive intel (openai#1700 / openai#1716 / openai#1707 / openai#1693). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikeapedia
added a commit
to mikeapedia/parameter-golf-1
that referenced
this pull request
Apr 19, 2026
Sliding-window eval only (TTT_ENABLED=0 by default) on standard SP8192. Beats merged non-casefold SOTA PR openai#1493 (1.0810) by 0.00394 BPB without any test-time adaptation. Single seed 1337; compute-constrained non-record submission — VM went down before the run log could be pushed so it is not attached. Metrics were observed during the session. Architecture: PR openai#1674 (@mikeapedia) base (Parcae constrained loop injection, Gemma-style global/local attention, Gram Newton-Schulz) + PR openai#1530 (@samacqua) varlen attention + fused MLP triton kernel + AttnOutGate (PR openai#1667 @MarioPaerle) + SmearGate (@KellerJordan concept, @MarioPaerle reintroduction) + new layered local sliding windows (512 on early/loop layers, 1024 on post-loop layers, split at index 6). KV-tying on globals dropped vs PR openai#1674. TTT scaffolding (phased global-SGD + per-doc LoRA, from PR openai#1693 lineage) remains in the file for experiments but is disabled by default for this submission.
Open
7 tasks
Meirzhan05
added a commit
to Meirzhan05/parameter-golf
that referenced
this pull request
Apr 27, 2026
Per PR openai#1667/openai#1693: a tiny linear (gate_width x num_heads, default 12x8 = 96 weights per layer) projects the first 12 dims of the input into per-head gate values. Scaled to (0, 2) via 2*sigmoid for symmetric pass-through at zero-init. Total: 1056 extra params (8 heads x 12 width x 11 layers) — ~1KB at fp16. Zero-init = identity at start (transparent). Lets each head dynamically suppress noise per-token. Compatible with depth recurrence, parallel residuals, XSA, and GPTQ (gate weights pass through as fp16, numel < 65536).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Key Innovation
Attention Output Gate (from PR #1667 @MarioPaerle)
Per-head multiplicative gate on the attention output, weight-initialized to zero so all heads pass through at scale 1.0 at init. Implemented as an inline-safe op with
.contiguous()barriers so it composes withfullgraph=Truetorch.compile:SmearGate
Input-dependent per-channel residual mixer (13 parameters). Mixes the current token with the previous token (strictly backward-looking, so causal). Zero-initialized lambda means it starts as the identity on the residual stream.
Casefold V4 + Multi-Phase Global SGD TTT (from PR #1670)
Same tokenizer and TTT protocol as PR #1670. Casefold legality is pending organizer review at Issue #1604.
Results
Record Comparison
Compliance (Issue #1017 Track B)
torch.no_grad()before any SGD update.AttnOutGate and SmearGate are pure training-time architectural additions with trained weights; no eval-time effect beyond the trained weights. Analogous gating constructs have precedent in skip gates (PR #549 family), parallel-lane gating (PR #1204 family), and SmearGate (modded-nanogpt).
Casefold tokenizer: pending organizer review at Issue #1604. The tokenizer is retrained from scratch on casefolded FineWeb — it is not a modification of the standard SP8192 tokenizer. Byte-level BPB is computed correctly via the sentencepiece piece table on the original (non-casefolded) validation bytes.
Lineage
PR #1530 (@samacqua) -> PR #1626 (@dexhunter, multi-phase SGD TTT) -> PR #1670 (@dexhunter, casefold v4 + phased TTT) -> this PR (+ AttnOutGate from PR #1667 @MarioPaerle + SmearGate)
Credits
Test plan