Skip to content

Record: Casefold V4 + AttnOutGate + Multi-Phase Global SGD TTT — val_bpb 1.05733 (3-seed mean)#1693

Open
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:dexhunter/casefold-v4-attn-outgate-phased-ttt
Open

Record: Casefold V4 + AttnOutGate + Multi-Phase Global SGD TTT — val_bpb 1.05733 (3-seed mean)#1693
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:dexhunter/casefold-v4-attn-outgate-phased-ttt

Conversation

@dexhunter
Copy link
Copy Markdown
Contributor

Summary

Key Innovation

Attention Output Gate (from PR #1667 @MarioPaerle)

Per-head multiplicative gate on the attention output, weight-initialized to zero so all heads pass through at scale 1.0 at init. Implemented as an inline-safe op with .contiguous() barriers so it composes with fullgraph=True torch.compile:

def _apply_attn_out_gate_inline(y, x_orig, gate_w):
    gate_in = x_orig[:, :, :12].contiguous()
    gate = (2.0 * torch.sigmoid(F.linear(gate_in, gate_w.to(gate_in.dtype)))).contiguous()
    return y * gate.unsqueeze(-1)
  • 12 x 8 heads x 11 layers = 1,056 new parameters
  • Applied in all three attention paths (standard, parallel-residual, depth-recurrent)
  • Negligible throughput cost (<2%)

SmearGate

Input-dependent per-channel residual mixer (13 parameters). Mixes the current token with the previous token (strictly backward-looking, so causal). Zero-initialized lambda means it starts as the identity on the residual stream.

Casefold V4 + Multi-Phase Global SGD TTT (from PR #1670)

Same tokenizer and TTT protocol as PR #1670. Casefold legality is pending organizer review at Issue #1604.

Results

Seed Pre-TTT BPB Post-TTT BPB val_loss (nats) Artifact
42 1.06633 1.05693 3.04604 15,936,269
0 1.06674 1.05730 3.04712 15,937,514
1234 1.06714 1.05777 3.04846 15,938,772
Mean 1.06674 1.05733 3.04721 15,937,518
Std 0.00035

Record Comparison

Submission val_bpb Delta BPB vs prior Delta nats vs prior
Merged SOTA (PR #1493) 1.08100 - -
PR #1530 @samacqua 1.07336 - -
PR #1585 @codemath3000 (casefold leader) 1.06390 - -
PR #1670 @dexhunter (casefold v4 + phased TTT) 1.05970 -0.00420 vs #1585 -
This 1.05733 -0.00657 vs #1585 / -0.00237 vs #1670 -0.01697 vs #1585 / -0.00680 vs #1670

Compliance (Issue #1017 Track B)

  • Condition 1 (Causality): Sliding-window eval is strictly causal. AttnOutGate is a position-local sigmoid of the current token's channels. SmearGate mixes with the previous token only (backward-looking).
  • Condition 2 (Normalized distribution): Standard softmax over full vocab. Gates modulate hidden states, not logits.
  • Condition 3 (Score before update): Each phase fully scored under torch.no_grad() before any SGD update.
  • Condition 4 (Single pass): Each token scored exactly once per phase.

AttnOutGate and SmearGate are pure training-time architectural additions with trained weights; no eval-time effect beyond the trained weights. Analogous gating constructs have precedent in skip gates (PR #549 family), parallel-lane gating (PR #1204 family), and SmearGate (modded-nanogpt).

Casefold tokenizer: pending organizer review at Issue #1604. The tokenizer is retrained from scratch on casefolded FineWeb — it is not a modification of the standard SP8192 tokenizer. Byte-level BPB is computed correctly via the sentencepiece piece table on the original (non-casefolded) validation bytes.

Lineage

PR #1530 (@samacqua) -> PR #1626 (@dexhunter, multi-phase SGD TTT) -> PR #1670 (@dexhunter, casefold v4 + phased TTT) -> this PR (+ AttnOutGate from PR #1667 @MarioPaerle + SmearGate)

Credits

Test plan

  • 3-seed training on 8xH100 SXM (seeds 42, 0, 1234) — all under 600s train, all under 600s eval
  • All artifacts under 16 MB (max 15,938,772 B / 16,000,000 B)
  • Code-size and hyperparameter consistency verified across all 3 seed logs
  • Score-before-update ordering held across all 3 phases (phase boundaries [666, 1333, 2000])
  • Clears the 0.005-nat bar vs merged SOTA and vs casefold leader
  • Judges verify reproducibility
  • Judges confirm casefold legality (Issue Clarify which text normalizations are allowed for custom tokenizers #1604)

…TTT — val_bpb 1.05733 (3-seed mean)

Stacks per-head Attention Output Gate (PR openai#1667 @MarioPaerle) and SmearGate
on top of PR openai#1670's Casefold V4 + Multi-Phase Global SGD TTT base.
Zero-init gates (identity at init) add 1,056 + 13 parameters total.

- Seed 42:   val_bpb=1.05693, val_loss=3.04604, artifact=15,936,269 B
- Seed 0:    val_bpb=1.05730, val_loss=3.04712, artifact=15,937,514 B
- Seed 1234: val_bpb=1.05777, val_loss=3.04846, artifact=15,938,772 B
- 3-seed mean val_bpb=1.05733 (std 0.00035), val_loss=3.04721 nats
- Delta vs casefold leader (PR openai#1585): -0.00657 BPB / -0.01697 nats (>3x the 0.005-nat bar)
- Delta vs PR openai#1670 casefold base: -0.00237 BPB / -0.00680 nats

Casefold legality pending organizer review at Issue openai#1604.
AttnOutGate and SmearGate are pure architectural additions and comply with
all Issue openai#1017 conditions (causality, normalized distribution, score-before-
update, single pass).
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 17, 2026
amrayach added a commit to amrayach/parameter-golf that referenced this pull request Apr 18, 2026
…verlay

Supersedes the 2026-04-18 post-Ultrareview pin (876bb36). Rev 5 adds
provenance auto-capture, repo-type=model fix, exact-SHA env-var pin, and
run_all.sh/README alignment; new pin reflects the pipeline-patch commit.

Also records the live-guidance absolute-BPB overlay and 04b deprecation
driven by open-PR competitive intel (openai#1700 / openai#1716 / openai#1707 / openai#1693).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mikeapedia added a commit to mikeapedia/parameter-golf-1 that referenced this pull request Apr 19, 2026
Sliding-window eval only (TTT_ENABLED=0 by default) on standard SP8192.
Beats merged non-casefold SOTA PR openai#1493 (1.0810) by 0.00394 BPB without
any test-time adaptation. Single seed 1337; compute-constrained
non-record submission — VM went down before the run log could be pushed
so it is not attached. Metrics were observed during the session.

Architecture: PR openai#1674 (@mikeapedia) base (Parcae constrained loop
injection, Gemma-style global/local attention, Gram Newton-Schulz) +
PR openai#1530 (@samacqua) varlen attention + fused MLP triton kernel +
AttnOutGate (PR openai#1667 @MarioPaerle) + SmearGate (@KellerJordan concept,
@MarioPaerle reintroduction) + new layered local sliding windows
(512 on early/loop layers, 1024 on post-loop layers, split at index 6).

KV-tying on globals dropped vs PR openai#1674. TTT scaffolding (phased
global-SGD + per-doc LoRA, from PR openai#1693 lineage) remains in the file
for experiments but is disabled by default for this submission.
Meirzhan05 added a commit to Meirzhan05/parameter-golf that referenced this pull request Apr 27, 2026
Per PR openai#1667/openai#1693: a tiny linear (gate_width x num_heads, default 12x8 = 96
weights per layer) projects the first 12 dims of the input into per-head gate
values. Scaled to (0, 2) via 2*sigmoid for symmetric pass-through at zero-init.

Total: 1056 extra params (8 heads x 12 width x 11 layers) — ~1KB at fp16.
Zero-init = identity at start (transparent). Lets each head dynamically
suppress noise per-token. Compatible with depth recurrence, parallel residuals,
XSA, and GPTQ (gate weights pass through as fp16, numel < 65536).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant