Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) with ablations by anantdgoel · Pull Request #413 · openai/parameter-golf

anantdgoel · 2026-03-22T07:06:50Z

Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) with ablations

val_bpb: 1.4525 (sliding window, stride=128, GA+VR combined) | 13.2 MB | 1xRTX3090, 1000 steps, 131K batch

Two novel architecture modifications and one negative result. Sharing validated techniques with controlled ablation data.

Contributions

Value Residual (ResFormer) -- -0.015 BPB. Cache V vectors from layer 0, mix into all subsequent layers via learnable scalars. 18 params total. arXiv:2410.17897 (ACL 2025). Enable: VALUE_RESIDUAL=1.
Gated Attention -- -0.003 BPB. Per-head sigmoid gate after SDPA output, eliminating attention sinks. ~37K params. arXiv:2505.06708 (NeurIPS 2025 Best Paper). Enable: GATED_ATTENTION=1.
PPM-C Context Mixer -- +0.0018 BPB (negative result). Classical compression blended with neural softmax. Dilutes predictions on SmearGate+BigramHash models.

The two positive techniques stack additively for -0.017 BPB combined.

Ablation Results

v1024 9L 2xMLP, SmearGate + BigramHash + OrthoInit + WD 0.04, 131K batch, 1000 steps.

Config	Sliding BPB	Delta vs Control
Control	1.4697	--
Gated Attention only	1.4665	-0.0032
Value Residual only	1.4546	-0.0151
GA + VR combined	1.4525	-0.0172
PPM-C (eval-only)	1.2900	+0.0018 (worse)

A production run (11L MLP3x + full community stack + VR + GA, 9500 steps) is in progress. Results in a follow-up submission if competitive.

Files

README.md -- Full writeup with technique details and reproducibility
submission.json -- Metadata
train_gpt.py -- Training script with Value Residual, Gated Attention, XSA, EMA, Partial RoPE, LN Scale

…) with ablations Two novel architecture modifications validated with controlled ablations: - Value Residual: layer-0 V shortcut, 18 scalars, -0.015 BPB - Gated Attention: per-head sigmoid gate, -0.003 BPB - PPM-C: negative result (+0.002 BPB on SmearGate+BigramHash) Combined: -0.017 BPB additive, no interference. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…+ Gated Attention + BigramHash(10240) + 12L (val_bpb=1.1690) First submission combining 6 independently-proven architecture improvements: - Catalytic Residuals (PR openai#450, -0.024 bpb) - Value Residual/ResFormer (PR openai#413, -0.015 bpb) - Gated Attention (PR openai#413, -0.003 bpb) - BigramHash(10240) (PR openai#450, -0.070 bpb vs 2048) - 12 Layers (-0.023 bpb vs 11L) - 3x MLP 8xH100 SXM: 6981 steps, 85.78ms/step, 15.3MB artifact (int6+zstd)

MatoTeziTanka · 2026-04-12T14:18:57Z

Sibling Draft Review for PR #413

Date: 2026-04-12
Reviewer: Claude Agent (for Mato/@MatoTeziTanka)
NEEDS_LLM_REVIEW: Yes
Status: Ready for detailed code review

PR Summary

Title: Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) with ablations
State: OPEN
Mato blocking comment: NO

Marker Analysis

Marker	Found	Notes
target_in_key	False	Custom loss key injection pattern
TTT	True	Test-Time Training integration
SLOT	False	Slot-based attention variant
custom_tokenizer	True	3 patterns detected

Architecture Changes

train_gpt.py modifications: ~1659 lines changed
Quantization: True
EMA: True
LoRA: False
GPTQ: False

Assessment

⚠ REQUIRES_REVIEW - Custom modifications detected:

TTT (Test-Time Training) integration detected
Custom tokenizer patterns found (3 instances)

Recommendation

REVIEW — Detailed code inspection needed before merge

Review Checklist

Code style and formatting (PEP 8, consistency)
Numerical stability of modifications
Integration with existing pipeline
Reproducibility (seed handling, RNG control)
Performance metrics reported correctly
No undocumented dependencies

Next Steps

Full train_gpt.py code review
Validate loss computation and gradients
Check for integration issues with existing pipeline
Verify metrics reproducibility

Generated for: Mato (@MatoTeziTanka)
Scope: NEEDS_LLM_REVIEW sweep
Action: Draft only—no posts to GitHub

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE-pending standard record-track checks.

Reviewed by @MatoTeziTanka — The Agora. Classification via sibling-session agent (Haiku-backed). This review was drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

MatoTeziTanka · 2026-04-12T14:18:59Z

Sibling Draft Review for PR #413

Date: 2026-04-12
Reviewer: Claude Agent (for Mato/@MatoTeziTanka)
NEEDS_LLM_REVIEW: Yes
Status: Ready for detailed code review

PR Summary

Title: Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) with ablations
State: OPEN
Mato blocking comment: NO

Marker Analysis

Marker	Found	Notes
target_in_key	False	Custom loss key injection pattern
TTT	True	Test-Time Training integration
SLOT	False	Slot-based attention variant
custom_tokenizer	True	3 patterns detected

Architecture Changes

train_gpt.py modifications: ~1659 lines changed
Quantization: True
EMA: True
LoRA: False
GPTQ: False

Assessment

⚠ REQUIRES_REVIEW - Custom modifications detected:

TTT (Test-Time Training) integration detected
Custom tokenizer patterns found (3 instances)

Recommendation

REVIEW — Detailed code inspection needed before merge

Review Checklist

Code style and formatting (PEP 8, consistency)
Numerical stability of modifications
Integration with existing pipeline
Reproducibility (seed handling, RNG control)
Performance metrics reported correctly
No undocumented dependencies

Next Steps

Full train_gpt.py code review
Validate loss computation and gradients
Check for integration issues with existing pipeline
Verify metrics reproducibility

Generated for: Mato (@MatoTeziTanka)
Scope: NEEDS_LLM_REVIEW sweep
Action: Draft only—no posts to GitHub

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE-pending standard record-track checks.

Reviewed by @MatoTeziTanka — The Agora. Classification via sibling-session agent (Haiku-backed). This review was drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

MatoTeziTanka · 2026-04-12T14:21:02Z

Community Review — Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) with ablations

Compliance flag: Pre-Quant TTT violation (multi-epoch SGD on val_tokens before scoring)

PR 413 — Non-record: Value Residual + Gated Attention (anantdgoel)

Head SHA: c0a2939
Track: non-record-16mb
Reported val_bpb: 1.4525 (pre_quant_val_bpb: 1.4525)
Hardware: 1×RTX3090 (ablations), 1×A6000 (validation)

Check 1: N-gram Family Bug (target token in hash key)

CLEAN. The BigramHash at line 930–936 computes:

out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod

Position i hashes tokens [i] (current) and [i-1] (previous). The current token t[i] is the token being embedded, not the target (next token). This is causally clean — no look-ahead. This is the legal BigramHash pattern (same as baseline). No N-gram family bug.

Check 2: Pre-Quant TTT (multi-epoch AdamW on val_tokens without score-first)

CLOSE TRIGGER. The SGD TTT at lines 429–442 (eval_val_sgd_ttt) runs a two-phase procedure:

Phase 1 (Adapt): multi-epoch SGD over all val_tokens — iterates args.sgd_ttt_epochs (default: 2) passes over the entire validation set, computing loss and updating weights.
Phase 2 (Score): sliding-window scoring on the already-adapted model.

This is adapt-first, score-second on the full val set — the banned Pre-Quant TTT pattern. The optimizer is SGD (not AdamW), but the structure is identical: gradient updates on val_tokens before scoring, across multiple epochs, with no score-first-per-chunk interleaving. The SGD vs. AdamW distinction does not change the violation — the invariant is about adapting on val data before scoring, which this does unambiguously.

Verdict: CLOSE — Pre-Quant TTT violation (multi-epoch SGD on val_tokens before scoring).

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: Recommend CLOSE unless the author restructures the flagged code.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source. If this review misread your code, please call it out so I can re-audit manually.

notapplica mentioned this pull request Mar 22, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

joshuaswarren mentioned this pull request Mar 22, 2026

Non-record: 6-Technique Stack — Catalytic Residuals + Value Residual + Gated Attention + BigramHash(10240) + 12L (val_bpb=1.1690) #474

Open

amaljithkuttamath mentioned this pull request Mar 23, 2026

Record: 11L VR + GA + LeakyReLU² + Legal Score-First TTT (val_bpb=pending) #490

Draft

pentxayc mentioned this pull request Mar 26, 2026

Record: 0.4416 BPB -- Complementary Training + Backoff N-gram Mixer #803

Open

3 tasks

Naazimsnh02 mentioned this pull request Mar 28, 2026

Record: 0.4311 BPB - Complementary Training + Backoff N-gram Mixer + TTT #1033

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) with ablations#413

Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) with ablations#413
anantdgoel wants to merge 1 commit intoopenai:mainfrom
anantdgoel:value-residual-gated-attention

anantdgoel commented Mar 22, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anantdgoel commented Mar 22, 2026