Non-Record Universal Transformer submission. (2x Attention layers, 3 Layer MLP, depth scheduling) by serdardoesml · Pull Request #1088 · openai/parameter-golf

serdardoesml · 2026-03-29T18:09:25Z

This run uses a shared UT-style recurrent block with two attention layers before the MLP, so the model can form circuits like induction heads before passing through the MLP. I also changed the feedforward to a 3-layer MLP (added a fully connected layer between the up and down proj), which felt like a cleaner use of parameters than pushing a standard MLP all the way to 16x just to match KV-pair count. (KV-pair view of MLPs as described in https://arxiv.org/pdf/2505.19488v1).

The norms stay independent across depth, and I add a bias to pre-norms. The bias is important to get this to work, since it acts like a depth embedding, adding a different vector at each depth while still sharing the main weights. For quantization, I reused the noisy QAT idea from the other non-record DepthRecurrence submission. I am not sure how optimal it is here, but it helped a bit on quantized BPB.

A big part of making this competitive is the layer/depth schedule. Training at a lower depth early on is something enabled by UT and can save a lot of time. There could be ways to speed it up even further with early exiting strategies.

All scheduled depths are compiled up front in the warmup/priming stage (an idea I got from modded-nanogpt speedrun), so we don't hit recompiles when switching. This run uses NUM_LAYER_SCHEDULE=0:2,2000:6; the schedule itself can probably be tuned a lot more since with limited compute, I could only guess what would transfer to full 8xH100 scale, and it doesn't seem optimal.

I removed the UNet style extra skip connections for simplicity, as I'm not sure it's a good fit for shared weights. Another direction to explore could be re-adding this by having 2 sets of weights, one for encoder and one for decoder layers then repeating both.

It also does not include any of the leaderboard improvements made since the baseline. If I can get more compute i will continue experimenting with it, and I'm confident it could be a good path for others to use as a starting point later on.

This run trained at under the 600s wallclock cap and stopped at step 6011.

Final numbers:
pre-quant val_bpb=1.2542,
final int8+zlib roundtrip val_bpb=1.25595494,
total size 15,982,324 bytes.

serdardoesml · 2026-03-30T21:56:51Z

Can re-open with clean commit history if needed

MatoTeziTanka · 2026-04-12T13:07:13Z

Community Review — Non-Record Universal Transformer submission. (2x Attention layers, 3 Layer MLP, depth scheduling)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

Analysis

PR #1088 — records/track_non_record_16mb/2026-03-27_UT_DoubleAttn_3L-MLP/train_gpt.py
Head SHA: e2ae49483e66807be0c52a2126bcef2bd062de2a

Checks performed

1. ILLEGAL n-gram family bug (target XOR'd into hash key, NOT BigramHash)
CLEAR. No n-gram, bigram, BigramHash, or hash-key construction of any kind is present. Full grep for ngram, bigram, BigramHash, xor, ^ hash patterns returned zero matches.

2. ILLEGAL Pre-Quant TTT (multi-epoch on val_tokens without score-first)
CLEAR. val_tokens is accessed exclusively inside eval_val() (lines 212–283), which runs under torch.inference_mode() (line 255) and calls model.eval() (line 254) / model.train() (line 282). No .backward(), opt.step(), or gradient accumulation ever touches val_tokens. The two calls to eval_val are: mid-training validation at line 1068, and post-quantization roundtrip check at line 1213. Neither constitutes TTT.

3. LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard)
NOT PRESENT — no TTT of any kind. No is_last_chunk, score_first, or scored-region gating found.

4. HOLD scored-region SLOT
NOT PRESENT. No scored-region slot logic detected.

5. Architecture novelty
The submission implements a Universal Transformer variant with two shared attention modules (shared_attn + shared_attn_2, lines 697–698) and one shared MLP (shared_mlp, line 699), applied across all blocks (Block.forward lines 663–671). Each Block has its own per-layer norm and scale parameters. This is a clean pure-neural architecture: standard causal Transformer training on train shards, Muon + Adam optimizers, layer schedule (depth warmup 2→6 layers at step 2000), noisy QAT int8 during training, and final int8+zlib quantization. No external data manipulation, no val-set gradient signals.

Conclusion

This is a straightforward pure-neural submission. No illegal techniques detected.

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

serdardoesml added 24 commits March 27, 2026 17:47

UT First try

365616d

fix

fb3c338

change config

1bcf227

double attention

9f63ff2

fix

c3c418d

reduce layers

ae4f75e

depth schedule, remove unet, etc etc

766644c

fix

5a459d2

compile fix

f8dde67

cleanup

068d463

fix compile

4db49f0

fix compile

04415f2

change depth schedule, fix timing

839d537

update schedule

9247919

change schedule

fc02f86

Noisy QAT

0aff3f5

new schedule

0348bff

reduce mlp mul

6e1e457

new config

a4a6ee1

slight changes

6634f80

.

4555245

Dual layer mlp

0fbda5e

fix forced switch if not using last depth on schedule

32a6244

Submission

e2ae494

notapplica mentioned this pull request Mar 30, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

serdardoesml added 2 commits April 18, 2026 19:29

Merge branch 'openai:main' into main

2b3e0a3

Merge branch 'openai:main' into main

0a64588

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-Record Universal Transformer submission. (2x Attention layers, 3 Layer MLP, depth scheduling)#1088

Non-Record Universal Transformer submission. (2x Attention layers, 3 Layer MLP, depth scheduling)#1088
serdardoesml wants to merge 26 commits intoopenai:mainfrom
serdardoesml:main

serdardoesml commented Mar 29, 2026

Uh oh!

serdardoesml commented Mar 30, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

serdardoesml commented Mar 29, 2026

Uh oh!

serdardoesml commented Mar 30, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Non-Record Universal Transformer submission. (2x Attention layers, 3 Layer MLP, depth scheduling)

Analysis

Checks performed

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants