Skip to content

Non-Record Universal Transformer submission. (2x Attention layers, 3 Layer MLP, depth scheduling)#1088

Open
serdardoesml wants to merge 26 commits intoopenai:mainfrom
serdardoesml:main
Open

Non-Record Universal Transformer submission. (2x Attention layers, 3 Layer MLP, depth scheduling)#1088
serdardoesml wants to merge 26 commits intoopenai:mainfrom
serdardoesml:main

Conversation

@serdardoesml
Copy link
Copy Markdown

This run uses a shared UT-style recurrent block with two attention layers before the MLP, so the model can form circuits like induction heads before passing through the MLP. I also changed the feedforward to a 3-layer MLP (added a fully connected layer between the up and down proj), which felt like a cleaner use of parameters than pushing a standard MLP all the way to 16x just to match KV-pair count. (KV-pair view of MLPs as described in https://arxiv.org/pdf/2505.19488v1).

The norms stay independent across depth, and I add a bias to pre-norms. The bias is important to get this to work, since it acts like a depth embedding, adding a different vector at each depth while still sharing the main weights. For quantization, I reused the noisy QAT idea from the other non-record DepthRecurrence submission. I am not sure how optimal it is here, but it helped a bit on quantized BPB.

A big part of making this competitive is the layer/depth schedule. Training at a lower depth early on is something enabled by UT and can save a lot of time. There could be ways to speed it up even further with early exiting strategies.

All scheduled depths are compiled up front in the warmup/priming stage (an idea I got from modded-nanogpt speedrun), so we don't hit recompiles when switching. This run uses NUM_LAYER_SCHEDULE=0:2,2000:6; the schedule itself can probably be tuned a lot more since with limited compute, I could only guess what would transfer to full 8xH100 scale, and it doesn't seem optimal.

I removed the UNet style extra skip connections for simplicity, as I'm not sure it's a good fit for shared weights. Another direction to explore could be re-adding this by having 2 sets of weights, one for encoder and one for decoder layers then repeating both.

It also does not include any of the leaderboard improvements made since the baseline. If I can get more compute i will continue experimenting with it, and I'm confident it could be a good path for others to use as a starting point later on.

This run trained at under the 600s wallclock cap and stopped at step 6011.

Final numbers:
pre-quant val_bpb=1.2542,
final int8+zlib roundtrip val_bpb=1.25595494,
total size 15,982,324 bytes.

@serdardoesml
Copy link
Copy Markdown
Author

Can re-open with clean commit history if needed

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-Record Universal Transformer submission. (2x Attention layers, 3 Layer MLP, depth scheduling)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

Analysis

PR #1088records/track_non_record_16mb/2026-03-27_UT_DoubleAttn_3L-MLP/train_gpt.py
Head SHA: e2ae49483e66807be0c52a2126bcef2bd062de2a

Checks performed

1. ILLEGAL n-gram family bug (target XOR'd into hash key, NOT BigramHash)
CLEAR. No n-gram, bigram, BigramHash, or hash-key construction of any kind is present. Full grep for ngram, bigram, BigramHash, xor, ^ hash patterns returned zero matches.

2. ILLEGAL Pre-Quant TTT (multi-epoch on val_tokens without score-first)
CLEAR. val_tokens is accessed exclusively inside eval_val() (lines 212–283), which runs under torch.inference_mode() (line 255) and calls model.eval() (line 254) / model.train() (line 282). No .backward(), opt.step(), or gradient accumulation ever touches val_tokens. The two calls to eval_val are: mid-training validation at line 1068, and post-quantization roundtrip check at line 1213. Neither constitutes TTT.

3. LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard)
NOT PRESENT — no TTT of any kind. No is_last_chunk, score_first, or scored-region gating found.

4. HOLD scored-region SLOT
NOT PRESENT. No scored-region slot logic detected.

5. Architecture novelty
The submission implements a Universal Transformer variant with two shared attention modules (shared_attn + shared_attn_2, lines 697–698) and one shared MLP (shared_mlp, line 699), applied across all blocks (Block.forward lines 663–671). Each Block has its own per-layer norm and scale parameters. This is a clean pure-neural architecture: standard causal Transformer training on train shards, Muon + Adam optimizers, layer schedule (depth warmup 2→6 layers at step 2000), noisy QAT int8 during training, and final int8+zlib quantization. No external data manipulation, no val-set gradient signals.

Conclusion

This is a straightforward pure-neural submission. No illegal techniques detected.

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants