Non-Record Universal Transformer submission. (2x Attention layers, 3 Layer MLP, depth scheduling)#1088
Non-Record Universal Transformer submission. (2x Attention layers, 3 Layer MLP, depth scheduling)#1088serdardoesml wants to merge 26 commits intoopenai:mainfrom
Conversation
|
Can re-open with clean commit history if needed |
Community Review — Non-Record Universal Transformer submission. (2x Attention layers, 3 Layer MLP, depth scheduling)Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache AnalysisPR #1088 — Checks performed1. ILLEGAL n-gram family bug (target XOR'd into hash key, NOT BigramHash) 2. ILLEGAL Pre-Quant TTT (multi-epoch on val_tokens without score-first) 3. LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard) 4. HOLD scored-region SLOT 5. Architecture novelty ConclusionThis is a straightforward pure-neural submission. No illegal techniques detected. Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission. Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually. |
This run uses a shared UT-style recurrent block with two attention layers before the MLP, so the model can form circuits like induction heads before passing through the MLP. I also changed the feedforward to a 3-layer MLP (added a fully connected layer between the up and down proj), which felt like a cleaner use of parameters than pushing a standard MLP all the way to 16x just to match KV-pair count. (KV-pair view of MLPs as described in https://arxiv.org/pdf/2505.19488v1).
The norms stay independent across depth, and I add a bias to pre-norms. The bias is important to get this to work, since it acts like a depth embedding, adding a different vector at each depth while still sharing the main weights. For quantization, I reused the noisy QAT idea from the other non-record DepthRecurrence submission. I am not sure how optimal it is here, but it helped a bit on quantized BPB.
A big part of making this competitive is the layer/depth schedule. Training at a lower depth early on is something enabled by UT and can save a lot of time. There could be ways to speed it up even further with early exiting strategies.
All scheduled depths are compiled up front in the warmup/priming stage (an idea I got from modded-nanogpt speedrun), so we don't hit recompiles when switching. This run uses NUM_LAYER_SCHEDULE=0:2,2000:6; the schedule itself can probably be tuned a lot more since with limited compute, I could only guess what would transfer to full 8xH100 scale, and it doesn't seem optimal.
I removed the UNet style extra skip connections for simplicity, as I'm not sure it's a good fit for shared weights. Another direction to explore could be re-adding this by having 2 sets of weights, one for encoder and one for decoder layers then repeating both.
It also does not include any of the leaderboard improvements made since the baseline. If I can get more compute i will continue experimenting with it, and I'm confident it could be a good path for others to use as a starting point later on.
This run trained at under the 600s wallclock cap and stopped at step 6011.
Final numbers:
pre-quant val_bpb=1.2542,
final int8+zlib roundtrip val_bpb=1.25595494,
total size 15,982,324 bytes.