Non-record: Universal Transformer (4h unlimited compute track) by oneKn8 · Pull Request #1206 · openai/parameter-golf

oneKn8 · 2026-04-01T02:51:17Z

Summary

Track: Unlimited compute (4 hours on 8xH100)
Architecture: Single shared 1024d transformer block iterated 24 times with learned step embeddings
Status: Awaiting compute access for training results

This directly responds to the Requests for PRs item:

Universal transformer - We have lots of depth recurrence submissions, but I'd love to see one 4 hour

Key design decisions

Weight sharing: One block instead of 9+ unique blocks frees parameter budget for a 2x wider model (1024d vs 512d baseline)
Step embeddings: Learned per-iteration vectors added to the residual stream so the shared block differentiates early vs. late iterations (addresses the failure mode reported in PR Non-record: Negative Results — Architecture, TTT Variants, Quantization, and N-gram Cache Illegality #1186 where full recurrence degraded by +0.20 BPB)
LeakyReLU(0.5)^2: Proven activation from PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549
24 iterations: Effective depth of 24 with only ~10M unique parameters (~9.3MB compressed, well under 16MB)

Parameter comparison

Property	Baseline (9L x 512d)	This UT (1x1024d x 24 iter)
Unique params	~17M	~10M
Effective depth	9	24
Model width	512	1024
Compressed size (est.)	~15.9MB	~9.3MB

Why this should beat the 4-Hour Baseline (1.2074 BPB)

The current 4-hour unlimited compute entry is just the baseline trained longer. This UT has:

2.7x effective depth (24 vs 9 iterations)
2x model width (1024d vs 512d)
Modern activation (LeakyReLU(0.5)^2 vs relu^2)
~6MB headroom for BigramHash, int6 GPTQ, and other techniques

Training logs will be added once compute is available.

Test plan

Python syntax validation passes
Code is under 1500-line limit (1106 lines)
All submission files present (train_gpt.py, README.md, submission.json)
Training on 8xH100 (pending compute access)
val_bpb and artifact size verified
3-seed reproducibility check

Single shared 1024d transformer block iterated 24 times with learned step embeddings. Weight sharing frees parameter budget for 2x wider model vs baseline. Targets the unlimited compute track request for a proper Universal Transformer submission.

MatoTeziTanka · 2026-04-12T05:57:58Z

Community Review — Non-record: Universal Transformer (4h unlimited compute track)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

PR #1206 — UniversalTransformer_4h Head SHA: `a19c08b` Files changed: - records/track_non_record_16mb/2026-03-31_UniversalTransformer_4h/README.md - records/track_non_record_16mb/2026-03-31_UniversalTransformer_4h/submission.json - records/track_non_record_16mb/2026-03-31_UniversalTransformer_4h/train_gpt.py --- ### Architecture `UniversalGPT` (lines 625–709): a single shared `Block` iterated `num_iterations=24` times (lines 695–698), with learned `step_embeddings` (one per iteration, shape [24, 1024]) added to the residual stream before each pass so the block can differentiate early vs. late iterations. - 1024d model, 16 heads, 4 KV heads (GQA), 3× MLP, LeakyReLU(0.5)² activation - Tied embeddings, vocab_size=1024 (sp1024 tokenizer), seq_len=2048 - 4-hour wallclock budget on 8×H100 ### Check results ILLEGAL n-gram family bug: NOT PRESENT. No `BigramHash`, no n-gram hash construction, no XOR of target into hash key. Zero n-gram machinery in file. ILLEGAL Pre-Quant TTT: NOT PRESENT. `val_tokens` is used only in `eval_val()` (lines 224–280), which runs under `torch.inference_mode()` with `model.eval()` — pure read-only evaluation, no gradient computation, no optimizer step, no multi-epoch loop over val data. LEGAL score-first TTT (PR #1413 pattern): NOT PRESENT. No TTT of any kind — no test-time adaptation, no `is_last_chunk` guard, no scored-region slot. HOLD scored-region SLOT: NOT PRESENT. No slot mechanism, no scored-region boundary logic. CLEAN pure neural: CONFIRMED. The model is a straightforward Universal Transformer: standard cross-entropy next-token prediction on train data only, validated on val data read-only. No auxiliary losses, no hash tables, no retrieval, no TTT. ### Conclusion This is a legitimate architectural experiment (weight-shared Universal Transformer) with no rule violations. Training uses only train-split tokens for gradient updates. Val tokens are read-only for BPB measurement. Quantization (int8 + zlib) is applied...

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Universal Transformer (4h unlimited compute track)#1206

Non-record: Universal Transformer (4h unlimited compute track)#1206
oneKn8 wants to merge 1 commit intoopenai:mainfrom
oneKn8:universal-transformer-4h

oneKn8 commented Apr 1, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

oneKn8 commented Apr 1, 2026

Summary

Key design decisions

Parameter comparison

Why this should beat the 4-Hour Baseline (1.2074 BPB)

Test plan

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Non-record: Universal Transformer (4h unlimited compute track)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants