Skip to content

Non-record: Universal Transformer (4h unlimited compute track)#1206

Open
oneKn8 wants to merge 1 commit intoopenai:mainfrom
oneKn8:universal-transformer-4h
Open

Non-record: Universal Transformer (4h unlimited compute track)#1206
oneKn8 wants to merge 1 commit intoopenai:mainfrom
oneKn8:universal-transformer-4h

Conversation

@oneKn8
Copy link
Copy Markdown

@oneKn8 oneKn8 commented Apr 1, 2026

Summary

  • Track: Unlimited compute (4 hours on 8xH100)
  • Architecture: Single shared 1024d transformer block iterated 24 times with learned step embeddings
  • Status: Awaiting compute access for training results

This directly responds to the Requests for PRs item:

Universal transformer - We have lots of depth recurrence submissions, but I'd love to see one 4 hour

Key design decisions

Parameter comparison

Property Baseline (9L x 512d) This UT (1x1024d x 24 iter)
Unique params ~17M ~10M
Effective depth 9 24
Model width 512 1024
Compressed size (est.) ~15.9MB ~9.3MB

Why this should beat the 4-Hour Baseline (1.2074 BPB)

The current 4-hour unlimited compute entry is just the baseline trained longer. This UT has:

  1. 2.7x effective depth (24 vs 9 iterations)
  2. 2x model width (1024d vs 512d)
  3. Modern activation (LeakyReLU(0.5)^2 vs relu^2)
  4. ~6MB headroom for BigramHash, int6 GPTQ, and other techniques

Training logs will be added once compute is available.

Test plan

  • Python syntax validation passes
  • Code is under 1500-line limit (1106 lines)
  • All submission files present (train_gpt.py, README.md, submission.json)
  • Training on 8xH100 (pending compute access)
  • val_bpb and artifact size verified
  • 3-seed reproducibility check

Single shared 1024d transformer block iterated 24 times with learned
step embeddings. Weight sharing frees parameter budget for 2x wider
model vs baseline. Targets the unlimited compute track request for a
proper Universal Transformer submission.
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: Universal Transformer (4h unlimited compute track)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

PR #1206 — UniversalTransformer_4h Head SHA: a19c08b Files changed: - records/track_non_record_16mb/2026-03-31_UniversalTransformer_4h/README.md - records/track_non_record_16mb/2026-03-31_UniversalTransformer_4h/submission.json - records/track_non_record_16mb/2026-03-31_UniversalTransformer_4h/train_gpt.py --- ### Architecture UniversalGPT (lines 625–709): a single shared Block iterated num_iterations=24 times (lines 695–698), with learned step_embeddings (one per iteration, shape [24, 1024]) added to the residual stream before each pass so the block can differentiate early vs. late iterations. - 1024d model, 16 heads, 4 KV heads (GQA), 3× MLP, LeakyReLU(0.5)² activation - Tied embeddings, vocab_size=1024 (sp1024 tokenizer), seq_len=2048 - 4-hour wallclock budget on 8×H100 ### Check results ILLEGAL n-gram family bug: NOT PRESENT. No BigramHash, no n-gram hash construction, no XOR of target into hash key. Zero n-gram machinery in file. ILLEGAL Pre-Quant TTT: NOT PRESENT. val_tokens is used only in eval_val() (lines 224–280), which runs under torch.inference_mode() with model.eval() — pure read-only evaluation, no gradient computation, no optimizer step, no multi-epoch loop over val data. LEGAL score-first TTT (PR #1413 pattern): NOT PRESENT. No TTT of any kind — no test-time adaptation, no is_last_chunk guard, no scored-region slot. HOLD scored-region SLOT: NOT PRESENT. No slot mechanism, no scored-region boundary logic. CLEAN pure neural: CONFIRMED. The model is a straightforward Universal Transformer: standard cross-entropy next-token prediction on train data only, validated on val data read-only. No auxiliary losses, no hash tables, no retrieval, no TTT. ### Conclusion This is a legitimate architectural experiment (weight-shared Universal Transformer) with no rule violations. Training uses only train-split tokens for gradient updates. Val tokens are read-only for BPB measurement. Quantization (int8 + zlib) is applied...

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants