Non-Record: Unified Attention + FA3 + 1hr training (val_bpb=1.1088) by VirajDeshwal · Pull Request #1270 · openai/parameter-golf

VirajDeshwal · 2026-04-02T22:34:30Z

Non-Record: Unified Attention + FA3 + 1hr training

val_bpb: 1.1088 | ~15.82 MB | 8×H100 SXM | 1 hour training

Same architecture as our record submission PR #1202 (val_bpb 1.1412, 10-min). This run trains for 1 hour to explore how unified attention scales with unlimited compute.

Run	steps	Post-TTT bpb	Training time
Record (PR #1202)	12,100	1.1412	10 min
This run	72,000	1.1088	60 min

Beats the current unlimited compute SOTA (1.1239, 1-bit quantization, 2hr) by 0.015 BPB in half the training time.

Only the schedule changes: ITERATIONS=72000, WARMDOWN_ITERS=10000, QAT_START_FRACTION=0.85. Architecture and eval identical to PR #1202.

Key finding: unified attention scales with longer training just as standard architectures do. The model plateaus at peak LR around step 48K, then the 10,000-step warmdown phase drives val_bpb from 1.223 to 1.133. With 9x more warmdown steps than the 10-min run, the refinement goes much deeper.

See README for full details.

MatoTeziTanka · 2026-04-12T05:43:48Z

Community Review — Non-Record: Unified Attention + FA3 + 1hr training (val_bpb=1.1088)

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

Analysis

N-gram Family Bug (ILLEGAL BigramHash)

NOT PRESENT. self.bigram is initialized to None at line 839 and is never assigned a hash-based implementation. The BigramHash illegal pattern (target XOR'd into hash key) does not appear anywhere in the file. The bigram field is dead code in this submission.

Pre-Quant TTT (ILLEGAL — multi-epoch on val_tokens before score)

NOT PRESENT. There is no unconditional multi-epoch finetuning on val_tokens before scoring begins.

Legal Score-First TTT (PR #1413 pattern)

PRESENT and correctly implemented. Function eval_val_legal_ttt (line 993) implements the score-first pattern:

Lines 1044–1083: For each chunk ci, scoring happens first — model is in eval() mode with torch.no_grad(), computing forward_logits, accumulating loss_sum / token_count / byte_count.
Line 1085: is_last_chunk = (ci == num_chunks - 1) guard.
Lines 1086–1121: Training only executes if not is_last_chunk and args.legal_ttt_epochs > 0 — the final chunk is never trained on, preventing lookahead into the scored region.

This matches the PR #1413 pattern precisely: score chunk N, train on chunk N (to prepare for chunk N+1), skip training after the last scored chunk.

Scored-Region SLOT (HOLD)

NOT PRESENT. No manipulation of scored-region slot detected.

Architecture

UnifiedAttention with FA3 (flash_attn_interface), UNet-style skip connections via skip_weights, layer bank weight sharing (unified_bank, output_bank, etc.), smear gating. Legal score-first TTT enabled by default (LEGAL_TTT_ENABLED=1).

Conclusion

The TTT implementation is compliant with the score-first constraint. No illegal patterns detected. CLEAN.

Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under torch.no_grad() before optimizer.step(), with is_last_chunk guard preventing adaptation on the final scored chunk.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Non-Record: Unified Attention + FA3 + 1hr training (val_bpb=1.1088)

5db964e

This was referenced Apr 30, 2026

[URGENT] Leaderboard Update #1892

Open

PRs pending since a month despite of approval from community review #2001

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-Record: Unified Attention + FA3 + 1hr training (val_bpb=1.1088)#1270

Non-Record: Unified Attention + FA3 + 1hr training (val_bpb=1.1088)#1270
VirajDeshwal wants to merge 1 commit intoopenai:mainfrom
VirajDeshwal:submission/2026-04-01_UnifiedAttention_FA3_1hour_nonrecord

VirajDeshwal commented Apr 2, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

VirajDeshwal commented Apr 2, 2026