Skip to content

Non-Record: Unified Attention + FA3 + 1hr training (val_bpb=1.1088)#1270

Open
VirajDeshwal wants to merge 1 commit intoopenai:mainfrom
VirajDeshwal:submission/2026-04-01_UnifiedAttention_FA3_1hour_nonrecord
Open

Non-Record: Unified Attention + FA3 + 1hr training (val_bpb=1.1088)#1270
VirajDeshwal wants to merge 1 commit intoopenai:mainfrom
VirajDeshwal:submission/2026-04-01_UnifiedAttention_FA3_1hour_nonrecord

Conversation

@VirajDeshwal
Copy link
Copy Markdown

Non-Record: Unified Attention + FA3 + 1hr training

val_bpb: 1.1088 | ~15.82 MB | 8×H100 SXM | 1 hour training

Same architecture as our record submission PR #1202 (val_bpb 1.1412, 10-min). This run trains for 1 hour to explore how unified attention scales with unlimited compute.

Run steps Post-TTT bpb Training time
Record (PR #1202) 12,100 1.1412 10 min
This run 72,000 1.1088 60 min

Beats the current unlimited compute SOTA (1.1239, 1-bit quantization, 2hr) by 0.015 BPB in half the training time.

Only the schedule changes: ITERATIONS=72000, WARMDOWN_ITERS=10000, QAT_START_FRACTION=0.85. Architecture and eval identical to PR #1202.

Key finding: unified attention scales with longer training just as standard architectures do. The model plateaus at peak LR around step 48K, then the 10,000-step warmdown phase drives val_bpb from 1.223 to 1.133. With 9x more warmdown steps than the 10-min run, the refinement goes much deeper.

See README for full details.

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-Record: Unified Attention + FA3 + 1hr training (val_bpb=1.1088)

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

Analysis

N-gram Family Bug (ILLEGAL BigramHash)

NOT PRESENT. self.bigram is initialized to None at line 839 and is never assigned a hash-based implementation. The BigramHash illegal pattern (target XOR'd into hash key) does not appear anywhere in the file. The bigram field is dead code in this submission.

Pre-Quant TTT (ILLEGAL — multi-epoch on val_tokens before score)

NOT PRESENT. There is no unconditional multi-epoch finetuning on val_tokens before scoring begins.

Legal Score-First TTT (PR #1413 pattern)

PRESENT and correctly implemented. Function eval_val_legal_ttt (line 993) implements the score-first pattern:

  • Lines 1044–1083: For each chunk ci, scoring happens first — model is in eval() mode with torch.no_grad(), computing forward_logits, accumulating loss_sum / token_count / byte_count.
  • Line 1085: is_last_chunk = (ci == num_chunks - 1) guard.
  • Lines 1086–1121: Training only executes if not is_last_chunk and args.legal_ttt_epochs > 0 — the final chunk is never trained on, preventing lookahead into the scored region.

This matches the PR #1413 pattern precisely: score chunk N, train on chunk N (to prepare for chunk N+1), skip training after the last scored chunk.

Scored-Region SLOT (HOLD)

NOT PRESENT. No manipulation of scored-region slot detected.

Architecture

UnifiedAttention with FA3 (flash_attn_interface), UNet-style skip connections via skip_weights, layer bank weight sharing (unified_bank, output_bank, etc.), smear gating. Legal score-first TTT enabled by default (LEGAL_TTT_ENABLED=1).

Conclusion

The TTT implementation is compliant with the score-first constraint. No illegal patterns detected. CLEAN.

Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under torch.no_grad() before optimizer.step(), with is_last_chunk guard preventing adaptation on the final scored chunk.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants