Non-Record: Unified Attention + FA3 + 1hr training (val_bpb=1.1088)#1270
Conversation
Community Review — Non-Record: Unified Attention + FA3 + 1hr training (val_bpb=1.1088)Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern) AnalysisN-gram Family Bug (ILLEGAL BigramHash)NOT PRESENT. Pre-Quant TTT (ILLEGAL — multi-epoch on val_tokens before score)NOT PRESENT. There is no unconditional multi-epoch finetuning on Legal Score-First TTT (PR #1413 pattern)PRESENT and correctly implemented. Function
This matches the PR #1413 pattern precisely: score chunk N, train on chunk N (to prepare for chunk N+1), skip training after the last scored chunk. Scored-Region SLOT (HOLD)NOT PRESENT. No manipulation of scored-region slot detected. ArchitectureUnifiedAttention with FA3 (flash_attn_interface), UNet-style skip connections via ConclusionThe TTT implementation is compliant with the score-first constraint. No illegal patterns detected. CLEAN. Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline. Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually. |
Non-Record: Unified Attention + FA3 + 1hr training
val_bpb: 1.1088 | ~15.82 MB | 8×H100 SXM | 1 hour training
Same architecture as our record submission PR #1202 (val_bpb 1.1412, 10-min). This run trains for 1 hour to explore how unified attention scales with unlimited compute.
Beats the current unlimited compute SOTA (1.1239, 1-bit quantization, 2hr) by 0.015 BPB in half the training time.
Only the schedule changes: ITERATIONS=72000, WARMDOWN_ITERS=10000, QAT_START_FRACTION=0.85. Architecture and eval identical to PR #1202.
Key finding: unified attention scales with longer training just as standard architectures do. The model plateaus at peak LR around step 48K, then the 10,000-step warmdown phase drives val_bpb from 1.223 to 1.133. With 9x more warmdown steps than the 10-min run, the refinement goes much deeper.
See README for full details.