Skip to content

Record: Unified Attention + FA3 + Legal TTT (val_bpb=1.1412, 3-seed)#1202

Open
VirajDeshwal wants to merge 5 commits intoopenai:mainfrom
VirajDeshwal:submission/2026-03-31_UnifiedAttention_FA3_LegalTTT
Open

Record: Unified Attention + FA3 + Legal TTT (val_bpb=1.1412, 3-seed)#1202
VirajDeshwal wants to merge 5 commits intoopenai:mainfrom
VirajDeshwal:submission/2026-03-31_UnifiedAttention_FA3_LegalTTT

Conversation

@VirajDeshwal
Copy link
Copy Markdown

@VirajDeshwal VirajDeshwal commented Mar 31, 2026

Unified Attention + FA3 Head-Dim Padding + Legal Score-First TTT

val_bpb: 1.1412 (3-seed mean, std 0.0008) | ~15.97 MB | 8×H100 SXM

Seed step_avg steps Post-TTT bpb Artifact
1337 49.6ms 12,088 1.1416 15,991,687
42 49.6ms 12,109 1.1416 15,988,916
2025 49.6ms 12,103 1.1403 15,962,515

What's new

  • Unified Attention (Deshwal, 2026): Single W_unified projection replaces Q/K/V. 67% fewer attention projection parameters, reallocated to MLP. Bands form naturally during training.
  • FA3 Head-Dim Padding: Zero-pad head_dim 44→48 for Hopper FA3 compatibility. Mathematically lossless, 1.57× faster than FA2. Enables 12,100 steps in 10 min.
  • Legal Score-First TTT: SGD on all params, 3 epochs per 32K chunk, stride=64.

Timing

Phase Time
Training (K=11, d=528, 4H) 600s
Quantization + roundtrip ~70s
Legal TTT ~408s
Total ~18 min (10+8)

@VirajDeshwal VirajDeshwal force-pushed the submission/2026-03-31_UnifiedAttention_FA3_LegalTTT branch 4 times, most recently from 168cecf to 2882f3a Compare March 31, 2026 22:14
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: Unified Attention + FA3 + Legal TTT (val_bpb=1.1412, 3-seed)

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

Analysis Head SHA: e11ed4b Submission name: 2026-03-31_UnifiedAttention_FA3_LegalTTT ### N-gram / BigramHash check — CLEAR No XOR hash, no n-gram family, no BigramHash class present. self.bigram at line 839 is set to None and is never assigned any value in the file. References at lines 908–909 and 957–958 are dead-code guards (if self.bigram is not None:). No target token is XOR'd into a hash key anywhere. N-gram bug does not apply. ### Illegal Pre-Quant TTT check — CLEAR There is no multi-epoch gradient update on val_tokens before scoring. The TTT update loop (lines 1101–1121) is strictly gated by if not is_last_chunk (line 1086): the model trains only on chunks whose scored region has already been recorded earlier in the same pass. The final chunk is scored without any update. This is exactly the PR #1413 score-first pattern. ### Legal Score-First TTT check — CONFIRMED eval_val_legal_ttt (line 993) implements the correct protocol: - Line 1053–1083: score chunk ci under torch.no_grad() first; accumulate loss_sum, token_count, byte_count. - Line 1085–1086: is_last_chunk guard prevents any gradient update on the final chunk. - Lines 1101–1121: update loop runs on chunk ci after that chunk's score is already banked, preparing for future chunks. - Lines 1095–1097: cosine LR schedule decays across chunks. The scored region assignment (lines 1009–1011) maps each window to a chunk by its scored-start token, ensuring no future data is used to update weights before scoring. ### Scored-region SLOT — NOT PRESENT No scored-region SLOT mechanism detected. Not applicable. ### Architecture UnifiedAttention +...

Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under torch.no_grad() before optimizer.step(), with is_last_chunk guard preventing adaptation on the final scored chunk.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants