Record: Unified Attention + FA3 + Legal TTT (val_bpb=1.1412, 3-seed) by VirajDeshwal · Pull Request #1202 · openai/parameter-golf

VirajDeshwal · 2026-03-31T21:53:17Z

Unified Attention + FA3 Head-Dim Padding + Legal Score-First TTT

val_bpb: 1.1412 (3-seed mean, std 0.0008) | ~15.97 MB | 8×H100 SXM

Seed	step_avg	steps	Post-TTT bpb	Artifact
1337	49.6ms	12,088	1.1416	15,991,687
42	49.6ms	12,109	1.1416	15,988,916
2025	49.6ms	12,103	1.1403	15,962,515

What's new

Unified Attention (Deshwal, 2026): Single W_unified projection replaces Q/K/V. 67% fewer attention projection parameters, reallocated to MLP. Bands form naturally during training.
FA3 Head-Dim Padding: Zero-pad head_dim 44→48 for Hopper FA3 compatibility. Mathematically lossless, 1.57× faster than FA2. Enables 12,100 steps in 10 min.
Legal Score-First TTT: SGD on all params, 3 epochs per 32K chunk, stride=64.

Timing

Phase	Time
Training (K=11, d=528, 4H)	600s
Quantization + roundtrip	~70s
Legal TTT	~408s
Total	~18 min (10+8)

MatoTeziTanka · 2026-04-12T06:02:36Z

Community Review — Record: Unified Attention + FA3 + Legal TTT (val_bpb=1.1412, 3-seed)

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

Analysis Head SHA: `e11ed4b` Submission name: 2026-03-31_UnifiedAttention_FA3_LegalTTT ### N-gram / BigramHash check — CLEAR No XOR hash, no n-gram family, no BigramHash class present. `self.bigram` at line 839 is set to `None` and is never assigned any value in the file. References at lines 908–909 and 957–958 are dead-code guards (`if self.bigram is not None:`). No target token is XOR'd into a hash key anywhere. N-gram bug does not apply. ### Illegal Pre-Quant TTT check — CLEAR There is no multi-epoch gradient update on val_tokens before scoring. The TTT update loop (lines 1101–1121) is strictly gated by `if not is_last_chunk` (line 1086): the model trains only on chunks whose scored region has already been recorded earlier in the same pass. The final chunk is scored without any update. This is exactly the PR #1413 score-first pattern. ### Legal Score-First TTT check — CONFIRMED `eval_val_legal_ttt` (line 993) implements the correct protocol: - Line 1053–1083: score chunk `ci` under `torch.no_grad()` first; accumulate `loss_sum`, `token_count`, `byte_count`. - Line 1085–1086: `is_last_chunk` guard prevents any gradient update on the final chunk. - Lines 1101–1121: update loop runs on chunk `ci` after that chunk's score is already banked, preparing for future chunks. - Lines 1095–1097: cosine LR schedule decays across chunks. The scored region assignment (lines 1009–1011) maps each window to a chunk by its scored-start token, ensuring no future data is used to update weights before scoring. ### Scored-region SLOT — NOT PRESENT No scored-region SLOT mechanism detected. Not applicable. ### Architecture UnifiedAttention +...

Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under torch.no_grad() before optimizer.step(), with is_last_chunk guard preventing adaptation on the final scored chunk.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

…egalTTT

VirajDeshwal force-pushed the submission/2026-03-31_UnifiedAttention_FA3_LegalTTT branch 4 times, most recently from 168cecf to 2882f3a Compare March 31, 2026 22:14

Record: Unified Attention + FA3 + Legal TTT (val_bpb=1.1412, 3-seed)

e11ed4b

VirajDeshwal force-pushed the submission/2026-03-31_UnifiedAttention_FA3_LegalTTT branch from 2882f3a to e11ed4b Compare March 31, 2026 22:59

VirajDeshwal mentioned this pull request Apr 2, 2026

Non-Record: Unified Attention + FA3 + 1hr training (val_bpb=1.1088) #1270

Open

VirajDeshwal added 2 commits April 12, 2026 15:12

Merge branch 'main' into submission/2026-03-31_UnifiedAttention_FA3_L…

f7d0d35

…egalTTT

Merge branch 'main' into submission/2026-03-31_UnifiedAttention_FA3_L…

f9ad153

…egalTTT

This was referenced Apr 30, 2026

[URGENT] Leaderboard Update #1892

Open

PRs pending since a month despite of approval from community review #2001

Open

VirajDeshwal added 2 commits April 30, 2026 11:31

Merge branch 'main' into submission/2026-03-31_UnifiedAttention_FA3_L…

81c6d58

…egalTTT

Merge branch 'main' into submission/2026-03-31_UnifiedAttention_FA3_L…

a9bcb1e

…egalTTT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Unified Attention + FA3 + Legal TTT (val_bpb=1.1412, 3-seed)#1202

Record: Unified Attention + FA3 + Legal TTT (val_bpb=1.1412, 3-seed)#1202
VirajDeshwal wants to merge 5 commits intoopenai:mainfrom
VirajDeshwal:submission/2026-03-31_UnifiedAttention_FA3_LegalTTT

VirajDeshwal commented Mar 31, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

VirajDeshwal commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unified Attention + FA3 Head-Dim Padding + Legal Score-First TTT

What's new

Timing

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Record: Unified Attention + FA3 + Legal TTT (val_bpb=1.1412, 3-seed)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

VirajDeshwal commented Mar 31, 2026 •

edited

Loading