Non-record: Full Attention + LZMA + small BigramHash (val_bpb=1.2094)#1250
Non-record: Full Attention + LZMA + small BigramHash (val_bpb=1.2094)#1250ibarrajo wants to merge 1 commit intoopenai:mainfrom
Conversation
NEGATIVE RESULT: 8 KV heads (full attention) + LZMA + BigramHash 3072x112. Smaller BigramHash (3072 vs 6144) devastated quality. Full attention is 2.5x slower than GQA (235ms vs 95ms/step), yielding only 2490 steps. Base 1.2195, TTT 1.2094. Artifact 12.3MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — Non-record: Full Attention + LZMA + small BigramHash (val_bpb=1.2094)Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern) AnalysisPR OverviewPR #1250 — "Approach K: Fused Triton MLP kernel (#1072) + GQA + LZMA" at Check 1: ILLEGAL n-gram family bug (target XOR'd into hash key)CLEAN. The Check 2: ILLEGAL Pre-Quant TTT (multi-epoch on val_tokens without score-first)CLEAN. The Check 3: LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard)PASS. Check 4: HOLD scored-region SLOTNo scored-region SLOT mechanism found. Not applicable. Check 5: PURE_NEURAL_CLEANNot applicable — PR uses BigramHash embedding, GQA, and TTT on val_tokens, so SummaryThe TTT implementation correctly enforces score-before-train ordering with the Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline. Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually. |
Summary
Results
Key Findings (Negative Result)
Why Non-record
Two compounding issues: (1) smaller BigramHash destroys quality, (2) full attention is too slow to train enough steps. Together they yield 1.2094 — far worse than any other approach.
Rule Compliance
🤖 Generated with Claude Code