Skip to content

Non-record: Full Attention + LZMA + small BigramHash (val_bpb=1.2094)#1250

Open
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:approach-k
Open

Non-record: Full Attention + LZMA + small BigramHash (val_bpb=1.2094)#1250
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:approach-k

Conversation

@ibarrajo
Copy link
Copy Markdown

@ibarrajo ibarrajo commented Apr 2, 2026

Summary

  • NEGATIVE RESULT: Full attention (8 KV heads) + LZMA + BigramHash 3072x112
  • Smaller BigramHash (3072 vs 6144 in other approaches) devastated quality
  • Full attention is 2.5x slower than GQA (235ms vs 95ms/step)

Results

Metric Value
val_bpb (base) 1.2195
val_bpb (TTT) 1.2094
Artifact size 12.3 MB
Step time 235 ms
Total steps 2,490
Training time ~585s

Key Findings (Negative Result)

  1. BigramHash size matters enormously: Reducing from 6144 to 3072 rows caused ~0.08 BPB regression. The bigram embedding table is a critical component, not just an auxiliary feature.
  2. Full attention is too slow at this model size: 235ms/step means only 2,490 steps in 10 min vs 6,000+ with GQA. The quality-per-step of full attention doesn't compensate for 2.5x fewer steps.
  3. Artifact underutilized: Only 12.3MB of 16MB budget used — the speed bottleneck prevents using the freed space for a larger model.

Why Non-record

Two compounding issues: (1) smaller BigramHash destroys quality, (2) full attention is too slow to train enough steps. Together they yield 1.2094 — far worse than any other approach.

Rule Compliance

  • Training < 600s on 8xH100
  • Artifact < 16,000,000 bytes (12.3MB)
  • No val tokens in artifact
  • GPTQ calibration within training budget

🤖 Generated with Claude Code

NEGATIVE RESULT: 8 KV heads (full attention) + LZMA + BigramHash
3072x112. Smaller BigramHash (3072 vs 6144) devastated quality.
Full attention is 2.5x slower than GQA (235ms vs 95ms/step),
yielding only 2490 steps. Base 1.2195, TTT 1.2094. Artifact 12.3MB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: Full Attention + LZMA + small BigramHash (val_bpb=1.2094)

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

Analysis

PR Overview

PR #1250 — "Approach K: Fused Triton MLP kernel (#1072) + GQA + LZMA" at
records/track_10min_16mb/2026-04-01_ApproachK_FusedTritonGQA/train_gpt.py
(head SHA: 637b830)

Check 1: ILLEGAL n-gram family bug (target XOR'd into hash key)

CLEAN. The BigramHash at lines 604–610 XORs t[..., 1:] (current token)
with t[..., :-1] (previous token) — both are input-side tokens. Target tokens
(y) are never included in the hash key. The implementation is a standard
causal bigram context embedding with no look-ahead into the target.

Check 2: ILLEGAL Pre-Quant TTT (multi-epoch on val_tokens without score-first)

CLEAN. The eval_val_sliding_ttt function (line 871) explicitly structures
Phase 1 as inference-mode scoring (lines 947–983) and Phase 2 as training
(lines 984+). The function docstring at line 880 states: "Legal score-first
TTT: score each chunk, then train on it."

Check 3: LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard)

PASS. is_last_chunk guard at line 985–986 (if not is_last_chunk and ttt_epochs > 0) correctly prevents training on the final chunk before it has
been scored. This matches the PR #1413 canonical pattern. Scoring always
precedes training within each chunk iteration (lines 942–1022).

Check 4: HOLD scored-region SLOT

No scored-region SLOT mechanism found. Not applicable.

Check 5: PURE_NEURAL_CLEAN

Not applicable — PR uses BigramHash embedding, GQA, and TTT on val_tokens, so
it is not pure neural.

Summary

The TTT implementation correctly enforces score-before-train ordering with the
is_last_chunk guard. The BigramHash XOR uses only input-side tokens. No
illegal patterns detected. The submission is LEGAL_SCORE_FIRST_TTT_CLEAN.

Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under torch.no_grad() before optimizer.step(), with is_last_chunk guard preventing adaptation on the final scored chunk.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants