Non-record: Full Attention + LZMA + small BigramHash (val_bpb=1.2094) by ibarrajo · Pull Request #1250 · openai/parameter-golf

ibarrajo · 2026-04-02T08:10:48Z

Summary

NEGATIVE RESULT: Full attention (8 KV heads) + LZMA + BigramHash 3072x112
Smaller BigramHash (3072 vs 6144 in other approaches) devastated quality
Full attention is 2.5x slower than GQA (235ms vs 95ms/step)

Results

Metric	Value
val_bpb (base)	1.2195
val_bpb (TTT)	1.2094
Artifact size	12.3 MB
Step time	235 ms
Total steps	2,490
Training time	~585s

Key Findings (Negative Result)

BigramHash size matters enormously: Reducing from 6144 to 3072 rows caused ~0.08 BPB regression. The bigram embedding table is a critical component, not just an auxiliary feature.
Full attention is too slow at this model size: 235ms/step means only 2,490 steps in 10 min vs 6,000+ with GQA. The quality-per-step of full attention doesn't compensate for 2.5x fewer steps.
Artifact underutilized: Only 12.3MB of 16MB budget used — the speed bottleneck prevents using the freed space for a larger model.

Why Non-record

Two compounding issues: (1) smaller BigramHash destroys quality, (2) full attention is too slow to train enough steps. Together they yield 1.2094 — far worse than any other approach.

Rule Compliance

Training < 600s on 8xH100
Artifact < 16,000,000 bytes (12.3MB)
No val tokens in artifact
GPTQ calibration within training budget

🤖 Generated with Claude Code

NEGATIVE RESULT: 8 KV heads (full attention) + LZMA + BigramHash 3072x112. Smaller BigramHash (3072 vs 6144) devastated quality. Full attention is 2.5x slower than GQA (235ms vs 95ms/step), yielding only 2490 steps. Base 1.2195, TTT 1.2094. Artifact 12.3MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-12T05:50:04Z

Community Review — Non-record: Full Attention + LZMA + small BigramHash (val_bpb=1.2094)

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

Analysis

PR Overview

PR #1250 — "Approach K: Fused Triton MLP kernel (#1072) + GQA + LZMA" at
records/track_10min_16mb/2026-04-01_ApproachK_FusedTritonGQA/train_gpt.py
(head SHA: 637b830)

Check 1: ILLEGAL n-gram family bug (target XOR'd into hash key)

CLEAN. The BigramHash at lines 604–610 XORs t[..., 1:] (current token)
with t[..., :-1] (previous token) — both are input-side tokens. Target tokens
(y) are never included in the hash key. The implementation is a standard
causal bigram context embedding with no look-ahead into the target.

Check 2: ILLEGAL Pre-Quant TTT (multi-epoch on val_tokens without score-first)

CLEAN. The eval_val_sliding_ttt function (line 871) explicitly structures
Phase 1 as inference-mode scoring (lines 947–983) and Phase 2 as training
(lines 984+). The function docstring at line 880 states: "Legal score-first
TTT: score each chunk, then train on it."

Check 3: LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard)

PASS. is_last_chunk guard at line 985–986 (if not is_last_chunk and ttt_epochs > 0) correctly prevents training on the final chunk before it has
been scored. This matches the PR #1413 canonical pattern. Scoring always
precedes training within each chunk iteration (lines 942–1022).

Check 4: HOLD scored-region SLOT

No scored-region SLOT mechanism found. Not applicable.

Check 5: PURE_NEURAL_CLEAN

Not applicable — PR uses BigramHash embedding, GQA, and TTT on val_tokens, so
it is not pure neural.

Summary

The TTT implementation correctly enforces score-before-train ordering with the
is_last_chunk guard. The BigramHash XOR uses only input-side tokens. No
illegal patterns detected. The submission is LEGAL_SCORE_FIRST_TTT_CLEAN.

Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under torch.no_grad() before optimizer.step(), with is_last_chunk guard preventing adaptation on the final scored chunk.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Full Attention + LZMA + small BigramHash (val_bpb=1.2094)#1250

Non-record: Full Attention + LZMA + small BigramHash (val_bpb=1.2094)#1250
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:approach-k

ibarrajo commented Apr 2, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ibarrajo commented Apr 2, 2026

Summary

Results

Key Findings (Negative Result)

Why Non-record

Rule Compliance

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Non-record: Full Attention + LZMA + small BigramHash (val_bpb=1.2094)

Analysis

PR Overview

Check 1: ILLEGAL n-gram family bug (target XOR'd into hash key)

Check 2: ILLEGAL Pre-Quant TTT (multi-epoch on val_tokens without score-first)

Check 3: LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard)

Check 4: HOLD scored-region SLOT

Check 5: PURE_NEURAL_CLEAN

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants