Skip to content

Record: val_bpb: 1.14020 [tested 3x on 8xh100]#267

Open
andrewgcodes wants to merge 17 commits intoopenai:mainfrom
andrewgcodes:devin/1774040790-causal-ttt-submission
Open

Record: val_bpb: 1.14020 [tested 3x on 8xh100]#267
andrewgcodes wants to merge 17 commits intoopenai:mainfrom
andrewgcodes:devin/1774040790-causal-ttt-submission

Conversation

@andrewgcodes
Copy link
Copy Markdown

@andrewgcodes andrewgcodes commented Mar 20, 2026

Flagging that this is doing TTT during Val but compliantly. @0hq

I believe these make it allowed:

  1. No training before evaluation: Each chunk is evaluated first, loss is recorded, then training occurs
  2. No re-evaluation: Tokens are scored exactly once; training on chunk N cannot affect scores for chunks 0..N
  3. No multiple passes: The validation set is processed in a single sequential pass (32 chunks)

@andrewgcodes andrewgcodes changed the title Record: val_bpb: 1.14020 Record: val_bpb: 1.14020 [tested 3x on 8xh100] Mar 20, 2026
romainsantoli-web pushed a commit to romainsantoli-web/parameter-golf that referenced this pull request Mar 21, 2026
…its)

Combines techniques from PR openai#162, openai#180, openai#267, openai#281:
- 11-layer GPT with U-Net skip connections, GQA
- SmearGate + BigramHash(10240)
- Mixed int5/int6 quantization + 3% magnitude pruning
- Causal TTT at eval time
- SWA(frac=0.4), WD=0.042, Z-loss
- Target: sub-1.135 val_bpb

Awaiting RunPod 8xH100 credits for 3-seed validation.
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: val_bpb: 1.14020 [tested 3x on 8xh100]

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)


PR #267 — "Record: val_bpb: 1.14020 [tested 3x on 8xh100]"
Head SHA: 7940226
Submission dir: records/track_10min_16mb/2026-03-20_CausalTTT_Int5MLP_BigramHash_SWA


Check 1: N-gram family bug (CLOSE trigger)

CLEAN. BigramHashEmbedding.bigram_hash (line 693–699) computes:

out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod

The hash key for position i is (t[i], t[i-1]) — current token XOR'd with previous token. The target token (t[i+1]) is never in the lookup key. This is standard causal bigram context; no future-token leakage. NOT the CLOSE-triggering bug.


Check 2: Pre-Quant TTT (CLOSE trigger)

CLEAN. The TTT optimizer is torch.optim.SGD (line 1435), not AdamW. The Pre-Quant TTT CLOSE trigger requires multi-epoch AdamW on val_tokens without score-first. This submission uses SGD with momentum=0.9, post-quantization, after scoring each chunk. Does not meet the CLOSE criteria.


Check 3: Legal TTT (CLEAN)

CONFIRMED LEGAL. The causal TTT loop (lines 1446–1529) follows strict score-first-per-chunk ordering:

Note: The sliding-window clamping logic is novel — specifically how clamped_start/clamped_end partition scored tokens. The TTT implementation itself follows legal score-first discipline, so this is a MERGE recommendation with a note that the BPB accounting math could benefit from maintainer spot-check.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks and a quick look at the sliding-window clamping math.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants