Record: 10L + Batched LoRA TTT (mean val_bpb=1.1180, 3 seeds) by hypery11 · Pull Request #713 · openai/parameter-golf

hypery11 · 2026-03-25T13:10:04Z

Results

Seed	Base val_bpb	TTT val_bpb
42	1.1476	1.1160
1337	1.1540	1.1210
2024	1.1504	1.1170
Mean	1.1507	1.1180
Std	0.0032	0.0026

Artifact: 15.75 MB
Train: 600s on 8xH100 SXM
TTT eval: ~496s

Method

10-layer transformer (512d, 8/4 GQA, 3x MLP LeakyReLU(0.5)^2) with per-document batched LoRA test-time training.

LoRA rank-8 on Q/V projections + LM head. 64 documents batched in parallel. Per-doc reset, Adam lr=0.01, 256-token chunks, 3 epochs, score on final epoch. Mixed int5/int6 quantization + zstd-22.

See README.md for full details.

3-seed validation: 1.1160 / 1.1210 / 1.1170 (std 0.0026) Per-document rank-8 LoRA on Q/V/LM-head, batch-64, 3 epochs. 15.75MB artifact. Train 600s, eval 496s.

dexhunter · 2026-03-29T01:46:35Z

Hi @hypery11 — interesting LoRA TTT approach with per-document batching.

I wanted to flag a potential score-first compliance concern. Looking at lora_ttt_eval() (line 1095), the scoring happens only on the final epoch:

for epoch in range(ttt_epochs):       # 3 epochs
    for ci in range(max_chunks):
        ...
        if epoch == ttt_epochs - 1:   # score only on epoch 3
            # accumulate loss_sum
        if needs_train:               # train on non-last chunks
            loss.backward()
            cur_opt.step()

This means when scoring on epoch 3, the LoRA weights have already been trained on the full document for 2 complete epochs. A token at position t in the document is scored using LoRA weights that were adapted on tokens including t itself (from epochs 1 and 2).

The README rule is: "you are only allowed to test-time train on validation set tokens you've already evaluated your model on."

In the standard score-first TTT pattern (PR #461/#549/#726), each chunk is scored BEFORE the model trains on it, and the score is final — no re-scoring after training. Here, scoring happens after training, which appears to be the adapt-then-score pattern that PR #518 was closed for.

For reference, PR #518 was closed by @valerio-oai because it "trains on the validation set by reporting the score on a doc after its weights have adapted to it."

Would you be able to clarify how this differs from the adapt-then-score pattern?

MatoTeziTanka · 2026-04-12T14:31:24Z

Community Review — Record: 10L + Batched LoRA TTT (mean val_bpb=1.1180, 3 seeds)

Compliance flag: Pre-Quant TTT violation

Check 1: N-gram Family Bug (CLOSE trigger: target token in hash key)

CLEAN. BigramHashEmbedding.bigram_hash() builds its key as xor(36313 * t[..., 1:], 27191 * t[..., :-1]) — that is, the hash at position i uses tokens i (current) and i-1 (prior). The embedding output at position i is added to the input representation at position i and passed into the transformer. The target token at position i is t[i+1] — not present in the key. Same analysis holds for TrigramHashEmbedding which keys on t[i], t[i-1], t[i-2]. Neither family leaks the target token into the lookup key. Both are in BigramHash-legal form.

Check 2: Pre-Quant TTT (CLOSE trigger: multi-epoch AdamW on val_tokens without score-first)

VIOLATION — CLOSE.

The test_time_train() function (lines 967–1025) fine-tunes all model parameters (not LoRA-only) on val_tokens for ttt_epochs (default 10) epochs using torch.optim.AdamW. There is no score-first-per-chunk mechanism: the function iterates over the full val sequence in sequential chunks for each epoch, trains unconditionally on every chunk, then calls eval_val_sliding() after all training is complete. This is the canonical Pre-Quant TTT pattern: multi-epoch AdamW sweep over val tokens, no score gating. ttt_enabled defaults to 0 but the PR description states TTT is used and reports TTT val_bpb scores, indicating this path is active for the submitted result.

Check 3: Legal TTT (score-first-per-chunk)

The lora_ttt_eval() function (lines 1096–1280) implements a batched LoRA approach with correct score-first ordering: it scores on epoch == ttt_epochs - 1 (final epoch, last chunk), and trains only on needs_train (all chunks except the last). This is structurally sound — each chunk is scored before any subsequent chunk's training update can influence it, and the final chunk is scored without a gradient step. However, this legal TTT path does not redeem the submission because test_time_train() (the full-model AdamW path) is also present and is the mechanism cited in the PR description results table.

Check 4: Scored-Region SLOT

Not applicable — no scored-region manipulation detected.

Check 5: Pure Neural

The architecture is a standard transformer with BigramHash and TrigramHash embedding additions. Both are CLEAN per Check 1. Pure neural — CLEAN.

test_time_train() runs multi-epoch AdamW over the full val_tokens stream before scoring. This violates score-first discipline. The legal LoRA TTT path (lora_ttt_eval) exists and is correctly implemented, but the illegal path is also present and active.

Verdict: CLOSE — Pre-Quant TTT violation.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: Recommend CLOSE unless the author removes test_time_train() and resubmits with only the legal LoRA TTT path.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source. If this review misread your code, please call it out so I can re-audit manually.

Record: 10L + Batched LoRA TTT (mean val_bpb=1.1180)

9b6ab5b

3-seed validation: 1.1160 / 1.1210 / 1.1170 (std 0.0026) Per-document rank-8 LoRA on Q/V/LM-head, batch-64, 3 epochs. 15.75MB artifact. Train 600s, eval 496s.

notapplica mentioned this pull request Mar 25, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 10L + Batched LoRA TTT (mean val_bpb=1.1180, 3 seeds)#713

Record: 10L + Batched LoRA TTT (mean val_bpb=1.1180, 3 seeds)#713
hypery11 wants to merge 1 commit intoopenai:mainfrom
hypery11:submission/2026-03-25_10L_LoRA_TTT_Record

hypery11 commented Mar 25, 2026

Uh oh!

dexhunter commented Mar 29, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hypery11 commented Mar 25, 2026

Results

Method

Uh oh!

dexhunter commented Mar 29, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Record: 10L + Batched LoRA TTT (mean val_bpb=1.1180, 3 seeds)

Check 1: N-gram Family Bug (CLOSE trigger: target token in hash key)

Check 2: Pre-Quant TTT (CLOSE trigger: multi-epoch AdamW on val_tokens without score-first)

Check 3: Legal TTT (score-first-per-chunk)

Check 4: Scored-Region SLOT

Check 5: Pure Neural

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants