Skip to content

Record: 10L + Batched LoRA TTT (mean val_bpb=1.1180, 3 seeds)#713

Open
hypery11 wants to merge 1 commit intoopenai:mainfrom
hypery11:submission/2026-03-25_10L_LoRA_TTT_Record
Open

Record: 10L + Batched LoRA TTT (mean val_bpb=1.1180, 3 seeds)#713
hypery11 wants to merge 1 commit intoopenai:mainfrom
hypery11:submission/2026-03-25_10L_LoRA_TTT_Record

Conversation

@hypery11
Copy link
Copy Markdown

Results

Seed Base val_bpb TTT val_bpb
42 1.1476 1.1160
1337 1.1540 1.1210
2024 1.1504 1.1170
Mean 1.1507 1.1180
Std 0.0032 0.0026
  • Artifact: 15.75 MB
  • Train: 600s on 8xH100 SXM
  • TTT eval: ~496s

Method

10-layer transformer (512d, 8/4 GQA, 3x MLP LeakyReLU(0.5)^2) with per-document batched LoRA test-time training.

LoRA rank-8 on Q/V projections + LM head. 64 documents batched in parallel. Per-doc reset, Adam lr=0.01, 256-token chunks, 3 epochs, score on final epoch. Mixed int5/int6 quantization + zstd-22.

See README.md for full details.

3-seed validation: 1.1160 / 1.1210 / 1.1170 (std 0.0026)
Per-document rank-8 LoRA on Q/V/LM-head, batch-64, 3 epochs.
15.75MB artifact. Train 600s, eval 496s.
@dexhunter
Copy link
Copy Markdown
Contributor

Hi @hypery11 — interesting LoRA TTT approach with per-document batching.

I wanted to flag a potential score-first compliance concern. Looking at lora_ttt_eval() (line 1095), the scoring happens only on the final epoch:

for epoch in range(ttt_epochs):       # 3 epochs
    for ci in range(max_chunks):
        ...
        if epoch == ttt_epochs - 1:   # score only on epoch 3
            # accumulate loss_sum
        if needs_train:               # train on non-last chunks
            loss.backward()
            cur_opt.step()

This means when scoring on epoch 3, the LoRA weights have already been trained on the full document for 2 complete epochs. A token at position t in the document is scored using LoRA weights that were adapted on tokens including t itself (from epochs 1 and 2).

The README rule is: "you are only allowed to test-time train on validation set tokens you've already evaluated your model on."

In the standard score-first TTT pattern (PR #461/#549/#726), each chunk is scored BEFORE the model trains on it, and the score is final — no re-scoring after training. Here, scoring happens after training, which appears to be the adapt-then-score pattern that PR #518 was closed for.

For reference, PR #518 was closed by @valerio-oai because it "trains on the validation set by reporting the score on a doc after its weights have adapted to it."

Would you be able to clarify how this differs from the adapt-then-score pattern?

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: 10L + Batched LoRA TTT (mean val_bpb=1.1180, 3 seeds)

Compliance flag: Pre-Quant TTT violation


Check 1: N-gram Family Bug (CLOSE trigger: target token in hash key)

CLEAN. BigramHashEmbedding.bigram_hash() builds its key as xor(36313 * t[..., 1:], 27191 * t[..., :-1]) — that is, the hash at position i uses tokens i (current) and i-1 (prior). The embedding output at position i is added to the input representation at position i and passed into the transformer. The target token at position i is t[i+1] — not present in the key. Same analysis holds for TrigramHashEmbedding which keys on t[i], t[i-1], t[i-2]. Neither family leaks the target token into the lookup key. Both are in BigramHash-legal form.

Check 2: Pre-Quant TTT (CLOSE trigger: multi-epoch AdamW on val_tokens without score-first)

VIOLATION — CLOSE.

The test_time_train() function (lines 967–1025) fine-tunes all model parameters (not LoRA-only) on val_tokens for ttt_epochs (default 10) epochs using torch.optim.AdamW. There is no score-first-per-chunk mechanism: the function iterates over the full val sequence in sequential chunks for each epoch, trains unconditionally on every chunk, then calls eval_val_sliding() after all training is complete. This is the canonical Pre-Quant TTT pattern: multi-epoch AdamW sweep over val tokens, no score gating. ttt_enabled defaults to 0 but the PR description states TTT is used and reports TTT val_bpb scores, indicating this path is active for the submitted result.

Check 3: Legal TTT (score-first-per-chunk)

The lora_ttt_eval() function (lines 1096–1280) implements a batched LoRA approach with correct score-first ordering: it scores on epoch == ttt_epochs - 1 (final epoch, last chunk), and trains only on needs_train (all chunks except the last). This is structurally sound — each chunk is scored before any subsequent chunk's training update can influence it, and the final chunk is scored without a gradient step. However, this legal TTT path does not redeem the submission because test_time_train() (the full-model AdamW path) is also present and is the mechanism cited in the PR description results table.

Check 4: Scored-Region SLOT

Not applicable — no scored-region manipulation detected.

Check 5: Pure Neural

The architecture is a standard transformer with BigramHash and TrigramHash embedding additions. Both are CLEAN per Check 1. Pure neural — CLEAN.

test_time_train() runs multi-epoch AdamW over the full val_tokens stream before scoring. This violates score-first discipline. The legal LoRA TTT path (lora_ttt_eval) exists and is correctly implemented, but the illegal path is also present and active.

Verdict: CLOSE — Pre-Quant TTT violation.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: Recommend CLOSE unless the author removes test_time_train() and resubmits with only the legal LoRA TTT path.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants