[track_10min_16mb] 50-Epoch Cosine LoRA TTT + SOTA (10L Int5/Int6 BigramHash SWA) — Atharva Date (ADIITJ) by ADIITJ · Pull Request #467 · openai/parameter-golf

ADIITJ · 2026-03-22T22:09:26Z

Summary

Document-isolated multi-epoch cosine LoRA TTT at evaluation on top of the SOTA training stack (thwu1, 1.1428 bpb). Zero artifact cost: LoRA adapters initialized fresh at eval time, discarded after each document.

Author: Atharva Date (ADIITJ)
Status: Non-record (pending 3-seed H100 validation)
Expected bpb: 1.05–1.10 (projected)
Base: SOTA thwu1 (1.1428 bpb)

Key changes from single-pass LoRA TTT

Component	This submission	SOTA (thwu1)
Training	identical	—
Quantization	Int5-MLP + Int6-Attn + zstd-22	same
`swa_start_frac`	0.35	0.40
`warmdown_iters`	3500	3000
Final eval	50-epoch cosine LoRA TTT	Sliding window

Multi-Epoch Cosine LoRA TTT

50 epochs per document with cosine LR decay: lr = 0.001 * 0.5 * (1 + cos(π * ep / 50))
Rank-8 LoRA adapters on Q and V projections in all 10 attention layers
Score-first per chunk within each epoch (backward-looking guarantee)
NLL accumulated from the final epoch only
LoRA + optimizer reset between documents (no cross-document leakage)

Motivation

PR #517 (lukacf) showed that cosine LR scheduling on TTT gets 0.978 bpb (vs ~1.085 with no TTT) using full-model fine-tuning. This submission applies the same cosine schedule to rank-8 LoRA adapters, which avoids full-model memory pressure while capturing most of the adaptation benefit.

Compliance

Artifact ≤ 16MB (est. ~14.3MB, identical to SOTA)
No network calls during evaluation
Training ≤ 10 min on 8xH100
Score-first per chunk within every TTT epoch
LoRA reset between documents (no cross-document leakage)
3-seed H100 validation — non-record until complete

…g H100 validation) Combines thwu1's SOTA training stack (10L, Int5-MLP, Int6-Attn, BigramHash, SWA) with document-isolated LoRA TTT at evaluation. LoRA adapters (rank=8) target Q and V in all 10 attention layers, initialized fresh per document at eval time — zero artifact cost. - swa_start_frac=0.35 (vs SOTA 0.40), warmdown_iters=3500 (vs 3000) - Score-first TTT: chunk scored before LoRA step, no information leakage - Expected bpb: 1.137–1.140 (SOTA 1.1428 + TTT delta ~0.003–0.005) - Artifact: ~14.3MB (same quantization as SOTA, LoRA weights not stored) - train_gpt.py: exactly 1500 lines, 64281 bytes, AST-clean Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Adds a new track record submission that keeps the existing SOTA training/quantization stack but changes the final evaluation to document-isolated LoRA test-time training (TTT), with the intent of improving val_bpb without increasing artifact size.

Changes:

Introduces a new train_gpt.py in a records folder that implements LoRA TTT during final evaluation (Q/V adapters per layer; score-then-train per chunk).
Updates submission metadata (submission.json) and adds documentation (README.md) describing the method and hyperparameter deltas (SWA + warmdown tweaks).
Adds a local requirements.txt for the record folder (zstandard/sentencepiece/numpy/torch).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
records/track_10min_16mb/2026-03-23_LoRA_TTT_Int5Int6/train_gpt.py	Implements the SOTA base model/training/quantization and adds document-isolated LoRA TTT for final eval.
records/track_10min_16mb/2026-03-23_LoRA_TTT_Int5Int6/submission.json	Declares submission metadata and projected bpb range.
records/track_10min_16mb/2026-03-23_LoRA_TTT_Int5Int6/requirements.txt	Specifies runtime dependencies for reproducing the submission.
records/track_10min_16mb/2026-03-23_LoRA_TTT_Int5Int6/README.md	Explains architecture, quantization, and the LoRA TTT evaluation procedure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-22T22:13:23Z

+def _find_docs(all_tokens: Tensor) -> list[tuple[int, int]]:
+    """Return (start, length) for each document delimited by BOS_ID."""
+    bos_pos = (all_tokens == BOS_ID).nonzero(as_tuple=True)[0].tolist()
+    if not bos_pos:
+        # No BOS found: treat entire sequence as one document
+        return [(0, all_tokens.numel())]
+    docs = []
+    for i, start in enumerate(bos_pos):
+        end = bos_pos[i + 1] if i + 1 < len(bos_pos) else all_tokens.numel()
+        length = end - start
+        if length >= 2:
+            docs.append((int(start), int(length)))
+    return docs


_find_docs splits documents at each BOS but excludes the BOS token of the next document from scoring/training. This means most BOS tokens are never evaluated (unlike the standard/ sliding-window eval which scores every token except the first), so the reported val_bpb will not be comparable and can be artificially improved. Adjust document slicing so every token in the validation stream (except the very first) is scored exactly once (e.g., include the next document’s BOS in the previous document, as done in earlier LoRA-TTT implementations), and ensure any pre-BOS prefix tokens are handled consistently.

Multi-epoch cosine LR schedule on rank-8 LoRA adapters per document. 50 epochs, lr=0.001 with cosine decay to ~0. Score-first per chunk within each epoch (backward-looking). NLL accumulated in final epoch only. Expected bpb: ~1.05–1.10 vs single-pass ~1.137. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

MatoTeziTanka · 2026-04-12T13:51:10Z

Community Review — [track_10min_16mb] 50-Epoch Cosine LoRA TTT + SOTA (10L Int5/Int6 BigramHash SWA) — Atharva Date (ADIITJ)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

Deterministic AST scan of train_gpt.py found:

N-gram family bug (target-in-key): not found
Pre-Quant TTT on val_tokens: not found
Scored-region SLOT: not found
Custom tokenizer overrides: not found

Standard train → quantize → eval path with no eval-time adaptation.

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via deterministic AST classifier cross-checked against competition rules. If this review misread your code, please call it out so I can re-audit manually.

Copilot AI review requested due to automatic review settings March 22, 2026 22:09

Copilot started reviewing on behalf of ADIITJ March 22, 2026 22:09 View session

Copilot AI reviewed Mar 22, 2026

View reviewed changes

ADIITJ changed the title ~~[track_10min_16mb] LoRA TTT + SOTA Int5/Int6 (10L, BigramHash, SWA) — Atharva Date (ADIITJ)~~ [track_10min_16mb] 50-Epoch Cosine LoRA TTT + SOTA (10L Int5/Int6 BigramHash SWA) — Atharva Date (ADIITJ) Mar 23, 2026

0hq mentioned this pull request Mar 25, 2026

Illegal submissions megathread #677

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[track_10min_16mb] 50-Epoch Cosine LoRA TTT + SOTA (10L Int5/Int6 BigramHash SWA) — Atharva Date (ADIITJ)#467

[track_10min_16mb] 50-Epoch Cosine LoRA TTT + SOTA (10L Int5/Int6 BigramHash SWA) — Atharva Date (ADIITJ)#467
ADIITJ wants to merge 2 commits intoopenai:mainfrom
ADIITJ:adiitj/lora-ttt-int5int6-swa

ADIITJ commented Mar 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 22, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ADIITJ commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes from single-pass LoRA TTT

Multi-Epoch Cosine LoRA TTT

Motivation

Compliance

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — [track_10min_16mb] 50-Epoch Cosine LoRA TTT + SOTA (10L Int5/Int6 BigramHash SWA) — Atharva Date (ADIITJ)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ADIITJ commented Mar 22, 2026 •

edited

Loading