Skip to content

[track_10min_16mb] 50-Epoch Cosine LoRA TTT + SOTA (10L Int5/Int6 BigramHash SWA) — Atharva Date (ADIITJ)#467

Open
ADIITJ wants to merge 2 commits intoopenai:mainfrom
ADIITJ:adiitj/lora-ttt-int5int6-swa
Open

[track_10min_16mb] 50-Epoch Cosine LoRA TTT + SOTA (10L Int5/Int6 BigramHash SWA) — Atharva Date (ADIITJ)#467
ADIITJ wants to merge 2 commits intoopenai:mainfrom
ADIITJ:adiitj/lora-ttt-int5int6-swa

Conversation

@ADIITJ
Copy link
Copy Markdown

@ADIITJ ADIITJ commented Mar 22, 2026

Summary

Document-isolated multi-epoch cosine LoRA TTT at evaluation on top of the SOTA training stack (thwu1, 1.1428 bpb). Zero artifact cost: LoRA adapters initialized fresh at eval time, discarded after each document.

  • Author: Atharva Date (ADIITJ)
  • Status: Non-record (pending 3-seed H100 validation)
  • Expected bpb: 1.05–1.10 (projected)
  • Base: SOTA thwu1 (1.1428 bpb)

Key changes from single-pass LoRA TTT

Component This submission SOTA (thwu1)
Training identical
Quantization Int5-MLP + Int6-Attn + zstd-22 same
swa_start_frac 0.35 0.40
warmdown_iters 3500 3000
Final eval 50-epoch cosine LoRA TTT Sliding window

Multi-Epoch Cosine LoRA TTT

  • 50 epochs per document with cosine LR decay: lr = 0.001 * 0.5 * (1 + cos(π * ep / 50))
  • Rank-8 LoRA adapters on Q and V projections in all 10 attention layers
  • Score-first per chunk within each epoch (backward-looking guarantee)
  • NLL accumulated from the final epoch only
  • LoRA + optimizer reset between documents (no cross-document leakage)

Motivation

PR #517 (lukacf) showed that cosine LR scheduling on TTT gets 0.978 bpb (vs ~1.085 with no TTT) using full-model fine-tuning. This submission applies the same cosine schedule to rank-8 LoRA adapters, which avoids full-model memory pressure while capturing most of the adaptation benefit.

Compliance

  • Artifact ≤ 16MB (est. ~14.3MB, identical to SOTA)
  • No network calls during evaluation
  • Training ≤ 10 min on 8xH100
  • Score-first per chunk within every TTT epoch
  • LoRA reset between documents (no cross-document leakage)
  • 3-seed H100 validation — non-record until complete

…g H100 validation)

Combines thwu1's SOTA training stack (10L, Int5-MLP, Int6-Attn, BigramHash, SWA) with
document-isolated LoRA TTT at evaluation. LoRA adapters (rank=8) target Q and V in all
10 attention layers, initialized fresh per document at eval time — zero artifact cost.

- swa_start_frac=0.35 (vs SOTA 0.40), warmdown_iters=3500 (vs 3000)
- Score-first TTT: chunk scored before LoRA step, no information leakage
- Expected bpb: 1.137–1.140 (SOTA 1.1428 + TTT delta ~0.003–0.005)
- Artifact: ~14.3MB (same quantization as SOTA, LoRA weights not stored)
- train_gpt.py: exactly 1500 lines, 64281 bytes, AST-clean

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 22, 2026 22:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new track record submission that keeps the existing SOTA training/quantization stack but changes the final evaluation to document-isolated LoRA test-time training (TTT), with the intent of improving val_bpb without increasing artifact size.

Changes:

  • Introduces a new train_gpt.py in a records folder that implements LoRA TTT during final evaluation (Q/V adapters per layer; score-then-train per chunk).
  • Updates submission metadata (submission.json) and adds documentation (README.md) describing the method and hyperparameter deltas (SWA + warmdown tweaks).
  • Adds a local requirements.txt for the record folder (zstandard/sentencepiece/numpy/torch).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
records/track_10min_16mb/2026-03-23_LoRA_TTT_Int5Int6/train_gpt.py Implements the SOTA base model/training/quantization and adds document-isolated LoRA TTT for final eval.
records/track_10min_16mb/2026-03-23_LoRA_TTT_Int5Int6/submission.json Declares submission metadata and projected bpb range.
records/track_10min_16mb/2026-03-23_LoRA_TTT_Int5Int6/requirements.txt Specifies runtime dependencies for reproducing the submission.
records/track_10min_16mb/2026-03-23_LoRA_TTT_Int5Int6/README.md Explains architecture, quantization, and the LoRA TTT evaluation procedure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +897 to +909
def _find_docs(all_tokens: Tensor) -> list[tuple[int, int]]:
"""Return (start, length) for each document delimited by BOS_ID."""
bos_pos = (all_tokens == BOS_ID).nonzero(as_tuple=True)[0].tolist()
if not bos_pos:
# No BOS found: treat entire sequence as one document
return [(0, all_tokens.numel())]
docs = []
for i, start in enumerate(bos_pos):
end = bos_pos[i + 1] if i + 1 < len(bos_pos) else all_tokens.numel()
length = end - start
if length >= 2:
docs.append((int(start), int(length)))
return docs
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_find_docs splits documents at each BOS but excludes the BOS token of the next document from scoring/training. This means most BOS tokens are never evaluated (unlike the standard/ sliding-window eval which scores every token except the first), so the reported val_bpb will not be comparable and can be artificially improved. Adjust document slicing so every token in the validation stream (except the very first) is scored exactly once (e.g., include the next document’s BOS in the previous document, as done in earlier LoRA-TTT implementations), and ensure any pre-BOS prefix tokens are handled consistently.

Copilot uses AI. Check for mistakes.
Multi-epoch cosine LR schedule on rank-8 LoRA adapters per document.
50 epochs, lr=0.001 with cosine decay to ~0. Score-first per chunk
within each epoch (backward-looking). NLL accumulated in final epoch only.
Expected bpb: ~1.05–1.10 vs single-pass ~1.137.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ADIITJ ADIITJ changed the title [track_10min_16mb] LoRA TTT + SOTA Int5/Int6 (10L, BigramHash, SWA) — Atharva Date (ADIITJ) [track_10min_16mb] 50-Epoch Cosine LoRA TTT + SOTA (10L Int5/Int6 BigramHash SWA) — Atharva Date (ADIITJ) Mar 23, 2026
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — [track_10min_16mb] 50-Epoch Cosine LoRA TTT + SOTA (10L Int5/Int6 BigramHash SWA) — Atharva Date (ADIITJ)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

Deterministic AST scan of train_gpt.py found:

  • N-gram family bug (target-in-key): not found
  • Pre-Quant TTT on val_tokens: not found
  • Scored-region SLOT: not found
  • Custom tokenizer overrides: not found

Standard train → quantize → eval path with no eval-time adaptation.

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via deterministic AST classifier cross-checked against competition rules. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants