Skip to content

Late STE QAT + Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Overtone + SWA + SGD TTT (int6+zstd-22)#297

Open
davidpuertolas wants to merge 2 commits intoopenai:mainfrom
davidpuertolas:submission/2026-03-20_STE_QAT_MLP3x_SmearBigram_LoRATTT
Open

Late STE QAT + Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Overtone + SWA + SGD TTT (int6+zstd-22)#297
davidpuertolas wants to merge 2 commits intoopenai:mainfrom
davidpuertolas:submission/2026-03-20_STE_QAT_MLP3x_SmearBigram_LoRATTT

Conversation

@davidpuertolas
Copy link
Copy Markdown

This record captures Late STE QAT + a dense 9×512 stack (MLP3×, SmearGate, BigramHash, ortho / Overtone-style init, SWA) with full-model SGD test-time training (not LoRA) after sliding-window eval on the dequantized checkpoint.

Method

Training (600s wallclock, 8×H100 SXM)

Muon + AdamW, MLP (hidden 1536), SmearGate, BigramHash, SWA over the second half of warmdown, late STE QAT from ~85% of wallclock with 0.5× LR when QAT activates. Key knobs: matrix_lr=0.025, muon_weight_decay=0.038, train_batch_tokens=786432, train_seq_len=2048, eval_stride=64, etc. (see README.md in this folder).

Evaluation

  1. Serialize int6 (per-row) + compress with zstd level 22 → submission artifact.
  2. Load, sliding-window validation (eval_stride=64).
  3. SGD TTT (LR 3e-4, momentum 0.95; LoRA TTT off by default).
  4. Report roundtrip metrics (final_int8_zstd_roundtrip_exact in logs when using this script).

Why zstd here

Using zstd-22 instead of zlib on the same quantized blob keeps bytes_total under the 16,000,000-byte cap (decimal MB) for this configuration.

Submission metadata

{
  "track": "10min_16mb",
  "date": "2026-03-20",
  "name": "Late STE QAT + Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Overtone + SWA + SGD TTT",
  "author": "David Puertolas Merenciano",
  "github_id": "davidpuertolas",
  "blurb": "Late STE QAT (last 15%, per #76) avoids Muon momentum corruption while closing quant gap. Full-model SGD TTT (per #152) replaces LoRA TTT which hurts with SmearGate (#178). WD=0.038 + LR=0.025 from best validated submissions (#179, #194). Artifact: int6+zstd-22, under 16MB cap.",
  "val_loss": 1.96353693,
  "val_bpb": 1.16292025,
  "bytes_total": 15948643,
  "bytes_code": 64426
}
Field Value
val_bpb 1.16292025
val_loss 1.96353693
bytes_total 15,948,643 (below 16,000,000)
bytes_code 64,426
Seed (logged) 1337
Wallclock cap 600s (step=5464 in train.log)

Compressed artifact (logged): 15,884,217 bytes int6+zstd + 64,426 bytes UTF-8 train_gpt.py = 15,948,643 total.

Command

From repo root, with FineWeb sp1024 data and tokenizer installed:

pip install zstandard

export HF_TOKEN="..."   # if needed for dataset download
python3 data/cached_challenge_fineweb.py --variant sp1024
RUN_ID=late_qat_sgd_ttt_zstd \
SEED=1337 \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
EVAL_STRIDE=64 \
SGD_TTT_ENABLED=1 \
TTT_LORA_ENABLED=0 \
torchrun --standalone --nproc_per_node=8 \
  old/20/03/26-zstandard/train_gpt.py

Single GPU: --nproc_per_node=1. Longer runs: MAX_WALLCLOCK_SECONDS=0 or another value.

Included files

File Role
old/20/03/26-zstandard/train_gpt.py Training + int6+zstd artifact
old/20/03/26-zstandard/train.log Example log (seed 1337)
old/20/03/26-zstandard/README.md Full write-up
old/20/03/26-zstandard/submission.json Challenge JSON

rarce added a commit to rarce/parameter-golf that referenced this pull request Mar 21, 2026
Three techniques from the top PRs (openai#265, openai#287, openai#297):

1. XSA (Exclusive Self Attention) on last 3 layers (XSA_LAST_N=3):
   Removes self-value bias via orthogonal projection (arXiv:2603.09078).
   GQA-aware: uses reshape+broadcast instead of repeat_interleave.
   Zero new parameters, ~2ms/step overhead.

2. EMA (decay=0.997) replaces SWA (EMA_ENABLED=1, SWA_ENABLED=0):
   Exponential moving average updated every step during warmdown.
   Smoother weight averaging, better generalization/compression.

3. Late QAT (QAT_LATE_FRAC=0.85):
   QAT activates at 85% of wallclock to avoid Muon momentum corruption.
   LR halved when QAT activates (per PR openai#297 finding).

Trimmed comments to stay under 1500-line cap (1457 lines).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@davidpuertolas
Copy link
Copy Markdown
Author

any feedback? @0hq

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Late STE QAT + Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Overtone + SWA + SGD TTT (int6+zstd-22)

Compliance flag: Pre-Quant TTT violation

Head SHA: 6b710c1
File audited: records/track_10min_16mb/2026-03-20_STE_QAT_MLP3x_SmearBigram_LoRATTT/train_gpt.py

Check 1 — N-gram family bug (target token in hash key)

CLEAN. BigramHashEmbedding.bigram_hash hashes position i using tokens[i] and tokens[i-1]:

out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod

This is called with input_ids in GPT.forward: self.bigram(input_ids). Position i uses the current input token and its predecessor — both causal. Target token (tokens[i+1]) never enters the hash key. Not BigramHash bug. CLEAN.

Check 2 — Pre-Quant TTT (multi-epoch AdamW on val_tokens without score-first)

CLOSE. The submitted score comes from eval_val_sgd_ttt, which follows the train-then-score pattern:

  1. Full pass over all val_tokens (single epoch, batched): forward + backward + SGD.step() for every sequence, accumulating no scores.
  2. After the training epoch completes, eval_val_sliding is called to produce the reported bpb.

This is structurally identical to the Pre-Quant TTT violation: the model is trained on val_tokens before the reported score is obtained. The fact that it uses SGD (not AdamW) and operates post-quantization (on dequantized weights) does not change the classification — the optimizer sees the val labels before scoring, and the reported bpb reflects those adapted weights. Score-first is not satisfied at any granularity.

The pre-TTT sliding-window score (q_val_bpb, logged as final_int8_zlib_roundtrip) is computed before TTT runs, but it is not the submitted score. The submission JSON value 1.16292025 matches final_sgd_ttt.

Check 3 — Legal TTT (score-first-per-chunk)

The LoRA TTT path (eval_val_ttt_lora) does implement score-first-per-chunk correctly — chunk i is scored before being used for a gradient step. However, this path is disabled by default (TTT_LORA_ENABLED=0) and not what produces the submitted score.

Check 4 — Scored-region SLOT

Not applicable; SGD TTT trains over the full val set uniformly. No scored-region manipulation identified.

This PR's TTT implementation trains on validation tokens before scoring them, which violates the score-first-per-chunk discipline established in PR #1413 and the rulings in Issue #677. The legal pattern requires scoring each chunk under torch.no_grad() before any gradient step.

Verdict: CLOSE — Pre-Quant TTT violation.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: Recommend CLOSE unless the author restructures to score-first-per-chunk (PR #1413 pattern).


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants