Late STE QAT + Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Overtone + SWA + SGD TTT (int6+zstd-22) by davidpuertolas · Pull Request #297 · openai/parameter-golf

davidpuertolas · 2026-03-21T00:28:53Z

This record captures Late STE QAT + a dense 9×512 stack (MLP3×, SmearGate, BigramHash, ortho / Overtone-style init, SWA) with full-model SGD test-time training (not LoRA) after sliding-window eval on the dequantized checkpoint.

Method

Training (600s wallclock, 8×H100 SXM)

Muon + AdamW, MLP 3× (hidden 1536), SmearGate, BigramHash, SWA over the second half of warmdown, late STE QAT from ~85% of wallclock with 0.5× LR when QAT activates. Key knobs: matrix_lr=0.025, muon_weight_decay=0.038, train_batch_tokens=786432, train_seq_len=2048, eval_stride=64, etc. (see README.md in this folder).

Evaluation

Serialize int6 (per-row) + compress with zstd level 22 → submission artifact.
Load, sliding-window validation (eval_stride=64).
SGD TTT (LR 3e-4, momentum 0.95; LoRA TTT off by default).
Report roundtrip metrics (final_int8_zstd_roundtrip_exact in logs when using this script).

Why zstd here

Using zstd-22 instead of zlib on the same quantized blob keeps bytes_total under the 16,000,000-byte cap (decimal MB) for this configuration.

Submission metadata

{
  "track": "10min_16mb",
  "date": "2026-03-20",
  "name": "Late STE QAT + Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Overtone + SWA + SGD TTT",
  "author": "David Puertolas Merenciano",
  "github_id": "davidpuertolas",
  "blurb": "Late STE QAT (last 15%, per #76) avoids Muon momentum corruption while closing quant gap. Full-model SGD TTT (per #152) replaces LoRA TTT which hurts with SmearGate (#178). WD=0.038 + LR=0.025 from best validated submissions (#179, #194). Artifact: int6+zstd-22, under 16MB cap.",
  "val_loss": 1.96353693,
  "val_bpb": 1.16292025,
  "bytes_total": 15948643,
  "bytes_code": 64426
}

Field	Value
val_bpb	1.16292025
val_loss	1.96353693
bytes_total	15,948,643 (below 16,000,000)
bytes_code	64,426
Seed (logged)	1337
Wallclock cap	600s (`step=5464` in `train.log`)

Compressed artifact (logged): 15,884,217 bytes int6+zstd + 64,426 bytes UTF-8 train_gpt.py = 15,948,643 total.

Command

From repo root, with FineWeb sp1024 data and tokenizer installed:

pip install zstandard

export HF_TOKEN="..."   # if needed for dataset download
python3 data/cached_challenge_fineweb.py --variant sp1024

RUN_ID=late_qat_sgd_ttt_zstd \
SEED=1337 \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
EVAL_STRIDE=64 \
SGD_TTT_ENABLED=1 \
TTT_LORA_ENABLED=0 \
torchrun --standalone --nproc_per_node=8 \
  old/20/03/26-zstandard/train_gpt.py

Single GPU: --nproc_per_node=1. Longer runs: MAX_WALLCLOCK_SECONDS=0 or another value.

Included files

File	Role
`old/20/03/26-zstandard/train_gpt.py`	Training + int6+zstd artifact
`old/20/03/26-zstandard/train.log`	Example log (seed 1337)
`old/20/03/26-zstandard/README.md`	Full write-up
`old/20/03/26-zstandard/submission.json`	Challenge JSON

Three techniques from the top PRs (openai#265, openai#287, openai#297): 1. XSA (Exclusive Self Attention) on last 3 layers (XSA_LAST_N=3): Removes self-value bias via orthogonal projection (arXiv:2603.09078). GQA-aware: uses reshape+broadcast instead of repeat_interleave. Zero new parameters, ~2ms/step overhead. 2. EMA (decay=0.997) replaces SWA (EMA_ENABLED=1, SWA_ENABLED=0): Exponential moving average updated every step during warmdown. Smoother weight averaging, better generalization/compression. 3. Late QAT (QAT_LATE_FRAC=0.85): QAT activates at 85% of wallclock to avoid Muon momentum corruption. LR halved when QAT activates (per PR openai#297 finding). Trimmed comments to stay under 1500-line cap (1457 lines). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

davidpuertolas · 2026-03-24T07:35:56Z

any feedback? @0hq

MatoTeziTanka · 2026-04-12T14:04:56Z

Community Review — Late STE QAT + Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Overtone + SWA + SGD TTT (int6+zstd-22)

Compliance flag: Pre-Quant TTT violation

Head SHA: 6b710c1
File audited: records/track_10min_16mb/2026-03-20_STE_QAT_MLP3x_SmearBigram_LoRATTT/train_gpt.py

Check 1 — N-gram family bug (target token in hash key)

CLEAN. BigramHashEmbedding.bigram_hash hashes position i using tokens[i] and tokens[i-1]:

out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod

This is called with input_ids in GPT.forward: self.bigram(input_ids). Position i uses the current input token and its predecessor — both causal. Target token (tokens[i+1]) never enters the hash key. Not BigramHash bug. CLEAN.

Check 2 — Pre-Quant TTT (multi-epoch AdamW on val_tokens without score-first)

CLOSE. The submitted score comes from eval_val_sgd_ttt, which follows the train-then-score pattern:

Full pass over all val_tokens (single epoch, batched): forward + backward + SGD.step() for every sequence, accumulating no scores.
After the training epoch completes, eval_val_sliding is called to produce the reported bpb.

This is structurally identical to the Pre-Quant TTT violation: the model is trained on val_tokens before the reported score is obtained. The fact that it uses SGD (not AdamW) and operates post-quantization (on dequantized weights) does not change the classification — the optimizer sees the val labels before scoring, and the reported bpb reflects those adapted weights. Score-first is not satisfied at any granularity.

The pre-TTT sliding-window score (q_val_bpb, logged as final_int8_zlib_roundtrip) is computed before TTT runs, but it is not the submitted score. The submission JSON value 1.16292025 matches final_sgd_ttt.

Check 3 — Legal TTT (score-first-per-chunk)

The LoRA TTT path (eval_val_ttt_lora) does implement score-first-per-chunk correctly — chunk i is scored before being used for a gradient step. However, this path is disabled by default (TTT_LORA_ENABLED=0) and not what produces the submitted score.

Check 4 — Scored-region SLOT

Not applicable; SGD TTT trains over the full val set uniformly. No scored-region manipulation identified.

This PR's TTT implementation trains on validation tokens before scoring them, which violates the score-first-per-chunk discipline established in PR #1413 and the rulings in Issue #677. The legal pattern requires scoring each chunk under torch.no_grad() before any gradient step.

Verdict: CLOSE — Pre-Quant TTT violation.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: Recommend CLOSE unless the author restructures to score-first-per-chunk (PR #1413 pattern).

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source. If this review misread your code, please call it out so I can re-audit manually.

davidpuertolas added 2 commits March 21, 2026 01:16

Add submission STE QAT MLP3x SmearBigram LoRATTT

e88a6b0

Add submission STE QAT MLP3x SmearBigram LoRATTT

6b710c1

notapplica mentioned this pull request Mar 21, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

This was referenced Mar 23, 2026

Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804) #534

Closed

Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804) #543

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Late STE QAT + Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Overtone + SWA + SGD TTT (int6+zstd-22)#297

Late STE QAT + Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Overtone + SWA + SGD TTT (int6+zstd-22)#297
davidpuertolas wants to merge 2 commits intoopenai:mainfrom
davidpuertolas:submission/2026-03-20_STE_QAT_MLP3x_SmearBigram_LoRATTT

davidpuertolas commented Mar 21, 2026

Uh oh!

davidpuertolas commented Mar 24, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

davidpuertolas commented Mar 21, 2026

Method

Training (600s wallclock, 8×H100 SXM)

Evaluation

Why zstd here

Submission metadata

Command

Included files

Uh oh!

davidpuertolas commented Mar 24, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Late STE QAT + Int6 MLP3x + SmearGate + BigramHash + OrthoInit + Overtone + SWA + SGD TTT (int6+zstd-22)

Check 1 — N-gram family bug (target token in hash key)

Check 2 — Pre-Quant TTT (multi-epoch AdamW on val_tokens without score-first)

Check 3 — Legal TTT (score-first-per-chunk)

Check 4 — Scored-region SLOT

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants