Add Combined Int6 + QAT + Sliding Window submission by pleasedontddosme · Pull Request #149 · openai/parameter-golf

pleasedontddosme · 2026-03-20T01:43:25Z

Combines best techniques from WarmdownQuantization (#1) and SlidingWindow (#2):

Int6 quant, FP16 tied embeddings, Late-K passthrough
Batched sliding window eval (stride=64), overtone init, phase-transition resid_mix
Muon decoupled weight decay, AdamW for embeddings/scalars
Novel: QAT with STE in last 30% of training for near-zero quant penalty
Cosine warmdown schedule, higher Muon momentum warmup

Combines best techniques from WarmdownQuantization (openai#1) and SlidingWindow (openai#2): - Int6 quant, FP16 tied embeddings, Late-K passthrough - Batched sliding window eval (stride=64), overtone init, phase-transition resid_mix - Muon decoupled weight decay, AdamW for embeddings/scalars - Novel: QAT with STE in last 30% of training for near-zero quant penalty - Cosine warmdown schedule, higher Muon momentum warmup Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace enable_gqa kwarg (PyTorch 2.5+) with manual repeat_interleave for KV heads, compatible with all versions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Key changes based on analysis of first 8xH100 run (1.1888 bpb): - Remove QAT: 14% step overhead was net negative - 10 layers + 2x MLP: faster steps (57ms vs 66ms), more training - eval@2048 with stride=128: biggest bpb win (~0.02), matching leader's approach - Auto-detect enable_gqa (PyTorch 2.5+) with repeat_interleave fallback - Smaller eval batch (64) for 2048-length sequences to avoid OOM Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Use version tuple check instead of runtime probe for enable_gqa - Add weights_only=False to torch.load (default changed in 2.8) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adopts thwu1's core: 10L GQA, learned BigramHashEmbedding, SmearGate, mixed Int5/Int6 quant, magnitude pruning, SWA, zstd compression, linear warmdown, train@2048. Novel additions: dual bigram (learned + post-hoc statistical table), compiled eval, PyTorch 2.4 GQA fallback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Novel innovations over all competitors: - TTT LoRA: per-document rank-8 Q/V/LM-head adapters at eval time - Dual bigram inside TTT: statistical bigram table applied within TTT loop - Label smoothing (0.05): prevents overconfident predictions - Z-loss (1e-4): regularizes logit magnitudes for quantization Refactored GPT._run_blocks() to accept optional LoRA, added forward_ttt_logits() for TTT eval path. Fallback: TTT_ENABLED=0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… artifact - Fix mtp_proj zero-init being overwritten by _init_weights (add _zero_init flag) - Fix FP16 passthrough: keep last TWO layers' c_k in FP16 (blocks.8 + blocks.9) - Add multi-token prediction (t+2 auxiliary loss, weight=0.15) - Add LR cooldown floor at 5% to prevent dead final training steps - Strip mtp_proj from artifact to save ~200KB of 16MB budget - Fix variable name collision (mtp_proj -> mtp_logit_proj) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…e, XSA, EMA; update learning rates and SWA settings

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…rash) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MatoTeziTanka · 2026-04-12T14:17:33Z

PR #149 Review

Title: Add Combined Int6 + QAT + Sliding Window submission
State: open
Date Reviewed: 2026-04-11

Code Analysis

train_gpt.py Checks

target-in-key pattern: not found
TTT (Temporal Token Tagging): not found
SLOT (Slot MoE): not found
Custom Tokenizer: not found

Verdict

Classification: PURE_NEURAL_CLEAN

Recommendation:

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: ✓ MERGE

Draft by @MatoTeziTanka for parameter-golf review sweep (2026-04-11)

Reviewed by @MatoTeziTanka — The Agora. Classification via sibling-session agent (Haiku-backed). This review was drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

pleasedontddosme and others added 8 commits March 20, 2026 02:41

Fix GQA compatibility with PyTorch 2.4

f2ed21b

Replace enable_gqa kwarg (PyTorch 2.5+) with manual repeat_interleave for KV heads, compatible with all versions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix PyTorch 2.8 compat: version-based GQA detection, weights_only=False

d389270

- Use version tuple check instead of runtime probe for enable_gqa - Add weights_only=False to torch.load (default changed in 2.8) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

V5.2 Enhance hyperparameters and optimizer: add Partial RoPE, LN Scal…

edd990b

…e, XSA, EMA; update learning rates and SWA settings

pleasedontddosme marked this pull request as draft March 21, 2026 22:46

pleasedontddosme and others added 3 commits March 21, 2026 23:57

Fix PyTorch <2.5 compat: conditional device_id in init_process_group

406c109

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Allow DIST_BACKEND override (gloo fallback for broken NCCL)

d3f89dc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix torch.compile fullgraph bug on PyTorch 2.4 (mul_softmax_pattern c…

ff7a3a0

…rash) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Combined Int6 + QAT + Sliding Window submission#149

Add Combined Int6 + QAT + Sliding Window submission#149
pleasedontddosme wants to merge 11 commits intoopenai:mainfrom
pleasedontddosme:combined-int6-qat-slidingwindow

pleasedontddosme commented Mar 20, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pleasedontddosme commented Mar 20, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

PR #149 Review

Code Analysis

train_gpt.py Checks

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants