Add Combined Int6 + QAT + Sliding Window submission#149
Draft
pleasedontddosme wants to merge 11 commits intoopenai:mainfrom
Draft
Add Combined Int6 + QAT + Sliding Window submission#149pleasedontddosme wants to merge 11 commits intoopenai:mainfrom
pleasedontddosme wants to merge 11 commits intoopenai:mainfrom
Conversation
Combines best techniques from WarmdownQuantization (openai#1) and SlidingWindow (openai#2): - Int6 quant, FP16 tied embeddings, Late-K passthrough - Batched sliding window eval (stride=64), overtone init, phase-transition resid_mix - Muon decoupled weight decay, AdamW for embeddings/scalars - Novel: QAT with STE in last 30% of training for near-zero quant penalty - Cosine warmdown schedule, higher Muon momentum warmup Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace enable_gqa kwarg (PyTorch 2.5+) with manual repeat_interleave for KV heads, compatible with all versions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Key changes based on analysis of first 8xH100 run (1.1888 bpb): - Remove QAT: 14% step overhead was net negative - 10 layers + 2x MLP: faster steps (57ms vs 66ms), more training - eval@2048 with stride=128: biggest bpb win (~0.02), matching leader's approach - Auto-detect enable_gqa (PyTorch 2.5+) with repeat_interleave fallback - Smaller eval batch (64) for 2048-length sequences to avoid OOM Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use version tuple check instead of runtime probe for enable_gqa - Add weights_only=False to torch.load (default changed in 2.8) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adopts thwu1's core: 10L GQA, learned BigramHashEmbedding, SmearGate, mixed Int5/Int6 quant, magnitude pruning, SWA, zstd compression, linear warmdown, train@2048. Novel additions: dual bigram (learned + post-hoc statistical table), compiled eval, PyTorch 2.4 GQA fallback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Novel innovations over all competitors: - TTT LoRA: per-document rank-8 Q/V/LM-head adapters at eval time - Dual bigram inside TTT: statistical bigram table applied within TTT loop - Label smoothing (0.05): prevents overconfident predictions - Z-loss (1e-4): regularizes logit magnitudes for quantization Refactored GPT._run_blocks() to accept optional LoRA, added forward_ttt_logits() for TTT eval path. Fallback: TTT_ENABLED=0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… artifact - Fix mtp_proj zero-init being overwritten by _init_weights (add _zero_init flag) - Fix FP16 passthrough: keep last TWO layers' c_k in FP16 (blocks.8 + blocks.9) - Add multi-token prediction (t+2 auxiliary loss, weight=0.15) - Add LR cooldown floor at 5% to prevent dead final training steps - Strip mtp_proj from artifact to save ~200KB of 16MB budget - Fix variable name collision (mtp_proj -> mtp_logit_proj) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e, XSA, EMA; update learning rates and SWA settings
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rash) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PR #149 ReviewTitle: Add Combined Int6 + QAT + Sliding Window submission Code Analysistrain_gpt.py Checks
VerdictClassification: PURE_NEURAL_CLEAN Recommendation: Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: ✓ MERGE Draft by @MatoTeziTanka for parameter-golf review sweep (2026-04-11) Reviewed by @MatoTeziTanka — The Agora. Classification via sibling-session agent (Haiku-backed). This review was drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Combines best techniques from WarmdownQuantization (#1) and SlidingWindow (#2):