Skip to content

Add Combined Int6 + QAT + Sliding Window submission#149

Draft
pleasedontddosme wants to merge 11 commits intoopenai:mainfrom
pleasedontddosme:combined-int6-qat-slidingwindow
Draft

Add Combined Int6 + QAT + Sliding Window submission#149
pleasedontddosme wants to merge 11 commits intoopenai:mainfrom
pleasedontddosme:combined-int6-qat-slidingwindow

Conversation

@pleasedontddosme
Copy link
Copy Markdown

Combines best techniques from WarmdownQuantization (#1) and SlidingWindow (#2):

  • Int6 quant, FP16 tied embeddings, Late-K passthrough
  • Batched sliding window eval (stride=64), overtone init, phase-transition resid_mix
  • Muon decoupled weight decay, AdamW for embeddings/scalars
  • Novel: QAT with STE in last 30% of training for near-zero quant penalty
  • Cosine warmdown schedule, higher Muon momentum warmup

pleasedontddosme and others added 8 commits March 20, 2026 02:41
Combines best techniques from WarmdownQuantization (openai#1) and SlidingWindow (openai#2):
- Int6 quant, FP16 tied embeddings, Late-K passthrough
- Batched sliding window eval (stride=64), overtone init, phase-transition resid_mix
- Muon decoupled weight decay, AdamW for embeddings/scalars
- Novel: QAT with STE in last 30% of training for near-zero quant penalty
- Cosine warmdown schedule, higher Muon momentum warmup

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace enable_gqa kwarg (PyTorch 2.5+) with manual
repeat_interleave for KV heads, compatible with all versions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Key changes based on analysis of first 8xH100 run (1.1888 bpb):
- Remove QAT: 14% step overhead was net negative
- 10 layers + 2x MLP: faster steps (57ms vs 66ms), more training
- eval@2048 with stride=128: biggest bpb win (~0.02), matching leader's approach
- Auto-detect enable_gqa (PyTorch 2.5+) with repeat_interleave fallback
- Smaller eval batch (64) for 2048-length sequences to avoid OOM

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use version tuple check instead of runtime probe for enable_gqa
- Add weights_only=False to torch.load (default changed in 2.8)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adopts thwu1's core: 10L GQA, learned BigramHashEmbedding, SmearGate,
mixed Int5/Int6 quant, magnitude pruning, SWA, zstd compression,
linear warmdown, train@2048. Novel additions: dual bigram (learned +
post-hoc statistical table), compiled eval, PyTorch 2.4 GQA fallback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Novel innovations over all competitors:
- TTT LoRA: per-document rank-8 Q/V/LM-head adapters at eval time
- Dual bigram inside TTT: statistical bigram table applied within TTT loop
- Label smoothing (0.05): prevents overconfident predictions
- Z-loss (1e-4): regularizes logit magnitudes for quantization

Refactored GPT._run_blocks() to accept optional LoRA, added
forward_ttt_logits() for TTT eval path. Fallback: TTT_ENABLED=0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… artifact

- Fix mtp_proj zero-init being overwritten by _init_weights (add _zero_init flag)
- Fix FP16 passthrough: keep last TWO layers' c_k in FP16 (blocks.8 + blocks.9)
- Add multi-token prediction (t+2 auxiliary loss, weight=0.15)
- Add LR cooldown floor at 5% to prevent dead final training steps
- Strip mtp_proj from artifact to save ~200KB of 16MB budget
- Fix variable name collision (mtp_proj -> mtp_logit_proj)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e, XSA, EMA; update learning rates and SWA settings
@pleasedontddosme pleasedontddosme marked this pull request as draft March 21, 2026 22:46
pleasedontddosme and others added 3 commits March 21, 2026 23:57
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rash)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

PR #149 Review

Title: Add Combined Int6 + QAT + Sliding Window submission
State: open
Date Reviewed: 2026-04-11

Code Analysis

train_gpt.py Checks

  • target-in-key pattern: not found
  • TTT (Temporal Token Tagging): not found
  • SLOT (Slot MoE): not found
  • Custom Tokenizer: not found

Verdict

Classification: PURE_NEURAL_CLEAN

Recommendation:

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: ✓ MERGE


Draft by @MatoTeziTanka for parameter-golf review sweep (2026-04-11)


Reviewed by @MatoTeziTankaThe Agora. Classification via sibling-session agent (Haiku-backed). This review was drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants