Skip to content

QAT x SWA Ablation: SWA sabotages QAT (-3.64 mBPB, 3-seed validated)#989

Open
alexanderaperry-arch wants to merge 1 commit intoopenai:mainfrom
alexanderaperry-arch:qat-swa-ablation
Open

QAT x SWA Ablation: SWA sabotages QAT (-3.64 mBPB, 3-seed validated)#989
alexanderaperry-arch wants to merge 1 commit intoopenai:mainfrom
alexanderaperry-arch:qat-swa-ablation

Conversation

@alexanderaperry-arch
Copy link
Copy Markdown

@alexanderaperry-arch alexanderaperry-arch commented Mar 27, 2026

Leaderboard-relevant ablation: SWA and QAT are antagonistic

Systematic 2×2 factorial (QAT on/off × SWA on/off) on the PR #180 stack. 3-seed validated on 8xH100, all runs under 10 min and 16MB.

Result

Config QAT SWA Mean BPB (3 seeds) Delta vs Control
no_swa_qat Yes No 1.14018 -3.64 mBPB
control No Yes 1.14382 baseline
qat_snap70 Yes Yes 1.14468 +0.86 mBPB
no_swa No No 1.14486 +1.04 mBPB

Why this matters

Every QAT submission in this competition (#117, #139, smeargate_ortho) also used SWA — and every one underperformed non-QAT entries. Our ablation shows why: SWA's checkpoint averaging dilutes the quantization-boundary alignment that QAT works to achieve. Combining them is worse than either alone.

The fix is simple: remove SWA when using QAT. This alone yields ~3.6 mBPB.

Actionable for competitors

  • If you're using SWA + QAT together, drop SWA
  • QAT alone is 3.5x more effective than SWA alone for quantization quality
  • Training val_bpb is misleading for QAT — post-quantization BPB is the metric that matters
  • QAT weights need ~10% magnitude pruning (vs 3%) to fit under 16MB — they compress worse

Open question

Top entries now use EMA instead of SWA. Nagel et al. (2022) proposes EMA to stabilize QAT, but our short-horizon results suggest averaging mechanisms in general may conflict with QAT under tight wallclock constraints. EMA × QAT interaction is untested.

Validation

Non-record research submission. 2x2 factorial ablation of QAT x SWA
interaction on PR openai#180 stack (10L/512d/MLP3x).

Key finding: SWA and QAT are antagonistic. QAT alone (1.14018, 3-seed
mean) beats SWA alone (1.14382) by 3.64 mBPB. Combining them is worse
than either alone. This explains why prior QAT entries underperformed
non-QAT submissions in the competition.

3-seed validation (seeds 42, 1337, 2024), artifact under 16MB limit.
@alexanderaperry-arch alexanderaperry-arch changed the title QAT x SWA Ablation: antagonistic interaction finding QAT x SWA Ablation: SWA sabotages QAT (-3.64 mBPB, 3-seed validated) Mar 27, 2026
aryanbhosale added a commit to aryanbhosale/parameter-golf that referenced this pull request Mar 28, 2026
slope 0.75 + LR 0.027 + warmdown 3700 (PR openai#977)
No SWA with QAT (PR openai#989)
QAT from 50% + range fix [-31,31]
mHC 22-param residual mixing (PR openai#928)
VE128 + no gated_attn + no value_residual (PR openai#549)
LZMA preset 7 compression (PR openai#999)
Muon TTT with NS3 (PR openai#999)
Entropy-adaptive TTT epochs 2/3/4 (PR openai#999)
Per-layer TTT LR (PR openai#995)
TTT momentum 0.95 (PR openai#995)
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — QAT x SWA Ablation: SWA sabotages QAT (-3.64 mBPB, 3-seed validated)

BPB: 1.1402 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA ae1d2c25a4cf, file records/track_10min_16mb/2026-03-28_QAT_SWA_Ablation/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=10, vocab=1024, code=54948 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=10, vocab=1024, code=54948 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants