Non-record: Emergent weight symmetry in QO projections + learnable SymMix#1214
Non-record: Emergent weight symmetry in QO projections + learnable SymMix#1214gersh wants to merge 1 commit intoopenai:mainfrom
Conversation
…mMix Layers 6-8 O projections converge to exact symmetry (W=W^T, sym_energy=0.999998) during training on 4xH100. Q projections in same layers reach 99.5%. All other layers remain random (~0.500). Learnable SymMix (W + tanh(beta)*W^T) is loss-neutral (+0.0001 BPB) with betas converging to near-zero. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — Non-record: Emergent weight symmetry in QO projections + learnable SymMixBPB: 0.0001 (cache parse — may be delta/std, not val_bpb; check PR title) | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache What I found in the code (head SHA Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=104892 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline. Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=104892 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Summary
Research finding: During full training of the PR#1019 SOTA architecture on 4×H100, layers 6-8 O projections converge to exact symmetry (W = W^T, sym_energy = 0.999998). Q projections in the same layers reach 99.5% symmetry. All other layers remain random (~0.500).
We also test learnable SymMix (
W_eff = W + tanh(beta) * W^T) which is perfectly loss-neutral (+0.0001 BPB) with learned betas converging to near-zero.Key Results
Symmetry Pattern
Sharp bimodal pattern — L6-8 O projections are symmetric to machine precision. No other weight matrices (KV, MLP, encoder QO) show any symmetry. Cartan decomposition confirms: L6-8 are 99.5-100% symmetric-traceless with <0.5% antisymmetric component. All other layers are 50/50.
Practical Application
Force-symmetrizing L6-8 QO projections post-training with
(W + W.T) / 2before quantization saves ~623 KB artifact size at a cost of only +0.006 BPB, because these weights are already near-symmetric and the constraint eliminates residual noise that wastes quantization bits.Six lines of post-training code before your quantization step:
This frees ~600 KB in the 16MB artifact budget that can be reinvested into other improvements.
Additional Structural Analysis
We also checked for block-diagonal structure, band structure, circulant patterns, and Cartan decomposition across all weight matrices. No other exploitable structure was found — the bimodal symmetry in L6-8 QO is the one structural feature this model develops during training.
Test plan
SYMMIX_ENABLED=0for baseline,SYMMIX_ENABLED=1for experiment)🤖 Generated with Claude Code