Skip to content

Non-record: Emergent weight symmetry in QO projections + learnable SymMix#1214

Open
gersh wants to merge 1 commit intoopenai:mainfrom
gersh:emergent-qo-symmetry
Open

Non-record: Emergent weight symmetry in QO projections + learnable SymMix#1214
gersh wants to merge 1 commit intoopenai:mainfrom
gersh:emergent-qo-symmetry

Conversation

@gersh
Copy link
Copy Markdown

@gersh gersh commented Apr 1, 2026

Summary

Research finding: During full training of the PR#1019 SOTA architecture on 4×H100, layers 6-8 O projections converge to exact symmetry (W = W^T, sym_energy = 0.999998). Q projections in the same layers reach 99.5% symmetry. All other layers remain random (~0.500).

We also test learnable SymMix (W_eff = W + tanh(beta) * W^T) which is perfectly loss-neutral (+0.0001 BPB) with learned betas converging to near-zero.

Key Results

Metric Baseline SymMix Delta
val_bpb 1.1687 1.1688 +0.0001
Artifact (int6+lzma) 14,817 KB 14,799 KB -18 KB

Symmetry Pattern

Layer  Q sym_energy   O sym_energy
L0-5   ~0.500         ~0.500        (random)
L6     0.9955         0.999998      (near-perfect)
L7     0.9955         0.999998      (near-perfect)
L8     0.9955         0.999998      (near-perfect)
L9-10  ~0.500         ~0.500        (random)

Sharp bimodal pattern — L6-8 O projections are symmetric to machine precision. No other weight matrices (KV, MLP, encoder QO) show any symmetry. Cartan decomposition confirms: L6-8 are 99.5-100% symmetric-traceless with <0.5% antisymmetric component. All other layers are 50/50.

Practical Application

Force-symmetrizing L6-8 QO projections post-training with (W + W.T) / 2 before quantization saves ~623 KB artifact size at a cost of only +0.006 BPB, because these weights are already near-symmetric and the constraint eliminates residual noise that wastes quantization bits.

Six lines of post-training code before your quantization step:

n = model.num_layers
for i in [6, 7, 8]:
    for idx in [i, n + i]:  # Q and O
        w = model.qo_bank.data[idx]
        model.qo_bank.data[idx] = (w + w.T) / 2

This frees ~600 KB in the 16MB artifact budget that can be reinvested into other improvements.

Additional Structural Analysis

We also checked for block-diagonal structure, band structure, circulant patterns, and Cartan decomposition across all weight matrices. No other exploitable structure was found — the bimodal symmetry in L6-8 QO is the one structural feature this model develops during training.

Test plan

  • Script compiles and runs (SYMMIX_ENABLED=0 for baseline, SYMMIX_ENABLED=1 for experiment)
  • Tested on 4×H100 SXM with seed 1337
  • Artifact under 16MB (14,799 KB)
  • Symmetry analysis reproduced on trained checkpoint
  • Post-training force-symmetry tested: -623 KB, +0.006 BPB

🤖 Generated with Claude Code

…mMix

Layers 6-8 O projections converge to exact symmetry (W=W^T, sym_energy=0.999998)
during training on 4xH100. Q projections in same layers reach 99.5%. All other
layers remain random (~0.500). Learnable SymMix (W + tanh(beta)*W^T) is
loss-neutral (+0.0001 BPB) with betas converging to near-zero.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: Emergent weight symmetry in QO projections + learnable SymMix

BPB: 0.0001 (cache parse — may be delta/std, not val_bpb; check PR title) | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 3f45f230a29e, file records/track_non_record_16mb/2026-04-01_Emergent_QO_Symmetry_SymMix/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=104892 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=104892 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants