Non-record: Emergent weight symmetry in QO projections + learnable SymMix by gersh · Pull Request #1214 · openai/parameter-golf

gersh · 2026-04-01T06:30:36Z

Summary

Research finding: During full training of the PR#1019 SOTA architecture on 4×H100, layers 6-8 O projections converge to exact symmetry (W = W^T, sym_energy = 0.999998). Q projections in the same layers reach 99.5% symmetry. All other layers remain random (~0.500).

We also test learnable SymMix (W_eff = W + tanh(beta) * W^T) which is perfectly loss-neutral (+0.0001 BPB) with learned betas converging to near-zero.

Key Results

Metric	Baseline	SymMix	Delta
val_bpb	1.1687	1.1688	+0.0001
Artifact (int6+lzma)	14,817 KB	14,799 KB	-18 KB

Symmetry Pattern

Layer  Q sym_energy   O sym_energy
L0-5   ~0.500         ~0.500        (random)
L6     0.9955         0.999998      (near-perfect)
L7     0.9955         0.999998      (near-perfect)
L8     0.9955         0.999998      (near-perfect)
L9-10  ~0.500         ~0.500        (random)

Sharp bimodal pattern — L6-8 O projections are symmetric to machine precision. No other weight matrices (KV, MLP, encoder QO) show any symmetry. Cartan decomposition confirms: L6-8 are 99.5-100% symmetric-traceless with <0.5% antisymmetric component. All other layers are 50/50.

Practical Application

Force-symmetrizing L6-8 QO projections post-training with (W + W.T) / 2 before quantization saves ~623 KB artifact size at a cost of only +0.006 BPB, because these weights are already near-symmetric and the constraint eliminates residual noise that wastes quantization bits.

Six lines of post-training code before your quantization step:

n = model.num_layers
for i in [6, 7, 8]:
    for idx in [i, n + i]:  # Q and O
        w = model.qo_bank.data[idx]
        model.qo_bank.data[idx] = (w + w.T) / 2

This frees ~600 KB in the 16MB artifact budget that can be reinvested into other improvements.

Additional Structural Analysis

We also checked for block-diagonal structure, band structure, circulant patterns, and Cartan decomposition across all weight matrices. No other exploitable structure was found — the bimodal symmetry in L6-8 QO is the one structural feature this model develops during training.

Test plan

Script compiles and runs (SYMMIX_ENABLED=0 for baseline, SYMMIX_ENABLED=1 for experiment)
Tested on 4×H100 SXM with seed 1337
Artifact under 16MB (14,799 KB)
Symmetry analysis reproduced on trained checkpoint
Post-training force-symmetry tested: -623 KB, +0.006 BPB

🤖 Generated with Claude Code

…mMix Layers 6-8 O projections converge to exact symmetry (W=W^T, sym_energy=0.999998) during training on 4xH100. Q projections in same layers reach 99.5%. All other layers remain random (~0.500). Learnable SymMix (W + tanh(beta)*W^T) is loss-neutral (+0.0001 BPB) with betas converging to near-zero. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-11T20:02:18Z

Community Review — Non-record: Emergent weight symmetry in QO projections + learnable SymMix

BPB: 0.0001 (cache parse — may be delta/std, not val_bpb; check PR title) | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 3f45f230a29e, file records/track_non_record_16mb/2026-04-01_Emergent_QO_Symmetry_SymMix/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=104892 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=104892 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Emergent weight symmetry in QO projections + learnable SymMix#1214

Non-record: Emergent weight symmetry in QO projections + learnable SymMix#1214
gersh wants to merge 1 commit intoopenai:mainfrom
gersh:emergent-qo-symmetry

gersh commented Apr 1, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gersh commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Results

Symmetry Pattern

Practical Application

Additional Structural Analysis

Test plan

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Non-record: Emergent weight symmetry in QO projections + learnable SymMix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gersh commented Apr 1, 2026 •

edited

Loading