Non-record: Higher-Rank Output Heads — Standard Tied Head Wins on a Frontier 11L Baseline by albertorkive · Pull Request #908 · openai/parameter-golf

albertorkive · 2026-03-26T23:06:50Z

Summary

This PR adds a non-record study of higher-rank output heads on a fixed frontier-aligned 11L baseline.

Tested family:

Factorized heads (low-rank bottleneck: d_model → rank → vocab)
Mixture-of-Softmaxes heads (K gated expert projections)
Simplex head (softmax bottleneck before output projection)

Main result:

The standard tied head was best
Every tested higher-rank variant underperformed, often severely
Mixture variants increased artifact size without improving score
The simplex head reduced artifact size substantially but collapsed score

This is a clean negative result on a strong baseline and should be useful for anyone considering output-head expressivity as the next frontier lever in this budget regime.

Results

Run	Variant	val_bpb	Steps	Artifact
H0	standard tied head	1.1734	4415	16.83 MB
H1	factorized r=64	2.4396	4451	16.73 MB
H2	factorized r=128	1.9227	4425	16.92 MB
H3	MoS K=2 r=64	2.6167	4428	16.57 MB
H4	MoS K=4 r=64	2.7112	4149	17.17 MB
H5	MoS K=4 r=128	2.0898	4160	17.94 MB
H6	simplex 128	4.1069	4241	10.95 MB

Why These All Fail

The factorized and mixture heads are designed to break the softmax rank bottleneck (Yang et al. 2017) — the idea that a single d_model → vocab projection can't represent the full rank of natural language.

At this scale (d=512, vocab=1024), the bottleneck doesn't bind. The vocabulary is small enough that the tied embedding matrix already has rank min(512, 1024) = 512, which is sufficient. The extra parameters in mixture/factorized heads add noise and artifact size without solving a real capacity problem.

The simplex head is a different failure mode: forcing a probability simplex before the output projection destroys the model's ability to produce sharp logit distributions.

Setup

11L/512d fixed baseline
EMA, XSA4, SmearGate, BigramHash, partial RoPE, LN Scale, VE128
seq2048, sliding eval, Late QAT
Hopper FA3, compiled training, real quantization/artifact path
8×H100, 600s wallclock

Because this family sweep ran on the full fast path with quantization, there is no separate confirmatory matrix. The family sweep itself is the authoritative result set.

Reproduction

bash records/track_non_record_16mb/2026-03-26_HigherRankHeads_11L_Study/run_higher_rank_heads_study.sh

Prepare a non-record study of higher-rank output heads on a frontier 11L baseline. - add the study folder under records/track_non_record_16mb - include the full 7-variant family sweep on the fast frontier-aligned stack - include raw JSONL, auto-generated summary, and exact per-run logs - include a self-contained study-local trainer and full-family reproduction runner - document the negative result: the standard tied head outperformed all tested higher-rank alternatives

- remove internal notes and unsupported throughput claims - align the README, summary, and reproduction notes with the files actually included - frame the study as a clean fast-path negative result on a frontier baseline

MatoTeziTanka · 2026-04-12T13:27:33Z

Community Review — Non-record: Higher-Rank Output Heads — Standard Tied Head Wins on a Frontier 11L Baseline

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

PR #908 — HigherRankHeads 11L Study (non-record track) Head SHA: `c233e73` ### What This PR Is A non-record study in `records/track_non_record_16mb/2026-03-26_HigherRankHeads_11L_Study/`. It runs 7 head-architecture variants (standard tied, factorized r=64/128, mixture-softmax K=2/4 r=64/128, simplex 128) on a fixed 11L baseline, all with `SKIP_QUANT=0` and `USE_TTT=0` (default). No score is submitted — every variant performed worse than the H0 control (best: H0 at 1.1734 BPB). ### Check 1: ILLEGAL n-gram family bug (target XOR'd into hash key) CLEAN. `get_bigram_hash` at line 1715 hashes only input context tokens (`x = input_ids`). XOR is applied between adjacent context pairs: `rand_int_1 * x[..., 1:]` XOR `rand_int_2 * x[..., :-1]`. Target labels (`y`) are never passed to this function. Call site (line 2004) confirms `get_bigram_hash(input_ids, ...)`, not targets. No BigramHash family bug present. ### Check 2: ILLEGAL Pre-Quant TTT (multi-epoch val_tokens without score-first) NOT PRESENT / NOT TRIGGERED. `apply_full_weight_ttt` exists at line 509 and is a multi-epoch SGD loop over `val_tokens` (lines 557–587). However, `USE_TTT` defaults to `0` (line 219), and the run script (`run_higher_rank_heads_study.sh`) never sets `USE_TTT=1`. TTT is fully disabled in all 7 submitted runs. The function code itself does contain the pre-quant TTT pattern (train on val before quant at line 2787), but since the flag is off it is not executed. ### Check 3: LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard) Not applicable — TTT is disabled. ### Check 4: HOLD scored-region SLOT No scored-region slot manipulation detected. The PR is non-record; there is no submitted score artifact. ###...

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

albertorkive added 2 commits March 26, 2026 21:26

Polish higher-rank heads study packaging

c233e73

- remove internal notes and unsupported throughput claims - align the README, summary, and reproduction notes with the files actually included - frame the study as a clean fast-path negative result on a frontier baseline

PiyushDatta mentioned this pull request May 1, 2026

Record: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.089 (best), val_bpb 1.090 (3-seed mean) - PiyushDatta #2106

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Higher-Rank Output Heads — Standard Tied Head Wins on a Frontier 11L Baseline#908

Non-record: Higher-Rank Output Heads — Standard Tied Head Wins on a Frontier 11L Baseline#908
albertorkive wants to merge 2 commits intoopenai:mainfrom
albertorkive:higher-rank-heads-study

albertorkive commented Mar 26, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

albertorkive commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

Why These All Fail

Setup

Reproduction

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Non-record: Higher-Rank Output Heads — Standard Tied Head Wins on a Frontier 11L Baseline

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

albertorkive commented Mar 26, 2026 •

edited

Loading