Skip to content

Non-record: Higher-Rank Output Heads — Standard Tied Head Wins on a Frontier 11L Baseline#908

Open
albertorkive wants to merge 2 commits intoopenai:mainfrom
albertorkive:higher-rank-heads-study
Open

Non-record: Higher-Rank Output Heads — Standard Tied Head Wins on a Frontier 11L Baseline#908
albertorkive wants to merge 2 commits intoopenai:mainfrom
albertorkive:higher-rank-heads-study

Conversation

@albertorkive
Copy link
Copy Markdown

@albertorkive albertorkive commented Mar 26, 2026

Summary

This PR adds a non-record study of higher-rank output heads on a fixed frontier-aligned 11L baseline.

Tested family:

  • Factorized heads (low-rank bottleneck: d_model → rank → vocab)
  • Mixture-of-Softmaxes heads (K gated expert projections)
  • Simplex head (softmax bottleneck before output projection)

Main result:

  • The standard tied head was best
  • Every tested higher-rank variant underperformed, often severely
  • Mixture variants increased artifact size without improving score
  • The simplex head reduced artifact size substantially but collapsed score

This is a clean negative result on a strong baseline and should be useful for anyone considering output-head expressivity as the next frontier lever in this budget regime.

Results

Run Variant val_bpb Steps Artifact
H0 standard tied head 1.1734 4415 16.83 MB
H1 factorized r=64 2.4396 4451 16.73 MB
H2 factorized r=128 1.9227 4425 16.92 MB
H3 MoS K=2 r=64 2.6167 4428 16.57 MB
H4 MoS K=4 r=64 2.7112 4149 17.17 MB
H5 MoS K=4 r=128 2.0898 4160 17.94 MB
H6 simplex 128 4.1069 4241 10.95 MB

Why These All Fail

The factorized and mixture heads are designed to break the softmax rank bottleneck (Yang et al. 2017) — the idea that a single d_model → vocab projection can't represent the full rank of natural language.

At this scale (d=512, vocab=1024), the bottleneck doesn't bind. The vocabulary is small enough that the tied embedding matrix already has rank min(512, 1024) = 512, which is sufficient. The extra parameters in mixture/factorized heads add noise and artifact size without solving a real capacity problem.

The simplex head is a different failure mode: forcing a probability simplex before the output projection destroys the model's ability to produce sharp logit distributions.

Setup

  • 11L/512d fixed baseline
  • EMA, XSA4, SmearGate, BigramHash, partial RoPE, LN Scale, VE128
  • seq2048, sliding eval, Late QAT
  • Hopper FA3, compiled training, real quantization/artifact path
  • 8×H100, 600s wallclock

Because this family sweep ran on the full fast path with quantization, there is no separate confirmatory matrix. The family sweep itself is the authoritative result set.

Reproduction

bash records/track_non_record_16mb/2026-03-26_HigherRankHeads_11L_Study/run_higher_rank_heads_study.sh

Prepare a non-record study of higher-rank output heads on a frontier 11L baseline.

- add the study folder under records/track_non_record_16mb
- include the full 7-variant family sweep on the fast frontier-aligned stack
- include raw JSONL, auto-generated summary, and exact per-run logs
- include a self-contained study-local trainer and full-family reproduction runner
- document the negative result: the standard tied head outperformed all tested higher-rank alternatives
- remove internal notes and unsupported throughput claims
- align the README, summary, and reproduction notes with the files actually included
- frame the study as a clean fast-path negative result on a frontier baseline
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: Higher-Rank Output Heads — Standard Tied Head Wins on a Frontier 11L Baseline

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

PR #908 — HigherRankHeads 11L Study (non-record track) Head SHA: c233e73 ### What This PR Is A non-record study in records/track_non_record_16mb/2026-03-26_HigherRankHeads_11L_Study/. It runs 7 head-architecture variants (standard tied, factorized r=64/128, mixture-softmax K=2/4 r=64/128, simplex 128) on a fixed 11L baseline, all with SKIP_QUANT=0 and USE_TTT=0 (default). No score is submitted — every variant performed worse than the H0 control (best: H0 at 1.1734 BPB). ### Check 1: ILLEGAL n-gram family bug (target XOR'd into hash key) CLEAN. get_bigram_hash at line 1715 hashes only input context tokens (x = input_ids). XOR is applied between adjacent context pairs: rand_int_1 * x[..., 1:] XOR rand_int_2 * x[..., :-1]. Target labels (y) are never passed to this function. Call site (line 2004) confirms get_bigram_hash(input_ids, ...), not targets. No BigramHash family bug present. ### Check 2: ILLEGAL Pre-Quant TTT (multi-epoch val_tokens without score-first) NOT PRESENT / NOT TRIGGERED. apply_full_weight_ttt exists at line 509 and is a multi-epoch SGD loop over val_tokens (lines 557–587). However, USE_TTT defaults to 0 (line 219), and the run script (run_higher_rank_heads_study.sh) never sets USE_TTT=1. TTT is fully disabled in all 7 submitted runs. The function code itself does contain the pre-quant TTT pattern (train on val before quant at line 2787), but since the flag is off it is not executed. ### Check 3: LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard) Not applicable — TTT is disabled. ### Check 4: HOLD scored-region SLOT No scored-region slot manipulation detected. The PR is non-record; there is no submitted score artifact. ###...

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants