Non-record: Mixture of Softmax K=2 R=64 (1xH100, 10min, 1.3932 bpb) by User123331 · Pull Request #266 · openai/parameter-golf

User123331 · 2026-03-20T21:09:16Z

Summary

Non-record submission exploring Mixture of Softmax (Yang et al., 2018) as a technique to break the softmax bottleneck in the baseline 9×512 architecture.

Approach: Replace the standard tied-embedding softmax with a K=2 mixture of softmaxes, using low-rank factorization (rank=64) to keep parameter overhead minimal (~99K extra params, ~97KB).
Motivation: At vocab=1024 and dim=512, the standard softmax output has rank ≤ 513. MoS K=2 lifts the theoretical rank to ≤ 1026, covering the full vocabulary dimensionality.
Hardware: 1× H100 SXM, 10-minute wallclock cap.

Results

Metric	Value
Post-quant val_bpb	1.3932
Pre-quant val_bpb	1.3921
Quant degradation	+0.0011 bpb
Steps completed	1113 / 20000
Model params	17,159,240
Artifact (int8+zlib)	12.8 MB (3.2 MB under cap)
Step avg	539 ms/step

Training Curve

Step	Train Loss	Val BPB	Time
0	6.93	4.11	0s
100	3.27	—	54s
500	2.58	1.52	271s
1000	2.40	1.40	542s
1113	—	1.39	600s

Key Takeaways

MoS adds negligible artifact overhead. Low-rank factorization (dim→64→K×dim) keeps the cost at ~97KB — well within the 16MB budget with 3.2MB to spare.
Quantization is well-behaved. Only +0.0011 bpb degradation from int8+zlib roundtrip, suggesting MoS weights are quantization-friendly.
Loss was still dropping at wallclock stop. The model reached only 1113 steps on 1×H100 — substantially undertrained relative to the 8×H100 baseline (~20K steps). A longer run or more GPUs would give a fairer comparison.
No TTT/LoRA evaluation was performed — only the standard int8 roundtrip eval. Combining MoS with test-time training is an open question.
The softmax bottleneck (rank ≤ d+1) is theoretically more severe with richer upstream representations (e.g., SmearGate, BigramHash, wider MLP). MoS may yield larger gains when stacked on top of the current SOTA architecture rather than the vanilla baseline.

Included Files

train_gpt.py — Full training script with MoS implementation
train.log — Complete training output
submission.json — Structured metadata
README.md — Run details

References

Yang, Z. et al. (2018). "Breaking the Softmax Bottleneck: A High-Rank RNN Language Model." ICLR 2018.
Godey, N. et al. (2024). "Why do small language models underperform? Studying LM Saturation via the Softmax Bottleneck."

🤖 Generated with Claude Code

First experiment applying Mixture of Softmax (Yang et al., 2018) to the baseline 9x512 architecture. Uses low-rank factorization (rank=64) to keep parameter overhead minimal (~99K params, 97KB). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MatoTeziTanka · 2026-04-11T20:13:21Z

Community Review — Non-record: Mixture of Softmax K=2 R=64 (1xH100, 10min, 1.3932 bpb)

Compliance: NEEDS AUTHOR ACTION — train_gpt.py fails to import on CT2038 (Python 3.10 / torch 2.10.0+cpu)

What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with:

SyntaxError('invalid decimal literal', ('/workspace/bulk_smoke/pr_266/train_gpt.py', 1490, 21, '| 0 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 |', 1490, 21))

A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:

PEP 701 f-string nesting — e.g. log(f" {cat}: {", ".join(...)}") is valid Python 3.12+ but invalid Python 3.10 because the inner ", " re-enters the outer double-quote context. One-character fix: ', ' instead of ", ". See PR Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT — val_bpb 1.07785 (3-seed mean) #1541 / Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean) #1523 for reference.
Missing flash_attn variants — e.g. from flash_attn_interface import flash_attn_varlen_func when the wrapper script only stubs flash_attn_func. Not a PR defect on H100s, but the eval image / CPU preflight path needs a guarded import.
Local compiled extension — e.g. import cutlass_evt_fusion from a records/*/cutlass_evt_fusion/ subfolder that isn't on the import path at smoke time. Usually an import-order issue inside the script.
Actual syntax error — typo, missing bracket, etc.

Recommendation: Could you run python3 -c "import py_compile; py_compile.compile('train_gpt.py')" on your records-folder train_gpt.py under Python 3.10 specifically? The eval image is Python 3.10 per Issue #17 / the README, so any parse error on 3.10 blocks the submission at import time before any of the scored-eval logic runs.

Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — SyntaxError('invalid decimal literal', ('/workspace/bulk_smoke/pr_266/train_gpt.py', 1490, 21, '| 0 NVIDIA H100 80GB HB…. Classification via classify_prs.py AST-based classifier; full compliance audit deferred until the import issue is resolved. Auto-drafted from a template and spot-checked before posting.

PiyushDatta mentioned this pull request May 1, 2026

Record: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.089 (best), val_bpb 1.090 (3-seed mean) - PiyushDatta #2106

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Mixture of Softmax K=2 R=64 (1xH100, 10min, 1.3932 bpb)#266

Non-record: Mixture of Softmax K=2 R=64 (1xH100, 10min, 1.3932 bpb)#266
User123331 wants to merge 1 commit intoopenai:mainfrom
User123331:non-record-mos-k2-r64-pilot

User123331 commented Mar 20, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

User123331 commented Mar 20, 2026

Summary

Results

Training Curve

Key Takeaways

Included Files

References

Uh oh!

MatoTeziTanka commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Community Review — Non-record: Mixture of Softmax K=2 R=64 (1xH100, 10min, 1.3932 bpb)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MatoTeziTanka commented Apr 11, 2026 •

edited

Loading