Non-record: Mixture of Softmax K=2 R=64 (1xH100, 10min, 1.3932 bpb)#266
Non-record: Mixture of Softmax K=2 R=64 (1xH100, 10min, 1.3932 bpb)#266User123331 wants to merge 1 commit intoopenai:mainfrom
Conversation
First experiment applying Mixture of Softmax (Yang et al., 2018) to the baseline 9x512 architecture. Uses low-rank factorization (rank=64) to keep parameter overhead minimal (~99K params, 97KB). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Community Review — Non-record: Mixture of Softmax K=2 R=64 (1xH100, 10min, 1.3932 bpb)Compliance: NEEDS AUTHOR ACTION — What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with: A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:
Recommendation: Could you run Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — SyntaxError('invalid decimal literal', ('/workspace/bulk_smoke/pr_266/train_gpt.py', 1490, 21, '| 0 NVIDIA H100 80GB HB…. Classification via |
Summary
Non-record submission exploring Mixture of Softmax (Yang et al., 2018) as a technique to break the softmax bottleneck in the baseline 9×512 architecture.
Results
Training Curve
Key Takeaways
Included Files
train_gpt.py— Full training script with MoS implementationtrain.log— Complete training outputsubmission.json— Structured metadataREADME.md— Run detailsReferences
🤖 Generated with Claude Code