Record submission : Int6 + MLP 3x + Flash Attention 3 + NorMuon, val_bpb = 1.1532#173
Record submission : Int6 + MLP 3x + Flash Attention 3 + NorMuon, val_bpb = 1.1532#173tamoghnokandar wants to merge 1 commit intoopenai:mainfrom
Conversation
Community Review — Record submission : Int6 + MLP 3x + Flash Attention 3 + NorMuon, val_bpb = 1.1532Compliance: NEEDS AUTHOR ACTION — What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with: A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:
Recommendation: Could you run Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — ModuleNotFoundError: No module named 'kernels'. Classification via |
This PR builds directly on the prior PR #114 and improves it further by replacing Muon with NorMuon and switching the attention path to FlashAttention 3.
val_bpb = 1.1532 (best seed), 1.1546 mean over 3 seeds.
On the three completed seeds in this PR:
val_bpb = 1.1532,val_loss = 1.9471val_bpb = 1.1542,val_loss = 1.9488val_bpb = 1.1563,val_loss = 1.9524Mean over seeds 7 / 42 / 1337:
val_bpb = 1.1546val_loss = 1.9495Artifact size remains within budget at about
15.96MB. Training still uses the 10-minute wallclock cap on8x H100, with sliding-window evaluation at stride 256.What's New
NorMuon replaces Muon
This keeps the same overall optimizer split but swaps the optimizer to NorMuon. In this setup, NorMuon gave a modest but repeatable improvement over the previous Muon-based version.
FlashAttention 3 replaces the prior attention path
The model now uses the FA3 kernel directly for the attention mechanism. This keeps the same architecture and evaluation setup, but improves the training/runtime path on H100s.
Multi-seed validation
The previous README highlighted a single best result plus older seed runs. This PR updates the result summary to the actual new 3-seed set for this NorMuon + FA3 variant: seeds
7,42, and1337.Approach
This submission keeps the main structure from the previous PR:
1024 -> 1536c_k.weightpassthrough retained in fp16seq_len=2048eval_seq_len=2048,stride=256The key change here is not a new quantization scheme or architecture jump, but a cleaner training/runtime stack:
Training
Training was done on 8 H100 GPUs using Modal. The script for training on Modal is also attached.