Skip to content

Record submission : Int6 + MLP 3x + Flash Attention 3 + NorMuon, val_bpb = 1.1532#173

Open
tamoghnokandar wants to merge 1 commit intoopenai:mainfrom
tamoghnokandar:main
Open

Record submission : Int6 + MLP 3x + Flash Attention 3 + NorMuon, val_bpb = 1.1532#173
tamoghnokandar wants to merge 1 commit intoopenai:mainfrom
tamoghnokandar:main

Conversation

@tamoghnokandar
Copy link
Copy Markdown

This PR builds directly on the prior PR #114 and improves it further by replacing Muon with NorMuon and switching the attention path to FlashAttention 3.

val_bpb = 1.1532 (best seed), 1.1546 mean over 3 seeds.

On the three completed seeds in this PR:

  • Seed 7: val_bpb = 1.1532, val_loss = 1.9471
  • Seed 42: val_bpb = 1.1542, val_loss = 1.9488
  • Seed 1337: val_bpb = 1.1563, val_loss = 1.9524

Mean over seeds 7 / 42 / 1337:

  • val_bpb = 1.1546
  • val_loss = 1.9495

Artifact size remains within budget at about 15.96MB. Training still uses the 10-minute wallclock cap on 8x H100, with sliding-window evaluation at stride 256.

What's New

  1. NorMuon replaces Muon
    This keeps the same overall optimizer split but swaps the optimizer to NorMuon. In this setup, NorMuon gave a modest but repeatable improvement over the previous Muon-based version.

  2. FlashAttention 3 replaces the prior attention path
    The model now uses the FA3 kernel directly for the attention mechanism. This keeps the same architecture and evaluation setup, but improves the training/runtime path on H100s.

  3. Multi-seed validation
    The previous README highlighted a single best result plus older seed runs. This PR updates the result summary to the actual new 3-seed set for this NorMuon + FA3 variant: seeds 7, 42, and 1337.

Approach

This submission keeps the main structure from the previous PR:

  • Int6 post-training quantization with per-row scaling
  • MLP hidden size increased from 1024 -> 1536
  • tied embedding kept in fp16
  • final-layer c_k.weight passthrough retained in fp16
  • train at seq_len=2048
  • sliding-window eval at eval_seq_len=2048, stride=256
  • GRAD_CLIP_NORM=0.3 stabilizes long-sequence training

The key change here is not a new quantization scheme or architecture jump, but a cleaner training/runtime stack:

  • NorMuon for optimizer updates
  • FlashAttention 3 for the attention kernel

Training

Training was done on 8 H100 GPUs using Modal. The script for training on Modal is also attached.

@MatoTeziTanka
Copy link
Copy Markdown

MatoTeziTanka commented Apr 11, 2026

Community Review — Record submission : Int6 + MLP 3x + Flash Attention 3 + NorMuon, val_bpb = 1.1532

Compliance: NEEDS AUTHOR ACTION — train_gpt.py fails to import on CT2038 (Python 3.10 / torch 2.10.0+cpu)

What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with:

ModuleNotFoundError: No module named 'kernels'

A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:

Recommendation: Could you run python3 -c "import py_compile; py_compile.compile('train_gpt.py')" on your records-folder train_gpt.py under Python 3.10 specifically? The eval image is Python 3.10 per Issue #17 / the README, so any parse error on 3.10 blocks the submission at import time before any of the scored-eval logic runs.

Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — ModuleNotFoundError: No module named 'kernels'. Classification via classify_prs.py AST-based classifier; full compliance audit deferred until the import issue is resolved. Auto-drafted from a template and spot-checked before posting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants