Record submission : Int6 + MLP 3x + Flash Attention 3 + NorMuon, val_bpb = 1.1532 by tamoghnokandar · Pull Request #173 · openai/parameter-golf

tamoghnokandar · 2026-03-20T05:13:05Z

This PR builds directly on the prior PR #114 and improves it further by replacing Muon with NorMuon and switching the attention path to FlashAttention 3.

val_bpb = 1.1532 (best seed), 1.1546 mean over 3 seeds.

On the three completed seeds in this PR:

Seed 7: val_bpb = 1.1532, val_loss = 1.9471
Seed 42: val_bpb = 1.1542, val_loss = 1.9488
Seed 1337: val_bpb = 1.1563, val_loss = 1.9524

Mean over seeds 7 / 42 / 1337:

val_bpb = 1.1546
val_loss = 1.9495

Artifact size remains within budget at about 15.96MB. Training still uses the 10-minute wallclock cap on 8x H100, with sliding-window evaluation at stride 256.

What's New

NorMuon replaces Muon
This keeps the same overall optimizer split but swaps the optimizer to NorMuon. In this setup, NorMuon gave a modest but repeatable improvement over the previous Muon-based version.
FlashAttention 3 replaces the prior attention path
The model now uses the FA3 kernel directly for the attention mechanism. This keeps the same architecture and evaluation setup, but improves the training/runtime path on H100s.
Multi-seed validation
The previous README highlighted a single best result plus older seed runs. This PR updates the result summary to the actual new 3-seed set for this NorMuon + FA3 variant: seeds 7, 42, and 1337.

Approach

This submission keeps the main structure from the previous PR:

Int6 post-training quantization with per-row scaling
MLP hidden size increased from 1024 -> 1536
tied embedding kept in fp16
final-layer c_k.weight passthrough retained in fp16
train at seq_len=2048
sliding-window eval at eval_seq_len=2048, stride=256
GRAD_CLIP_NORM=0.3 stabilizes long-sequence training

The key change here is not a new quantization scheme or architecture jump, but a cleaner training/runtime stack:

NorMuon for optimizer updates
FlashAttention 3 for the attention kernel

Training

Training was done on 8 H100 GPUs using Modal. The script for training on Modal is also attached.

MatoTeziTanka · 2026-04-11T20:09:26Z

Community Review — Record submission : Int6 + MLP 3x + Flash Attention 3 + NorMuon, val_bpb = 1.1532

Compliance: NEEDS AUTHOR ACTION — train_gpt.py fails to import on CT2038 (Python 3.10 / torch 2.10.0+cpu)

What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with:

ModuleNotFoundError: No module named 'kernels'

A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:

PEP 701 f-string nesting — e.g. log(f" {cat}: {", ".join(...)}") is valid Python 3.12+ but invalid Python 3.10 because the inner ", " re-enters the outer double-quote context. One-character fix: ', ' instead of ", ". See PR Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT — val_bpb 1.07785 (3-seed mean) #1541 / Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean) #1523 for reference.
Missing flash_attn variants — e.g. from flash_attn_interface import flash_attn_varlen_func when the wrapper script only stubs flash_attn_func. Not a PR defect on H100s, but the eval image / CPU preflight path needs a guarded import.
Local compiled extension — e.g. import cutlass_evt_fusion from a records/*/cutlass_evt_fusion/ subfolder that isn't on the import path at smoke time. Usually an import-order issue inside the script.
Actual syntax error — typo, missing bracket, etc.

Recommendation: Could you run python3 -c "import py_compile; py_compile.compile('train_gpt.py')" on your records-folder train_gpt.py under Python 3.10 specifically? The eval image is Python 3.10 per Issue #17 / the README, so any parse error on 3.10 blocks the submission at import time before any of the scored-eval logic runs.

Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — ModuleNotFoundError: No module named 'kernels'. Classification via classify_prs.py AST-based classifier; full compliance audit deferred until the import issue is resolved. Auto-drafted from a template and spot-checked before posting.

Add submission

152b161

notapplica mentioned this pull request Mar 20, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record submission : Int6 + MLP 3x + Flash Attention 3 + NorMuon, val_bpb = 1.1532#173

Record submission : Int6 + MLP 3x + Flash Attention 3 + NorMuon, val_bpb = 1.1532#173
tamoghnokandar wants to merge 1 commit intoopenai:mainfrom
tamoghnokandar:main

tamoghnokandar commented Mar 20, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tamoghnokandar commented Mar 20, 2026

What's New

Approach

Training

Uh oh!

MatoTeziTanka commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Community Review — Record submission : Int6 + MLP 3x + Flash Attention 3 + NorMuon, val_bpb = 1.1532

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MatoTeziTanka commented Apr 11, 2026 •

edited

Loading