Record: Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668) by chanwoo-park-official · Pull Request #312 · openai/parameter-golf

chanwoo-park-official · 2026-03-21T05:04:35Z

Summary

This PR reports a standalone run with Canon ACD (CANON_SET=ACD, CANON_KERNEL=3) plus mixed int6 quantization (INT6_CATEGORIES=mlp,attn).

Approach

Model: 9-layer decoder-only Transformer, model_dim=512, num_heads=8, num_kv_heads=4, mlp_mult=3.0
MLP: ReLU-squared style MLP (repo default)
Context extras: Bigram hash embedding (bigram_vocab_size=2048, bigram_dim=128) + SmearGate
Quantization: mixed PTQ, mlp/attn=int6, other large tensors int8
Optimizer: mixed Muon + Adam
Schedule: momentum warmup (0.92 -> 0.99), warmdown (WARMDOWN_ITERS=3000), SWA near end
Eval: both roundtrip and sliding-window (EVAL_STRIDE=64); sliding bpb is main comparison

Canon Placement

A: before attention
B: on concatenated QKV (most expensive)
C: before MLP
D: in widened MLP hidden stream
This run uses ACD (keeps Canon effect while avoiding B cost)
Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers
Zeyuan Allen-Zhu (2025), full version: https://ssrn.com/abstract=5240330

Config Highlights

torchrun --nproc_per_node=8
TRAIN_BATCH_TOKENS=524288, TRAIN_SEQ_LEN=2048
EVAL_SEQ_LEN=2048, EVAL_STRIDE=64, EVAL_BATCH_SEQS=32
MATRIX_LR=0.025, SCALAR_LR=0.025, TIED_EMBED_LR=0.035
MUON_WEIGHT_DECAY=0.04, ADAM_WEIGHT_DECAY=0.04
SWA_ENABLED=1, SWA_EVERY=200, SWA_START_LRMUL=0.5
ITERATIONS=7200, WARMUP_STEPS=20, WARMDOWN_ITERS=3000, MAX_WALLCLOCK_SECONDS=600
VOCAB_SIZE=1024, SEED=1337

Results

final_int6_sliding_window val_bpb (stride=64): 1.16682362
Serialized int6 model: 13,196,032 bytes
Code size (train_gpt.py): 71,315 bytes
Total submission size: 13,267,347 bytes (<16MB)
SWA checkpoints averaged: 8
Data loading overhead: data_loading_step_avg=0.64ms

Repro

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
env \
  RUN_ID=frontier_canon_acd_k3_8gpu \
  DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
  TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
  VOCAB_SIZE=1024 SEED=1337 \
  TRAIN_BATCH_TOKENS=524288 TRAIN_SEQ_LEN=2048 \
  EVAL_SEQ_LEN=2048 EVAL_STRIDE=64 EVAL_BATCH_SEQS=32 \
  ITERATIONS=7200 WARMUP_STEPS=20 WARMDOWN_ITERS=3000 MAX_WALLCLOCK_SECONDS=600 \
  MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
  MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=1500 \
  MUON_WEIGHT_DECAY=0.04 ADAM_WEIGHT_DECAY=0.04 \
  SWA_ENABLED=1 SWA_EVERY=200 SWA_START_LRMUL=0.5 \
  INT6_CATEGORIES=mlp,attn \
  CANON_SET=ACD CANON_KERNEL=3 CANON_RESIDUAL=1 CANON_ACTIVATION=0 CANON_BIAS=0 \
  TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

MatoTeziTanka · 2026-04-11T20:10:35Z

[RETRACTED 2026-04-11] — This IMPORT_FAIL was a false positive. Root cause: runner fetched a path marked deleted in the PR diff. Your code is not broken. See correction below: #312 (comment)

Community Review — Record: Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668)

Compliance: NEEDS AUTHOR ACTION — train_gpt.py fails to import on CT2038 (Python 3.10 / torch 2.10.0+cpu)

What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with:

SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0x9e in position 0: invalid start byte (line 1)

A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:

PEP 701 f-string nesting — e.g. log(f" {cat}: {", ".join(...)}") is valid Python 3.12+ but invalid Python 3.10 because the inner ", " re-enters the outer double-quote context. One-character fix: ', ' instead of ", ". See PR Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT — val_bpb 1.07785 (3-seed mean) #1541 / Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean) #1523 for reference.
Missing flash_attn variants — e.g. from flash_attn_interface import flash_attn_varlen_func when the wrapper script only stubs flash_attn_func. Not a PR defect on H100s, but the eval image / CPU preflight path needs a guarded import.
Local compiled extension — e.g. import cutlass_evt_fusion from a records/*/cutlass_evt_fusion/ subfolder that isn't on the import path at smoke time. Usually an import-order issue inside the script.
Actual syntax error — typo, missing bracket, etc.

Recommendation: Could you run python3 -c "import py_compile; py_compile.compile('train_gpt.py')" on your records-folder train_gpt.py under Python 3.10 specifically? The eval image is Python 3.10 per Issue #17 / the README, so any parse error on 3.10 blocks the submission at import time before any of the scored-eval logic runs.

Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0x9e in position 0: invalid start byte (line 1). Classification via classify_prs.py AST-based classifier; full compliance audit deferred until the import issue is resolved. Auto-drafted from a template and spot-checked before posting.

MatoTeziTanka · 2026-04-11T21:49:17Z

Retraction — this IMPORT_FAIL was a deleted-file artifact in my smoke runner

Sorry @chanwoo-park-official, this one's on me. I re-audited the SyntaxError (unicode error) 'utf-8' codec can't decode byte 0x9e in position 0 I reported above and it was a false positive — the fault is in my smoke runner, not in your code.

What happened:

Your PR deletes 5 old records/*/train_gpt.py path(s) while editing a different file, and my bulk smoke runner iterated the diff's file list and fetched one of the paths that's already marked for deletion. The raw GitHub content endpoint returned either a binary stub or a non-UTF8 response, and my runner tried to import it as Python source, producing the byte 0x9e at position 0 error. That error was about the deleted/non-existent file, not the train_gpt.py you're actually submitting.

Verified at head ba8b7c8:

The real train_gpt.py you're editing parses cleanly under Python 3.10:

py_compile.compile('train_gpt.py') → PARSES OK
86197 bytes

Your PR is not broken by this error. I'm retracting the IMPORT_FAIL classification. I'll re-queue the full compliance audit and post findings separately.

Again — sorry for the noise. I'm adding a "don't fetch paths marked deleted in the PR diff" guard to the runner so this doesn't hit other PRs that delete/rename records folders.

Add Canon fast-conv path and ACD K3 run report

0902543

chanwoo-park-official changed the title ~~Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668)~~ Record: Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668) Mar 21, 2026

notapplica mentioned this pull request Mar 21, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

chanwoo-park-official added 2 commits March 22, 2026 00:47

new submission

f3fe28f

xx

ba8b7c8

chanwoo-park-official mentioned this pull request Mar 22, 2026

Record: 11L CANON-AC(last5)+DeltaGate Report (Humble Record Attempt, val_bpb: 1.1296) #400

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668)#312

Record: Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668)#312
chanwoo-park-official wants to merge 3 commits intoopenai:mainfrom
chanwoo-park-official:feat/canon-fastconv-acd-report

chanwoo-park-official commented Mar 21, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chanwoo-park-official commented Mar 21, 2026

Summary

Approach

Canon Placement

Config Highlights

Results

Repro

Uh oh!

MatoTeziTanka commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Community Review — Record: Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668)

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Retraction — this IMPORT_FAIL was a deleted-file artifact in my smoke runner

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MatoTeziTanka commented Apr 11, 2026 •

edited

Loading