Skip to content

Record: Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668)#312

Open
chanwoo-park-official wants to merge 3 commits intoopenai:mainfrom
chanwoo-park-official:feat/canon-fastconv-acd-report
Open

Record: Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668)#312
chanwoo-park-official wants to merge 3 commits intoopenai:mainfrom
chanwoo-park-official:feat/canon-fastconv-acd-report

Conversation

@chanwoo-park-official
Copy link
Copy Markdown

Summary

This PR reports a standalone run with Canon ACD (CANON_SET=ACD, CANON_KERNEL=3) plus mixed int6 quantization (INT6_CATEGORIES=mlp,attn).

Approach

  • Model: 9-layer decoder-only Transformer, model_dim=512, num_heads=8, num_kv_heads=4, mlp_mult=3.0
  • MLP: ReLU-squared style MLP (repo default)
  • Context extras: Bigram hash embedding (bigram_vocab_size=2048, bigram_dim=128) + SmearGate
  • Quantization: mixed PTQ, mlp/attn=int6, other large tensors int8
  • Optimizer: mixed Muon + Adam
  • Schedule: momentum warmup (0.92 -> 0.99), warmdown (WARMDOWN_ITERS=3000), SWA near end
  • Eval: both roundtrip and sliding-window (EVAL_STRIDE=64); sliding bpb is main comparison

Canon Placement

  • A: before attention
  • B: on concatenated QKV (most expensive)
  • C: before MLP
  • D: in widened MLP hidden stream
  • This run uses ACD (keeps Canon effect while avoiding B cost)
  • Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers
    Zeyuan Allen-Zhu (2025), full version: https://ssrn.com/abstract=5240330

Config Highlights

  • torchrun --nproc_per_node=8
  • TRAIN_BATCH_TOKENS=524288, TRAIN_SEQ_LEN=2048
  • EVAL_SEQ_LEN=2048, EVAL_STRIDE=64, EVAL_BATCH_SEQS=32
  • MATRIX_LR=0.025, SCALAR_LR=0.025, TIED_EMBED_LR=0.035
  • MUON_WEIGHT_DECAY=0.04, ADAM_WEIGHT_DECAY=0.04
  • SWA_ENABLED=1, SWA_EVERY=200, SWA_START_LRMUL=0.5
  • ITERATIONS=7200, WARMUP_STEPS=20, WARMDOWN_ITERS=3000, MAX_WALLCLOCK_SECONDS=600
  • VOCAB_SIZE=1024, SEED=1337

Results

  • final_int6_sliding_window val_bpb (stride=64): 1.16682362
  • Serialized int6 model: 13,196,032 bytes
  • Code size (train_gpt.py): 71,315 bytes
  • Total submission size: 13,267,347 bytes (<16MB)
  • SWA checkpoints averaged: 8
  • Data loading overhead: data_loading_step_avg=0.64ms

Repro

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
env \
  RUN_ID=frontier_canon_acd_k3_8gpu \
  DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
  TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
  VOCAB_SIZE=1024 SEED=1337 \
  TRAIN_BATCH_TOKENS=524288 TRAIN_SEQ_LEN=2048 \
  EVAL_SEQ_LEN=2048 EVAL_STRIDE=64 EVAL_BATCH_SEQS=32 \
  ITERATIONS=7200 WARMUP_STEPS=20 WARMDOWN_ITERS=3000 MAX_WALLCLOCK_SECONDS=600 \
  MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
  MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=1500 \
  MUON_WEIGHT_DECAY=0.04 ADAM_WEIGHT_DECAY=0.04 \
  SWA_ENABLED=1 SWA_EVERY=200 SWA_START_LRMUL=0.5 \
  INT6_CATEGORIES=mlp,attn \
  CANON_SET=ACD CANON_KERNEL=3 CANON_RESIDUAL=1 CANON_ACTIVATION=0 CANON_BIAS=0 \
  TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

@chanwoo-park-official chanwoo-park-official changed the title Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668) Record: Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668) Mar 21, 2026
@MatoTeziTanka
Copy link
Copy Markdown

MatoTeziTanka commented Apr 11, 2026

[RETRACTED 2026-04-11] — This IMPORT_FAIL was a false positive. Root cause: runner fetched a path marked deleted in the PR diff. Your code is not broken. See correction below: #312 (comment)


Community Review — Record: Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668)

Compliance: NEEDS AUTHOR ACTION — train_gpt.py fails to import on CT2038 (Python 3.10 / torch 2.10.0+cpu)

What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with:

SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0x9e in position 0: invalid start byte (line 1)

A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:

Recommendation: Could you run python3 -c "import py_compile; py_compile.compile('train_gpt.py')" on your records-folder train_gpt.py under Python 3.10 specifically? The eval image is Python 3.10 per Issue #17 / the README, so any parse error on 3.10 blocks the submission at import time before any of the scored-eval logic runs.

Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0x9e in position 0: invalid start byte (line 1). Classification via classify_prs.py AST-based classifier; full compliance audit deferred until the import issue is resolved. Auto-drafted from a template and spot-checked before posting.

@MatoTeziTanka
Copy link
Copy Markdown

Retraction — this IMPORT_FAIL was a deleted-file artifact in my smoke runner

Sorry @chanwoo-park-official, this one's on me. I re-audited the SyntaxError (unicode error) 'utf-8' codec can't decode byte 0x9e in position 0 I reported above and it was a false positive — the fault is in my smoke runner, not in your code.

What happened:

Your PR deletes 5 old records/*/train_gpt.py path(s) while editing a different file, and my bulk smoke runner iterated the diff's file list and fetched one of the paths that's already marked for deletion. The raw GitHub content endpoint returned either a binary stub or a non-UTF8 response, and my runner tried to import it as Python source, producing the byte 0x9e at position 0 error. That error was about the deleted/non-existent file, not the train_gpt.py you're actually submitting.

Verified at head ba8b7c8:

The real train_gpt.py you're editing parses cleanly under Python 3.10:

py_compile.compile('train_gpt.py') → PARSES OK
86197 bytes

Your PR is not broken by this error. I'm retracting the IMPORT_FAIL classification. I'll re-queue the full compliance audit and post findings separately.

Again — sorry for the noise. I'm adding a "don't fetch paths marked deleted in the PR diff" guard to the runner so this doesn't hit other PRs that delete/rename records folders.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants