Record: Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668)#312
Conversation
Community Review — Record: Int6 + Canon ACD (K=3) + Muon WD 0.04 + SWA + Sliding Eval (val_bpb=1.1668)Compliance: NEEDS AUTHOR ACTION — What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with: A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:
Recommendation: Could you run Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0x9e in position 0: invalid start byte (line 1). Classification via |
Retraction — this IMPORT_FAIL was a deleted-file artifact in my smoke runnerSorry @chanwoo-park-official, this one's on me. I re-audited the What happened: Your PR deletes 5 old Verified at head The real Your PR is not broken by this error. I'm retracting the IMPORT_FAIL classification. I'll re-queue the full compliance audit and post findings separately. Again — sorry for the noise. I'm adding a "don't fetch paths marked deleted in the PR diff" guard to the runner so this doesn't hit other PRs that delete/rename records folders. |
Summary
This PR reports a standalone run with Canon ACD (
CANON_SET=ACD,CANON_KERNEL=3) plus mixed int6 quantization (INT6_CATEGORIES=mlp,attn).Approach
model_dim=512,num_heads=8,num_kv_heads=4,mlp_mult=3.0bigram_vocab_size=2048,bigram_dim=128) + SmearGatemlp/attn=int6, other large tensors int80.92 -> 0.99), warmdown (WARMDOWN_ITERS=3000), SWA near endEVAL_STRIDE=64); sliding bpb is main comparisonCanon Placement
A: before attentionB: on concatenated QKV (most expensive)C: before MLPD: in widened MLP hidden streamACD(keeps Canon effect while avoidingBcost)Zeyuan Allen-Zhu (2025), full version: https://ssrn.com/abstract=5240330
Config Highlights
torchrun --nproc_per_node=8TRAIN_BATCH_TOKENS=524288,TRAIN_SEQ_LEN=2048EVAL_SEQ_LEN=2048,EVAL_STRIDE=64,EVAL_BATCH_SEQS=32MATRIX_LR=0.025,SCALAR_LR=0.025,TIED_EMBED_LR=0.035MUON_WEIGHT_DECAY=0.04,ADAM_WEIGHT_DECAY=0.04SWA_ENABLED=1,SWA_EVERY=200,SWA_START_LRMUL=0.5ITERATIONS=7200,WARMUP_STEPS=20,WARMDOWN_ITERS=3000,MAX_WALLCLOCK_SECONDS=600VOCAB_SIZE=1024,SEED=1337Results
final_int6_sliding_window val_bpb(stride=64): 1.16682362train_gpt.py): 71,315 bytesdata_loading_step_avg=0.64msRepro