diff --git a/records/track_10min_16mb/2026-04-20_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_CaseOps/README.md b/records/track_10min_16mb/2026-04-20_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_CaseOps/README.md new file mode 100644 index 0000000000..f934967aa3 --- /dev/null +++ b/records/track_10min_16mb/2026-04-20_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_CaseOps/README.md @@ -0,0 +1,127 @@ +# Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT + CaseOps Tokenizer — val_bpb 1.07462 + +**val_bpb = 1.07462** (3-seed mean, std 0.00043) | **8×H100 SXM** | max artifact 15,991,629 bytes + +## 3-Seed Results + +| Seed | Pre-quant EMA | Quantized | Sliding (Track A) | **TTT (Track B)** | Artifact bytes | +|------|---------------|-----------|-------------------|-------------------|----------------| +| 42 | 1.08393 | 1.09482 | 1.07605 | **1.07447** | 15,991,629 | +| 314 | 1.08467 | 1.09589 | 1.07711 | **1.07521** | 15,990,248 | +| 999 | 1.08384 | 1.09437 | 1.07552 | **1.07418** | 15,989,091 | +| **Mean** | **1.08415** | **1.09503** | **1.07623** | **1.07462** | — | +| **Std** | 0.00037 | 0.00064 | 0.00066 | 0.00043 | — | + +Merged SOTA (PR #1493 @bigbag): **1.08100 BPB**. Delta: **−0.00638 BPB = −0.01402 nats/token**. + +### Statistical significance + +- Our mean: 1.07462, SE = 0.00025 (std / √3) +- Merged SOTA: 1.0810, SE = 0.00012 (from reported std 0.0002 / √3) +- Combined SE: 0.00028 +- **z ≈ 22.8, p ≪ 0.0001** ✓ clears p<0.01 threshold + +## Contribution + +This submission combines two previously-separate directions: + +1. **Legal score-first TTT** (merged, PR #1493 @bigbag): 3-layer depth recurrence + parallel residuals + QK-Gain 5.25 + SGD TTT with score-before-update ordering. Fully compliant with Issue #1017 Conditions 1–4. + +2. **Lossless CaseOps tokenizer with byte sidecar** (pending, PR #1729 @romeerp): bijective case-folding tokenization (TITLE / ALLCAPS / CAPNEXT reserved tokens) plus a companion `fineweb_val_bytes_*.bin` sidecar that reports original UTF-8 byte counts per token. This enables honest BPB accounting against raw bytes even when the tokenizer inserts control symbols. + +**What's novel in this submission:** +- Integrated #1729's CaseOps tokenizer onto #1493's merged legal-TTT stack via a ~25-line byte-sidecar patch to `ValidationData` and the three eval functions (`eval_val`, `eval_val_sliding`, `eval_val_ttt`). +- **Deliberately excluded** the pre-quant TTT component of PR #1735/#1738, which has been community-flagged (see PR #1416 review by @MatoTeziTanka and @dexhunter) as a Condition-3 violation: multi-epoch training on val_tokens without score-first discipline. +- Fixed `load_validation_tokens` to exclude `_bytes_*.bin` files from glob match (prevents double-counting the token stream). + +## Compliance (Issue #1017 — all 4 conditions) + +- **C1 (Strict causal)**: `flash_attn_3_func(..., causal=True)`; sliding-window eval uses strict prefix only; byte sidecar is pre-computed data (shipped as `fineweb_val_bytes_*.bin`), not runtime state from val tokens. +- **C2 (Full normalized distribution)**: standard softmax over full 8192-vocab Σ; logit softcap `30·tanh(x/30)` applied uniformly to all logits, independent of x_t. +- **C3 (Score-before-update)**: Legal score-first TTT from PR #1493 unchanged. Per chunk: `base_model.eval(); with torch.no_grad(): loss_sum += scored_nll.sum()` → THEN `base_model.train(); loss.backward(); optimizer.step()`. Updates only affect subsequent chunks. +- **C4 (Single pass)**: Each window scored exactly once; no rescoring, no second pass. + +Byte-sidecar accounting (from PR #1729) is pre-computed from training data and shipped alongside the val tokens; it is **reference data, not runtime state from val tokens**, so it does not affect C1 or C3. + +Additional compliance (no banned mechanisms): +- No SLOT (any variant), no ETLB, no n-gram cache, no pre-quant TTT +- No validation tokens used for adaptation before being scored +- Tokenizer transform is **fully reversible** (see PR #1729 `lossless_caps.py`) +- BPB computed against **original UTF-8 bytes** via sidecar, not transformed token length + +## Budget + +- **Training**: 588 s (wallclock-capped at 600 s) on 8×H100 SXM +- **Evaluation**: ~497 s total per seed (pre-quant EMA: 7 s + GPTQ: 13 s + quantized eval: 9 s + sliding: 106 s + TTT: 380 s) +- **Artifact**: max 15,991,629 bytes (LZMA-packed code: 16,831 bytes + Brotli-compressed quantized model: 15,971,683 bytes). Under 16,000,000 decimal limit on all 3 seeds. + +## Architecture + +Inherited from PR #1493: +- 11L × 512d × 8H / 4KV, MLP 4×, LeakyReLU(0.5)², Partial RoPE (16/64 dims), layerwise LN scale, tied embeddings, logit softcap 30 +- Depth recurrence: encoder [0,1,2,3,4,5,3,4] decoder [5,3,4,5,6,7,8,9,10] (loops layers 3-5 activated at step ~2016, frac=0.35) +- Parallel residuals from layer 7: attention and MLP operate on same pre-residual input +- Skip gates (sigmoid-gated U-Net connections) +- MuonEq-R optimizer (row-normalized Muon, Newton-Schulz 5 steps), AdamW for embeddings/scalars +- Full-Hessian GPTQ with SDClip: k=12.85 (int6 matrices), k=20.0 (int8 embeddings) +- Byte-shuffle + Brotli-11 compression +- EMA decay 0.9965, weight decay (Muon 0.095, embed 0.085, Adam 0.02), warmdown frac 0.72 + +Changes from #1493: +- **Tokenizer**: `fineweb_8192_bpe_lossless_caps_caseops_v1_reserved` from `romeerp/parameter-golf-caseops-v1` (instead of default SP8192) +- **Byte counting**: reads `fineweb_val_bytes_*.bin` sidecar when present, uses it for BPB computation instead of LUT-based accounting (~25-line patch) +- **Data-loading filter**: `load_validation_tokens` now excludes `_bytes_` filenames from the glob match + +## Reproduction + +### 1. Install + +```bash +pip install torch flash-attn sentencepiece brotli huggingface-hub numpy tqdm +``` + +### 2. Download CaseOps data + tokenizer + +```bash +cd parameter-golf +# Uses PR #1729's modified downloader which accepts suffixed variant names; +# or apply the one-line patch to data/cached_challenge_fineweb.py: +# if name.startswith("sp"): return f"fineweb10B_{name}" +MATCHED_FINEWEB_REPO_ID=romeerp/parameter-golf-caseops-v1 \ + python3 data/cached_challenge_fineweb.py \ + --variant sp8192_lossless_caps_caseops_v1_reserved \ + --train-shards 80 +``` + +### 3. Rename / symlink to expected paths + +```bash +cd data/datasets +mv fineweb10B_sp8192_lossless_caps_caseops_v1_reserved fineweb10B_sp8192 +cd ../tokenizers +mv fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model fineweb_8192_bpe.model +mv fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.vocab fineweb_8192_bpe.vocab +``` + +### 4. Run + +```bash +SEED=42 TTT_ENABLED=1 torchrun --standalone --nproc_per_node=8 \ + records/track_10min_16mb/2026-04-20_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_CaseOps/train_gpt.py +``` + +Repeat with SEED=314 and SEED=999 for 3-seed validation. + +## Attribution + +- **@bigbag** — PR #1493 (merged SOTA): the entire base stack — SP8192 + 3-layer recurrence + parallel residuals + QK-Gain 5.25 + legal score-first TTT +- **@romeerp** — PR #1729 (pending): CaseOps lossless-case tokenizer and byte sidecar design. This submission adopts the tokenizer + sidecar components only. +- **@clarkkev** — PR #1394: SP8192 base stack, GPTQ SDClip, int6 matrices / int8 embeddings, MuonEq-R, SP8192 tokenizer +- **@dexhunter** — PR #1331, #1437, #1413: 3-layer depth recurrence, QK-Gain variants +- **@Robby955** — PR #1412: parallel residuals (Hessian-aware SDClip lineage) +- **@msisovic** — PR #1204: mini depth recurrence precursor +- **@abaybektursun** — PR #549, #1019: legal score-first TTT precedent, GPTQ-XSA lineage +- **@Christopher-Lee-McClendon** — PR #461: LoRA TTT framework +- **@stukenov** — PR #1364 (pending): pre-quant AdamW TTT concept (deliberately not adopted, see compliance discussion in PR #1416) +- **@X-Abhishek-X** — PR #1445: hyperparameter tuning +- **@MatoTeziTanka**, **@dexhunter** — PR #1416 review: the compliance analysis that guided our decision to exclude pre-quant TTT from this submission diff --git a/records/track_10min_16mb/2026-04-20_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_CaseOps/submission.json b/records/track_10min_16mb/2026-04-20_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_CaseOps/submission.json new file mode 100644 index 0000000000..b944b7d0ce --- /dev/null +++ b/records/track_10min_16mb/2026-04-20_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_CaseOps/submission.json @@ -0,0 +1,53 @@ +{ + "name": "SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT + CaseOps Tokenizer", + "val_bpb": 1.07462, + "val_bpb_std": 0.00043, + "n_seeds": 3, + "seeds": [42, 314, 999], + "per_seed_val_bpb": { + "42": 1.07447, + "314": 1.07521, + "999": 1.07418 + }, + "per_seed_artifact_bytes": { + "42": 15991629, + "314": 15990248, + "999": 15989091 + }, + "hardware": "8xH100 80GB SXM", + "training_time_seconds": 588, + "eval_time_seconds": 497, + "artifact_bytes_max": 15991629, + "track": "B", + "compliance": { + "issue_1017_condition_1": true, + "issue_1017_condition_2": true, + "issue_1017_condition_3": true, + "issue_1017_condition_4": true, + "size_16mb_decimal": true, + "train_under_600s": true, + "eval_under_600s": true, + "no_slot": true, + "no_pre_quant_ttt": true, + "no_ngram_cache": true, + "no_etlb": true + }, + "delta_vs_merged_sota": { + "prior_sota_pr": 1493, + "prior_sota_author": "bigbag", + "prior_sota_bpb": 1.0810, + "delta_bpb": -0.00638, + "delta_nats_per_token": -0.01402, + "z_statistic": 22.8, + "p_value": "<<0.0001" + }, + "attribution": [ + {"pr": 1493, "author": "bigbag", "contribution": "merged SOTA base stack (SP8192 + 3L recurrence + parallel residuals + QK-Gain 5.25 + legal TTT)"}, + {"pr": 1729, "author": "romeerp", "contribution": "CaseOps lossless tokenizer + byte sidecar design (pending)"}, + {"pr": 1394, "author": "clarkkev", "contribution": "SP8192 base stack, GPTQ SDClip, MuonEq-R"}, + {"pr": 1331, "author": "dexhunter", "contribution": "3-layer depth recurrence"}, + {"pr": 1412, "author": "Robby955", "contribution": "parallel residuals"}, + {"pr": 549, "author": "abaybektursun", "contribution": "legal score-first TTT precedent"}, + {"pr": 461, "author": "Christopher-Lee-McClendon", "contribution": "LoRA TTT framework"} + ] +} diff --git a/records/track_10min_16mb/2026-04-20_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_CaseOps/train_gpt.py b/records/track_10min_16mb/2026-04-20_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_CaseOps/train_gpt.py new file mode 100644 index 0000000000..84014e70f0 --- /dev/null +++ b/records/track_10min_16mb/2026-04-20_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_CaseOps/train_gpt.py @@ -0,0 +1,2 @@ +import lzma as L,base64 as B +exec(L.decompress(B.b85decode(";K7JAGhF~Qn@VT6Qap3bt~@<3h>ok~)Km^%c^ys%R{D_%yAk9-_tV7^coUOo3$w>`(`ci)t`2F7>r>Ltx>>S2CRw|7ov>Wn1e~_!RLQ=%V9g?)G3yPsu%SBy!lj1PaC-x%dDmCDOZ^r^!)+WWz}ejKXTJ#^U6Ra!};QocHHXQC+4UM!QQ!-N5Xd|%~a(9)bTYIO+>B~8~@lqmri%^qEkQUy074Rh6w7V_#^s9J-3BNA`G;qyR$LYcI?e+loZVWi~B$n=TKFp{%SeHYp{oNWh;U@Ahk8M2$OU%K8B$lb*dRQXd-GR_@*KAZdRdwSd#v=LSq1v@Puul=a7WXDmh1^kBj}Y2XlER!D2E{&{%lV(hz$#n5%+%sk&Q}>{y0xpRgiQQBJeVV0hy8UD3ntyo@(Pv+K7^zVRDt4bah(r8kfsZThb+H1)~K-lIr4`|V#-2R>G7pP*N!fwWd&Dq8C)y=NrG_U_Oz6Q?+@ok1?(VJ5?ZT~&}C4Ks38WRB>3i=I!}H-8qq=&yKJ;tbpwwn~lAseD^q1C*u5T;lKQtF;?zv@u0f36%6SXU~txi3v5iSPK*`fNE9531KaQDL`zTPF$MX4U(-3sY-&?>QJe)giBQzpor7H)AZ#4=Hn#`AoAL7tT){&bw(fgz|eQRt`#6-<>;m*+&$!nf|od6&lVKYYHuOoNgZU_L>E@!O%__mlt=);Hwdc43+CM?sh5y+my3XSVYMO8F1pXuq$fvTU<$mpDjr>Lm){DeV)>4AKAhA?jxjH<-3yYQ#5qz+4c`Utifny+Ydmr4?c_z60#9@FU+U1&O$Lfg$WrX7gCj50O1t`1A`k04LVr;^*~{|@(TS5>#TAjL(B`umc8bVA$bS|F?^2A7E}z7IIgZlY(8Ex#K+nLh0vzlKK=74U!g+sX4T?e3_^_7XB1A(HB{pYd{vHYcak_P3DZ2LAB20wAP+C_9p7R|0}wA=p~JFi&xD8H}n(LxCc5rcmwF`!s(tSf_j;nCb+0O(0-dTaxV&s(l>Rv&TP5r4xs>o?F9)~Ad6SB`_5VX9Z5`vf@6r1rKmSVcOVPed21~9(>t;966%Mn?V2uHv7=e1lncNw;)kfHr!^SIaT0b9AirdARXp{fLfurWc(%Xp+I5jAgQ1IdJ6>HPG}7&?U-r5)sJsfMqd(qM^-y3TSd(gmE3{lE9-R(ZJS992NcTRjzMvj+dWmvpK7>LTN6fw3^&P>;pY&p*UIlp-$aXQGmcUP!_d4>jjJ+M@!HGBMn({DKmg}IH`Vzcv7IY%sUr;BS>vM9641f2`dD~0tS1YF72>oTNvKXBP;_yP8S>lh^P#1taE(D<(UE2w413dkd&5Jz+HHHHd&E{8ve~jDHDys+8is7i~+gbWN;~8S^ivD)#UTgS)z^J{AVsu6HIGNslu}jH(v3<1dzNneURgv2$J6>qcD}t0MS`w=eUa@b4d-G%{ud$NmKtlYnIGzDx$ffgT>mlJ%0egncFtBQz=3>R6j&BV`c94;1W61W67=3-9eYv=3#)-4P`563JvaY5Nop0Yy3}9%%#plm}fW`+-+^gJCIu*KZ(Aix0~Pc-@~Nmc0_wE2y>lK&nOM+N6NM&-B6jJc0qs5OB!6aztH-n2FY#-=xPs7A=)JK!?(;h6Tm%krNYu^Q(j-Zj6Ene!`=hrZk}TI`QK5mgYE&aJ*?yxq7Pg5e6XaUNXO3c$*CTkHm&;h9a0H3_gCT;q49(hAR4>CTWp0ggH2YXYzves?NU$4JqGnCd+h7}n#2$QM3HDU0!j^RfZ~IJELy(S3br?CjY_xV{(ino$wS=qd%4EP4tN2$C@z;v&jRjd1}2pG)$D5cwdwe6#F)Ky8?UQDE#ud#^?k%!ZyQ#>>yHXexeu%gfKCkFWdJ80A8Ca%{HL|2T2LKo}XqpyY04xvX5}P-H`L0%0M}lP1LkmHR#fYh_;@kP^OJWP@${)wBFPy`2EC!1ikRaXy-Ql@vgz;gFVp;9P*aXBbBLEqVvzY7;0YLStK)w_W&N@55~eEQuY?z_AkApKud!Du6ow{uG>*u1MrxcpdGK*c9b0%Sa1mgGu;4KXws^yZcaTL{X#;3fRJGM(X!2tj#`tbk;JVv1Nr!@{rn*v;UHZvP62$vU@1C-dl9D^i+1Lq#G@7$-5+wYw(z414z4(IrC^Qz>u3!rjBT`kWNp4Q}w5}4dPTxwcYZe8Lg$uHBNeU#jwpLCxxj8Iw}L6;l1QGldfn+1nW1#4eB@Lj$s(4;6(fd3jY*mpxb)6tyx6*FCZmTMN4+x@UB*hbPdTYnN^>GK9iyu;NMj+K1q(Qf;Q&CLUMor#Y|ykl?XcChc)ym%GU8rfN597Xo^?xGlOOrS?x_4)DUxjdq`a;J^NiVCYv!ktPXLO#}vVUkGf6OF6ag(4TaN(9r36yS+%ef2cJ`6<@q;V_twUV1cF(|(}cZ(OmC5o5rA1(DRW8xyXqX4BhCU`+y8FO9<4t2{EUA=#c*>3JN(QpHP9ieR45zQ&-5bZSa0Y2@k)UUw}kP&JK>GxNrU{^LuLD*}+%&&h7qPgxf4A7bV_q;`aD81|SqlcJ_a71_St%aRkx0?KAfjLZ3o+V|!G?|eWtb)dCOD4go=(`q3i^4wsBw81mMXMY3v2c^xWaKN~Z$mu_nZ()0rm9`8P&PrYsuMX!X%%&g(dA#;DnRKRk@Q2qV9pQ(g1c##bJI;v+PbsZhQbkg>?bz-V!TPp7d}`k5;qIKx^+=V)t&(cL0bmF*+a&Agg#^GDCawKdIKW|yg=*)23y*z=$~n!6f}iqFuQ-ZRvdgYbZGSsh;%Y~Ln-g2`&a+f5(FqWrCXDdPWbmt0Z=KRaV(V8MutJ_CkT6xg{qLj;R1wW3ad0WS7ai83qYs!AG$MvEGa1k+V40@0P8YWGEgwKT3w@JPysFE=Kx;2~sC!1RgHpop7HDe3WhoEWs+2$F`Yy`QztGho8+eN7bu6ErK#M!GjgZOlJ|uB~Yat?I~^FOQrW>8&E@u(_=Hy|jYD+k``qweyO0r^>=uNUy1S`Pcx5gkqH-;5`}d#sczSElB;ReItH*e3awWPc=pc5~ygl_*hQ_I{3o-4+*3PiN`Q|(O}MOO#f2iX{-gK870sY-(p=ixEcHntKqtaq>!n2vU_A)_eq-nrjmnDCEYr(_R}jGpIxi~wZx@U++SW@)UV?IYVbWq8o^Gavt8f+jHmoEi$^*xl-B5*2m$rM&M%&BOamDkeeA$RJ3L8UpD37?NC)RTObkZhm$Tb=~Xjq-Sf6}GE0x`3Vq=qLlbQrRpf4A@_`mK9ibyj3;94xDLYEh%Bw%Nhr{dtlbrOKRJ^l}>cz=I8n7J}fLgW(OaKia`Peb`L=R=R`!O3Me1WMP?sBTMmC7dSW?kMCyWtc$Do&&u6)p()j)ZU7^*a0ubNJ+2EcsKj`y4*c#BAzMa2PL+@R0@}U%)^O)|4XC!Fm!b7K{~&W?!N^|w&jp1u~CaQy{u95E%|RTHD_wY^us))#3ii#*MK<2ftKtS_+TP3e_gf=OfsvJet0l&o_RBb)nYZ&8s-5FdI7B|?7Qk|uQNqxV0M~}p;v;yY%7FLZIb&}lt=^YD0b@94_L40VPFc2>=>H;5cH`(^?Q@1Q2mC(Rl>SDH1glgII0HZINZjxgYa5nxk}C=8moz@NiWSL?%o#eQ<3jl&#`S^{HHYRWcEkrczx_+wS2NIvw9eb`U?y;VZ*~EDh!V5_JQC^Nfdtj$9#^UP)35VWw+&aMFS@(#r(FJ_SGu-8V&HWUCJ}P-M!>iUICM0{wGrJRz-8OA0&BsJVRBU^=+TK~A*VtePgI75s9{WJ@z1q!<#pr<=l&0YtB7>!8opnvs#I-CsyubB7$U6)PQKc{T&mKu+)nuS2wHOlkl;4!LBQE>WXd-3nzJZ0ZJmfjb$vV@majcIl)4D|y$aI|i8I{vHvdz*Gcv#l5T2X<>N{^faZ!t8OgkwN40NvTSh({)`OXAB?(1@H_~nKs(&Hdv>H;C+GM-=LKT^*FKH0R@3KI9rZm=y+og8M-Ea4)sN(I$IM&eUaV;(QMHof_+XPNfB`@Ci8K>!dkrD&0X>n7Hn&sYj6`Gu87rr8J?1X#I2oY?(B(-ck?A=ZVd_hfu=93V11w%;TO4u7B|!O`a4a5x1~`pZla%ed1_hzPZZGuAYMCpP9XOL-7h=f^;?2Zx38l>=?i_6n#&Dm0(vPiyng4kHRVf!1~qY0wpW)^bNy0gC2q0~>fau3ir@vSgjQGOoq7Va&YK1T?vu!=>z+79QzJ&?utIffv$nUT*ElkCk0+I){2|fU-gJ=QLv{JiuzNUykJu|MYw}k+rS{`e)|l#~Cn_agIpI*>nfpIP>bmtPo{lAOPv!XtHTJki^>OuihVX?S@1)tfmDg{$dr{7n@vOe@*dV*DL~rpIyBmEU<(l%DFR9&vGBOME|tL<~QO>HGfckF=xYWqP5#R4zbLPPoR%Ak)Uyf&jM2BegO35rhI-81Wnv%LS+<9*jlqy+UJq|EFYnV>W``*tsPS|NL9#adEI^EhI2E$+UM!jQuHGq9eSuLE_iX=VLYE#>pUS@m#gpPU!5nL6!oV8B-aputxtu!pSQV`c<5xCpC?I@bq1F-Gs0(1Tk62y5n7>!e@hYbhKUR6>X~HQtZlKx~DCUkl-N3}Z}$29k~hw=8_r8XD{&prI>B-1_v2`qrsKzzk`3*5TU%WB6JPrf?E_Oh#X~T0^ofp*6LcouAVZdtW8pshZER(rO`NZU5+6But*$kY#-{M|v^fWS;I+4v7WqC!`Li7S^q8)qzDd^6!zDgaH4dAAMASIX64AQqqlv`D>gI&%7QdcWxuYtBq}bFsX~>!5MBalgg8BT6&Cq-iaKOGP7&T9rE^xmITEa9xECA@nw3lCa*4dzMArc%Ou6hs^RzXf&Y40V-f@R$8+7+&bO&7+JWsNun9+WB2-eK}DJkHdp&TZJqmJ(&QEu?=VTT&ZTMIAUAQ55n&(tYqpk?huz(lPT={1qcH;rmq3H4}#q36U-5X+1Jy4N;IMytc%#IbAR;JMiq-PG-XdR4mCa(c(D$$;Y%o@z%#qZ{-Fe?7c0O9m_lTHw!5??KhNM*!pA?qh|VgY73*v-e$t&WO+qe@BYx-sS-JNk`3?gdjAtJ_qz+#%F<50w@X3knacs8-D_jtdhQ)r<$iwEJyI&iE0GIipejywwswZy;Ds`nb9_vNkp-C*{6&RmmiQ!n7`jK70N|}usj1L%EhUV7ZUZ7rnj&jmo*ZwvUZLVfHW^7mFPMoHBtJ?f+_~N&Z>-eY$;LPGcuvX8g&K&i>MD)vyYt1PC+Eyc3lyW=d10~P|eg%7vnQn?~#roNUS47m7fg)W^h>_;BPM@OagUNS1XgpmPOj3hyU_%YO73u6Tqszv4Db;!$hM(|N5koTiEkvd*ZV<3ehCS!fjU-+U^RENs`k%@Hk`T}_v`I_`N!`+G@zBcQ5>J%T({yzeD`TCgor?vAoQcMlTa_J!=k;u_n|6ll8Qw7pbt73Vpe3aG8}t4{LHN^Cgi?-dEP75R&Z{`H4hKe10HB#{eLlGFe=$nEYJM#~A_Pd0R3Rue^LXm?EY8mrF<^##wpByxB3;y@jT*vf?8aq0?k+DfM2tqAMI?f8aF?)UjIt;jI$671<$;Q>D^J=}^J?t4Bz|9;V9|wVFOG(|WOY?u(|%MD{$H}1s2CrQWh(EMY=;DJn+^;eE}Z>x0*Qo^|M&ua7Y>jDS;q;R&E%c#91@)e;1hP3)R^PQ6dKSeXf`+UHx{KQ<1%0{I?s(IHF*aR{S<$BIYz^bQTd9W0)a_|?>$EC7Q06@rI^#vA4T==XPjk>=}@B7~t0)SS(l)Ie|k3a85Zz$jv(}P(=bl4v`b0$tn?}VOEfWO}@3VX7CiERar80Kd)u5em!&<$S9r-uNdGg17Shy9N)S?Q^l&l_)BA@geiWht=LUQTXouqW*j9+TkbWs77<2rb|NO9BTue6V7Pz;apuh}Bj>G++mKVo}=6j@J=k`#y$FnFkdWjN16J|IL1;{AGg%>_%E)X-LK2XCgf~1*}23`0%-}RCtJxG@F5Mt`_!W~LOER{L6Rmu;E_SOVd}bGP)X5pg*j_mL)h#`tctWP6c0eR?FA^!}y$7b(HV#`oI!#fmdbZb94LptK>v@6PV~AA^ftS{YV1x@3JbhSLOax2Sw^z+Z^QkYnDTxi?&@Vq@F(%%ptY&t$YCWnIAufPv}A_`K9G&sfgb9Y(}f1n4}WH*(uNQGvTWpb_p%_V@fsww5du0aur8eQV%s=m&h|JQvi>uu|%xgaXB1R1pSkFarIBs>zEEhFMMz!BTqgzX6O;+4SF{Bj04jcb`+89j7%#kCR30kAlNtun^aVbkTW2(>z_sGu8+rnGl*kVk@eVD{Da%u?L^n<>42Ib5~ZqKZl-?&wRM{+jzMgdoo@AI|PH5KmT&6WXRK|6~rf66;06)`u>mNqn_-@_>>L!Z^Rj4X(;9YR7xag5R02fU5~d23x&}k{xs3^1S(oU#;VOk=*&S{s}tMf>~`&kJF^s_9JiWwxo+)`@fgoY6qL1km0bJXSAyGF2m7&zq9F*eVq`TWOW;RUu@jD%Qkn3lS!c{@A6-LPaC@-J>MI!L+@Qfxx|hY|oFtb@c9L)Cl{iV)PlCxvVq(A2hyv!-PD1W`*m!^iVQ&=gO0}_bMN940*#$XfwgIzs!*%f-fS3KH90hgzG>d!LL5HjgC05Gx48)5lrh_XB|HTf=ErP>l>y%DR21&)Z$aF3L}T`h_fi&GlBdt=g`>d?25=P^o{s03J8}Ybp-h-GnUsX~kLcWX2P5_`N=v6{r3jvb=~A(~B`(_A5Up^;yHm&c*mW#OI!aw{#w~Jr%Dj5cGZTD8>J#b*VS~Mk;AgQ?`V9aIo-Hukt?STkCwOlKGViz;kjA*T777|IF|g%UaIZek+RAaii@GRecPahB#KxXf%J{hUnsf}_2U&f&(Qjm~qWsS=!oDpZuUx7h+KE~KUrwM0zeGqY)5MOhtqIq&hKH+q$3{A6qESkAuF{BVbiJGlxrQ^o0}h`Q7dNq9888W1jT5zUm{G$amhbVw)5c_ANn7Mmy2cyA+Z-Z^ZvUyJA+HNWPdk=D`TCWTefwdZ#|P@Hv_hKxNq^nV1r309iuCqZokU{mZalQjfD)bFt!9sWg0HWYBdm^FO9@45Jk6$akCiw9!C0|VF?dPB6z`NSL(e>A2XHXK4C>G*a4XdC7hEKGtIYVq~}~VSMJMj`dG3_xUnM(O#{o`=;#Lm{TQs0aoOj^3W|Em$#;9u^r|uu6udbL5tZSr3(?O#;)mqeI@Niu?ju*-PiZ4(6nSuzm4W^bH|Kz$Z0x7%c~aPO0VS>0(yofDYrZ2J=N@Wh|P=37EC-BlarTW5qQ|rA!$Iur<3aB@!Fq^-IvVG=ro|3ZZ)BMWhUz`=IM;uS;qR=OXR8+u{Cg$$mQs$zq>&Q@CoSbXwHT2UA$}&(R7PyY>_xWB|JDb|Af0Z~(PY)TGngB{Fir4VF}xtn)rhsq|Y*oiVh1s;BjAJAG!!3vj4AIm9O_8ZC9~1gc%RonS`Xe!b=UEB(fmpO2tco*UP7aFG*a4;=fa7^m8$z#e{iW`Q5t@CMP=Hfqyeqw^yQwJaxjx>%?e!zzQZ8uwPetENFeU+K|X@o05(J`ju4Xb`vs2AwCKJ{Qlj4CP;mJU&3<*`B3=3=!oL0e6vI~DX;?`dIM|<&FHQ!8Di;Er4o(NF~YJx+4i5l;fG3g8|~H8Uz$?Gjro8R7Cc^IemAmJm3DNYLzWfS{1TjhLA8Fv!mbq-kZcgx$SVo`3xo$!jObLstw;XcR*(`#86w@z24w10=~&oYl}OxQE*74mQ$yREN9>NlouksTR@&MX4jboB6E$TX9?6tJxUbqMU=r*7~rxYqSo3R^RQ>7t8Yyw_~<7T^1+(+(C~i3nYU0mQnljTGQml%H0@%)_rJ(`W~L2>wcqW1625pKm0ShwCO_jo9@NX=uKt1ze`XTY>$%JjF0}1KqN)4tZ)P23%KbOjbQ#ZxBNycm3fvX&K)&R?_Et$u8snKBuEvD{Tx$|rRr>9*Xi6bVLJDVQxxcXwas6zslXW8Oz^Af$P-n=ewwZ^9jw6QuZ4J86vs9~9T*McoFjCOU9VCgM!<#et(GF9}64v|nf__06sDoGp8x1rogc@H?NzQ=lw^4uM{yk2n4R(zcVe_*j4ufd|9;V)sld_^Vo1Q_8#KP7v2FKa=NVEqktnmeRN*e_^{T!lwJ(C6d2y<4BNU*Z5Kcd=wTF7+hC3(--8WXJ2B)WA=BboM+oWR=&5L8M<=2R!YoxHqaU0PX35^hsdyt@Uk!?wD+rqCmusPeYuUFo+IFM$eQ$dv^xSt#Yv0Frv-oGX?!>Q3NgW2cngtf~DwH<4?w{=Fq+UFnEo;x9Cv0kl0D2B7zC{dIrLH1t@f+DeiW=&m{wXqfj>3$$Ct*roM)nPPDe?7}7QC*EFjtfMU$}#@W*!YqT`RqfIgY-3(DXw>uUC4d=3%w?XpnpKG8JmIN0JbFQ;(6Pv4fu2eGq?}XGpczaB7n0hkP1=1Mgw~uhln%iyO$UH~LU{WQ#L$y+su)lQ4yo2|_!1Z4N^DaHq0-$$+1ehb}j_FYpzH=TwnM<{K+<3wFQ!5#RZ_j@J^7MPCxq;mA(3U#3|3eql}_QVWY7c4-VrGNennkWSNQ}yWCnsOvqKyWCjKJ81#v+M(%)%nCuD@F>VaeD^Tii+W%q)OrziX?lnFn500<1p5TDK`sfF-sjm@xqBeU*k9!^G$OZG><7aYc4rKmZa%VYkYV{1mCG!99dDy6YS9aqqYsv%C#%TtTf@&(VWAZ!}zcS!JHp)0|%?%WQHhipNQ8Zp1~_}t(fqro8B(*ZEdW5xJq0q;Y`@L71kS`t9P%ClrRWpE$L8V`E_SD0vyW4MY>_)94K^ze1Qq6cM;O6A7Z%=(Sgk0cVQ6Q+DNpd=JH{a|%RzC{ha>k?jl8P41fV~O_2;~(iB#M5}2OD0)H3wkNmDTO_zAN}nBlBE3mhvSU63h$^{_$#f^@~oNPZIyGCe?(gV@R|X;zZ{7W!nCCUI=>~Wamw3>(fn`|rJGDdPc>`sBpIC43vWAZ=kgHUBD~Uir+H}_|Q}N%sv<(HIjlKYLD8+=5!d1J0=K_910i$i>T*wRGFPcyrdT+J%;wXBnXyq8TEynn&%1F!D&L@2gnd82xt29{Q5II@i7`-co?*I}0`3_^p=Xv~|=2C`7uvfTyCi~XKmp}_g-_|1igFPVeN(BxK2{bkQiCq~Dnn&_FjV0hYmL8_5EAf|(x2ADcAm&;C83)JEL|4?KrmG(*)Y01|-Mf*J_-Au+8p-*_CULV6(oxnGsD>*h^1U9Nlm)ZA;>&I6zB>0%?pAYlO#+4pbQuWV6d?ui*c7^=FgGWzxK2-nYui%Dd7&Iu)ZNa5bHY6-)urYYazpa;=suM9f|3$_g@VV9oX28ZCY3rf>Z6j*#lr*D4vz(nz^ZcXgZE|(JZ;jY5oVOgbtd#>e=+i!2*J{to}qNJ7EVgXU}i1b0>M3VS(%|GrIt>1eJuU!sQrjqYMrs?-OYWyev1WmB(gWc6s3$O)myTuoP^%VT<*{*01R7_s6rjE=V=F#teoz)fKY&mZ-YVOZmgB8WwTqt^Ah0@=TN3Ps~ax}aUyf-OoFt@XuMJP{&YPyi@Rg^so-l%I9#a@=!(A*dyD65n7VUaF?m1!qtegj4?PesKvjZ4(xAFPyA-DcVX;dI2YsQ&Pz?>#vvI~?-0@DHi`07;qFI9bGLh;kRQ3L`xVVz>f!q(@onE(aidk@jedui&x~)De$U0=TM?J;=PB17mp5!=gJc4V}|;{B${(4NfS6J_hAslh{C@g*Smy@F}{5Zh7D~Q$u=D-rc!jCQg9aD+vCd}ke9Y|DW#MG2Yp^Hp=>#73oKd~kRJScM>8S}K?KdpUk)M#pfh&z=QaLhMw27Fxfl!m@Or?duDA~;q2M^iRQ$(yOWP7)duq#hV|m<&2;QRfpFJ*Ws2_H+i>Pqb>@3*lUwTKAb(B-?EA4^{!>{)91aK6LiuOAKJ8;2%0h2P9k)WQxe$2Q;Tg5W;DJr)@ieXa_!yKFbwg4BMKsS))8T+DT(qmUjzZ>97jC;nRXrq(jT8gN-_Aw2yJF2VSTE36QuFk)CV0`;<+RUIFyAbr#=tjrafw}sFl23uJQ15MGl(d&t}#k2QoWT8Lc6-$vJmoEN%ziD`#7WoF%Gz_D~E4jJ=fVYBR8B!up<$YxtKU%6-Il=Er47t!beg~OwzjVmCtkAxzC+wqELvpJSbcx8-ck5Vbw|>N$w-GfAeceQ&s~IVpg>4no;@WTkC0nHkI%oRl)5{6VnxqvY+cws`L(1A2?OJLsF)kr=CI6V}1Mu*d69*T$^yRXvK#eb9w5ujrpG^Nx%B2)YUYUM+(kgL*#=5uxs^dCsjUXTHG*#0c_v{UioqL&qN$`t-`MVRzF}5as!TkL^e6c=y(o9%2Ys9%7Crjf~ho}G"),format=L.FORMAT_RAW,filters=[{"id":L.FILTER_LZMA2}])) diff --git a/records/track_10min_16mb/2026-04-20_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_CaseOps/train_seed314.log b/records/track_10min_16mb/2026-04-20_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_CaseOps/train_seed314.log new file mode 100644 index 0000000000..63316e46f7 --- /dev/null +++ b/records/track_10min_16mb/2026-04-20_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_CaseOps/train_seed314.log @@ -0,0 +1,155 @@ + + | | + |_| +For detailed documentation and guides, please visit: + + +W0420 21:05:51.032000 54175 torch/distributed/run.py:803] +W0420 21:05:51.032000 54175 torch/distributed/run.py:803] ***************************************** +W0420 21:05:51.032000 54175 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0420 21:05:51.032000 54175 torch/distributed/run.py:803] ***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: ./data/ + datasets_dir: ./data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.9965 + embed_bits: 8 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.35 + etlb_clip: 3.0 + etlb_enabled: False + etlb_lr: 0.05 + etlb_steps: 5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/verify_packed_seed314.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.022 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.095 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_residual_start: 7 + qk_gain_init: 5.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: verify_packed_seed314 + scalar_lr: 0.02 + seed: 314 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 786432 + train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + ttt_chunk_tokens: 32768 + ttt_enabled: True + ttt_epochs: 3 + ttt_lr: 0.005 + ttt_momentum: 0.9 + val_batch_tokens: 524288 + val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.72 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +val_bpb:byte_sidecar:enabled +train_shards: 80 +val_tokens: 47851520 +model_params:35944536 +gptq:reserving 12s, effective=588000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.0104 val_bpb: 4.1174 +1/20000 train_loss: 9.0107 train_time: 0.0m tok/s: 8307024 +2/20000 train_loss: 12.8039 train_time: 0.0m tok/s: 8136732 +3/20000 train_loss: 10.0986 train_time: 0.0m tok/s: 8050733 +4/20000 train_loss: 8.5017 train_time: 0.0m tok/s: 7999251 +5/20000 train_loss: 7.7783 train_time: 0.0m tok/s: 7973593 +500/20000 train_loss: 2.8875 train_time: 0.8m tok/s: 7746690 +1000/20000 train_loss: 2.8043 train_time: 1.7m tok/s: 7736461 +1500/20000 train_loss: 2.6530 train_time: 2.5m tok/s: 7733885 +2000/20000 train_loss: 2.6387 train_time: 3.4m tok/s: 7731902 +layer_loop:enabled step:2024 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +2500/20000 train_loss: 2.6585 train_time: 4.6m tok/s: 7099903 +3000/20000 train_loss: 2.5481 train_time: 5.9m tok/s: 6712525 +3500/20000 train_loss: 2.3856 train_time: 7.1m tok/s: 6462037 +4000/20000 train_loss: 2.3971 train_time: 8.3m tok/s: 6281621 +4000/20000 val_loss: 2.4325 val_bpb: 1.1116 +4500/20000 train_loss: 2.4205 train_time: 9.6m tok/s: 6152973 +4587/20000 val_loss: 2.3762 val_bpb: 1.0858 +stopping_early: wallclock_cap train_time: 588112ms step: 4587/20000 +peak memory allocated: 39046 MiB reserved: 39070 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.37366384 val_bpb:1.08467175 eval_time:7307ms +Serialized model: 135431033 bytes +Code size: 16831 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 12.8s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int8): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights +Serialized model quantized+brotli: 15973417 bytes +Total submission size quantized+brotli: 15990248 bytes +quantized val_loss:2.39822255 val_bpb:1.09589412 eval_time:9123ms +quantized_sliding_window val_loss:2.35710844 val_bpb:1.07710658 eval_time:105520ms +ttt:start chunks=1461 ttt_lr=0.005 ttt_epochs=3 +quantized_ttt val_loss:2.35295280 val_bpb:1.07520761 eval_time:388701ms diff --git a/records/track_10min_16mb/2026-04-20_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_CaseOps/train_seed42.log b/records/track_10min_16mb/2026-04-20_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_CaseOps/train_seed42.log new file mode 100644 index 0000000000..25d8900f4f --- /dev/null +++ b/records/track_10min_16mb/2026-04-20_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_CaseOps/train_seed42.log @@ -0,0 +1,155 @@ + + | | + |_| +For detailed documentation and guides, please visit: + + +W0420 20:36:59.046000 53090 torch/distributed/run.py:803] +W0420 20:36:59.046000 53090 torch/distributed/run.py:803] ***************************************** +W0420 20:36:59.046000 53090 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0420 20:36:59.046000 53090 torch/distributed/run.py:803] ***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: ./data/ + datasets_dir: ./data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.9965 + embed_bits: 8 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.35 + etlb_clip: 3.0 + etlb_enabled: False + etlb_lr: 0.05 + etlb_steps: 5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/verify_packed_seed42.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.022 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.095 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_residual_start: 7 + qk_gain_init: 5.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: verify_packed_seed42 + scalar_lr: 0.02 + seed: 42 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 786432 + train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + ttt_chunk_tokens: 32768 + ttt_enabled: True + ttt_epochs: 3 + ttt_lr: 0.005 + ttt_momentum: 0.9 + val_batch_tokens: 524288 + val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.72 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +val_bpb:byte_sidecar:enabled +train_shards: 80 +val_tokens: 47851520 +model_params:35944536 +gptq:reserving 12s, effective=588000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.0140 val_bpb: 4.1191 +1/20000 train_loss: 9.0146 train_time: 0.0m tok/s: 8276000 +2/20000 train_loss: 12.7300 train_time: 0.0m tok/s: 8146748 +3/20000 train_loss: 10.0544 train_time: 0.0m tok/s: 8048113 +4/20000 train_loss: 8.5213 train_time: 0.0m tok/s: 8000476 +5/20000 train_loss: 7.8293 train_time: 0.0m tok/s: 7974877 +500/20000 train_loss: 2.8887 train_time: 0.8m tok/s: 7745494 +1000/20000 train_loss: 2.8033 train_time: 1.7m tok/s: 7733799 +1500/20000 train_loss: 2.6493 train_time: 2.5m tok/s: 7730998 +2000/20000 train_loss: 2.6391 train_time: 3.4m tok/s: 7732636 +layer_loop:enabled step:2024 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +2500/20000 train_loss: 2.6597 train_time: 4.6m tok/s: 7102535 +3000/20000 train_loss: 2.5468 train_time: 5.9m tok/s: 6716793 +3500/20000 train_loss: 2.3805 train_time: 7.1m tok/s: 6466089 +4000/20000 train_loss: 2.3981 train_time: 8.3m tok/s: 6286606 +4000/20000 val_loss: 2.4311 val_bpb: 1.1109 +4500/20000 train_loss: 2.4176 train_time: 9.6m tok/s: 6157331 +4589/20000 val_loss: 2.3746 val_bpb: 1.0851 +stopping_early: wallclock_cap train_time: 588012ms step: 4589/20000 +peak memory allocated: 39046 MiB reserved: 39070 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.37204232 val_bpb:1.08393078 eval_time:7341ms +Serialized model: 135431033 bytes +Code size: 16831 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 12.7s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int8): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights +Serialized model quantized+brotli: 15974798 bytes +Total submission size quantized+brotli: 15991629 bytes +quantized val_loss:2.39586613 val_bpb:1.09481733 eval_time:9127ms +quantized_sliding_window val_loss:2.35479239 val_bpb:1.07604823 eval_time:105607ms +ttt:start chunks=1461 ttt_lr=0.005 ttt_epochs=3 +quantized_ttt val_loss:2.35132821 val_bpb:1.07446524 eval_time:377435ms diff --git a/records/track_10min_16mb/2026-04-20_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_CaseOps/train_seed999.log b/records/track_10min_16mb/2026-04-20_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_CaseOps/train_seed999.log new file mode 100644 index 0000000000..fb920569d7 --- /dev/null +++ b/records/track_10min_16mb/2026-04-20_SP8192_3LayerRecur_ParResid_QK525_LegalTTT_CaseOps/train_seed999.log @@ -0,0 +1,155 @@ + + | | + |_| +For detailed documentation and guides, please visit: + + +W0420 21:27:31.637000 55411 torch/distributed/run.py:803] +W0420 21:27:31.637000 55411 torch/distributed/run.py:803] ***************************************** +W0420 21:27:31.637000 55411 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0420 21:27:31.637000 55411 torch/distributed/run.py:803] ***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: ./data/ + datasets_dir: ./data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.9965 + embed_bits: 8 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.35 + etlb_clip: 3.0 + etlb_enabled: False + etlb_lr: 0.05 + etlb_steps: 5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/verify_packed_seed999.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.022 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.095 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_residual_start: 7 + qk_gain_init: 5.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: verify_packed_seed999 + scalar_lr: 0.02 + seed: 999 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 786432 + train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + ttt_chunk_tokens: 32768 + ttt_enabled: True + ttt_epochs: 3 + ttt_lr: 0.005 + ttt_momentum: 0.9 + val_batch_tokens: 524288 + val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.72 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +val_bpb:byte_sidecar:enabled +train_shards: 80 +val_tokens: 47851520 +model_params:35944536 +gptq:reserving 12s, effective=588000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.0053 val_bpb: 4.1151 +1/20000 train_loss: 9.0060 train_time: 0.0m tok/s: 8291152 +2/20000 train_loss: 12.9437 train_time: 0.0m tok/s: 8189033 +3/20000 train_loss: 10.1240 train_time: 0.0m tok/s: 8080950 +4/20000 train_loss: 8.4957 train_time: 0.0m tok/s: 8030178 +5/20000 train_loss: 7.7384 train_time: 0.0m tok/s: 7994008 +500/20000 train_loss: 2.8871 train_time: 0.8m tok/s: 7740138 +1000/20000 train_loss: 2.8040 train_time: 1.7m tok/s: 7726430 +1500/20000 train_loss: 2.6536 train_time: 2.5m tok/s: 7731038 +2000/20000 train_loss: 2.6375 train_time: 3.4m tok/s: 7733322 +layer_loop:enabled step:2024 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +2500/20000 train_loss: 2.6599 train_time: 4.6m tok/s: 7104045 +3000/20000 train_loss: 2.5457 train_time: 5.9m tok/s: 6718982 +3500/20000 train_loss: 2.3846 train_time: 7.1m tok/s: 6469005 +4000/20000 train_loss: 2.3991 train_time: 8.3m tok/s: 6289641 +4000/20000 val_loss: 2.4310 val_bpb: 1.1109 +4500/20000 train_loss: 2.4146 train_time: 9.6m tok/s: 6161004 +4592/20000 val_loss: 2.3745 val_bpb: 1.0850 +stopping_early: wallclock_cap train_time: 588086ms step: 4592/20000 +peak memory allocated: 39046 MiB reserved: 39070 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.37184229 val_bpb:1.08383937 eval_time:7364ms +Serialized model: 135431033 bytes +Code size: 16831 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 12.7s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int8): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights +Serialized model quantized+brotli: 15972260 bytes +Total submission size quantized+brotli: 15989091 bytes +quantized val_loss:2.39488502 val_bpb:1.09436900 eval_time:9179ms +quantized_sliding_window val_loss:2.35362888 val_bpb:1.07551655 eval_time:105230ms +ttt:start chunks=1461 ttt_lr=0.005 ttt_epochs=3 +quantized_ttt val_loss:2.35069962 val_bpb:1.07417799 eval_time:374530ms