diff --git a/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/README.md b/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/README.md new file mode 100644 index 0000000000..6b14ade61e --- /dev/null +++ b/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/README.md @@ -0,0 +1,40 @@ +# Record: Improved Parallel Residuals + +**val_bpb: 1.07578747** (3-seed mean, std 0.0007) | **2.77887078 nats** | **~15.98 MB** | 8xH100 SXM, 600s | Legal TTT + +This submission starts from [PR #1523](https://github.com/openai/parameter-golf/pull/1523). Most of the newer submissions moved away from my fuller parallel-residual formulation and settled on a simpler GPT-J-style split-lane decoder. This version keeps the strong parts of that newer baseline and reintroduces the useful parts of my parallel residual implementation. + +The key architectural change relative to PR #1523 is in the decoder after the split point. Attention and MLP read from different lanes, but neither sublayer writes back immediately. Instead, both outputs are accumulated into the two lanes together at the end of the block: + +```python +next_lane0 = attn_resid * lane0 + attn_post[0] * attn_out + mlp_post[0] * mlp_out +next_lane1 = mlp_resid * lane1 + attn_post[1] * attn_out + mlp_post[1] * mlp_out +``` + +That keeps the GPT-J-style parallel-in-time update, while restoring the richer learned routing between the attention and MLP lanes. The other important part is that decoder U-Net skips are still written only into `lane0`, which preserves the cheaper and more stable skip path from the newer baseline. Attention reads the mixed `lane0/x0` path, while MLP reads raw `lane1`. Final output uses the mean of the two lanes. + +In practice, that is pretty much the only modeling change here versus PR #1523, together with moving `PARALLEL_RESIDUAL_START` from the baseline's `7` to `8`. I ablated that start-layer change separately on top of the plain PR #1523 baseline, without my fuller parallel residual routing changes, and it gave a mild regression on its own. The other notable requirement is that I needed the CUTLASS EVT path to recover the full throughput. In this iteration the CUDA/C++ source is inlined into the training script itself and built against a standard `/opt/cutlass` checkout rather than shipping a separate prebuilt `.so`. + +## Results (8xH100 80GB SXM, 600s) + +| Seed | Steps | ms/step | Post-EMA BPB | Legal TTT BPB | val_loss (nats) | Artifact | +|------|-------|---------|--------------|----------------|-----------------|----------| +| 1337 | 4,655 | 126.13 | 1.0830 | **1.0751** | 2.7770 | 15,983,095 | +| 2024 | 4,689 | 125.20 | 1.0843 | **1.0765** | 2.7806 | 15,987,382 | +| 42 | 4,696 | 125.04 | 1.0837 | **1.0759** | 2.7790 | 15,982,563 | +| **Mean** | **4680.00** | **125.46** | **1.0837** | **1.07578747** | **2.77887078** | **15984347** | + +## Reproducibility + +```bash +pip install brotli sentencepiece +git clone https://github.com/NVIDIA/cutlass.git /opt/cutlass +cd /opt/cutlass +git checkout 08185b9c3e90510ee2b656662ed0d53b06d28157 +cd - +MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192 +for SEED in 1337 2024 42; do + SEED=$SEED TTT_ENABLED=1 HASH_EMBED_ENABLED=1 TTT_LR=0.01 MUON_MOMENTUM=0.97 PARALLEL_RESIDUAL_START=8 GPTQ_RESERVE_SECONDS=13 \ + torchrun --standalone --nproc_per_node=8 train_gpt.py +done +``` diff --git a/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/requirements.txt b/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/requirements.txt new file mode 100644 index 0000000000..7d206a3220 --- /dev/null +++ b/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/requirements.txt @@ -0,0 +1,10 @@ +numpy +tqdm +huggingface-hub +kernels +setuptools +typing-extensions==4.15.0 +datasets +tiktoken +sentencepiece +brotli \ No newline at end of file diff --git a/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/submission.json b/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/submission.json new file mode 100644 index 0000000000..3364009c77 --- /dev/null +++ b/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/submission.json @@ -0,0 +1,56 @@ +{ + "author": "Marko Sisovic", + "github_id": "msisovic", + "name": "Parallel Residuals", + "blurb": "Built from PR #1523. Restores fuller parallel residual routing on top of the newer GPT-J-style split-lane baseline by writing attention and MLP outputs into both lanes together at block end, while keeping decoder skips on lane0 only. Includes inline CUTLASS EVT fusion for reproducible throughput. Exact 3-seed legal-TTT mean: 1.07578747 BPB / 2.77887078 nats.", + "date": "2026-04-11", + "track": "10min_16mb", + "val_loss": 2.77887078, + "val_bpb": 1.07578747, + "val_loss_std": 0.00180154, + "val_bpb_std": 0.00069743, + "seeds": [ + 1337, + 2024, + 42 + ], + "seed_results": { + "1337": { + "val_loss": 2.77699288, + "val_bpb": 1.07506048, + "post_ema_val_loss": 2.7975248, + "post_ema_val_bpb": 1.08300903, + "artifact_bytes": 15983095, + "steps": 4655, + "step_avg_ms": 126.13 + }, + "2024": { + "val_loss": 2.78058475, + "val_bpb": 1.07645101, + "post_ema_val_loss": 2.80083877, + "post_ema_val_bpb": 1.08429197, + "artifact_bytes": 15987382, + "steps": 4689, + "step_avg_ms": 125.2 + }, + "42": { + "val_loss": 2.7790347, + "val_bpb": 1.07585093, + "post_ema_val_loss": 2.79919043, + "post_ema_val_bpb": 1.08365385, + "artifact_bytes": 15982563, + "steps": 4696, + "step_avg_ms": 125.04 + } + }, + "baseline_pr": 1523, + "artifact_bytes_mean": 15984346.67, + "artifact_bytes_max": 15987382, + "bytes_total": 15987382, + "code_bytes": 26056, + "train_steps_mean": 4680, + "step_avg_ms_mean": 125.46, + "hardware": "8xH100 80GB SXM", + "evaluation": "legal_ttt_exact", + "technique_summary": "Parallel residual routing + GPT-J-style parallel-in-time lane update + lane0-only decoder skips + inline CUTLASS EVT fusion + legal TTT" +} diff --git a/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/train_gpt.py b/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/train_gpt.py new file mode 100644 index 0000000000..cee4f99e7b --- /dev/null +++ b/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/train_gpt.py @@ -0,0 +1,5 @@ +import lzma as L,base64 as B,linecache as C +S=L.decompress(B.b85decode('{Wp48S^xk9=GL@E0stWa761SMbT8$j;cuBxlU)Ebn@VT6Qap3bt~@<3h>ok~)Km_aAcM1$ZA=RNsbc*hJl~=gx5@S`BHD+&>Xp;7x}-gi>D%6AW8T@rR19$rwtiM2o~|wgmtU5bqA34TA)eOf!1NbI=@)-W>;uDJn~M=1;V0Bh5)gdOhxi#JHVJ{qqF$n+Wk>KYHbERxFBz+YYLeXCIlw+g_%-8M_)k{Lx6$#APN{gz|uAcxc5*eTrvj6d9QlJ52-;<K_3qo$qHmPs%6$yx8m4KJ-{tCtBy#h!iJvc$V_#tO6JmUyS44dIsu^fEBO4Mlb+Qd}d?aPYea88WZ?Z|c{Y-SnAHSHDy6OYu{M38UKEc(8I+jm}&&IebB5VV%xjN(yqdDvWvaO?+N6}T+tBtr-;ul>}-!vl?I_u>7FWD?nxubtB5GeSj|u=!=7l>s*S2HGK*vQ8B65^Q3q_uAVo#X_?U%~Zq=!~bMCCeH5*F-tc=9^&X~#9~bb4=PtLKUrFv7|)@us~B$jh9&>Lu`mqT~-RlPsm(u_m*YQz-;fnJN(XvfY9Fyo`h|F9Jwv?OYXE>+N92>&_wGcC^^Wt9n@}bJf-`Nw6oENOJvLURw_Zyd0mZIlm@sd}K)f<8SZ*+mK=a1W5~+RYyFa=o`nL+tVh>ZB1sOZoUq)BGHO&-=Z@gfQ?PwLpT$s9}DZ{+lPM%?1x0bgvJufMi7$^1l&xlZRv$;n8`SP7AehZDp){tyD3M-EOHbr+!z53T*WyJ}1aYX6PMj7q+=Ee`(%q5~(uu|cxt)oro~861@Jh00u{`XQ2(aHnpcc_Bhs}`_rP}*EItJzO=5d#;2%m;$6C@~^vfjH}+r&kfbG%UFT^6-9+bw1Q4Olb%$T-Q`%O6pr=@Urb4>XsSh~b7kQ?T;hse`m@&db=5xo>9q@f!i|3JSfTjaD#OCDQO0&`*z@99ZWl35{7z?gWUD)%J6&dO!9!-No0}7-riPZMeby#mi==)Aq}8>Ifr=~|mARq;i0$nq@6;AFq{ZtLdj-0=?ijeK4~G%krxq#dJ_@Nq7qhoi{zQnZEv8}hu8>HOx_vK(dAGco9<}|P_!mMwcA5z!W&#sKejYf#yGM@%MXbT))(tnfuf_mfYG!m{*%Muc@p*i{>e@v{S_SOLBW}68`VUXNQ4{{;%myGVL=(Qc1&Bw@&l;e?DCUgi=vDgFNKI`^Q$_DC}#K|D4jPO&V)fr?485WE|!$V4OJ~$q{vTf<$WNH1;5*0{#;@}>^{odcwrQZHX9;??@(=a&4ehsoemOVP`(4aN0G+pa>7+d9~Pk!qp@G0*r4>1C&{)|mU2E~Q|NiSGt`w%j_(uzPEyGz-`aH)fZM=}B!+QOsPd~5ZiQ5ZD|(i@Hd&|;ejVY56Kl;zFyYuR)b6ymQ|gSL8B1dbJL>E6v*Jub_xgqsBvExhPe722#YNKCJ*-!PV8tThbtAmwlvQ8(`(VALk&*iXagVnr5#h6Xf{Anw}>2RtDk&duRQ=xKfecv{=@zF{)Xxo^%G*$qmjgCK>Bn6f%U=%(Z80!&q9<;NOa31oukDuD;EM<0lvd`cJ|Xc2KY>gdQwmvyE0ZB=iB^CC|0EUj#6omVsn#l-9v}&uH3ImRE!pws)5%pyw||u%SeSa!Vp+Iuh`zZVsM94eG5;h_}1EN(O2Ye^3y<%;odwp)3h|<z`-OT1zT?+{D4N2W?qz69|Lub35k+!x~T)v|5sg<{66bHmNWf@%@d&|-t~HbyhIlq!b8;{Z)}Ht=QZKh7zL9N!%$7w#Awis99mkDf1z4uRke@7yT32;RAk5J|$zxhVRtZHJMSqVxVN82>{~QCYmee_5%ChDK%%WQo-!2Z3JKc)>jRJl#xw;@P8#*R%j*$!`TLK(XXJFsK&nQ~X3_^Qzs-uJt?RJya9;Kj6x)6JVAjx_|C^{@vkH=KT0lXiULTazB?gRs1muX3TF+bPfMY@C(dur)T#|LfS=snNJQpR>PmyKx*EIb~>i?9F$KW9_lR;0;c@K+*R|m77kO%k)8lm5laZpZk3j@n)IQv?DDLzdG6g*{Yl}T6mD!o>r>e=zQt*4Snt=!^g;u#EW&Op)7Xp=xOB4A7;G5HM24URPN~p&nx#2G0Ylw58o9#n~_&S++l092y+AdBM~4}lzT*JBKU9#M)!OYBZ4m3{9TpUC=Yv;V~*vdvn|mHroc|O)EyNb^#;Al|7??)E<{@bN$Xt|>-0q6A&n9pdTaz5F-3F_VPoU<`X=sO0g-Ov&T{fRN04LGQdgzU3ghFvTz~W~;>=lR11R3+Xn?kc+4JM?6wjYh?rsDq);=_!GjQu2$BUh8X~Vq{zP|FE>U>9O+;~~_IP-3?=j(`%=#pY>pt{&0P*^pRf*Y8v4TCKnjYeIH3T2EUf~Rs9%c?o5*R|C@k7zE(S9THA%Ou9b19B8GM!W6{gG+Qja-Tj05zEy6vC|M4Ya~U*T_0Pd3lOPI4Ao+p{+oh=KIq%+-Q^q6%i^*4_mKR>A79Zj~fA>jmbs0Z{{@91?+J0+am$t#lrb9+U5h$Ra(%=$_h!7Z^iF-PYbs<4@^v_ui-`L0z5WaDj;Ouf|rxOo6gD@}_j9s77{^EfKAH=j?&(8hmWlv%p9#j2h=~X0bxE*4_2gv@T=7#yxR+tvwEMct+@vYBs5s(N@e(>>Ts6qAjs;V4blxZOoOqo6aDu)(?a2{LEiLe^LOcw@r&1V(5%d|OkDOWA_OIp%Lcur;t>(>GXR=&Io`KP5Hgqie5eiRbZIzxYzy2)OKjW>&+^DRd5-dw$yA#>{UOwJj;4Uf!5%I>Qz;$|6sdvj0rA;rrxCa0Z)dd)+PLYZ$`-A(U57=%WfISE*QxDL}TUUE110$PA=fQ?E+cd*l!T0vV-ggWZ0W0mqAAzFD<-zpc2eSie<<-wJH*tQtTZcvK9fx}et#{6`(LmvrnM)gISC8l{JNHNl9oEb<8zTHl5?&1QeBwimUim`YVL?#a6(a$lJ=S_lUmlpJE8`sOVBUlfmd$_%H8dv!0Eb`0Dh5e?f_I2zHorS|S;)-Bdq*DU}279qGU|4faX9o%_qlt-XSziUiEXH=ylovN&3Sa>$67E8ABwk~t$n1a|)IQo&@+2yE+b+R8*SK|^*v{$CmCWRAfN#c4Xi)*_WIItOM=k-z<>W9`OvQR1x`tnxeW5_A}fVc~WDJN?f(wRh;LFD_W>{5vI^uZkz{B6|EWNdwUOSQT#CuDH)zlgzL0p2Wpvd-r;EGu&NC6oV{XSU`G}T!Pz-(Yzb~XxqGwWFg*bX2S+?_E{iaAbJz>qSSQcFdC3GkiaU`T%l+fUPO!d7D@oTae9QTd>{t*7nwAbAFKpcDQO#>s#In}~KhMRWhJ{?NH5r8Em2{g_cm@m7hnzl6$8YZKBTBJq`I0u%84?z}($&lAclxgn38#eCIm1)rsHw(bt)6d~KE6|n%|1KgB3a*7lRKm2K1#4O@G8q<)ztKkid4(7i|b|XXKVD<7riSUw!R2u&U4y|Ptdi@UfU7Sm5<-a08L3cRwrxac5wrBK4fmMM@-@qmG1qLJNOf?>!0M{zKRp6G1t(HA4-F6>|8+pamSK_^zpW@rS~knTMt*lIcwKhxR|S4FsClr3t-fnuNrJ8)%TmPp@-U{`0s8$afJ+t=zu9-}X1S5l2SDE`694t2Ot3)tND@qfBgJiQVwC=%W30XuAh(xc0ps0A`u;f$qbhgkIV0FMtNAJOFw%x#0Ate$LjJN@MT8x235=_I|O8=e1{YFQVTRS#Gu5;SRyhfu^0h%!ke7Rr7;#j8yc76)g5rPM1FJnT#oO|AF|=qWPMR_uB(2C8Ssw%L%|VtD_QFW|GM|7o%JXVv?=P~%%j^RSvKu%cMm1&7IIYS9F*m0aP8pj0pW|;P=@g`NhvBkr4PzPPfJxi$dEySLo&;iqpP>RPzg2nx7>d4W12%4Xl&FK7_j2n=S4)f{XtjEi2B%NpW)MjQrp!#m*bBdKP$G;oh_cHpbK?B01=s56k~E8#FKj!N&u3954gk(%1j1#PTZZNKatha2|w^3OQJk$59ha$*Rh@qb>Tvhdo`{u<7<3?Z`ml_9_aO^`|OGC7LJSKUeQwU|UN#drH-N0W)M)PPt7s^!-OZSFoWbgF+li`W0NWgM}`of3h1rl7~dtJQUAwROW+>qH|2QKEPsy5`G4Pd&buAnrc~IS-jul>~QjP(KG2a7IKC_%v{YbM-oILHWPtWIeqRnexFZWoBKrYWxeK%ndp4$bHCn1!_*EiC9_j`E#NROI?I4@Gep?sN1EHh>~&!NVlhMsEsr}V$!=VliQ#jYCkVbM3fha(%r55j6$hl^3ZZqf0gBuSabWMZ{mo6oYEL5#^tI;vZUP18M#6Be{I%$Rzu5Q@oTGlTXOw>V^Gp7=X~HKn{<0i8ph4n$Y6ZbhF!gjf}W;b40N1JVVz?M$m(5@G&Ty~8tN9q?Y5e=eXpD<3M7C=^={WBW;a@?qnNv5XCVFGtqLb#JQtPFw-*Nvb2GUcpaOtqnUMy()Z?vK%t%C5IDncMGie*#B#6ImMc?7^rsRl0-;se!gl*#xJyipdHh~0ZP{w=RK|R}_T2P8KykX4-#&B&Lyp!YqSH;@OE!yl$7*>3=&mthvq=S+`*e0lY>)JD#)jaAEN<7NI^&cIHI`VZqLr|lMZ+FpD*Od$AZ9Xm7R)jK_`IfoA5Y}f$B{_q(=?R-G#tb&bgFpjy%{3G?%t=lo?Nv^%P6SFVoCXkx^AT5b?hc>yu3Voz)6PQD{PzT*5uwro1xO049c)(y>NTK`e-wdP(}t02!%HNMAxem;yRH#d02&4S1iu_pgXBR->VZT;DHfgZRdAgVNX3OH4pJim$q$JI%Hizmv=Zk!4*pv*6PKP?TZhOT$;-+Rc&;U8c>Xk#`0P)ne<2<;!wRV{Fesy(bDp2iDRhbM;hDD(`F>6Qp6EURTw8+erV1)SaqM!MoFizzhvt10Vhskv33xpbGLqz%XEgK^$T$0^Cnv)09n`6$4s*S!!3i2h=MSaO}~n8Pr+oEWNMeWD9}H&3LPCLvUl7`$9;9%)f#PPr>Vv}*~M%o`elF%65o6yr?mSnA#+{&1Xd#!Vx2?4UX4g%)K*+WuVDh16sA$qmHs;6?Bo*wSRu-679DNb4|b1vS_bAC0j`yfDa=u`=I7P%f*tOmBEyR1ZLQ8zDGtFG=Q`l663d_MRPGo^+&?HB<`Js9M~%TZq`jmNaPEN7!_MO{frBc%$TWW>iMmXO1Rv+Z#Ch4U8q2Y4Qt3JKR=yg>wYJ_rn9$cun%X^Z^?LWxWPwCs3sX4z;;)n)WdDWnpF7`J0M6NmMfV51Tb49U&7xT1FOUB%FFbBJ`!s;!tQ*C1=X|E7p0j6hVMAo6^so5UTZ1-*q0M1!lhwcB~e;;7bTP0wM_v}{dVGMkcvQ!M|UzX7KfzZG4kkzLI?zZ?7_Y3b}+ROxocvVW}mK4G=BNh}DxulXF642+XOJ_{^o~0r(2#t&rCBO~YR&;92w31z%hgv^%N_oS3fM8{co`ZwcWJ40^Y#X%T?Bq5JT)O3H*@48XLm`EuS1+V?`vM-9_#6c%R3wr7o6tWgI`7s2vvC<74cF((30o!;*M@1%U3Zo}i3SZEZ`w<2u}%O?KP^`!p_nboa;qiw%)<$~GNJVwP}bs#tHf{O9#d9;Z{ok%40W#+%@L>|ie*6!ghDL&FHcOP*`DZN*cmx&kO<@L@4+&h@ST#O0f%?16~aAok4(x=~|PsB&qTlNh7dzN)J7S^QktZ{lm_@H@%4aAhQpXWVnns_iR5WH6Usv3~hQ95;VOqFxMdtSpI-~!LRL;U=S#HAZJzuLwjr?M!=orb+D%I=pu7HsFl41gz`hqTb3GW<~HKw}etG?~&Evv^5piRl=9-p8QV3d17JmwL=+G>jgHSICjkPRCV+=&;qbQq=ZSk$Q`Exrq+{bHBFEKltIHV>_FjKFAh!90GCwgm*n7S{q;H$O>Pn`qcc(h{nKIXBs)K7rWl#JeM>U0)LDQFHJ-To0B>j1GNTE%*J*=|S$H5$#;3?w{5NI7yjq$$dL&M%^lq)7A>Jp?rz`<6}SZnys^?8VgIWSv9mg-bt7{ODSWDfY=B~wzw!;VY_~FRN=KBurn=@GxJ`^FlZ;g!Yaf2!)afDu#b(W^hV%LS(oD^LwA6!@u)O3%TP6EcMC`$CW&$mhjT1R}Z-+>O46>6zLW3OJ6WMX=NW#HsMas7ws)Jk4GK%fcA=^sC&o2yUP9rv^drQ%qDNM{zFT0@L4?Y@1sV>hx1j3TTb&#rkKa#Rvs9$a{1OLol<^kM%#rK{zXXX!=>r>c&P_(0Pq%Yh<_14cT+_ch*XqfFZBeJ311Dqw^MOMg>1()hVWy^{=eUfgp$XdX~p*UdC|P`#KaO~PO>UCOY97>(RGHM9*P`#X`k_fxGAcDV5O|J0q4@Vh9G4_ef!p?+pyItVH>7@N@P{Qp-~2`OsT|=$I5aA6I*MbUB*tkzaaroYW~SS?pZfND|vMqic;O8sqCeIr`q&()`T@Lx+~!4nO)d-&HGC~WWr{k6h8PP+0WpRdW3tI6^5G?qhYn8NjIbm$&^-ZDgf_6i(H@g=<%&;IPgDP{h)l(qd7f$3P^jE>tJUit%|w5%Ts+pTv>AGF)EXN%SIjKw3rJ!P!bYiaU-+d8vggfm#U6QB@bK)eW`j79@arrBhiEp?!~;E-a`5(L9&i}P0%}Q>vv4Duyh|VG^aFLVlfX7nXLQtR6w_|fk-B-9pe8zvL?<%kduHo<+bj(u~-I}G5H2pPP`m==NcD5c`Kkq=NmccC;d{iCQ^?t&hfmM7&ENkc%NxtY5l|G0%c!lHx~`r!Y^jN#-v%p@;*`r@Z!A?%-{MFAkW;CC`6gO40dHFoE-Yexsal`eS;(+su|9ILSU@PzQ5K`*hAjx2wZKxp^4WGaTadt4H8bABU~-CoB{E`sD&JVM#2hTGmZ&RcDnll;!f&)vx+P%iC_ycENO!h91>wqnoG0Bi2T{3>$Gck$MAB(C4-rQs*u&qXJTMm{s9B@a9gzM0YPl>(T&&Cd)B^huO3rHTo_^p~)q&_eyjso6Jj8D`CgLMil;g&muqfid3LJOMykvr9%X7UrsUP_Q4=H;f4rOy~r0On>$lMs&aDSs~*`~+GO2t&5-rHeL=bA>IB@!Fw@JRMML65PnaZ|wzcNWCu$@Pq&=`nYiDQA(0z^koiJ644=1oR8s$_H*JFnwa!4C||Ozh&?rUe7r?Vgca7$ta&wlGb>?3IeBMpu*KbDvmzc1W>yc}LYW9wNI+G+fFQ_yTz_~FLRG8faq-}0M#fEGk6g28OWR5o#RCFg;z57DK}Cu^+TYlm+%JUU`v(jRM6W_wU$w{*rBnBJs_X8b`&Dc^^t$QWJlzEBei6X@SsG-JFNj$~A2UH(?}5Aq5WyQ={lFXG0sdJs0hq*~j}I~z&iRr=O9f5P-Ro?t1aApO0F~2WQ~Xzh!t@D79;eaU+E$T2leJ=q$=N`&+2gW~qNInad5=q?+=q>;?NqOR>N$sb!TqI~xv;T89w-aVdW1W3zAx&nTT&ml{QOP$4;a+(biTnZ(Ioz&_NXwM690O{F^v;uz%u$AUk#FJr71-;yWL#nX10kapinHC(V&R6tQKJPCgR*fy$!7pT1pq%=ef9)?!(SW^>T#PuVxLb1%)TF*9gSZG~(csZS!%T72!9@klNujpS~|B_Nur1(+ki#amR+>eQIjoF)sJd<8r$7JF3L1F_`Iw%<)oe-{P02r}Bg&!~0YPOUo-6&`3XxR{R~15Bz{mUUv#WR|8)vj5l%}m&Y`C@sqW=^s={X&Z^*Bg-oj-(H0n;Q6Vf{ENs}s=UB<6B2~SW{gq)dy&7~S)O_#NCK3h#;Q&G}WpDk7rr`F@g=F)90G!=fd*>1~)k`rm-UIDi}cAtDpEgEjnMNH%*=406g(mwZXHl^9$u)#n*zvslCOciE44JSg=?>x{#7Bx5_zLQ3<D(a-$R?u?LsRqdEnmhA_v@4v1?&!tAxzs=&GG#13}0%A5d8^-@8=B_d2Qdx6oaNM%lHIPv37;r4Vc{tT+QQVRi8h>Ecl@KhL8x`Q|=UbIwJ`vlg(O>*J`;`U@3Yt&a-xa;lQM&wUt0TbHt)5twWT09!lB;C0*K45Khn&LOmd{qLaYtGZJiDlG(-nON}ygiAPHm>2+DBWGh#*23F4f*Rn~#u;vjZMk2D>ORUGS^5K$;K;UGqe8aki;!s8Uv=3u1EOb5xv=)`8vJG1GK|~v*jS|cQ3J3LCoG=UgWp^;1qYYHIkx%~v?5}$Ze$^{}m4jC*&fu^-6wjl>XIQGANEHi1;dJAXg?=iruJ_HnZCI2pL$-g(gIL|02v=U1)qb6nb(YI`o$V}8sJ3L-^}vKv>hq;dS~H`|VaptX6zL?-Zd@Y#yWK}KJ26Y`VLoK|moKmUJetCYBnmhQ@|Q$C$=-3L%g=;yb7eT0Q&*__b2%;WN}k3ZN`9&WdV1vP1>KPEcZ{eed}i4`S2cZH$15Pp=$XZr-AMc`ok!IG50v`H2W+=`>F7a0kqt5udlQn-Jk0GNqjO=E{q!JGx({CEME#UdEJc`=KXVz+NTqH=ZP|t3r4MBu=n-igW!!{-WW}pX9~PZoM$j4YI9mAL0Y@0F%R{B(Rn^G1+2^vIrMAm}qrHShpaheEscfw8d3>{Z$g((PNJ)R8WFneU7H59)xgGFTM`%-`>R8K)K$1tsc7Jo7@nGyF6JhCvlBvU-i;?U15xH@lLPHA*50>H9_<=nKIT0P7ap;&+8+0TqTu6-_lr#fBV3?prEhGt8ZGGp6K?H&LG1Fu`vvsraIh+-Y7CYANn5qW>5(U?lAETSbRYLggFeKtpXpl5q&C0{iKDcT}9Gs5c7xvLWOG(NCLMU3mmkz8MZH42_ShWjh4jA5OR#D;On{tegN6o16N1fgQi_rB$m?+Z+@6Lz+_OLj6^N);o1r>kB{=ai_j*lDD3E&c!EztCXpj#OIB1;y&Dr7Sz;FJoxza%8Tn!`2L;LvSsb9e71h;2!eQ4df|zWxs8JjCbB4gs7}qKXNf3(o`LPyogWGJ5Qg)im_QMA0H@~tD6b8@*!u)#{jkGfRFMTVk+}N9t<-nKV#6Arpu4g{iF>~)pd#S#yCV0@GP(ZB9qKL}z0w=wP+dbZn3}HqYgEzG-qA6-t%sh*yqBF+2glT>pAczlQ1l$`ZNI#W+YYGnu{n~x(Gd@L3KCfQbU7|K0!Gz-eQM`TQ7+LF_kks(2B-po_wsqYJ-NGNT!ONhy0b2ZhB>7*C4ba@pcD#YlJc+x}~%>XVM!V^T^mQRWMVtaq@?42<>W2N9c@xFplq7_6Ytr%MYYJnGIHNMG;@4oRi1S^oHn5Hf=-#75zW7r}~{8+NSd3q8$7(dq>MrMTU3GiW=OBDpcTSls2!JUndbzdyLMk3){!niWNsO5Swbo*Pq>$ZP9x^T$>JR-~8>z3po-ixXst$F)cxH0T6IQm!=hjdYyYa{4+vGFc2Ai`f+b-~cU6G#u_JfOpy=v0?bLrz$3wY}FEJY{$y)_K%9g35O$y01?QQs^uIQ!T&cdLkzvMG}^NFL4)2rXm+CuD_nYRQ34-yI#}&JPx+Ml+~b2LYAejOfVR|3D#kjY3sto*i0ZrBxGt`+j<)<=7xGqQ-H~n2$0~S*xCEua_RQHz6pTCWI>VRRlcYvq&r8Ke@i5+!mQOUG#|(NcON8@yQ6+9Uz}@dA-Iu>?Z{BP0%P^(V{-qgEf=?{TVz`&9Dr$Wy|;t{le8v=X?vkET5XL-#TeF^zXE$uI-AC*9Ek?{2V_=*oM@}kgk=b-ahsl&uA^wM;(3Oz+&Xac1l)oVF_l^t1mia8(<9%5~1PjC78-Sc5Yb{*pZZauDx7IU~I2x(2B&GZ*&jZVv(uh<)t(Bry|DhcVk#9z~s&fBM9j6VVLuEN>emN;s1KN&+S&)h9cfPA3PE~ka;n5ZN}cpU{Wa7kozJC9+8I2*8K?Uk7v6g))A&n`#g~9;QJF#5=R>}xYdUx-OVxrp2;Q}Vu4{gcpb8(|B$!0WX!3mEWIn&3JZl1&^`4B5@OJU5?jaUhB8HbBa!!B;6#76ki@?1P&*~puSI>^7`MZ)(UN4w8rwu|k1R6@$v2t^&)6;}7w7qc;Az;Yc4|*H!)U_kLdM%Wb#sBb-Ka?jY!o?P#(_dp)aj7v2Z6All@-><%<7khsQH-~tX=!Z$0XiJ14mkqTngCCf?Y_w^c;B|-kfsHSA)aRb);ddUYirO{`vK1?P-@3UgU_+)GDlLZFeWTE;XT1O!@mj6pF-^P}e7LZj~zK7wpx(t+b1ODd`aBCxX?4Y^{4Wr?JlURc8zAD%c%5J8iK%=IL5W*GY%EXJ!fg&G*eSn(dOco50-9<$?O6@-M&X?YjxTjs3tc80QyDv%}1}FeM)(_y^NU3LZuFm|j%qF6pE^HeS(X6yds-_ZtEfr)Vr&9lQF-R7M@+l$4)#@Dr0K=SM8z=B*3;D4;-|#BlWFn5(Fd3nsUK`>@xLI>pkD&zU`Obur??#+FEb{q?aSb!#2WoXIG`la|Hobc?@yT8TNimHYdU)1daMZZ}IRyIspxURB-0l_L?GsX6rOuYUsOA}z-a8;K`=;B{(jJZPg5u)*bk+4|H9m5_ejfMdR9KQS{vG*hW;bOLO=AJy@2WvUnR==3w55*{Y)q0e?9?-ebyAnUkIft}InYqjtyxb76JL_)$0+%d@k6?`8F4!=V;3)};H19QY-n)Qr0VWgwRb`m6gwGl2Xf8S9JFM1L|=kZSdaV9|;USH#*yzP*j_K;q^U8XlJbWP2a44#@iM01z+=+w=+39V?8G`i%9E4EV*)iFceOnBVHH$E5l{5b1@3`xo8E@uH}6yus4n+2&9fSKvTiZS9M{`C}XY^Zt1O979#rkRM7FGn<(m-87Kj0zm(aHv0gQWGaCmH@pr>!ym6h#4?8mR*SKlxxFg@RrRr+$ziZutSqC|iM4`OPCdf7&ch#PDEz8zk5+vuFl#qNDvAa1qz(z!)KH%CsaIcU9wTuZY52uFpN^8%GiC~#zkrkqeRp`q%mg@WvGr?f6b8ZXGxE=r%7=olDN!T_i(c1%LrC)lLfe&vOK+vf@`)+B5&+K>13%_!lJM`U6h=$S%0}%byCA3U{EcJ%hZLfp!*vQd1#?X?KU(onz{rghp_twPd}tGhGZV$a1$Y3=~3Jd{6(T$1p&6hxipMl3j_v9tse-GFjeovZg)0UOVW!aR3U(c;L7Q5T~yL^YLv{635DAP1>rV>Q^R8u+sPnYsu7C42QhJN**%QV@ta<8O4SYgxwbU_gbeqGRg@y09Vp#)yPYO5C}ZG;iS=4i904Qk&evlp6oJ7tszNMY1g2ZY5olO#u5k%@5`rL)=>HOBUyO}vhWkG&?&20Gy(FNfhC_U5j4^*DemdfbcfVDEy}o)cBdqb+K%_&B?ee=#PIQDmuB02<9uSWW{2QIv8#f{M}W)id?e5We0@=N}sM!R$5OAF#Ktpc(T27_ojwCdp^T!R&lzIk~c0E*+EGX*#jNpwHs8#i?hbSC)k6vik+d%XjLqn*#K&Zj4gf#ix}iP>-5<1AEk0{0bVo|Gb3ENup81lVT7of)@zT(l`Xv0^?2I15ET+TLW|T}02c*k6HA;a7nkFfB%CEihoKYn-pwXG5LrA}th*R3c`VAwftG8=Xp_s(0KIp(h1dif!u6r90Yf%DFSG{N-08Qm$VWT_|rsypnZ^Ju6G{PJSw$(BG+z}wbFu9T_GQXdF;yaic<4{a2zGWclGb>lRPQjM^2eIevk40LQ9XtGFZpJjsU@h465)MaZxdT4uW@sST1KJGzU3*U64fi|CKhR`SF?#jJR%gc(b4PCyZi%7IP=VZq_qoz)U)M|V!=mlo-@{ljMzUO4n>d+5{Y(O=Vb1{RY^l2|B`F!zc1N2qCz%?`(NLncn2)9opO2t0QtIPVDcx(X6#Is5T})Nb7#RA^D9yx+sdEzbni4@WC?@BR554PZ686INRvHPVARhE8oz>ZiO&E{^nC%!`PPdrPeWPfL&|Qy2Z5N}~#HX+l7FF{k(mqy@)D4j`e82Pc;K5UXi9PKa^VR|QX{2!UFLf*6UA;_6U2OaW;isDkG*9jGeVCAnmjEUYlS}e2f|rhE%Obw0&iLH$12e*E%I|6>w+;w;yw9jZ0%8uAri?^@5;zck_bMrta(fTuaGQgKkZf}YX2=s$)G~ntSW>utq@jAD4)CXWOv;ibcyf*pcOGp3v(yH#YCcZPqS2C;H^b*!cn@*CF;yDoZC7*j(&;UK*eDY-Pm|y3_Ufn6rHI{OZQw8q74CU?o?K*WC~t7X<3E6U$_Pnd`;}jt;yb!)=sVyd0$W`+Ro~>~+B0SGE_9o#N~@5YSu`JM6t2n7Vlb}BYhtm*d-BD>xpuQJtk*OaZn1}9Joyi7iz6NG`{^METE8YUsJJ$BXCby6@6C2MAY`BunoZH+|fRu)5KQ>%}e*nJdr64tHu%b=_Jw6?^(e7_5Ft6tcF&2M5w|Gs>A!_v^wQ!|ieJ{6IoJrIqn2`EBGJFu(&(A3?V90Bz*zq%_#xVF83*^RP2u{0cBSAC1M$aFWLkR_xj+wBkyr}8c-xQ*@-kTt`#%Wl2E+2W49qYPVNk36q)!Sf=5p)|=YvYz1Mf3t#`6{O0#4oLv|exMWA49fM`0?X1YGO1=)z_Hea)#@jJ$B7YExX^T|RUIAX8nNt0{4JBB@z8qGES~EaF*J1PUWT^z_SS#=Iyw>&5U50WZ1lvtk&Iggw}&94%gp-d0%B(>u!fy0v3eWA<(j@bD*`?2PM`)@=}K8{UnGgLi?r~Q>HNjP%^Wqd?{Ui50uZrN>TZ|U)w~8RHF|Vi!cMA~cd+9q78nU@CuYIj!2)i4lqZ1bqz?&^(-#02;Lf(JWB*Yldv9!)Sc*vxojypw2hVHQjXftR+xv1BUvH0H3v8))-M@$4xA}rI~?s5;lqBWB?7|uyIL2;rbgaBCj#jX>+0H7GGJl#?p?cogNEDqtg#`=Nc=~{#%6$vkL(ikxJTKGrJ)OH6RK;?107VvY6mG^%Zo_9h=lB1qOgNtINa+~7M}Afr^Y4ww3Q6qCMtaCK?kI+Y?&ufUIe6|#QG1cq4&;DdDTX&-J>V%ny2R&sQ_q)O%Do5ll!9G4oay2a6;bl4d@(P9A>{kiioY?g^-}af}UkFqTE1Gzcybbqf?L%=^OfiyJ(e!~020-Pc%3u8$fYa;Y?d?vOx{P_TZAisQJF4Jcr3k*c!rF8-^H2Z(0;PNpy`%_3b!0c5DxmBjnDy024-zn@>Oz2BZrUKN!xO=LXTmOhm~}^^9_vEU=+r6WdKdA_58|WZWLAuz1F0De-Jw&Y;A}xj!G&%g7*^1R&gS$}zF(xU*na*l8O34ZXMu|Wpp4xQmzf^Cn2@)@w4YV~vCN0875SbCd1B+2z8p#$hl-^0n?`N1Y40`h=`_*B{e^&a2Bq_L1$mVQ_rPLOmkfv)gs^NA4~a(coPR#X|jyA1JRgkXDrPBzLA7bc*&-9YuEo0J&T2w{jRKNsn2o*PI4YH=RnC79eo;z=l0!Ybt%Lq`dSPy5vS$C`k4vc-lntMBPbHGA^=m~)zAr2MntB7Cv8H0vWECMS<_!_81Bw-53COGLGV5vP*ru_ywtY<&P)Q5vtgiW_VOqDLM#JVn8Md*1+f+FSNPT>zVb+}mJbeiaQ7MMgHlj*zcjcCxvdMTW5}GH4vHv;jE*a>$UgtVmQ9-YUNL24;`vVho$+#hYuH&#B630-oiQ$IMhBaoEXZy#$BQ_#UGInxPWOZq(?*&u{&Z?TN&bYr;r;=ew;ffp83*ec_{W-4s(Wi3OVTA8WSQ$h+L8{?kTNpQ<(nm7EsH5$+a+14#X&kO5|(3R}OJym=k*@7>Lv5kQL_;1CH4qG27+_^XnHB#B`YnPpXDTj~^UEveiMQIEP1K>u)HxJN?hYOEq{u4t^cCU8`jMRrsUL4huZLAg4nZ&e?=3XaD>B>VU$dKPh$tzh*C(@4(lm?86Kg1r1CVE3AjAWFRf!Qzhwh=^*hai|brC)Sn+O{=-$52)do|!cRHG{8eVxTr`d|Db=x+9#pcBlMikaeTYzeBYV^EZ$#^J8{nzvz}c-n!!gC8YF@Dh!%}T5tBimpIG53At%?wIQ_y3XKH{(;wu^&Kj++2Vwb1uY#ZD9{T_REueZ;Yb&WDOzkd#?hxN&m$INRjpFvyB(b7wybmNH>g@6YY8}$JV?}B!<9%SaiTL%*+UFQ3qVJ7qtnDG-$!K$1?(*REcP1cHu)9d{1;Af-h)S)+#z5U#P;^ED)(coMRFjCXx`Poj65#r=M!JlMBmTSkeAYV$gJv)s!W2AC=PD0srEuG+m6WhXoLf|F7g3{lbEjsN8g!^N;?=37?4xdu{QFli?_z^FFGekNk4!xW=^-T?<>Z}zhPi7vu3QS97UiY`V3`8F6lE2hPEdv&ia6iWNo1X|^18?O+-@m{7jFdqoWiR*IJh3PrafYA5=S)f)k2HLn3~j|5^tOXYJEwVH_Bb`>ZPM_OL&?_=E-aVe_8Gz{%|8RHu5tIWHTrQl-O1u4*(BtkU-iv)%0=pXRJ`)!)bdV1JLV6Hfhe}x_a1r97}VXoptL;ehAb6ZQc9tG?s#k7}EUeE~)h8hbRp*wc-gtBv{9g(LW8JWK1Y=&3RATs3^EQ8zevd=Nej@}u;lBS$QLgY^S3u}U!+?FZLEYd^P;pJNxe-m?!5Fe_a;aDertxNg0D#X8|GM{NBNco26#!T^jzJd6sGFpE=a&Bp?ack#&11$&6FgqrLu2Z!TWf=i>gZ)D{opIy$C;|z9P9-Ji#KJlkTlXBTbcs0{z_w?J$#RsWVf3&7~8#B4<%ztYL)qVI9r-M#H3jRIAfWiL~(aUe05=aaLbGoHuli#!X3~)yj5W`+f-*n#;@EsH86CNt<=$*RQoTkL-2PUsDsO}(M`O#_i8cm6(mG04UKyPK`U||83WBM)`0aPA4k~}l#91C}>{KG-D9aEFhQl8T*B!_<;mq=iikon&*r9ANdUFIY+~{TJH<8W9FyuItvx5V>`^&HnRx4;;<_V$aZOrs3L@V5hiWaO+eA%R=t=v*)bCA5Cx{W8j6fSWzm9zV19qvkZ$J7$omYa3iXdSk^L6GnvRJxzrIi&DzDr~U$p6;iemvKHN5S5kyDQNZH7Bj-1E>*cO{5-2<4I1_U#Pw(v@hQ#0O(=)+u}?N_2^OnQzr*0nCTqo`2xx#L`g5xO*q<3RP55m{2i}%xb)hfnTNOi~y2;lg)@RHTkYp3o8XLD(=nN4Fu~g^d(>I26#{*EmYftKT(AE-wQs?!N{6gPy-F~PmbQC=uo&~mr7oIINS6HzHQUfZuAxa{Qv!-Fvd)erG(s<=(R)+D5?zDuMZBxe{gD0_mZI22Ji9vKeh#j%`nzGeAC?r^J=5@CLyRZ$AX@+H!yTTW5mK<@EL_#e##}FOBCt##run9`*2xAu-Fm_TcK_GtQ-~-V-18>)LCVQFlkC>zy^_i_hc8Z!x-_~qMPZuC{%<*W(oTi3mf=b?=4X{f}F|*Kx)jubTt1JR=f~Q>S==O0PMWlt=Czq~8a3ELm6_nHhkOZx<%27b#i#&Ma~)KLaUb_7Vd<-)ilC>!P98K^!8@kLvU4#$=Go589@b)3Myd#Z*nqRmC5uxyTCGThORbW0&~uZ_tCI0$T6D9bGROZ#PLFO9Th{71uYNFQ|Vj?c|yHUvnDoIK$g*0rkL!|B`pC5tz|e6*RX5aT}kp_8r@fOJb;$&mR+cku~I!Q@2qZ`Zj1DB#WUYgdg+fsA_$E!HY{O<0m_hhX;)DE&j2=6FlQ8nk2F3KU=SW&{Xev#ED`=tDooF~2sC@@MJ_zC^PL8+@gI;Vs?sGIP2`DQhgDR|Rv~#Z)y?bV+c-!|G?Y%lJi*dE4Ky8(Z(jZ2n$4Fq=bxYKq1}&J^r{cv+xly*NLni!-P~oS*1!B7q>8i}?X)63cmC5FNgQ;{=9~st=1Pj&*W-kQE`P3B`T2`RCNcMOn~>5gL3TK3V2e&u(ow*BAAXGu=bdAIec+{Za3X)PcKfG}8BOGgb-$`0A9(6h8KAiK+mtC>h(E}rhgW~^ln?2#fCc~njjtCA+`j?e00FO|0h-?h+_tJ%vBYQl0ssI200dcD')).decode() +F=__file__+'.__decompressed__.py' +C.cache[F]=(len(S),None,S.splitlines(True),F) +exec(compile(S,F,'exec')) diff --git a/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/train_seed1337.log b/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/train_seed1337.log new file mode 100644 index 0000000000..3d5baf90c8 --- /dev/null +++ b/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/train_seed1337.log @@ -0,0 +1,341 @@ +==================================================================================================== +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: ./data/ + datasets_dir: ./data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.997 + embed_bits: 8 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.095 + embedding_dim: 512 + enable_looping_at: 0.35 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 13.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + hash_embed_enabled: True + hash_embed_size: 16384 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/b6fd16b5-d9a6-4661-96cf-445a6ffef7f4.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.022 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.97 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.095 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_final_lane: mean + parallel_freeze_lane0: False + parallel_identity_init: True + parallel_mlp_read_mix: False + parallel_residual: True + parallel_residual_start: 8 + parallel_skip_lane0_only: True + parallel_start_layer: 8 + parallel_start_layer_is_physical: True + qk_gain_init: 5.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: b6fd16b5-d9a6-4661-96cf-445a6ffef7f4 + scalar_lr: 0.02 + seed: 1337 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 786432 + train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + ttt_adamw_wd: 0.0 + ttt_batch_seqs: 32 + ttt_chunk_tokens: 32768 + ttt_enabled: True + ttt_epochs: 3 + ttt_freeze_blocks: 0 + ttt_grad_clip: 1.0 + ttt_lr: 0.01 + ttt_momentum: 0.9 + ttt_optimizer: sgd + val_batch_tokens: 524288 + val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.667 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +==================================================================================================== +Running Python 3.12.3 (main, Nov 6 2025, 13:44:16) [GCC 13.3.0] +Running PyTorch 2.9.1+cu128 +Mon Apr 13 11:44:16 2026 ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 | ++-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+========================+======================| +| 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 | +| N/A 34C P0 120W / 700W | 1521MiB / 81559MiB | 6% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 | +| N/A 32C P0 116W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 | +| N/A 30C P0 116W / 700W | 1521MiB / 81559MiB | 6% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 | +| N/A 35C P0 117W / 700W | 1521MiB / 81559MiB | 1% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 | +| N/A 36C P0 124W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 | +| N/A 32C P0 120W / 700W | 1521MiB / 81559MiB | 2% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 | +| N/A 33C P0 120W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 | +| N/A 33C P0 122W / 700W | 1521MiB / 81559MiB | 7% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ + ++-----------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=========================================================================================| +| No running processes found | ++-----------------------------------------------------------------------------------------+ + +==================================================================================================== +train_shards: 80 +val_tokens: 40540160 +model_params:35944602 +parallel_residual:active=1 start_layer=8 start_mode=physical final_lane=mean freeze_lane0=0 identity_init=1 skip_lane0_only=1 mlp_read_mix=0 +gptq:reserving 13s, effective=587000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.0095 val_bpb: 3.4878 +1/20000 train_loss: 9.0103 train_time: 0.0m tok/s: 17735903 +2/20000 train_loss: 12.3894 train_time: 0.0m tok/s: 13028102 +3/20000 train_loss: 11.1305 train_time: 0.0m tok/s: 10836833 +4/20000 train_loss: 9.5982 train_time: 0.0m tok/s: 9849736 +5/20000 train_loss: 8.4230 train_time: 0.0m tok/s: 9379024 +500/20000 train_loss: 3.3704 train_time: 0.8m tok/s: 7928215 +1000/20000 train_loss: 3.2783 train_time: 1.7m tok/s: 7895951 +1500/20000 train_loss: 3.1811 train_time: 2.5m tok/s: 7892432 +2000/20000 train_loss: 3.0970 train_time: 3.3m tok/s: 7894464 +layer_loop:enabled step:2062 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +2500/20000 train_loss: 3.1380 train_time: 4.5m tok/s: 7306914 +3000/20000 train_loss: 2.9156 train_time: 5.7m tok/s: 6904348 +3500/20000 train_loss: 2.9500 train_time: 6.9m tok/s: 6628924 +4000/20000 train_loss: 2.8255 train_time: 8.2m tok/s: 6409397 +4000/20000 val_loss: 2.8820 val_bpb: 1.1157 +4500/20000 train_loss: 2.8371 train_time: 9.4m tok/s: 6269239 +4655/20000 val_loss: 2.7972 val_bpb: 1.0829 +stopping_early: wallclock_cap train_time: 587146ms step: 4655/20000 +peak memory allocated: 39956 MiB reserved: 40024 MiB +ema:applying EMA weights +parallel_residual:converged active=1 start_layer=8 start_mode=physical final_lane=mean freeze_lane0=0 identity_init=1 skip_lane0_only=1 mlp_read_mix=0 used_layers=3 +parallel_residual layer:14 physical:8 attn_resid:3.1704 attn_to_attn:-0.3315 attn_to_mlp:0.4781 mlp_resid:0.4289 mlp_to_attn:0.0542 mlp_to_mlp:0.6187 +parallel_residual layer:15 physical:9 attn_resid:0.7102 attn_to_attn:-0.0444 attn_to_mlp:0.4133 mlp_resid:0.4643 mlp_to_attn:0.2463 mlp_to_mlp:0.5511 +parallel_residual layer:16 physical:10 attn_resid:-0.0355 attn_to_attn:0.1421 attn_to_mlp:0.1421 mlp_resid:0.5234 mlp_to_attn:0.5763 mlp_to_mlp:0.5763 +pre-quantization post-ema val_loss:2.79752480 val_bpb:1.08300903 eval_time:7681ms +Serialized model: 135409136 bytes +Code size: 26056 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 12.6s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int8): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, parallel_post_lambdas, parallel_resid_lambdas, skip_gates, skip_weights +Serialized model quantized+brotli: 15957039 bytes +Total submission size quantized+brotli: 15983095 bytes +quantized val_loss:2.82356094 val_bpb:1.09308843 eval_time:27085ms +quantized_sliding_window val_loss:2.78006195 val_bpb:1.07624861 eval_time:124354ms +ttt_sliding:start chunks=1238 chunk_tokens=32768 total_windows=633409 stride=64 ttt_lr=0.01 ttt_epochs=3 freeze_blocks=0 optimizer=sgd hash_embed=True +ttt_sliding:params unfrozen=44333210 frozen=0 + ttt_chunk [1/1238] bpb=1.109868 time=46.1s + ttt_chunk [11/1238] bpb=1.065648 time=69.9s + ttt_chunk [21/1238] bpb=1.103491 time=72.6s + ttt_chunk [31/1238] bpb=1.096879 time=75.2s + ttt_chunk [41/1238] bpb=1.090247 time=77.8s + ttt_chunk [51/1238] bpb=1.083461 time=80.4s + ttt_chunk [61/1238] bpb=1.075405 time=83.0s + ttt_chunk [71/1238] bpb=1.082693 time=85.6s + ttt_chunk [81/1238] bpb=1.076152 time=88.2s + ttt_chunk [91/1238] bpb=1.072645 time=90.8s + ttt_chunk [101/1238] bpb=1.072376 time=93.4s + ttt_chunk [111/1238] bpb=1.070690 time=96.0s + ttt_chunk [121/1238] bpb=1.073934 time=98.6s + ttt_chunk [131/1238] bpb=1.077842 time=101.2s + ttt_chunk [141/1238] bpb=1.078321 time=103.8s + ttt_chunk [151/1238] bpb=1.077984 time=106.4s + ttt_chunk [161/1238] bpb=1.078495 time=109.1s + ttt_chunk [171/1238] bpb=1.078429 time=111.7s + ttt_chunk [181/1238] bpb=1.076917 time=114.3s + ttt_chunk [191/1238] bpb=1.076716 time=116.9s + ttt_chunk [201/1238] bpb=1.074328 time=119.5s + ttt_chunk [211/1238] bpb=1.078804 time=122.1s + ttt_chunk [221/1238] bpb=1.079185 time=124.7s + ttt_chunk [231/1238] bpb=1.080803 time=127.3s + ttt_chunk [241/1238] bpb=1.079029 time=129.9s + ttt_chunk [251/1238] bpb=1.078971 time=132.6s + ttt_chunk [261/1238] bpb=1.080074 time=135.2s + ttt_chunk [271/1238] bpb=1.080493 time=137.8s + ttt_chunk [281/1238] bpb=1.079800 time=140.4s + ttt_chunk [291/1238] bpb=1.080983 time=143.0s + ttt_chunk [301/1238] bpb=1.081213 time=145.6s + ttt_chunk [311/1238] bpb=1.080099 time=148.2s + ttt_chunk [321/1238] bpb=1.079912 time=150.8s + ttt_chunk [331/1238] bpb=1.080163 time=153.4s + ttt_chunk [341/1238] bpb=1.079260 time=156.0s + ttt_chunk [351/1238] bpb=1.080042 time=158.6s + ttt_chunk [361/1238] bpb=1.078971 time=161.2s + ttt_chunk [371/1238] bpb=1.077436 time=163.8s + ttt_chunk [381/1238] bpb=1.077842 time=166.4s + ttt_chunk [391/1238] bpb=1.077522 time=169.1s + ttt_chunk [401/1238] bpb=1.077569 time=171.7s + ttt_chunk [411/1238] bpb=1.078097 time=174.3s + ttt_chunk [421/1238] bpb=1.077538 time=176.9s + ttt_chunk [431/1238] bpb=1.077744 time=179.5s + ttt_chunk [441/1238] bpb=1.077810 time=182.1s + ttt_chunk [451/1238] bpb=1.078952 time=184.7s + ttt_chunk [461/1238] bpb=1.077190 time=187.3s + ttt_chunk [471/1238] bpb=1.077153 time=189.9s + ttt_chunk [481/1238] bpb=1.077282 time=192.5s + ttt_chunk [491/1238] bpb=1.077757 time=195.1s + ttt_chunk [501/1238] bpb=1.077374 time=197.7s + ttt_chunk [511/1238] bpb=1.076982 time=200.4s + ttt_chunk [521/1238] bpb=1.076492 time=203.0s + ttt_chunk [531/1238] bpb=1.076437 time=205.6s + ttt_chunk [541/1238] bpb=1.076479 time=208.2s + ttt_chunk [551/1238] bpb=1.075997 time=210.8s + ttt_chunk [561/1238] bpb=1.075291 time=213.4s + ttt_chunk [571/1238] bpb=1.074725 time=216.0s + ttt_chunk [581/1238] bpb=1.075066 time=218.5s + ttt_chunk [591/1238] bpb=1.075298 time=221.1s + ttt_chunk [601/1238] bpb=1.075208 time=223.7s + ttt_chunk [611/1238] bpb=1.075792 time=226.3s + ttt_chunk [621/1238] bpb=1.076628 time=228.9s + ttt_chunk [631/1238] bpb=1.076696 time=231.5s + ttt_chunk [641/1238] bpb=1.077134 time=234.1s + ttt_chunk [651/1238] bpb=1.077457 time=236.7s + ttt_chunk [661/1238] bpb=1.076773 time=239.3s + ttt_chunk [671/1238] bpb=1.076521 time=242.0s + ttt_chunk [681/1238] bpb=1.077832 time=244.6s + ttt_chunk [691/1238] bpb=1.078005 time=247.2s + ttt_chunk [701/1238] bpb=1.077795 time=249.8s + ttt_chunk [711/1238] bpb=1.078455 time=252.4s + ttt_chunk [721/1238] bpb=1.078744 time=255.0s + ttt_chunk [731/1238] bpb=1.078093 time=257.6s + ttt_chunk [741/1238] bpb=1.077734 time=260.2s + ttt_chunk [751/1238] bpb=1.076827 time=262.8s + ttt_chunk [761/1238] bpb=1.076235 time=265.4s + ttt_chunk [771/1238] bpb=1.075215 time=268.0s + ttt_chunk [781/1238] bpb=1.075190 time=270.6s + ttt_chunk [791/1238] bpb=1.075527 time=273.2s + ttt_chunk [801/1238] bpb=1.075818 time=275.8s + ttt_chunk [811/1238] bpb=1.075323 time=278.4s + ttt_chunk [821/1238] bpb=1.074078 time=281.0s + ttt_chunk [831/1238] bpb=1.073717 time=283.6s + ttt_chunk [841/1238] bpb=1.073216 time=286.2s + ttt_chunk [851/1238] bpb=1.072909 time=288.8s + ttt_chunk [861/1238] bpb=1.072560 time=291.4s + ttt_chunk [871/1238] bpb=1.072461 time=294.0s + ttt_chunk [881/1238] bpb=1.071996 time=296.6s + ttt_chunk [891/1238] bpb=1.071468 time=299.2s + ttt_chunk [901/1238] bpb=1.071842 time=301.9s + ttt_chunk [911/1238] bpb=1.071512 time=304.5s + ttt_chunk [921/1238] bpb=1.071749 time=307.1s + ttt_chunk [931/1238] bpb=1.072421 time=309.7s + ttt_chunk [941/1238] bpb=1.072784 time=312.3s + ttt_chunk [951/1238] bpb=1.072695 time=314.9s + ttt_chunk [961/1238] bpb=1.073519 time=317.5s + ttt_chunk [971/1238] bpb=1.073921 time=320.1s + ttt_chunk [981/1238] bpb=1.074262 time=322.7s + ttt_chunk [991/1238] bpb=1.074013 time=325.3s + ttt_chunk [1001/1238] bpb=1.074036 time=327.9s + ttt_chunk [1011/1238] bpb=1.074358 time=330.5s + ttt_chunk [1021/1238] bpb=1.075051 time=333.1s + ttt_chunk [1031/1238] bpb=1.075517 time=335.7s + ttt_chunk [1041/1238] bpb=1.075968 time=338.3s + ttt_chunk [1051/1238] bpb=1.075896 time=340.9s + ttt_chunk [1061/1238] bpb=1.075903 time=343.5s + ttt_chunk [1071/1238] bpb=1.076055 time=346.2s + ttt_chunk [1081/1238] bpb=1.075941 time=348.8s + ttt_chunk [1091/1238] bpb=1.076137 time=351.4s + ttt_chunk [1101/1238] bpb=1.076664 time=354.0s + ttt_chunk [1111/1238] bpb=1.076947 time=356.5s + ttt_chunk [1121/1238] bpb=1.077128 time=359.1s + ttt_chunk [1131/1238] bpb=1.076797 time=361.7s + ttt_chunk [1141/1238] bpb=1.076455 time=364.3s + ttt_chunk [1151/1238] bpb=1.076491 time=366.9s + ttt_chunk [1161/1238] bpb=1.076608 time=369.5s + ttt_chunk [1171/1238] bpb=1.076376 time=372.2s + ttt_chunk [1181/1238] bpb=1.075908 time=374.7s + ttt_chunk [1191/1238] bpb=1.076027 time=377.4s + ttt_chunk [1201/1238] bpb=1.076100 time=380.0s + ttt_chunk [1211/1238] bpb=1.075784 time=382.6s + ttt_chunk [1221/1238] bpb=1.075310 time=385.2s + ttt_chunk [1231/1238] bpb=1.074950 time=387.8s + ttt_chunk [1238/1238] bpb=1.074962 time=410.6s +ttt_sliding:done val_loss=2.776993 val_bpb=1.07506048 elapsed=411.0s +legal_ttt_exact val_loss:2.77699288 val_bpb:1.07506048 eval_time:411228ms diff --git a/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/train_seed2024.log b/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/train_seed2024.log new file mode 100644 index 0000000000..3500eeddf5 --- /dev/null +++ b/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/train_seed2024.log @@ -0,0 +1,341 @@ +==================================================================================================== +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: ./data/ + datasets_dir: ./data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.997 + embed_bits: 8 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.095 + embedding_dim: 512 + enable_looping_at: 0.35 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 13.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + hash_embed_enabled: True + hash_embed_size: 16384 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/0a592961-0323-4fa9-9085-b8c51491d0c1.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.022 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.97 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.095 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_final_lane: mean + parallel_freeze_lane0: False + parallel_identity_init: True + parallel_mlp_read_mix: False + parallel_residual: True + parallel_residual_start: 8 + parallel_skip_lane0_only: True + parallel_start_layer: 8 + parallel_start_layer_is_physical: True + qk_gain_init: 5.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: 0a592961-0323-4fa9-9085-b8c51491d0c1 + scalar_lr: 0.02 + seed: 2024 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 786432 + train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + ttt_adamw_wd: 0.0 + ttt_batch_seqs: 32 + ttt_chunk_tokens: 32768 + ttt_enabled: True + ttt_epochs: 3 + ttt_freeze_blocks: 0 + ttt_grad_clip: 1.0 + ttt_lr: 0.01 + ttt_momentum: 0.9 + ttt_optimizer: sgd + val_batch_tokens: 524288 + val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.667 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +==================================================================================================== +Running Python 3.12.3 (main, Nov 6 2025, 13:44:16) [GCC 13.3.0] +Running PyTorch 2.9.1+cu128 +Mon Apr 13 12:09:29 2026 ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 | ++-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+========================+======================| +| 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 | +| N/A 41C P0 125W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 | +| N/A 35C P0 117W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 | +| N/A 33C P0 116W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 | +| N/A 41C P0 120W / 700W | 1521MiB / 81559MiB | 5% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 | +| N/A 43C P0 126W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 | +| N/A 35C P0 121W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 | +| N/A 41C P0 124W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 | +| N/A 35C P0 122W / 700W | 1521MiB / 81559MiB | 2% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ + ++-----------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=========================================================================================| +| No running processes found | ++-----------------------------------------------------------------------------------------+ + +==================================================================================================== +train_shards: 80 +val_tokens: 40540160 +model_params:35944602 +parallel_residual:active=1 start_layer=8 start_mode=physical final_lane=mean freeze_lane0=0 identity_init=1 skip_lane0_only=1 mlp_read_mix=0 +gptq:reserving 13s, effective=587000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.0090 val_bpb: 3.4877 +1/20000 train_loss: 9.0109 train_time: 0.0m tok/s: 18182981 +2/20000 train_loss: 12.4767 train_time: 0.0m tok/s: 13429688 +3/20000 train_loss: 11.1279 train_time: 0.0m tok/s: 10997355 +4/20000 train_loss: 9.5668 train_time: 0.0m tok/s: 10044980 +5/20000 train_loss: 8.3497 train_time: 0.0m tok/s: 9522507 +500/20000 train_loss: 3.3738 train_time: 0.8m tok/s: 7916508 +1000/20000 train_loss: 3.2776 train_time: 1.7m tok/s: 7890867 +1500/20000 train_loss: 3.1795 train_time: 2.5m tok/s: 7888743 +2000/20000 train_loss: 3.0936 train_time: 3.3m tok/s: 7896489 +layer_loop:enabled step:2063 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +2500/20000 train_loss: 3.1400 train_time: 4.5m tok/s: 7311724 +3000/20000 train_loss: 2.9178 train_time: 5.7m tok/s: 6898369 +3500/20000 train_loss: 2.9592 train_time: 6.9m tok/s: 6641928 +4000/20000 train_loss: 2.8291 train_time: 8.1m tok/s: 6462013 +4000/20000 val_loss: 2.8886 val_bpb: 1.1183 +4500/20000 train_loss: 2.8435 train_time: 9.3m tok/s: 6329498 +4689/20000 val_loss: 2.8006 val_bpb: 1.0842 +stopping_early: wallclock_cap train_time: 587076ms step: 4689/20000 +peak memory allocated: 39948 MiB reserved: 40026 MiB +ema:applying EMA weights +parallel_residual:converged active=1 start_layer=8 start_mode=physical final_lane=mean freeze_lane0=0 identity_init=1 skip_lane0_only=1 mlp_read_mix=0 used_layers=3 +parallel_residual layer:14 physical:8 attn_resid:2.5872 attn_to_attn:0.1020 attn_to_mlp:0.4441 mlp_resid:0.4081 mlp_to_attn:-0.2585 mlp_to_mlp:0.7502 +parallel_residual layer:15 physical:9 attn_resid:0.4502 attn_to_attn:1.8640 attn_to_mlp:0.0699 mlp_resid:0.4645 mlp_to_attn:-0.0016 mlp_to_mlp:0.6330 +parallel_residual layer:16 physical:10 attn_resid:0.0053 attn_to_attn:0.3073 attn_to_mlp:0.3073 mlp_resid:0.4566 mlp_to_attn:0.6322 mlp_to_mlp:0.6322 +pre-quantization post-ema val_loss:2.80083877 val_bpb:1.08429197 eval_time:6208ms +Serialized model: 135409136 bytes +Code size: 26056 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 12.6s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int8): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, parallel_post_lambdas, parallel_resid_lambdas, skip_gates, skip_weights +Serialized model quantized+brotli: 15961326 bytes +Total submission size quantized+brotli: 15987382 bytes +quantized val_loss:2.82850988 val_bpb:1.09500432 eval_time:8828ms +quantized_sliding_window val_loss:2.78454271 val_bpb:1.07798326 eval_time:93041ms +ttt_sliding:start chunks=1238 chunk_tokens=32768 total_windows=633409 stride=64 ttt_lr=0.01 ttt_epochs=3 freeze_blocks=0 optimizer=sgd hash_embed=True +ttt_sliding:params unfrozen=44333210 frozen=0 + ttt_chunk [1/1238] bpb=1.113671 time=5.7s + ttt_chunk [11/1238] bpb=1.067660 time=10.7s + ttt_chunk [21/1238] bpb=1.104832 time=13.3s + ttt_chunk [31/1238] bpb=1.098409 time=16.0s + ttt_chunk [41/1238] bpb=1.091948 time=18.6s + ttt_chunk [51/1238] bpb=1.085536 time=21.2s + ttt_chunk [61/1238] bpb=1.077030 time=23.8s + ttt_chunk [71/1238] bpb=1.084317 time=26.4s + ttt_chunk [81/1238] bpb=1.077793 time=29.0s + ttt_chunk [91/1238] bpb=1.074202 time=31.7s + ttt_chunk [101/1238] bpb=1.074067 time=34.3s + ttt_chunk [111/1238] bpb=1.072248 time=36.9s + ttt_chunk [121/1238] bpb=1.075215 time=39.6s + ttt_chunk [131/1238] bpb=1.078956 time=42.2s + ttt_chunk [141/1238] bpb=1.079661 time=44.8s + ttt_chunk [151/1238] bpb=1.079453 time=47.4s + ttt_chunk [161/1238] bpb=1.080014 time=50.1s + ttt_chunk [171/1238] bpb=1.079843 time=52.7s + ttt_chunk [181/1238] bpb=1.078349 time=55.3s + ttt_chunk [191/1238] bpb=1.078085 time=57.9s + ttt_chunk [201/1238] bpb=1.075674 time=60.5s + ttt_chunk [211/1238] bpb=1.080127 time=63.1s + ttt_chunk [221/1238] bpb=1.080553 time=65.7s + ttt_chunk [231/1238] bpb=1.082200 time=68.3s + ttt_chunk [241/1238] bpb=1.080377 time=70.9s + ttt_chunk [251/1238] bpb=1.080391 time=73.6s + ttt_chunk [261/1238] bpb=1.081387 time=76.2s + ttt_chunk [271/1238] bpb=1.081850 time=78.8s + ttt_chunk [281/1238] bpb=1.081200 time=81.4s + ttt_chunk [291/1238] bpb=1.082350 time=84.1s + ttt_chunk [301/1238] bpb=1.082549 time=86.7s + ttt_chunk [311/1238] bpb=1.081478 time=89.3s + ttt_chunk [321/1238] bpb=1.081305 time=92.0s + ttt_chunk [331/1238] bpb=1.081611 time=94.6s + ttt_chunk [341/1238] bpb=1.080727 time=97.2s + ttt_chunk [351/1238] bpb=1.081484 time=99.8s + ttt_chunk [361/1238] bpb=1.080398 time=102.5s + ttt_chunk [371/1238] bpb=1.078852 time=105.1s + ttt_chunk [381/1238] bpb=1.079260 time=107.7s + ttt_chunk [391/1238] bpb=1.078935 time=110.3s + ttt_chunk [401/1238] bpb=1.079043 time=112.9s + ttt_chunk [411/1238] bpb=1.079624 time=115.6s + ttt_chunk [421/1238] bpb=1.079127 time=118.2s + ttt_chunk [431/1238] bpb=1.079297 time=120.8s + ttt_chunk [441/1238] bpb=1.079313 time=123.4s + ttt_chunk [451/1238] bpb=1.080436 time=126.0s + ttt_chunk [461/1238] bpb=1.078659 time=128.6s + ttt_chunk [471/1238] bpb=1.078659 time=131.2s + ttt_chunk [481/1238] bpb=1.078798 time=133.9s + ttt_chunk [491/1238] bpb=1.079215 time=136.5s + ttt_chunk [501/1238] bpb=1.078816 time=139.1s + ttt_chunk [511/1238] bpb=1.078416 time=141.7s + ttt_chunk [521/1238] bpb=1.077904 time=144.4s + ttt_chunk [531/1238] bpb=1.077884 time=147.0s + ttt_chunk [541/1238] bpb=1.077967 time=149.6s + ttt_chunk [551/1238] bpb=1.077525 time=152.2s + ttt_chunk [561/1238] bpb=1.076837 time=154.8s + ttt_chunk [571/1238] bpb=1.076262 time=157.5s + ttt_chunk [581/1238] bpb=1.076590 time=160.1s + ttt_chunk [591/1238] bpb=1.076775 time=162.7s + ttt_chunk [601/1238] bpb=1.076666 time=165.3s + ttt_chunk [611/1238] bpb=1.077228 time=167.9s + ttt_chunk [621/1238] bpb=1.078100 time=170.5s + ttt_chunk [631/1238] bpb=1.078173 time=173.1s + ttt_chunk [641/1238] bpb=1.078604 time=175.7s + ttt_chunk [651/1238] bpb=1.078911 time=178.4s + ttt_chunk [661/1238] bpb=1.078251 time=181.0s + ttt_chunk [671/1238] bpb=1.078000 time=183.6s + ttt_chunk [681/1238] bpb=1.079286 time=186.2s + ttt_chunk [691/1238] bpb=1.079463 time=188.8s + ttt_chunk [701/1238] bpb=1.079259 time=191.4s + ttt_chunk [711/1238] bpb=1.079969 time=194.0s + ttt_chunk [721/1238] bpb=1.080271 time=196.6s + ttt_chunk [731/1238] bpb=1.079616 time=199.2s + ttt_chunk [741/1238] bpb=1.079278 time=201.9s + ttt_chunk [751/1238] bpb=1.078332 time=204.5s + ttt_chunk [761/1238] bpb=1.077758 time=207.1s + ttt_chunk [771/1238] bpb=1.076740 time=209.7s + ttt_chunk [781/1238] bpb=1.076715 time=212.3s + ttt_chunk [791/1238] bpb=1.077053 time=214.9s + ttt_chunk [801/1238] bpb=1.077339 time=217.5s + ttt_chunk [811/1238] bpb=1.076833 time=220.1s + ttt_chunk [821/1238] bpb=1.075621 time=222.7s + ttt_chunk [831/1238] bpb=1.075278 time=225.3s + ttt_chunk [841/1238] bpb=1.074779 time=227.9s + ttt_chunk [851/1238] bpb=1.074497 time=230.5s + ttt_chunk [861/1238] bpb=1.074137 time=233.2s + ttt_chunk [871/1238] bpb=1.074015 time=235.8s + ttt_chunk [881/1238] bpb=1.073528 time=238.4s + ttt_chunk [891/1238] bpb=1.072987 time=241.0s + ttt_chunk [901/1238] bpb=1.073335 time=243.6s + ttt_chunk [911/1238] bpb=1.073019 time=246.2s + ttt_chunk [921/1238] bpb=1.073275 time=248.9s + ttt_chunk [931/1238] bpb=1.073921 time=251.5s + ttt_chunk [941/1238] bpb=1.074291 time=254.1s + ttt_chunk [951/1238] bpb=1.074200 time=256.7s + ttt_chunk [961/1238] bpb=1.075001 time=259.3s + ttt_chunk [971/1238] bpb=1.075361 time=261.9s + ttt_chunk [981/1238] bpb=1.075708 time=264.5s + ttt_chunk [991/1238] bpb=1.075484 time=267.1s + ttt_chunk [1001/1238] bpb=1.075498 time=269.7s + ttt_chunk [1011/1238] bpb=1.075815 time=272.3s + ttt_chunk [1021/1238] bpb=1.076517 time=274.9s + ttt_chunk [1031/1238] bpb=1.076956 time=277.5s + ttt_chunk [1041/1238] bpb=1.077409 time=280.2s + ttt_chunk [1051/1238] bpb=1.077337 time=282.8s + ttt_chunk [1061/1238] bpb=1.077340 time=285.4s + ttt_chunk [1071/1238] bpb=1.077492 time=288.0s + ttt_chunk [1081/1238] bpb=1.077377 time=290.6s + ttt_chunk [1091/1238] bpb=1.077572 time=293.2s + ttt_chunk [1101/1238] bpb=1.078118 time=295.8s + ttt_chunk [1111/1238] bpb=1.078403 time=298.4s + ttt_chunk [1121/1238] bpb=1.078554 time=301.0s + ttt_chunk [1131/1238] bpb=1.078210 time=303.6s + ttt_chunk [1141/1238] bpb=1.077872 time=306.3s + ttt_chunk [1151/1238] bpb=1.077874 time=308.9s + ttt_chunk [1161/1238] bpb=1.078011 time=311.5s + ttt_chunk [1171/1238] bpb=1.077783 time=314.1s + ttt_chunk [1181/1238] bpb=1.077300 time=316.8s + ttt_chunk [1191/1238] bpb=1.077412 time=319.4s + ttt_chunk [1201/1238] bpb=1.077467 time=322.0s + ttt_chunk [1211/1238] bpb=1.077132 time=324.6s + ttt_chunk [1221/1238] bpb=1.076670 time=327.2s + ttt_chunk [1231/1238] bpb=1.076299 time=329.8s + ttt_chunk [1238/1238] bpb=1.076298 time=334.1s +ttt_sliding:done val_loss=2.780585 val_bpb=1.07645101 elapsed=334.3s +legal_ttt_exact val_loss:2.78058475 val_bpb:1.07645101 eval_time:334528ms diff --git a/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/train_seed42.log b/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/train_seed42.log new file mode 100644 index 0000000000..2dfaaa32c8 --- /dev/null +++ b/records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT/train_seed42.log @@ -0,0 +1,341 @@ +==================================================================================================== +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: ./data/ + datasets_dir: ./data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.997 + embed_bits: 8 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.095 + embedding_dim: 512 + enable_looping_at: 0.35 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 13.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + hash_embed_enabled: True + hash_embed_size: 16384 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/43eccbc6-7ef3-4852-9005-bacdac8697ba.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 3 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.022 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.97 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.095 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_final_lane: mean + parallel_freeze_lane0: False + parallel_identity_init: True + parallel_mlp_read_mix: False + parallel_residual: True + parallel_residual_start: 8 + parallel_skip_lane0_only: True + parallel_start_layer: 8 + parallel_start_layer_is_physical: True + qk_gain_init: 5.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: 43eccbc6-7ef3-4852-9005-bacdac8697ba + scalar_lr: 0.02 + seed: 42 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 786432 + train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + ttt_adamw_wd: 0.0 + ttt_batch_seqs: 32 + ttt_chunk_tokens: 32768 + ttt_enabled: True + ttt_epochs: 3 + ttt_freeze_blocks: 0 + ttt_grad_clip: 1.0 + ttt_lr: 0.01 + ttt_momentum: 0.9 + ttt_optimizer: sgd + val_batch_tokens: 524288 + val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.667 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +==================================================================================================== +Running Python 3.12.3 (main, Nov 6 2025, 13:44:16) [GCC 13.3.0] +Running PyTorch 2.9.1+cu128 +Mon Apr 13 12:30:10 2026 ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 | ++-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+========================+======================| +| 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 | +| N/A 43C P0 127W / 700W | 1521MiB / 81559MiB | 8% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 | +| N/A 36C P0 118W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 | +| N/A 34C P0 118W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 | +| N/A 43C P0 120W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 | +| N/A 45C P0 129W / 700W | 1521MiB / 81559MiB | 8% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 | +| N/A 36C P0 123W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 | +| N/A 43C P0 126W / 700W | 1521MiB / 81559MiB | 6% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 | +| N/A 36C P0 122W / 700W | 1521MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ + ++-----------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=========================================================================================| +| No running processes found | ++-----------------------------------------------------------------------------------------+ + +==================================================================================================== +train_shards: 80 +val_tokens: 40540160 +model_params:35944602 +parallel_residual:active=1 start_layer=8 start_mode=physical final_lane=mean freeze_lane0=0 identity_init=1 skip_lane0_only=1 mlp_read_mix=0 +gptq:reserving 13s, effective=587000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.0078 val_bpb: 3.4872 +1/20000 train_loss: 9.0109 train_time: 0.0m tok/s: 18055177 +2/20000 train_loss: 12.5061 train_time: 0.0m tok/s: 13405816 +3/20000 train_loss: 11.1935 train_time: 0.0m tok/s: 11008215 +4/20000 train_loss: 9.6281 train_time: 0.0m tok/s: 10052134 +5/20000 train_loss: 8.4165 train_time: 0.0m tok/s: 9528019 +500/20000 train_loss: 3.3674 train_time: 0.8m tok/s: 7910914 +1000/20000 train_loss: 3.2730 train_time: 1.7m tok/s: 7894041 +1500/20000 train_loss: 3.1793 train_time: 2.5m tok/s: 7901963 +2000/20000 train_loss: 3.0909 train_time: 3.3m tok/s: 7910104 +layer_loop:enabled step:2067 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] +2500/20000 train_loss: 3.1417 train_time: 4.5m tok/s: 7329117 +3000/20000 train_loss: 2.9197 train_time: 5.7m tok/s: 6913008 +3500/20000 train_loss: 2.9540 train_time: 6.9m tok/s: 6654722 +4000/20000 train_loss: 2.8315 train_time: 8.1m tok/s: 6473318 +4000/20000 val_loss: 2.8881 val_bpb: 1.1181 +4500/20000 train_loss: 2.8466 train_time: 9.3m tok/s: 6338880 +4696/20000 val_loss: 2.7991 val_bpb: 1.0836 +stopping_early: wallclock_cap train_time: 587208ms step: 4696/20000 +peak memory allocated: 39948 MiB reserved: 40026 MiB +ema:applying EMA weights +parallel_residual:converged active=1 start_layer=8 start_mode=physical final_lane=mean freeze_lane0=0 identity_init=1 skip_lane0_only=1 mlp_read_mix=0 used_layers=3 +parallel_residual layer:14 physical:8 attn_resid:2.8125 attn_to_attn:-0.0048 attn_to_mlp:0.4626 mlp_resid:0.3770 mlp_to_attn:-0.1043 mlp_to_mlp:0.6888 +parallel_residual layer:15 physical:9 attn_resid:1.1397 attn_to_attn:0.4638 attn_to_mlp:0.3701 mlp_resid:0.4308 mlp_to_attn:0.1887 mlp_to_mlp:0.5600 +parallel_residual layer:16 physical:10 attn_resid:-0.0137 attn_to_attn:0.2337 attn_to_mlp:0.2337 mlp_resid:0.4919 mlp_to_attn:0.5891 mlp_to_mlp:0.5891 +pre-quantization post-ema val_loss:2.79919043 val_bpb:1.08365385 eval_time:6192ms +Serialized model: 135409136 bytes +Code size: 26056 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 12.6s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int8): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, parallel_post_lambdas, parallel_resid_lambdas, skip_gates, skip_weights +Serialized model quantized+brotli: 15956507 bytes +Total submission size quantized+brotli: 15982563 bytes +quantized val_loss:2.82637730 val_bpb:1.09417873 eval_time:8823ms +quantized_sliding_window val_loss:2.78259059 val_bpb:1.07722753 eval_time:92863ms +ttt_sliding:start chunks=1238 chunk_tokens=32768 total_windows=633409 stride=64 ttt_lr=0.01 ttt_epochs=3 freeze_blocks=0 optimizer=sgd hash_embed=True +ttt_sliding:params unfrozen=44333210 frozen=0 + ttt_chunk [1/1238] bpb=1.112730 time=5.6s + ttt_chunk [11/1238] bpb=1.066597 time=10.4s + ttt_chunk [21/1238] bpb=1.104212 time=13.0s + ttt_chunk [31/1238] bpb=1.098173 time=15.6s + ttt_chunk [41/1238] bpb=1.090976 time=18.2s + ttt_chunk [51/1238] bpb=1.084479 time=20.8s + ttt_chunk [61/1238] bpb=1.076436 time=23.5s + ttt_chunk [71/1238] bpb=1.083681 time=26.1s + ttt_chunk [81/1238] bpb=1.076911 time=28.7s + ttt_chunk [91/1238] bpb=1.073309 time=31.4s + ttt_chunk [101/1238] bpb=1.073143 time=34.0s + ttt_chunk [111/1238] bpb=1.071610 time=36.6s + ttt_chunk [121/1238] bpb=1.074633 time=39.2s + ttt_chunk [131/1238] bpb=1.078358 time=41.8s + ttt_chunk [141/1238] bpb=1.078877 time=44.5s + ttt_chunk [151/1238] bpb=1.078669 time=47.1s + ttt_chunk [161/1238] bpb=1.079066 time=49.7s + ttt_chunk [171/1238] bpb=1.078982 time=52.3s + ttt_chunk [181/1238] bpb=1.077497 time=54.9s + ttt_chunk [191/1238] bpb=1.077373 time=57.6s + ttt_chunk [201/1238] bpb=1.075053 time=60.2s + ttt_chunk [211/1238] bpb=1.079496 time=62.8s + ttt_chunk [221/1238] bpb=1.079906 time=65.4s + ttt_chunk [231/1238] bpb=1.081542 time=68.1s + ttt_chunk [241/1238] bpb=1.079759 time=70.7s + ttt_chunk [251/1238] bpb=1.079731 time=73.3s + ttt_chunk [261/1238] bpb=1.080843 time=75.9s + ttt_chunk [271/1238] bpb=1.081258 time=78.6s + ttt_chunk [281/1238] bpb=1.080592 time=81.2s + ttt_chunk [291/1238] bpb=1.081748 time=83.8s + ttt_chunk [301/1238] bpb=1.081986 time=86.4s + ttt_chunk [311/1238] bpb=1.080832 time=89.1s + ttt_chunk [321/1238] bpb=1.080647 time=91.7s + ttt_chunk [331/1238] bpb=1.080938 time=94.3s + ttt_chunk [341/1238] bpb=1.080035 time=96.9s + ttt_chunk [351/1238] bpb=1.080744 time=99.5s + ttt_chunk [361/1238] bpb=1.079687 time=102.1s + ttt_chunk [371/1238] bpb=1.078132 time=104.7s + ttt_chunk [381/1238] bpb=1.078492 time=107.4s + ttt_chunk [391/1238] bpb=1.078142 time=110.0s + ttt_chunk [401/1238] bpb=1.078219 time=112.6s + ttt_chunk [411/1238] bpb=1.078771 time=115.2s + ttt_chunk [421/1238] bpb=1.078264 time=117.8s + ttt_chunk [431/1238] bpb=1.078467 time=120.5s + ttt_chunk [441/1238] bpb=1.078514 time=123.1s + ttt_chunk [451/1238] bpb=1.079688 time=125.7s + ttt_chunk [461/1238] bpb=1.077897 time=128.4s + ttt_chunk [471/1238] bpb=1.077916 time=131.0s + ttt_chunk [481/1238] bpb=1.078084 time=133.6s + ttt_chunk [491/1238] bpb=1.078531 time=136.2s + ttt_chunk [501/1238] bpb=1.078123 time=138.9s + ttt_chunk [511/1238] bpb=1.077738 time=141.5s + ttt_chunk [521/1238] bpb=1.077240 time=144.1s + ttt_chunk [531/1238] bpb=1.077179 time=146.8s + ttt_chunk [541/1238] bpb=1.077247 time=149.4s + ttt_chunk [551/1238] bpb=1.076788 time=152.1s + ttt_chunk [561/1238] bpb=1.076089 time=154.7s + ttt_chunk [571/1238] bpb=1.075516 time=157.3s + ttt_chunk [581/1238] bpb=1.075851 time=159.9s + ttt_chunk [591/1238] bpb=1.076067 time=162.5s + ttt_chunk [601/1238] bpb=1.075992 time=165.1s + ttt_chunk [611/1238] bpb=1.076539 time=167.8s + ttt_chunk [621/1238] bpb=1.077376 time=170.4s + ttt_chunk [631/1238] bpb=1.077420 time=173.1s + ttt_chunk [641/1238] bpb=1.077862 time=175.7s + ttt_chunk [651/1238] bpb=1.078183 time=178.3s + ttt_chunk [661/1238] bpb=1.077495 time=181.0s + ttt_chunk [671/1238] bpb=1.077278 time=183.6s + ttt_chunk [681/1238] bpb=1.078556 time=186.2s + ttt_chunk [691/1238] bpb=1.078729 time=188.9s + ttt_chunk [701/1238] bpb=1.078521 time=191.5s + ttt_chunk [711/1238] bpb=1.079204 time=194.1s + ttt_chunk [721/1238] bpb=1.079506 time=196.7s + ttt_chunk [731/1238] bpb=1.078848 time=199.4s + ttt_chunk [741/1238] bpb=1.078526 time=202.0s + ttt_chunk [751/1238] bpb=1.077607 time=204.6s + ttt_chunk [761/1238] bpb=1.077033 time=207.3s + ttt_chunk [771/1238] bpb=1.076006 time=209.9s + ttt_chunk [781/1238] bpb=1.075983 time=212.5s + ttt_chunk [791/1238] bpb=1.076323 time=215.2s + ttt_chunk [801/1238] bpb=1.076600 time=217.8s + ttt_chunk [811/1238] bpb=1.076080 time=220.4s + ttt_chunk [821/1238] bpb=1.074853 time=223.0s + ttt_chunk [831/1238] bpb=1.074505 time=225.7s + ttt_chunk [841/1238] bpb=1.074017 time=228.3s + ttt_chunk [851/1238] bpb=1.073703 time=231.0s + ttt_chunk [861/1238] bpb=1.073352 time=233.6s + ttt_chunk [871/1238] bpb=1.073256 time=236.2s + ttt_chunk [881/1238] bpb=1.072765 time=238.9s + ttt_chunk [891/1238] bpb=1.072244 time=241.5s + ttt_chunk [901/1238] bpb=1.072603 time=244.1s + ttt_chunk [911/1238] bpb=1.072301 time=246.8s + ttt_chunk [921/1238] bpb=1.072583 time=249.4s + ttt_chunk [931/1238] bpb=1.073260 time=252.0s + ttt_chunk [941/1238] bpb=1.073628 time=254.6s + ttt_chunk [951/1238] bpb=1.073555 time=257.2s + ttt_chunk [961/1238] bpb=1.074344 time=259.9s + ttt_chunk [971/1238] bpb=1.074735 time=262.5s + ttt_chunk [981/1238] bpb=1.075070 time=265.1s + ttt_chunk [991/1238] bpb=1.074826 time=267.7s + ttt_chunk [1001/1238] bpb=1.074841 time=270.4s + ttt_chunk [1011/1238] bpb=1.075159 time=273.0s + ttt_chunk [1021/1238] bpb=1.075853 time=275.6s + ttt_chunk [1031/1238] bpb=1.076309 time=278.3s + ttt_chunk [1041/1238] bpb=1.076760 time=280.9s + ttt_chunk [1051/1238] bpb=1.076685 time=283.5s + ttt_chunk [1061/1238] bpb=1.076680 time=286.1s + ttt_chunk [1071/1238] bpb=1.076835 time=288.8s + ttt_chunk [1081/1238] bpb=1.076736 time=291.4s + ttt_chunk [1091/1238] bpb=1.076905 time=294.0s + ttt_chunk [1101/1238] bpb=1.077447 time=296.6s + ttt_chunk [1111/1238] bpb=1.077718 time=299.2s + ttt_chunk [1121/1238] bpb=1.077871 time=301.9s + ttt_chunk [1131/1238] bpb=1.077494 time=304.5s + ttt_chunk [1141/1238] bpb=1.077154 time=307.1s + ttt_chunk [1151/1238] bpb=1.077193 time=309.8s + ttt_chunk [1161/1238] bpb=1.077309 time=312.4s + ttt_chunk [1171/1238] bpb=1.077070 time=315.0s + ttt_chunk [1181/1238] bpb=1.076584 time=317.6s + ttt_chunk [1191/1238] bpb=1.076728 time=320.3s + ttt_chunk [1201/1238] bpb=1.076814 time=322.9s + ttt_chunk [1211/1238] bpb=1.076490 time=325.6s + ttt_chunk [1221/1238] bpb=1.076037 time=328.2s + ttt_chunk [1231/1238] bpb=1.075666 time=330.8s + ttt_chunk [1238/1238] bpb=1.075660 time=335.0s +ttt_sliding:done val_loss=2.779035 val_bpb=1.07585093 elapsed=335.2s +legal_ttt_exact val_loss:2.77903470 val_bpb:1.07585093 eval_time:335424ms