Hi,
I made what I think is a small improvement to the record (commit link) by developing on 1xH100, rebased on PR #159 and shelled out on an 8xH100 SXM bare-metal machine (Ubuntu 22 nvidia-docker) on Prime-Intellect to start collecting data. All of my results were roughly 0.7 seconds slower than I expected (I have 4 runs, 2 with my change and two at #159 and at first I thought my change just didn't improve run-time on 8xH100, until I thought to revert the code and run the original record version 😅).
Any advice on how to successfully replicate your results as a baseline, so I can show improvement? I want external validation 🙈
(and also to ask questions such as why do we even need that sa_lambda[0] multiply at all instead of just scaling the weights at init time (unsatisfying answer is because it degrades the results. Maybe the interplay between sa_lambda[0] being optimized by AdamW and vs QKVO by Muon matters, maybe it's the computation graph difference, I really don't know))
These are the baseline logs:
0d29bb84-83d1-446c-8aaf-c4deeb2df2d3.txt - 135780ms
f06653aa-edab-4ff4-bbc0-71a722af0a2b.txt - 135687ms
These are the modified version logs:
51b93080-55c3-4ad7-b27a-9784b8b71143.txt - 135088ms
07d52694-1cbd-49c5-9244-7a7883500fa5.txt - 135145ms
Hi,
I made what I think is a small improvement to the record (commit link) by developing on 1xH100, rebased on PR #159 and shelled out on an 8xH100 SXM bare-metal machine (Ubuntu 22 nvidia-docker) on Prime-Intellect to start collecting data. All of my results were roughly 0.7 seconds slower than I expected (I have 4 runs, 2 with my change and two at #159 and at first I thought my change just didn't improve run-time on 8xH100, until I thought to revert the code and run the original record version 😅).
Any advice on how to successfully replicate your results as a baseline, so I can show improvement? I want external validation 🙈
(and also to ask questions such as why do we even need that
sa_lambda[0]multiply at all instead of just scaling the weights at init time (unsatisfying answer is because it degrades the results. Maybe the interplay between sa_lambda[0] being optimized by AdamW and vs QKVO by Muon matters, maybe it's the computation graph difference, I really don't know))These are the baseline logs:
0d29bb84-83d1-446c-8aaf-c4deeb2df2d3.txt - 135780ms
f06653aa-edab-4ff4-bbc0-71a722af0a2b.txt - 135687ms
These are the modified version logs:
51b93080-55c3-4ad7-b27a-9784b8b71143.txt - 135088ms
07d52694-1cbd-49c5-9244-7a7883500fa5.txt - 135145ms