Hard time replicating on 8xH100

Hi,

I made what I think is a small improvement to the record ([commit link](https://github.com/KellerJordan/modded-nanogpt/compare/master...shenberg:modded-nanogpt:lambda_on_weights)) by developing on 1xH100, rebased on PR #159   and shelled out on an 8xH100 SXM bare-metal machine (Ubuntu 22 nvidia-docker) on Prime-Intellect to start collecting data. All of my results were roughly 0.7 seconds slower than I expected (I have 4 runs, 2 with my change and two at #159 and at first I thought my change just didn't improve run-time on 8xH100, until I thought to revert the code and run the original record version 😅). 
Any advice on how to successfully replicate your results as a baseline, so I can show improvement? I want external validation 🙈 
(and also to ask questions such as why do we even need that `sa_lambda[0]` multiply at all instead of just scaling the weights at init time (unsatisfying answer is because it degrades the results. Maybe the interplay between sa_lambda[0] being optimized by AdamW and vs QKVO by Muon matters, maybe it's the computation graph difference, I really don't know))

These are the baseline logs:

[0d29bb84-83d1-446c-8aaf-c4deeb2df2d3.txt](https://github.com/user-attachments/files/23656887/0d29bb84-83d1-446c-8aaf-c4deeb2df2d3.txt) - 135780ms

[f06653aa-edab-4ff4-bbc0-71a722af0a2b.txt](https://github.com/user-attachments/files/23656894/f06653aa-edab-4ff4-bbc0-71a722af0a2b.txt) - 135687ms

These are the modified version logs:

[51b93080-55c3-4ad7-b27a-9784b8b71143.txt](https://github.com/user-attachments/files/23656892/51b93080-55c3-4ad7-b27a-9784b8b71143.txt) - 135088ms

[07d52694-1cbd-49c5-9244-7a7883500fa5.txt](https://github.com/user-attachments/files/23656889/07d52694-1cbd-49c5-9244-7a7883500fa5.txt) - 135145ms

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hard time replicating on 8xH100 #160

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Hard time replicating on 8xH100 #160

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions