Skip to content

Hard time replicating on 8xH100 #160

@shenberg

Description

@shenberg

Hi,

I made what I think is a small improvement to the record (commit link) by developing on 1xH100, rebased on PR #159 and shelled out on an 8xH100 SXM bare-metal machine (Ubuntu 22 nvidia-docker) on Prime-Intellect to start collecting data. All of my results were roughly 0.7 seconds slower than I expected (I have 4 runs, 2 with my change and two at #159 and at first I thought my change just didn't improve run-time on 8xH100, until I thought to revert the code and run the original record version 😅).
Any advice on how to successfully replicate your results as a baseline, so I can show improvement? I want external validation 🙈
(and also to ask questions such as why do we even need that sa_lambda[0] multiply at all instead of just scaling the weights at init time (unsatisfying answer is because it degrades the results. Maybe the interplay between sa_lambda[0] being optimized by AdamW and vs QKVO by Muon matters, maybe it's the computation graph difference, I really don't know))

These are the baseline logs:

0d29bb84-83d1-446c-8aaf-c4deeb2df2d3.txt - 135780ms

f06653aa-edab-4ff4-bbc0-71a722af0a2b.txt - 135687ms

These are the modified version logs:

51b93080-55c3-4ad7-b27a-9784b8b71143.txt - 135088ms

07d52694-1cbd-49c5-9244-7a7883500fa5.txt - 135145ms

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions