New WR: sparse bigram gradient comms (-0.6 seconds) by shenberg · Pull Request #221 · KellerJordan/modded-nanogpt

shenberg · 2026-02-06T19:46:39Z

Title timing is a mistake, its closer to -0.75 seconds.

This is an update & cleanup of #219. No ML changes.

This PR contains three changes:

Sparse comms for bigram embeddings
Moved the bigram index calculation to write directly into a pinned tensor
Changed scatter/gather order a bit to improve overlap

The sparse comms implementation saves much bandwidth (even at largest batch size, the amount of communication is about the same as the embedding and lm_head layers (for the scatter)) at the cost of extra compute to reconstitute the rank-local gradient, which is hidden by overlap with other communication.

Moving the bigram index calculation directly into a pinned tensor saved a lot of time in the forward pass as .to(device, non_blocking=True) was very slow. Since I need the index on the CPU as part of the sparse communication scheme, this is mutually exclusive with #216 unfortunately, though I think much of the gain in #216 is already folded into moving to a pinned tensor.

Ablation: a build with only changes 2 & 3 was about 500ms faster than baseline (hard-coded _sparse_comms_active() to return False).

The scatter-gather order is a bit under-explored: using a profiler, it's easy to see that overlap is not perfect despite there being enough transfers queued. I found a configuration that improved timing by another 100ms (there should be about an extra 200ms on top of it) but for some reason loss was increased.

Will upload the logs in a bit. See #219 for more details of the sparse communication algorithm until I move them here to the readme

chrisjmccormick · 2026-02-14T07:59:22Z

FYI, in case it's helpful for catching things up--I integrated the PRs prior to yours into your code, and re-ran it for four runs.
The updated code and logs are in a subfolder of my PR #233.

ClassicLarry · 2026-02-16T05:17:46Z

Merging at 91.0 (-0.3s) based on re-timing. (This one undoes the 0.3s gain from moving the bigram to GPU)

I will do a merge afterwards to clean up the merge conflicts.

sparse update for bigram embeddings

9c54317

shenberg mentioned this pull request Feb 6, 2026

Sparse comms for bigram-embeds #219

Closed

Added logs and readme

8b187a5

Merge branch 'master' into sparse_bigram_scatter

4083af2

ClassicLarry merged commit b893aed into KellerJordan:master Feb 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New WR: sparse bigram gradient comms (-0.6 seconds)#221

New WR: sparse bigram gradient comms (-0.6 seconds)#221
ClassicLarry merged 3 commits intoKellerJordan:masterfrom
shenberg:sparse_bigram_scatter

shenberg commented Feb 6, 2026 •

edited

Loading

Uh oh!

chrisjmccormick commented Feb 14, 2026

Uh oh!

ClassicLarry commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shenberg commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrisjmccormick commented Feb 14, 2026

Uh oh!

ClassicLarry commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shenberg commented Feb 6, 2026 •

edited

Loading