Skip to content

New WR: sparse bigram gradient comms (-0.6 seconds)#221

Merged
ClassicLarry merged 3 commits intoKellerJordan:masterfrom
shenberg:sparse_bigram_scatter
Feb 16, 2026
Merged

New WR: sparse bigram gradient comms (-0.6 seconds)#221
ClassicLarry merged 3 commits intoKellerJordan:masterfrom
shenberg:sparse_bigram_scatter

Conversation

@shenberg
Copy link
Copy Markdown
Contributor

@shenberg shenberg commented Feb 6, 2026

Title timing is a mistake, its closer to -0.75 seconds.

This is an update & cleanup of #219. No ML changes.

This PR contains three changes:

  1. Sparse comms for bigram embeddings
  2. Moved the bigram index calculation to write directly into a pinned tensor
  3. Changed scatter/gather order a bit to improve overlap

The sparse comms implementation saves much bandwidth (even at largest batch size, the amount of communication is about the same as the embedding and lm_head layers (for the scatter)) at the cost of extra compute to reconstitute the rank-local gradient, which is hidden by overlap with other communication.

Moving the bigram index calculation directly into a pinned tensor saved a lot of time in the forward pass as .to(device, non_blocking=True) was very slow. Since I need the index on the CPU as part of the sparse communication scheme, this is mutually exclusive with #216 unfortunately, though I think much of the gain in #216 is already folded into moving to a pinned tensor.

Ablation: a build with only changes 2 & 3 was about 500ms faster than baseline (hard-coded _sparse_comms_active() to return False).

The scatter-gather order is a bit under-explored: using a profiler, it's easy to see that overlap is not perfect despite there being enough transfers queued. I found a configuration that improved timing by another 100ms (there should be about an extra 200ms on top of it) but for some reason loss was increased.

Will upload the logs in a bit. See #219 for more details of the sparse communication algorithm until I move them here to the readme

@chrisjmccormick
Copy link
Copy Markdown
Contributor

FYI, in case it's helpful for catching things up--I integrated the PRs prior to yours into your code, and re-ran it for four runs.
The updated code and logs are in a subfolder of my PR #233.

@ClassicLarry
Copy link
Copy Markdown
Collaborator

Merging at 91.0 (-0.3s) based on re-timing. (This one undoes the 0.3s gain from moving the bigram to GPU)

I will do a merge afterwards to clean up the merge conflicts.

@ClassicLarry ClassicLarry merged commit b893aed into KellerJordan:master Feb 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants