-
Notifications
You must be signed in to change notification settings - Fork 32
Add Sequence Parallelism via Ulysses #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: Stas Bekman <[email protected]>
| # XXX: this was incorrect for GAS | ||
| return self.config.epochs * len(self.train_dataloader) # // self.config.gradient_accumulation_steps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfc-gh-mwyatt, please confirm that my correction is kosher and I will remove comments - this PR makes partial progress on reporting and accounting with GAS>1 (enough to make loss and counters and wandb reporting correct, but it'll need more work to complete / make it smooth)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's because of SP? But then the original doesn't take into an account DP, so perhaps it needs more work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with SP>1 len(self.train_dataloader) is sp_world_size*len(original_train_dataloader)
the training loop needs to make all the iterations - it can't make GAS times less iterations - it's only the accounting that should skip reporting any iterations that weren't at the GAS boundary.
Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: Stas Bekman <[email protected]>
This is the Deepspeed counterpart of snowflakedb/ArcticTraining#45 - as the new feature(s) require changes on both sides. For PR reviewers: Readiness status: - [x] Code - [x] Tests - [ ] Docs - working on it Features: - [x] add support for delaying grad addition via `param.ds_grad_is_ready` flag (used when performing tiled compute in an autograd function) - [x] add light sp-only mpu version (Jeff Rasley) - [x] improved debug - [x] added `all_gather_object` to `dist` - [x] `UlyssesSPAttentionHF` (port of UlyssesAttention from Megatron-Deepspeed plus modern MHA-variations) - [x] `UlyssesSPDataLoaderAdapter` - DL adapter to shard the normal DL batches to be used by `UlyssesSPAttentionHF` - [x] `SequenceTiledCompute` - generic autograd function to perform compute after tiling on the sequence dimension - [x] `TiledMLP` - a specific autograd function to perform tiled MLP (it's much easier to understand before trying to grok `SequenceTiledCompute`) - [x] added a differentiable `_DimZeroAllToAll` (Samyam Rajbhandari) - [x] torch-dist-check now allows `torch.distributed.nn` (which is needed since deepspeed's dist is not up to date with `torch.distributed.nn`) --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Stas Bekman <[email protected]>
sfc-gh-mwyatt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, however I'm confused about the SFTTrainer loss implementation. In the INTEGRATIONS doc file you show that the step function should call a different loss function when SP is enabled, but this is not reflected in the code.
| in `step`: | ||
|
|
||
| ``` | ||
| if self.config.sequence_parallel_size == 1: | ||
| # this is the original code | ||
| loss = self.loss(batch) | ||
| self.model.backward(loss) | ||
| ... | ||
| else: | ||
| # sp will do backward inside sp_fwd_bwd_loss | ||
| # the returned loss is already averaged across ranks | ||
| loss = self.sp_fwd_bwd_loss(batch) | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not seem to be reflected in the code. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the doc is outdated - was written for the original implementation, as I mentioned in the slack post only .py code is ready for review.
I was planning to work on the doc, but got pulled into working on plots. Will get to it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just rewritten them to reflect reality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once reviewed I think we should move INTEGRATION.md to the deepspeed repo, what do you think? Since that's where the components are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heads up - this doc has moved to deepspeedai/DeepSpeed#7331
|
|
||
| logger: | ||
| level: WARNING | ||
| # level: INFO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove these (presumably) debug comments before merging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed most of these already, thought it'd be a good final version with an easy way for the user to switch between various options they are likely to want. These aren't debug.
e.g. I also left:
#attn_implementation: sdpa
and datasets.
But I can remove them if you feel the user will benefit from not seeing other options they are likely to want to quickly turn on/off.
Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: Stas Bekman <[email protected]>
|
@sfc-gh-mwyatt, docs are now ready for review. |
Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: Stas Bekman <[email protected]>
This is the Deepspeed counterpart of snowflakedb/ArcticTraining#45 - as the new feature(s) require changes on both sides. For PR reviewers: Readiness status: - [x] Code - [x] Tests - [ ] Docs - working on it Features: - [x] add support for delaying grad addition via `param.ds_grad_is_ready` flag (used when performing tiled compute in an autograd function) - [x] add light sp-only mpu version (Jeff Rasley) - [x] improved debug - [x] added `all_gather_object` to `dist` - [x] `UlyssesSPAttentionHF` (port of UlyssesAttention from Megatron-Deepspeed plus modern MHA-variations) - [x] `UlyssesSPDataLoaderAdapter` - DL adapter to shard the normal DL batches to be used by `UlyssesSPAttentionHF` - [x] `SequenceTiledCompute` - generic autograd function to perform compute after tiling on the sequence dimension - [x] `TiledMLP` - a specific autograd function to perform tiled MLP (it's much easier to understand before trying to grok `SequenceTiledCompute`) - [x] added a differentiable `_DimZeroAllToAll` (Samyam Rajbhandari) - [x] torch-dist-check now allows `torch.distributed.nn` (which is needed since deepspeed's dist is not up to date with `torch.distributed.nn`) --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: Max Kovalenko <[email protected]>
This is the Deepspeed counterpart of snowflakedb/ArcticTraining#45 - as the new feature(s) require changes on both sides. For PR reviewers: Readiness status: - [x] Code - [x] Tests - [ ] Docs - working on it Features: - [x] add support for delaying grad addition via `param.ds_grad_is_ready` flag (used when performing tiled compute in an autograd function) - [x] add light sp-only mpu version (Jeff Rasley) - [x] improved debug - [x] added `all_gather_object` to `dist` - [x] `UlyssesSPAttentionHF` (port of UlyssesAttention from Megatron-Deepspeed plus modern MHA-variations) - [x] `UlyssesSPDataLoaderAdapter` - DL adapter to shard the normal DL batches to be used by `UlyssesSPAttentionHF` - [x] `SequenceTiledCompute` - generic autograd function to perform compute after tiling on the sequence dimension - [x] `TiledMLP` - a specific autograd function to perform tiled MLP (it's much easier to understand before trying to grok `SequenceTiledCompute`) - [x] added a differentiable `_DimZeroAllToAll` (Samyam Rajbhandari) - [x] torch-dist-check now allows `torch.distributed.nn` (which is needed since deepspeed's dist is not up to date with `torch.distributed.nn`) --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>
This is the Deepspeed counterpart of snowflakedb/ArcticTraining#45 - as the new feature(s) require changes on both sides. For PR reviewers: Readiness status: - [x] Code - [x] Tests - [ ] Docs - working on it Features: - [x] add support for delaying grad addition via `param.ds_grad_is_ready` flag (used when performing tiled compute in an autograd function) - [x] add light sp-only mpu version (Jeff Rasley) - [x] improved debug - [x] added `all_gather_object` to `dist` - [x] `UlyssesSPAttentionHF` (port of UlyssesAttention from Megatron-Deepspeed plus modern MHA-variations) - [x] `UlyssesSPDataLoaderAdapter` - DL adapter to shard the normal DL batches to be used by `UlyssesSPAttentionHF` - [x] `SequenceTiledCompute` - generic autograd function to perform compute after tiling on the sequence dimension - [x] `TiledMLP` - a specific autograd function to perform tiled MLP (it's much easier to understand before trying to grok `SequenceTiledCompute`) - [x] added a differentiable `_DimZeroAllToAll` (Samyam Rajbhandari) - [x] torch-dist-check now allows `torch.distributed.nn` (which is needed since deepspeed's dist is not up to date with `torch.distributed.nn`) --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>
This PR is implementing/porting Sequence Parallelism via Deepspeed Ulysses
For PR reviewers:
Readiness status:
Related:
WIP notes:
https://docs.google.com/document/d/1_0McUfwOVhLUJf80KM-GEuemX_l6ab2X5T0XL4LuR70/edit?tab=t.0#heading=h.bw4a5a2iioao
Ulysses integration in Meg-DS PR:
https://github.com/snowflakedb/Megatron-DeepSpeed/commit/a2a476e11c3cee81cb551630590bf716ea2f3b8c
Dependencies: