Add Sequence Parallelism via Ulysses #45

sfc-gh-sbekman · 2025-02-11T20:11:36Z

This PR is implementing/porting Sequence Parallelism via Deepspeed Ulysses

For PR reviewers:

Readiness status:

Code
Docs
Tests

WIP notes:
https://docs.google.com/document/d/1_0McUfwOVhLUJf80KM-GEuemX_l6ab2X5T0XL4LuR70/edit?tab=t.0#heading=h.bw4a5a2iioao
Ulysses integration in Meg-DS PR:
https://github.com/snowflakedb/Megatron-DeepSpeed/commit/a2a476e11c3cee81cb551630590bf716ea2f3b8c

Dependencies:

Making attention mechanism stackable Making attention mechanism stackable huggingface/transformers#36609
Allow users to pass shifted labels [ForCausalLMLoss] allow users to pass shifted labels huggingface/transformers#36607
Liger kernel to support shifted labels (PR merged, but waiting for a new release)

Signed-off-by: Stas Bekman <[email protected]>

sfc-gh-sbekman · 2025-05-30T22:16:56Z

arctic_training/trainer/trainer.py

+        # XXX: this was incorrect for GAS
+        return self.config.epochs * len(self.train_dataloader)  # // self.config.gradient_accumulation_steps


@sfc-gh-mwyatt, please confirm that my correction is kosher and I will remove comments - this PR makes partial progress on reporting and accounting with GAS>1 (enough to make loss and counters and wandb reporting correct, but it'll need more work to complete / make it smooth)

I wonder if it's because of SP? But then the original doesn't take into an account DP, so perhaps it needs more work?

with SP>1 len(self.train_dataloader) is sp_world_size*len(original_train_dataloader)

the training loop needs to make all the iterations - it can't make GAS times less iterations - it's only the accounting that should skip reporting any iterations that weren't at the GAS boundary.

Signed-off-by: Stas Bekman <[email protected]>

This is the Deepspeed counterpart of snowflakedb/ArcticTraining#45 - as the new feature(s) require changes on both sides. For PR reviewers: Readiness status: - [x] Code - [x] Tests - [ ] Docs - working on it Features: - [x] add support for delaying grad addition via `param.ds_grad_is_ready` flag (used when performing tiled compute in an autograd function) - [x] add light sp-only mpu version (Jeff Rasley) - [x] improved debug - [x] added `all_gather_object` to `dist` - [x] `UlyssesSPAttentionHF` (port of UlyssesAttention from Megatron-Deepspeed plus modern MHA-variations) - [x] `UlyssesSPDataLoaderAdapter` - DL adapter to shard the normal DL batches to be used by `UlyssesSPAttentionHF` - [x] `SequenceTiledCompute` - generic autograd function to perform compute after tiling on the sequence dimension - [x] `TiledMLP` - a specific autograd function to perform tiled MLP (it's much easier to understand before trying to grok `SequenceTiledCompute`) - [x] added a differentiable `_DimZeroAllToAll` (Samyam Rajbhandari) - [x] torch-dist-check now allows `torch.distributed.nn` (which is needed since deepspeed's dist is not up to date with `torch.distributed.nn`) --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>

Signed-off-by: Stas Bekman <[email protected]>

sfc-gh-mwyatt

Overall looks good, however I'm confused about the SFTTrainer loss implementation. In the INTEGRATIONS doc file you show that the step function should call a different loss function when SP is enabled, but this is not reflected in the code.

arctic_training/monkey_patches.py

sfc-gh-mwyatt · 2025-06-02T18:21:33Z

projects/sequence-parallelism/INTEGRATION.md

+in `step`:
+
+```
+        if self.config.sequence_parallel_size == 1:
+            # this is the original code
+            loss = self.loss(batch)
+            self.model.backward(loss)
+            ...
+
+        else:
+            # sp will do backward inside sp_fwd_bwd_loss
+            # the returned loss is already averaged across ranks
+            loss = self.sp_fwd_bwd_loss(batch)
+```


This does not seem to be reflected in the code. Am I missing something?

the doc is outdated - was written for the original implementation, as I mentioned in the slack post only .py code is ready for review.

I was planning to work on the doc, but got pulled into working on plots. Will get to it now.

I have just rewritten them to reflect reality.

Once reviewed I think we should move INTEGRATION.md to the deepspeed repo, what do you think? Since that's where the components are.

heads up - this doc has moved to deepspeedai/DeepSpeed#7331

sfc-gh-mwyatt · 2025-06-02T18:22:30Z

projects/sequence-parallelism/run-sp1-llama-8b-baseline.yml

+
+logger:
+  level: WARNING
+#  level: INFO


Should we remove these (presumably) debug comments before merging?

I removed most of these already, thought it'd be a good final version with an easy way for the user to switch between various options they are likely to want. These aren't debug.

e.g. I also left:

#attn_implementation: sdpa

and datasets.

But I can remove them if you feel the user will benefit from not seeing other options they are likely to want to quickly turn on/off.

Signed-off-by: Stas Bekman <[email protected]>

sfc-gh-sbekman · 2025-06-03T00:17:22Z

@sfc-gh-mwyatt, docs are now ready for review.

Signed-off-by: Stas Bekman <[email protected]>

This is the Deepspeed counterpart of snowflakedb/ArcticTraining#45 - as the new feature(s) require changes on both sides. For PR reviewers: Readiness status: - [x] Code - [x] Tests - [ ] Docs - working on it Features: - [x] add support for delaying grad addition via `param.ds_grad_is_ready` flag (used when performing tiled compute in an autograd function) - [x] add light sp-only mpu version (Jeff Rasley) - [x] improved debug - [x] added `all_gather_object` to `dist` - [x] `UlyssesSPAttentionHF` (port of UlyssesAttention from Megatron-Deepspeed plus modern MHA-variations) - [x] `UlyssesSPDataLoaderAdapter` - DL adapter to shard the normal DL batches to be used by `UlyssesSPAttentionHF` - [x] `SequenceTiledCompute` - generic autograd function to perform compute after tiling on the sequence dimension - [x] `TiledMLP` - a specific autograd function to perform tiled MLP (it's much easier to understand before trying to grok `SequenceTiledCompute`) - [x] added a differentiable `_DimZeroAllToAll` (Samyam Rajbhandari) - [x] torch-dist-check now allows `torch.distributed.nn` (which is needed since deepspeed's dist is not up to date with `torch.distributed.nn`) --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: Max Kovalenko <[email protected]>

This is the Deepspeed counterpart of snowflakedb/ArcticTraining#45 - as the new feature(s) require changes on both sides. For PR reviewers: Readiness status: - [x] Code - [x] Tests - [ ] Docs - working on it Features: - [x] add support for delaying grad addition via `param.ds_grad_is_ready` flag (used when performing tiled compute in an autograd function) - [x] add light sp-only mpu version (Jeff Rasley) - [x] improved debug - [x] added `all_gather_object` to `dist` - [x] `UlyssesSPAttentionHF` (port of UlyssesAttention from Megatron-Deepspeed plus modern MHA-variations) - [x] `UlyssesSPDataLoaderAdapter` - DL adapter to shard the normal DL batches to be used by `UlyssesSPAttentionHF` - [x] `SequenceTiledCompute` - generic autograd function to perform compute after tiling on the sequence dimension - [x] `TiledMLP` - a specific autograd function to perform tiled MLP (it's much easier to understand before trying to grok `SequenceTiledCompute`) - [x] added a differentiable `_DimZeroAllToAll` (Samyam Rajbhandari) - [x] torch-dist-check now allows `torch.distributed.nn` (which is needed since deepspeed's dist is not up to date with `torch.distributed.nn`) --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>

sfc-gh-sbekman added 28 commits February 11, 2025 20:09

[Sequence Parallelism]

83a3726

add mpu

ecd7acc

remove hardwired value

7e6f663

Merge remote-tracking branch 'origin/main' into stas/sp

e7c83e2

wip

ba33025

wip

8886079

wip

275a226

wip

321ac87

wip

3dd741d

wip

bd44e32

MHA forward works

bebc83c

prep

58c3b0f

gqa works

6801ac8

cleanup

cee8432

DL works

c1ada04

wip

a8dec7b

rebase

71e4580

silence

c874325

Merge remote-tracking branch 'origin/main' into stas/sp

7b1f7b2

Merge remote-tracking branch 'origin/main' into stas/sp

2a8ce95

ulysses fully working now

b0746b3

wip

98908a4

Merge remote-tracking branch 'origin/main' into stas/sp

3a70573

wip

e2e472a

wip

5948913

big models variable seqlen works

bceb52f

wip

7f9aff6

fixed non-sp loss averaging - plots match now

e081971

Signed-off-by: Stas Bekman <[email protected]>

This was referenced Mar 7, 2025

[ForCausalLMLoss] allow users to pass shifted labels huggingface/transformers#36607

Merged

Making attention mechanism stackable huggingface/transformers#36609

Closed

sfc-gh-sbekman added 3 commits May 30, 2025 21:55

Merge remote-tracking branch 'origin/main' into stas/sp

f257ce3

style

d570603

Signed-off-by: Stas Bekman <[email protected]>

fix

b0af987

Signed-off-by: Stas Bekman <[email protected]>

sfc-gh-sbekman commented May 30, 2025

View reviewed changes

sfc-gh-sbekman added 5 commits May 30, 2025 22:19

cleanup

eb57257

Signed-off-by: Stas Bekman <[email protected]>

Merge remote-tracking branch 'origin/main' into stas/sp

e28dd74

trim

fe45f7c

Signed-off-by: Stas Bekman <[email protected]>

style

b370472

Signed-off-by: Stas Bekman <[email protected]>

Merge remote-tracking branch 'origin/main' into stas/sp

1b500a5

Signed-off-by: Stas Bekman <[email protected]>

deepspeed merged ulysses pr

1755955

Signed-off-by: Stas Bekman <[email protected]>

sfc-gh-mwyatt requested changes Jun 2, 2025

View reviewed changes

sfc-gh-sbekman added 7 commits June 2, 2025 21:27

imports

d1d4657

Signed-off-by: Stas Bekman <[email protected]>

wip docs

d5d9739

Signed-off-by: Stas Bekman <[email protected]>

wip docs

75758c8

Signed-off-by: Stas Bekman <[email protected]>

wip docs

5284eaa

Signed-off-by: Stas Bekman <[email protected]>

new deepspeed version

c96aeb6

Signed-off-by: Stas Bekman <[email protected]>

Merge remote-tracking branch 'origin/main' into stas/sp

db26e13

style

682bd68

Signed-off-by: Stas Bekman <[email protected]>

sfc-gh-sbekman added 4 commits June 3, 2025 00:48

add a map

dab3379

Signed-off-by: Stas Bekman <[email protected]>

fix

2931467

Signed-off-by: Stas Bekman <[email protected]>

fix

7598a2b

Signed-off-by: Stas Bekman <[email protected]>

INTEGRATION moved to DS

08326f1

Signed-off-by: Stas Bekman <[email protected]>

sfc-gh-mwyatt approved these changes Jun 3, 2025

View reviewed changes

sfc-gh-sbekman merged commit 9d33fd2 into main Jun 3, 2025
4 checks passed

sfc-gh-sbekman deleted the stas/sp branch June 3, 2025 18:19

		# XXX: this was incorrect for GAS
		return self.config.epochs * len(self.train_dataloader) # // self.config.gradient_accumulation_steps

Add Sequence Parallelism via Ulysses #45

Add Sequence Parallelism via Ulysses #45

Uh oh!

Conversation

sfc-gh-sbekman commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfc-gh-sbekman May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfc-gh-sbekman May 30, 2025

Choose a reason for hiding this comment

Uh oh!

sfc-gh-sbekman May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfc-gh-mwyatt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sfc-gh-mwyatt Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

sfc-gh-sbekman Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

sfc-gh-sbekman Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfc-gh-sbekman Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

sfc-gh-sbekman Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

sfc-gh-mwyatt Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

sfc-gh-sbekman Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

sfc-gh-sbekman commented Jun 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sfc-gh-sbekman commented Feb 11, 2025 •

edited

Loading

sfc-gh-sbekman May 30, 2025 •

edited

Loading

sfc-gh-sbekman May 30, 2025 •

edited

Loading

sfc-gh-sbekman Jun 3, 2025 •

edited

Loading