[ForCausalLMLoss] allow users to pass shifted labels #36607

stas00 · 2025-03-07T17:41:12Z

I'm porting DeepSpeed Ulysses sequence parallelism from Megatron-Deepspeed to the HF transformers ecosphere so that everybody could use this SP implementation. I'm integrating it into ArcticTraining and then once everything is neat we can integrate it into HF Accelerate and make it available to many frameworks/users.

One of the nuances of SP implementation is that each rank computes a shard of the loss and then the loss/grads are merged together - this allows for sequence length of 1M and more.

The problem emerges when loss is computed:

In unsharded seqlen logits we end up with (shift left):

input_ids: [1 2 3 4 5 6 7    8   ]
labels   : [1 2 3 4 5 6 7    8   ]
shiftedl : [2 3 4 5 6 7 8 -100]

when sharded seqlen logits (each gpu processes half seqlen in this example) we lose label 5 once shifted:

input_ids: [1 2 3    4] [5 6 7    8]
labels   : [1 2 3    4] [5 6 7    8]
shiftedl : [2 3 4 -100] [6 7 8 -100]

so we either need the ForCausalLMLoss API to allow for the user to provide the padding token to replace -100 with 5 in this case, or a much simpler solution is to just let the user do the shifting. This PR proposes the latter.

An alternative API change would be to pass labels=shift_labels - so no API change here, but to add a flag are_labels_shifted=False

So 2 ways:

def ForCausalLMLoss(
    logits, labels, vocab_size: int, num_items_in_batch: int = None, ignore_index: int = -100, 
    shift_labels=None, **kwargs

def ForCausalLMLoss(
    logits, labels, vocab_size: int, num_items_in_batch: int = None, ignore_index: int = -100, 
    are_labels_shifted=False, **kwargs

whatever works for you is good for me. The PR is currently made for (1).

Signed-off-by: Stas Bekman <[email protected]>

Rocketknight1 · 2025-03-10T14:34:53Z

Related to #36609. This PR seems simple enough that we could probably accept it as-is, but cc @ArthurZucker @Cyrilvallez if you think it's okay with our core philosophy

sfc-gh-sbekman · 2025-03-10T17:09:04Z

Also please note at the end of the OP I have an alternative proposal that isn't in the PR, which might be neater? Not sure.

muellerzr

Taking in shift labels is one I also agree with, solves a few headaches @ArthurZucker and I ran into when it came to what other models take.

stas00 · 2025-03-12T16:44:15Z

Thank you, Zach! So we just need to decide which API is neater:

def ForCausalLMLoss(
    logits, labels, vocab_size: int, num_items_in_batch: int = None, ignore_index: int = -100, 
    shift_labels=None, **kwargs

def ForCausalLMLoss(
    logits, labels, vocab_size: int, num_items_in_batch: int = None, ignore_index: int = -100, 
    are_labels_shifted=False, **kwargs

sfc-gh-sbekman · 2025-03-18T03:46:34Z

could someone please hit the merge button? unless we are waiting for someone else to review?

ArthurZucker

LGTM thanks for that! eager to see the full feature merged!

ArthurZucker · 2025-03-20T10:25:33Z

sorry @stas00 for the delay

sfc-gh-sbekman · 2025-03-20T15:27:19Z

Thank you, so much, Arthur!

Triang-jyed-driung · 2025-04-05T13:48:34Z

This issue is not resolved:
Take Qwen2ForCausalLM for example.
https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2/modeling_qwen2.py#L842
If one only passes shifted labels, this function will simply return loss=None.

stas00 · 2025-04-05T16:35:33Z

I think you're implying that the head of the model is also shifted_labels aware, but that's not the case.

This PR only changed ForCausalLMLoss which only works if you pass the batch w/o labels to model(**batch) call and then you manually calculate the loss using shifted labels.

Before:

loss = model(**batch)

After:

batch, labels = remove_labels(batch)
outputs = model(**batch)
[...]
shift_labels = do_shift_labels(labels)
loss = model.loss_function(logits=outputs.logits,..., labels=None, ..., shift_labels=shift_labels)

Of course, it's possible to make the Before support shift_labels, but that would be a much bigger change.

Triang-jyed-driung · 2025-04-06T14:59:40Z

Actually, with v4.50.3, I found that passing model(input_ids=x, labels='anything except None', shift_labels=y) will return the correct loss with shifted labels.

stas00 · 2025-04-06T16:12:50Z

Heh, yes, nice! I can see how this would work, because shift_labels gets passed via kwargs. But this is an unintended behaviour. I'd suggest to bring it up in a separate issue to be made intended.

If accepted the PR could be:

if labels is not None or kwargs.get("shift_labels", None) is not None:
   loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)

then labels=None would work and shift_labels becomes part of the official API for model head calling (and is tested). The tricky part now is to find all models that use ForCausalLMLoss loss and apply the above change and also would need to document that the behavior is different for models of this loss function, since other models will not do anything about the shift_labels arg.

* [ForCausalLMLoss] allow users to pass shifted labels Signed-off-by: Stas Bekman <[email protected]> * style Signed-off-by: Stas Bekman <[email protected]> --------- Signed-off-by: Stas Bekman <[email protected]>

stas00 added 2 commits March 7, 2025 09:31

[ForCausalLMLoss] allow users to pass shifted labels

73d11de

Signed-off-by: Stas Bekman <[email protected]>

style

17cd982

Signed-off-by: Stas Bekman <[email protected]>

stas00 marked this pull request as ready for review March 7, 2025 19:40

github-actions bot requested review from ArthurZucker and Rocketknight1 March 7, 2025 19:40

sfc-gh-sbekman mentioned this pull request Mar 10, 2025

Add Sequence Parallelism via Ulysses snowflakedb/ArcticTraining#45

Merged

6 tasks

muellerzr approved these changes Mar 12, 2025

View reviewed changes

stas00 added 2 commits March 12, 2025 09:56

Merge branch 'main' into loss-pre-shifted-labels

b7a3f23

Merge branch 'main' into loss-pre-shifted-labels

153b0b2

ArthurZucker added the DeepSpeed label Mar 20, 2025

ArthurZucker approved these changes Mar 20, 2025

View reviewed changes

ArthurZucker merged commit 8f64b17 into huggingface:main Mar 20, 2025
21 checks passed

NielsRogge mentioned this pull request Apr 1, 2025

clarify the label shifting behavior of llama models when labels is given. #32944

Open

Triang-jyed-driung mentioned this pull request Apr 4, 2025

[Feature Request] [HF] allow users to pass shifted labels fla-org/flash-linear-attention#302

Closed

stas00 deleted the loss-pre-shifted-labels branch April 6, 2025 16:14

This was referenced May 7, 2025

update loss computation in modeling code #37993

Closed

Add support for context parallelism #35983

Closed

stas00 mentioned this pull request Oct 23, 2025

TiledFusedLogitsLoss (LigerFusedLinearCrossEntropy, FLCE) Flag in from_pretrained() #41306

Open

[ForCausalLMLoss] allow users to pass shifted labels #36607

[ForCausalLMLoss] allow users to pass shifted labels #36607

Uh oh!

Conversation

stas00 commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rocketknight1 commented Mar 10, 2025

Uh oh!

sfc-gh-sbekman commented Mar 10, 2025

Uh oh!

muellerzr left a comment

Choose a reason for hiding this comment

Uh oh!

stas00 commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfc-gh-sbekman commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurZucker commented Mar 20, 2025

Uh oh!

sfc-gh-sbekman commented Mar 20, 2025

Uh oh!

Triang-jyed-driung commented Apr 5, 2025

Uh oh!

stas00 commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Triang-jyed-driung commented Apr 6, 2025

Uh oh!

stas00 commented Apr 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

stas00 commented Mar 7, 2025 •

edited

Loading

stas00 commented Mar 12, 2025 •

edited

Loading

sfc-gh-sbekman commented Mar 18, 2025 •

edited

Loading

stas00 commented Apr 5, 2025 •

edited

Loading

stas00 commented Apr 6, 2025 •

edited

Loading