update loss computation in modeling code #37993

ydshieh · 2025-05-07T09:25:45Z

What does this PR do?

cc @stas00

Follow up of #36607, see this comment.

The following code snippet gives

on main

tensor(7.0538, grad_fn=)
None

with this PR.

tensor(7.0538, grad_fn=)
tensor(7.0538, grad_fn=)

This is particular necessary for context parallel to run correctly.

Once the changes is approved, I will update all other places and add documentations.

import torch
from transformers import AutoModelForCausalLM

repo_id = "meta-llama/Llama-3.2-1B"
token = "YOUR_HF_TOKEN"

model = AutoModelForCausalLM.from_pretrained(repo_id, token=token)
input_ids = torch.ones(size=(1, 16), dtype=torch.int64)
labels = input_ids.clone()
shift_labels = torch.nn.functional.pad(labels[..., 1:], (0, 1), value=-100)
outputs_with_labels = model(input_ids, labels=labels)
outputs_with_shift_labels = model(input_ids, shift_labels=shift_labels)

print(outputs_with_labels.loss)
print(outputs_with_shift_labels.loss)

github-actions · 2025-05-07T09:25:59Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

HuggingFaceDocBuilderDev · 2025-05-07T09:38:48Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2025-05-07T10:20:09Z

src/transformers/models/llama/modeling_llama.py


        loss = None
-        if labels is not None:
+        if labels is not None or kwargs.get("shift_labels", None) is not None:


why not leave it to the loss function?

We can, but it means that we have to compute the loss outside the model's forward with
the user have to do

from transformers import ForCausalLMLoss shift_labels = .... loss = ForCausalLMLoss(logits=logits, shift_labels=shift_labels, vocab_size=model.config.vocab_size)

It's not too much work for users, but it is nice if we could make it easier for them (they only need to take care of preparing shift_labels ).

Enable this API means people could perform the same workflow with labels and with shift_labels (i.e. put them in model.forward and get it from outputs).

Also, we can't just pass shift_labels without passing labels

loss = ForCausalLMLoss(logits=logits, shift_labels=shift_labels, vocab_size=model.config.vocab_size)

as it is a required positional argument. We have to do

loss = ForCausalLMLoss(logits=logits, labels=None, shift_labels=shift_labels, vocab_size=model.config.vocab_size)

This kind of details could hide from user if we do it in modeling code.

Having to pass labels (no matter what values it contains) when we mean to use shift_labels is kind confusing.

It could work with model.forward if we pass

model(input_ids=x, labels=labels (or 'anything except None'), shift_labels=shift_labels)

but that is also confusing.

yeah, someone reported using a side-effect hack - passing labels=non_None_garbage, shift_labels=real_data and getting the model() to compute the loss ;)

so you know people want this feature ;)

honestly the only reason I didn't propose it is because I didn't want to do it for 200+ files ;) So I'm grateful to @ydshieh for taking the lead on this.

what I meant @ydshieh is that if we do this we need to update all models!

ydshieh · 2025-05-07T13:22:27Z

src/transformers/utils/generic.py

    """

    num_items_in_batch: Optional[int]
+    shift_labels: Optional[torch.Tensor]


This is somehow not super, as shift_labels is specific for ForCausalLMLoss. Not sure if we want to expose it here or within

class KwargsForCausalLM(FlashAttentionKwargs, LossKwargs): ...

which is in many modeling files.

stas00 · 2025-05-07T20:21:32Z

FWIW, Liger-kernel has just implemented this feature: linkedin/Liger-Kernel#683
It works in 0.5.9. They had to implement it - otherwise the fused cross-entropy won't be possible w/ shift_labels and a huge performance improvement and memory saving will be lost.

ydshieh · 2025-05-15T13:40:18Z

@bot /style

ydshieh · 2025-05-15T13:44:59Z

@bot /style

github-actions · 2025-05-15T13:46:32Z

Style fixes have been applied. View the workflow run here.

ArthurZucker

Happy to have this merged, just needs:

either all model update with make fix copies
put this in the loss_function instead (not sure if possible)

ArthurZucker · 2025-05-20T08:55:11Z

src/transformers/models/llama/modeling_llama.py


        loss = None
-        if labels is not None:
+        if labels is not None or kwargs.get("shift_labels", None) is not None:


what I meant @ydshieh is that if we do this we need to update all models!

ydshieh · 2025-05-20T09:43:35Z

put this in the loss_function instead (not sure if possible)

This is unfortunately impossible, as in current modeling code, it only compute loss if labels is passed

if labels is not None:
    loss = ...

and if we want to use shift_labels, it won't be taken into account if we also give it labels, which is strange behavior.

I will go

either all model update with make fix copies

🤞

ydshieh · 2025-05-31T10:30:27Z

Ready for an review 🙏

Mostly, change

if labels is not None:

to

if labels is not None or kwargs.get("shift_labels", None) is not None:

for CausalLM (or some ForConditionalGeneration) models.

ydshieh · 2025-05-31T10:34:03Z

I would love to change shift_labels to shifted_labels (also in def ForCausalLMLoss), as shift_labels sounds like a bool argument. Just nit however.

stas00 · 2025-05-31T15:25:57Z

It looks like I introduced it here 2 months ago, replicating the long-time pre-existing internal variable name:
8f64b17

I agree that shifted_labels is a better name, but shift_labels is already adopted in the wild - at least liger-kernel and deepspeed. Liger-kernel already made a public release, deepspeed is imminent. I don't know if others started using it.

ydshieh · 2025-05-31T15:29:56Z

Yeah, I understand. No big deal, I guess probably people focus on make AI brrrr would never care if there is ed or not 😅

ArthurZucker

not a big fan of this, IDK if we have a better solution. WDYT about a flag saying they are already shifted? (it's just that kwargs are not to be used here but I mean.... yeah

ydshieh · 2025-06-02T16:07:36Z

IIRC, @ArthurZucker you mean we allow labels could be a shifted label, and we introduce a flag to indicate this, in

def ForCausalLMLoss(

right? Technically it's doable, but the new flag (say, is_shifted) will be overlapped with the function of shift_labels, which is not ideal.

One possibility is we allow shift_labels to be Optional[Union[torch.Tensor, bool]], then we do inside ForCausalLMLoss

    if shift_labels is True and labels is not None:
        shift_labels = labels
    ....

it could work and it could avoid the changes in the many modeling files. But you see the downside (same argument having 2 possible types and perform something differently).

@stas00 Any comment here?

ydshieh · 2025-06-02T16:58:57Z

@ArthurZucker @stas00

This is probably a better and clean solution

https://github.com/huggingface/transformers/pull/38533/files

Let me know your opinions

stas00 · 2025-06-02T17:38:49Z

@ydshieh, the last one is a smooth solution!

github-actions bot marked this pull request as draft May 7, 2025 09:25

ydshieh marked this pull request as ready for review May 7, 2025 09:54

ydshieh requested a review from ArthurZucker May 7, 2025 09:55

ArthurZucker reviewed May 7, 2025

View reviewed changes

ydshieh changed the title ~~update~~ update loss computation in modeling code May 7, 2025

ydshieh mentioned this pull request May 7, 2025

Add support for context parallelism #35983

Closed

ydshieh commented May 7, 2025

View reviewed changes

Muhtasham approved these changes May 7, 2025

View reviewed changes

ydshieh force-pushed the update_loss branch from 9f0eccb to ddf19e7 Compare May 9, 2025 12:13

ydshieh marked this pull request as draft May 9, 2025 14:58

ydshieh marked this pull request as ready for review May 9, 2025 14:58

ArthurZucker reviewed May 20, 2025

View reviewed changes

fix

8f4bd23

ydshieh force-pushed the update_loss branch from e98ac25 to 72dac52 Compare May 30, 2025 18:30

ydshieh added 9 commits May 30, 2025 23:28

fix more: regex 1

4277138

fix more: regex 2

58ceb05

fix more: regex 3

6c9be6d

fix more: regex 4

ccaa61a

fix more: regex 5

91ce93e

fix more: regex 6

4d5ae68

fix more: regex 7

5a54121

fix more: regex 8

556913c

fix more: regex 9

736c2b0

ydshieh added 8 commits May 31, 2025 00:50

fix more: regex 10

29985c3

fix more: regex 11

b3c7f3f

fix more: regex 12

7712554

fix more: regex 13

d895fb8

fix more: regex 14

c44658f

fix more: regex 15

7d245b5

fix more: regex 16

4c7bd74

fix more: regex 17

a473534

ydshieh force-pushed the update_loss branch from 1e5237d to a473534 Compare May 31, 2025 08:42

ydshieh added 2 commits May 31, 2025 10:53

fix more: regex 18

c8b2013

fix: skip test_torch_fx for GenerationMixin

9a92387

ydshieh requested a review from ArthurZucker June 2, 2025 09:03

ArthurZucker reviewed Jun 2, 2025

View reviewed changes

ydshieh mentioned this pull request Jun 3, 2025

another way to use shift_labels #38533

Closed

ydshieh closed this Sep 18, 2025

stas00 mentioned this pull request Oct 23, 2025

TiledFusedLogitsLoss (LigerFusedLinearCrossEntropy, FLCE) Flag in from_pretrained() #41306

Open

update loss computation in modeling code #37993

update loss computation in modeling code #37993

Uh oh!

Conversation

ydshieh commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

on main

with this PR.

Uh oh!

github-actions bot commented May 7, 2025

Uh oh!

HuggingFaceDocBuilderDev commented May 7, 2025

Uh oh!

ArthurZucker May 7, 2025

Choose a reason for hiding this comment

Uh oh!

ydshieh May 7, 2025

Choose a reason for hiding this comment

Uh oh!

ydshieh May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stas00 May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker May 20, 2025

Choose a reason for hiding this comment

Uh oh!

ydshieh May 7, 2025

Choose a reason for hiding this comment

Uh oh!

stas00 commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ydshieh commented May 15, 2025

Uh oh!

ydshieh commented May 15, 2025

Uh oh!

github-actions bot commented May 15, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker May 20, 2025

Choose a reason for hiding this comment

Uh oh!

ydshieh commented May 20, 2025

Uh oh!

ydshieh commented May 31, 2025

Uh oh!

ydshieh commented May 31, 2025

Uh oh!

stas00 commented May 31, 2025

Uh oh!

ydshieh commented May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ydshieh commented Jun 2, 2025

Uh oh!

ydshieh commented Jun 2, 2025

Uh oh!

stas00 commented Jun 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ydshieh commented May 7, 2025 •

edited

Loading

ydshieh May 7, 2025 •

edited

Loading

stas00 May 7, 2025 •

edited

Loading

stas00 commented May 7, 2025 •

edited

Loading

ydshieh commented May 31, 2025 •

edited

Loading