-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Fix XGLM loss computation (PyTorch and TensorFlow) #35878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @damianoamatruda, yes, the original code is incorrect! However, a simpler fix would be to change the label padding value to |
|
Hi @Rocketknight1, thank you for the clear explanation! I've updated the PR to shift only the labels, as previously done, and replaced the padding token with the mask value I've also updated the PyTorch test to match the changes introduced in the newly merged PR #35659. |
Rocketknight1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, LGTM now! cc @ArthurZucker for core maintainer review
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
run-slow: xglm |
|
This comment contains run-slow, running the specified jobs: ['models/xglm'] ... |
|
Hi @damianoamatruda I'm seeing some failures in the slow tests for XGLM, can you take a look? You can check the CI logs, or to run slow tests locally, you can do something like |
|
Hi @Rocketknight1, I took a look at the errors, which were related to XGLM and similar models but weren't connected to the loss computation. I fixed them by taking inspiration from Now, however, with the latest rebase, there are failed tests that aren't related to XGLM. Can you do something about it? |
|
Hi @damianoamatruda, I'm not sure exactly what's causing that! It's likely those tests were just flaky on a past commit - can you try rebasing again? If they still won't go away then I'll see if we can actually fix or skip them on |
|
@Rocketknight1, I rebased and the test |
|
Yeah, that test is a problem on |
|
Tests are finally green! Pinging @Cyrilvallez for core maintainer review |
|
Great! |
Cyrilvallez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! All LGTM concerning the loss part!
However, I must say that I am skeptical concerning the change in set/get embeddings. It looks like we are changing input type to the functions here (layer vs underlying layer data), which may be breaking for current code. Moreover, the failing test explicitly states that it is expected to fail (and it was never fixed).
TLDR I'd rather we revert the part on embeddings, and keep the loss part 🤗
|
Hi @Cyrilvallez, done! The tests now pass without requiring the commits for the embeddings. How did you fix/disable the failing tests? Thank you for the review. |
|
Thanks for reverting! The failing tests were slow test triggered by the github-actions, they are not run by the usual CI which is why you cannot see them now! |
|
Is there anything else to do or is everything okay? |
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go! @Cyrilvallez has a conference this week hahah sorry 🤗
|
BTW you need to resolve confilcts (probably no changes needed on the modeling non tf side no? |
Cyrilvallez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @damianoamatruda! Indeed very sorry, I had a lot going on this week! As you can see, in the meantime xglm got the loss refactor incorporated, which automatically fixed the issue at hand in Pytorch. The change to the pytorch modeling should not be needed anymore. Very happy to add the test and the change to tensorflow file though!
This updates the expected output string of test_xglm_sample for torch 2.0 to the correct one and removes the one for torch 1.13.1 + cu116 (transformers moved to torch 2.0 with PR #35358).
|
@ArthurZucker, @Cyrilvallez, no problem, I understand your commitments 🤗 |
|
Refactor #35875 moved the loss computation for XGLM into a dedicated function |
Cyrilvallez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh indeed, did not notice that #35875 created a dedicated function to ensure BC, but BC was wrong!
All right, LGTM! Thanks a lot for the fix!! 🤗
|
Thank you all, it's been a pleasure for me! 🤗 |
What does this PR do?
This PR fixes the loss computation for XGLM in both PyTorch and TensorFlow implementations.
The labels were shifted by one and the padding token was appended, causing artificial loss contributions, inconsistencies between non-padded and right-padded sequences, and potential bias toward predicting padding tokens.
The updated implementations ignore the last logit and do not append the padding token to the labels, aligning with the behavior in GPT-2 and other models.
The logic of the computation was first identified in #22540, where it was ported from the PyTorch implementation to the TensorFlow one for consistency. In this PR I've reverted the TensorFlow implementation to its previous valid behavior and I've updated the PyTorch implementation to match it.
I've also added XGLM tests to ensure that the losses of non-padded and padded inputs match.
This bug was discovered in a joint project while collaborating with @mdrpanwar and @ayushkumartarun.
Who can review?
@Rocketknight1 @gante @ArthurZucker