-
Notifications
You must be signed in to change notification settings - Fork 31.7k
[generate] Fix encoder decoder models attention mask #36018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[generate] Fix encoder decoder models attention mask #36018
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noticing it's broken, should've triggered some slow tests before merging!
I agree that we'd need to pass both attention masks for encoder-decoder models if available. And the attention mask in prepare_inputs should come from decoder for further manipulations no top
Overall LGTM but would be cleaner if we could pass encoder-mask from within kwargs in step 7, imo. CC @gante for one more opinion on generation
|
agree that having a transformers/src/transformers/generation/utils.py Lines 1957 to 1959 in 9d2056f
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you for the fix 🙏
P.S.: can we somehow add a fast test to ensure this doesn't happen again? This is a pretty basic feature, and should be covered by fast tests
|
@zucchini-nlp As @eustlb wrote, historically the naming convention is for the |
|
@eustlb I'm going to rerun the tests until CI gets green, and then merge this PR -- not having this fix is blocking me :D |
What does this PR do?
#35852 introduced a bug where the attention mask is lost for encoder-decoder models.
This PR implements a naïve fix.
cf #36020