-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Fix: remove the redundant snippet of _whole_word_mask #36759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: remove the redundant snippet of _whole_word_mask #36759
Conversation
|
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the |
Rocketknight1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is correct! There should be no duplicates anywhere in cand_indexes and so no way for this block to be relevant.
I suspect this code hasn't been maintained in a while - can you check the rest of the function to confirm it's bug-free as well, and then ping me whenever you're happy for me to merge?
|
I've reviewed the rest of the function, and I believe most of it is correct. However, I also identified a couple of potential issues:
Overall, the function looks mostly correct. However, I've noted a few code snippets that might deviate from the intended behavior regarding special token handling and the deterministic masking calculation. @Rocketknight1 What do you think? Would you like me to create a separate PR to address these potential inconsistencies? |
|
@HuangBugWei, those are all good points! This collator seems to be specifically designed for |
0501079 to
9073008
Compare
What does this PR do?
I noticed that
cand_indexesis constructed using the following logic:Since
iis directly obtained fromenumerate(input_tokens), it is guaranteed to be unique across iterations. As a result, there will be no duplicate elements in the flattenedcand_indexes.Given this, the following check appears to be redundant, as it will never be triggered:
I suggest removing this redundant check to simplify the logic and improve efficiency and readability.
Let me know if there’s any edge case I might have missed!
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?