-
Notifications
You must be signed in to change notification settings - Fork 28.8k
skip_special_tokens has different behavior between slow and fast tokenizer #23250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'd like to confirm my understandings to the concept, since the PR 23312 is in progressing: In 🤗 Transformers, for both slow and fast tokenizers, there are only two types of tokens:
In both slow and fast tokenizer, Please let me know if there are any misunderstandings. |
Hey! Thanks for reporting this!
Now about the core of the issue, you have a good grasp of what is going on, good job! 🤗 And thanks for taking the time to dig in. T5 is a bit of a special case because it uses a hack in the The core issue is that the |
One thing is that some of the added tokens can be
|
Thanks for your reply, so the example for slow and fast tokenizer, which behavior is expected?
|
In this case, the |
It will be addressed in the linked PR. This is mostly due to the fact that the slow tokenizer was not properly added to the list of |
PR will be merged this week! |
System Info
transformers
version: 4.26.1Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Hi, recently, I find some subtle difference between slow tokenizer and fast tokenizer, Here is a example
Here are more informations about the issue, I'm not a native English speaker, hope to be understood.
tokenizers.Tokenizer.add_special_tokens(tokens)
, thus the tokenஐ
will be added to vocabulary, and be viewed as "special token", and never be processed by tokenizer.model.ஐ
as "normal token", so it will not be skipped. By the way, I read the related source code, whenskip_special_tokens=True
, slow tokenizer only skipself.all_special_ids
, butஐ
is not stored in this, butself.added_tokens_encoder
.I read some 🤗 official documents, and struggled to figure out the meaning of so called "special token", and realize it's a subtle concept, here is my thought: Tokens can be divided to these categories:
bos_token
,eos_token
, ...,additional_special_tokens
, the major propose of these tokens is used in encode post-processing pipeline. When these tokens appeared in input text, in slow tokenizer situation, in most cases, these tokens also be included inself.unique_no_split_tokens
, so these tokens will not be split, but I don't know the treatment in fast tokenizer case.so, in both cases, these user added tokens will never be split.
Please let me know if there are any misunderstandings.
Several weeks ago, I summit a issue 23001 related to
return_overflowing_tokens
behavior, which is considered as a specific feature of fast tokenizer, so it's a feature not a bug. Generally, I want to know the differences between slow and fast tokenizer, should be viewed as features, or bugs.Expected behavior
The slow tokenizer should behave same as fast tokenizer.
The text was updated successfully, but these errors were encountered: