-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Closed
Labels
Core: TokenizationInternals of the library; Tokenization.Internals of the library; Tokenization.
Description
Reporting a failing API design
This is mostly to help me record some of the biggest issues with the current API for adding tokens.
This is linked to #23909. Here is a simple snippet:
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("t5-base", use_fast = False)
>>> tokenizer = [
AddedToken("[ABC]", normalized=False),
AddedToken("[DEF]", normalized=False),
AddedToken("GHI IHG", normalized=False),
]
>>> tokenizer.add_tokens(new_toks)
>>> tokenizer.add_tokens([AddedToken("[SAMPLE]", normalized=True)], special_tokens = True)
>>> print(tokenizer.added_tokens_encoder)
>>> print( tokenizer.all_special_ids)This will show that the newly added token ([SAMPLE]) is not part of the all_special_ids. However, all_special_ids is used when decoding, to check whether the token should be skipped or not:
for token in filtered_tokens:
if skip_special_tokens and token in self.all_special_ids:
continue
if token in self.added_tokens_encoder:
if current_sub_text:
sub_texts.append(self.convert_tokens_to_string(current_sub_text))
current_sub_text = []
sub_texts.append(token)
else:
current_sub_text.append(token)Thus
>>> encoded = tokenizer.encode("[ABC] [DEF][SAMPLE]", add_special_tokens=False)
>>> tokenizer.decode(encoded, skip_special_tokens = True)
"[ABC] [DEF][SAMPLE]"However, the token is in added_tokens_encoder but not in additional_special_tokens.
Now imagine you want spaces_between_special_tokens ? This will add spaces between all added tokens, and thus checks if a token is part of tokenzier.added_tokens_encoder.
>>> encoded = tokenizer.encode("[ABC] [DEF][SAMPLE]", add_special_tokens=False)
>>> tokenizer.decode(encoded, spaces_between_special_tokens = True)
"[ABC] [DEF] [SAMPLE]"
>>> tokenizer.decode(encoded, spaces_between_special_tokens = False)
"[ABC][DEF][SAMPLE]"Metadata
Metadata
Assignees
Labels
Core: TokenizationInternals of the library; Tokenization.Internals of the library; Tokenization.