Skip to content

[Tokenizer] skip_special_tokens not working as expected  #24316

@ArthurZucker

Description

@ArthurZucker

Reporting a failing API design

This is mostly to help me record some of the biggest issues with the current API for adding tokens.

This is linked to #23909. Here is a simple snippet:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("t5-base", use_fast = False)
>>> tokenizer = [
                    AddedToken("[ABC]", normalized=False), 
                    AddedToken("[DEF]", normalized=False),
                    AddedToken("GHI IHG", normalized=False),
]
>>> tokenizer.add_tokens(new_toks)
>>> tokenizer.add_tokens([AddedToken("[SAMPLE]", normalized=True)], special_tokens = True)
>>> print(tokenizer.added_tokens_encoder)
>>> print( tokenizer.all_special_ids)

This will show that the newly added token ([SAMPLE]) is not part of the all_special_ids. However, all_special_ids is used when decoding, to check whether the token should be skipped or not:

        for token in filtered_tokens:
            if skip_special_tokens and token in self.all_special_ids:
                continue
            if token in self.added_tokens_encoder:
                if current_sub_text:
                    sub_texts.append(self.convert_tokens_to_string(current_sub_text))
                    current_sub_text = []
                sub_texts.append(token)
            else:
                current_sub_text.append(token)

Thus

>>> encoded = tokenizer.encode("[ABC] [DEF][SAMPLE]", add_special_tokens=False)
>>> tokenizer.decode(encoded, skip_special_tokens = True)
"[ABC] [DEF][SAMPLE]"

However, the token is in added_tokens_encoder but not in additional_special_tokens.

Now imagine you want spaces_between_special_tokens ? This will add spaces between all added tokens, and thus checks if a token is part of tokenzier.added_tokens_encoder.

>>> encoded = tokenizer.encode("[ABC] [DEF][SAMPLE]", add_special_tokens=False)
>>> tokenizer.decode(encoded, spaces_between_special_tokens = True)
"[ABC] [DEF] [SAMPLE]"
>>> tokenizer.decode(encoded, spaces_between_special_tokens = False)
"[ABC][DEF][SAMPLE]"

Metadata

Metadata

Assignees

Labels

Core: TokenizationInternals of the library; Tokenization.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions