-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Description
System Info
platform==Ubuntu18.04
python==3.10
transformers==4.29.2
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
</s> is the special token of LLaMATokenizer(Fast), it is expected that </s> can be recognized as a single token when encoding the text. However, it can be shown that the two tokenizers behave differently:
>>> t1 = transformers.AutoTokenizer.from_pretrained("huggyllama/llama-7b", use_fast=True)
>>> t2 = transformers.AutoTokenizer.from_pretrained("huggyllama/llama-7b", use_fast=False)
>>> text = "I love you.</s>"
>>> t1(text)
>>> {'input_ids': [1, 306, 5360, 366, 21106, 29879, 29958], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
>>> t2(text)
>>> {'input_ids': [1, 306, 5360, 366, 29889, 2], 'attention_mask': [1, 1, 1, 1, 1, 1]}also, LLaMATokenizerFast returns token_type_ids but LLaMATokenizer does not.
Expected behavior
LLaMATokenizerFast to be consistent with LLaMATokenzier.
vpegasus, ayaka14732, fxmarty and linhdvu14
Metadata
Metadata
Assignees
Labels
No labels