Description
Hi, all, I used the tokenzier to process data for llama model(already converted to hf formated) and set:
tokenizer = AutoTokenizer.from_pretrained(llama_model_id, model_max_length=1024, padding_side='right',
trust_remote_code=True)
tokenizer.add_special_tokens(
{
"eos_token": "</s>",
"bos_token": "</s>",
"unk_token": "</s>",
})
tokenizer.pad_token = tokenizer.eos_token
when tokenizing a piece of text with an eos_token:
tokenizer(['ASSISTANT: Hello!</s>']) # there is no space between ! and </s>.
output:
{'input_ids': [[1, 319, 1799, 9047, 13566, 29901, 15043, 29991, 829, 29879, 29958]],
'token_type_ids': [[0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
The eos_token: </s>
is encoded to 829, 29879, 29958
which means </s>
is regarded as </
,s
and >
.
tokenizer(['ASSISTANT: Hello! </s>']) # there is a space between ! and </s>.
output:
{'input_ids': [[1, 319, 1799, 9047, 13566, 29901, 15043, 29991, 2]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1]]}
in this time, </s>
is encoded correctly (token id is 2).
As description above, does this mean we should add a space between text and eos_token
? however, I find many popular projects like Alpaca
concatenate text with eos_token
without a space.
I previously thought tokenizer encode text in a greedy style, the eos_token
would be encoded correctly with or without a space. However, the experiments above seemed to not support my opinion.
could anyone help me, if there is something misunderstood by me? thx.