Skip to content

[Bug]? how does the tokenizer encode the special tokens? #1263

Closed
@vpegasus

Description

@vpegasus

Hi, all, I used the tokenzier to process data for llama model(already converted to hf formated) and set:

tokenizer = AutoTokenizer.from_pretrained(llama_model_id, model_max_length=1024, padding_side='right',
                                              trust_remote_code=True)
tokenizer.add_special_tokens(  
            {
                "eos_token": "</s>",
                "bos_token": "</s>",
                "unk_token": "</s>",
            })
tokenizer.pad_token = tokenizer.eos_token

when tokenizing a piece of text with an eos_token:

tokenizer(['ASSISTANT: Hello!</s>']) # there is no space between ! and </s>.
output:
{'input_ids': [[1, 319, 1799, 9047, 13566, 29901, 15043, 29991, 829, 29879, 29958]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

The eos_token: </s> is encoded to 829, 29879, 29958 which means </s> is regarded as </,s and >.

tokenizer(['ASSISTANT: Hello! </s>'])  # there is a space between ! and </s>.
output:
{'input_ids': [[1, 319, 1799, 9047, 13566, 29901, 15043, 29991, 2]],
  'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0]],
  'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1]]}

in this time, </s> is encoded correctly (token id is 2).

As description above, does this mean we should add a space between text and eos_token? however, I find many popular projects like Alpaca concatenate text with eos_token without a space.

I previously thought tokenizer encode text in a greedy style, the eos_token would be encoded correctly with or without a space. However, the experiments above seemed to not support my opinion.

could anyone help me, if there is something misunderstood by me? thx.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions