[Bug]? how does the tokenizer encode the special tokens?

Hi, all, I used the tokenzier to process  data for llama model(already converted to hf formated) and set:
```python
tokenizer = AutoTokenizer.from_pretrained(llama_model_id, model_max_length=1024, padding_side='right',
                                              trust_remote_code=True)
tokenizer.add_special_tokens(  
            {
                "eos_token": "</s>",
                "bos_token": "</s>",
                "unk_token": "</s>",
            })
tokenizer.pad_token = tokenizer.eos_token
```
when tokenizing  a piece of text with an eos_token:

```python
tokenizer(['ASSISTANT: Hello!</s>']) # there is no space between ! and </s>.
```

```
output:
{'input_ids': [[1, 319, 1799, 9047, 13566, 29901, 15043, 29991, 829, 29879, 29958]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
```
The `eos_token: </s>` is encoded to ` 829, 29879, 29958` which means `</s>` is regarded as `</`,`s` and `>`.


```python
tokenizer(['ASSISTANT: Hello! </s>'])  # there is a space between ! and </s>.
```

```
output:
{'input_ids': [[1, 319, 1799, 9047, 13566, 29901, 15043, 29991, 2]],
  'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0]],
  'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1]]}
```
in this time, `</s>` is encoded correctly  (token id is 2).

As description above, does this mean we should add a space between text and `eos_token`? however, I find many popular projects like `Alpaca` concatenate text with `eos_token` without a space.

I previously thought tokenizer encode text in a greedy style, the `eos_token`  would be encoded correctly with or without a space. However, the experiments above seemed to not support my opinion.

could anyone help me, if there is something misunderstood by me? thx.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]? how does the tokenizer encode the special tokens? #1263

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]? how does the tokenizer encode the special tokens? #1263

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions