The entry of `\n` in `vocab.txt` is causing token index shifting

It seems `\n` is causing token index shifting after the line 10295 in `vocab.txt`.

```
$ less -N vocab.txt
...
  10294 ##錄
  10295 
  10296 
  10297 する
```

Fortunately, I did not find any performance degrading in downstream tasks caused by this index shifting, but got an error message while executing `save_pretrained()`.
https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/models/bert/tokenization_bert.py#L357
```
Saving vocabulary to vocab.txt: vocabulary indices are not consecutive. Please check that the vocabulary is not corrupted!
```

The line 10295 in `vocab.txt` should be some non-existent word like `!!!DIFECTED!!!`, I think.

Also see #57.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

The entry of `\n` in `vocab.txt` is causing token index shifting #64

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

The entry of \n in vocab.txt is causing token index shifting #64

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

The entry of `\n` in `vocab.txt` is causing token index shifting #64