-
-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Description
It seems \n is causing token index shifting after the line 10295 in vocab.txt.
$ less -N vocab.txt
...
10294 ##錄
10295
10296
10297 する
Fortunately, I did not find any performance degrading in downstream tasks caused by this index shifting, but got an error message while executing save_pretrained().
https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/models/bert/tokenization_bert.py#L357
Saving vocabulary to vocab.txt: vocabulary indices are not consecutive. Please check that the vocabulary is not corrupted!
The line 10295 in vocab.txt should be some non-existent word like !!!DIFECTED!!!, I think.
Also see #57.
Metadata
Metadata
Assignees
Labels
No labels