Skip to content

The entry of \n in vocab.txt is causing token index shifting #64

@hiroshi-matsuda-rit

Description

@hiroshi-matsuda-rit

It seems \n is causing token index shifting after the line 10295 in vocab.txt.

$ less -N vocab.txt
...
  10294 ##錄
  10295 
  10296 
  10297 する

Fortunately, I did not find any performance degrading in downstream tasks caused by this index shifting, but got an error message while executing save_pretrained().
https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/models/bert/tokenization_bert.py#L357

Saving vocabulary to vocab.txt: vocabulary indices are not consecutive. Please check that the vocabulary is not corrupted!

The line 10295 in vocab.txt should be some non-existent word like !!!DIFECTED!!!, I think.

Also see #57.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions