Skip to content

Conversation

@Spico197
Copy link

Move super().__init__() to the end of __init__ to fix tokenizer loading bug.
Otherwise, the following error is encountered:

Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`,  it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.
Traceback (most recent call last):
  File "/data/zhutong/OpenBA-Enc/debug.py", line 3, in <module>
    tokenizer = AutoTokenizer.from_pretrained(
  File "/home/zhutong/miniconda3/envs/torch/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 731, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/zhutong/miniconda3/envs/torch/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2039, in from_pretrained
    return cls._from_pretrained(
  File "/home/zhutong/miniconda3/envs/torch/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2250, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/zhutong/.cache/huggingface/modules/transformers_modules/OpenBA/OpenBA-LM/330fa80ff08f9646728dc14b568b60ea3d9d8144/tokenization_openba.py", line 52, in __init__
    super().__init__(
  File "/home/zhutong/miniconda3/envs/torch/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 366, in __init__
    self._add_tokens(self.all_special_tokens_extended, special_tokens=True)
  File "/home/zhutong/miniconda3/envs/torch/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 454, in _add_tokens
    current_vocab = self.get_vocab().copy()
  File "/home/zhutong/.cache/huggingface/modules/transformers_modules/OpenBA/OpenBA-LM/330fa80ff08f9646728dc14b568b60ea3d9d8144/tokenization_openba.py", line 95, in get_vocab
    vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
  File "/home/zhutong/.cache/huggingface/modules/transformers_modules/OpenBA/OpenBA-LM/330fa80ff08f9646728dc14b568b60ea3d9d8144/tokenization_openba.py", line 92, in vocab_size
    return self.sp_model.get_piece_size() + self._extra_ids
AttributeError: 'OpenBATokenizer' object has no attribute 'sp_model'

move super().__init__() to the end of __init__ to fix tokenizer loading bug.
@Spico197
Copy link
Author

The hf version should also be modified to prevent the direct loading bug.

@Spico197 Spico197 changed the title fix initialization bug fix tokenizer initialization bug Sep 22, 2023
@Spico197
Copy link
Author

Found similar issue at huggingface/transformers#26340
This is caused by a breaking update of transformers (huggingface/transformers#23909) .
Could be resolved to degrade transformers from latest main branch to 4.33.2 .

@Spico197
Copy link
Author

The newer version of PreTrainedTokenizer in transformers will automatically add tokens, where sp_model is not initialized correctly in current settings. I suggest to make some changes to support future versions.

image

@Spico197 Spico197 changed the title fix tokenizer initialization bug fix tokenizer initialization bug with the latest version (4.34.0.dev0) of transformers Sep 22, 2023
@ZetangForward
Copy link
Member

We have specified the HF version and fix this problem, thanks for your suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants