fix tokenizer initialization bug with the latest version (4.34.0.dev0) of transformers #1

Spico197 · 2023-09-22T07:05:58Z

Move super().__init__() to the end of __init__ to fix tokenizer loading bug.
Otherwise, the following error is encountered:

Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`,  it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.
Traceback (most recent call last):
  File "/data/zhutong/OpenBA-Enc/debug.py", line 3, in <module>
    tokenizer = AutoTokenizer.from_pretrained(
  File "/home/zhutong/miniconda3/envs/torch/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 731, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/zhutong/miniconda3/envs/torch/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2039, in from_pretrained
    return cls._from_pretrained(
  File "/home/zhutong/miniconda3/envs/torch/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2250, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/zhutong/.cache/huggingface/modules/transformers_modules/OpenBA/OpenBA-LM/330fa80ff08f9646728dc14b568b60ea3d9d8144/tokenization_openba.py", line 52, in __init__
    super().__init__(
  File "/home/zhutong/miniconda3/envs/torch/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 366, in __init__
    self._add_tokens(self.all_special_tokens_extended, special_tokens=True)
  File "/home/zhutong/miniconda3/envs/torch/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 454, in _add_tokens
    current_vocab = self.get_vocab().copy()
  File "/home/zhutong/.cache/huggingface/modules/transformers_modules/OpenBA/OpenBA-LM/330fa80ff08f9646728dc14b568b60ea3d9d8144/tokenization_openba.py", line 95, in get_vocab
    vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
  File "/home/zhutong/.cache/huggingface/modules/transformers_modules/OpenBA/OpenBA-LM/330fa80ff08f9646728dc14b568b60ea3d9d8144/tokenization_openba.py", line 92, in vocab_size
    return self.sp_model.get_piece_size() + self._extra_ids
AttributeError: 'OpenBATokenizer' object has no attribute 'sp_model'

move super().__init__() to the end of __init__ to fix tokenizer loading bug.

Spico197 · 2023-09-22T07:07:16Z

The hf version should also be modified to prevent the direct loading bug.

Spico197 · 2023-09-22T16:29:20Z

Found similar issue at huggingface/transformers#26340
This is caused by a breaking update of transformers (huggingface/transformers#23909) .
Could be resolved to degrade transformers from latest main branch to 4.33.2 .

Spico197 · 2023-09-22T16:50:25Z

The newer version of PreTrainedTokenizer in transformers will automatically add tokens, where sp_model is not initialized correctly in current settings. I suggest to make some changes to support future versions.

ZetangForward · 2023-09-23T05:24:03Z

We have specified the HF version and fix this problem, thanks for your suggestion.

fix initialization bug

289893a

move super().__init__() to the end of __init__ to fix tokenizer loading bug.

Spico197 changed the title ~~fix initialization bug~~ fix tokenizer initialization bug Sep 22, 2023

Spico197 changed the title ~~fix tokenizer initialization bug~~ fix tokenizer initialization bug with the latest version (4.34.0.dev0) of transformers Sep 22, 2023

ZetangForward closed this Sep 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix tokenizer initialization bug with the latest version (4.34.0.dev0) of transformers #1

fix tokenizer initialization bug with the latest version (4.34.0.dev0) of transformers #1

Uh oh!

Spico197 commented Sep 22, 2023

Uh oh!

Spico197 commented Sep 22, 2023

Uh oh!

Spico197 commented Sep 22, 2023

Uh oh!

Spico197 commented Sep 22, 2023

Uh oh!

ZetangForward commented Sep 23, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix tokenizer initialization bug with the latest version (4.34.0.dev0) of transformers #1

fix tokenizer initialization bug with the latest version (4.34.0.dev0) of transformers #1

Uh oh!

Conversation

Spico197 commented Sep 22, 2023

Uh oh!

Spico197 commented Sep 22, 2023

Uh oh!

Spico197 commented Sep 22, 2023

Uh oh!

Spico197 commented Sep 22, 2023

Uh oh!

ZetangForward commented Sep 23, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants