Skip to content

sudachitra and other custom tokenizers no longer compatible with transformers later than 4.34 #66

@mingboiz

Description

@mingboiz

System Info

transformers version: 4.34.0
Platform: linux
Python version: 3.9.18
sudachitra version: 0.1.8
sudachipy version: 0.6.7
sudachi-core version:20230927

Upstream changes in transformers due to PR: huggingface/transformers#23909 causes error when
running the example over at: https://huggingface.co/megagonlabs/transformers-ud-japanese-electra-base-discriminator
this happens for other custom tokenizers as well: huggingface/transformers#26777

from sudachitra import ElectraSudachipyTokenizer
tokenizer = ElectraSudachipyTokenizer.from_pretrained("megagonlabs/transformers-ud-japanese-electra-base-discriminator")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/lib64/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2045, in from_pretrained
    return cls._from_pretrained(
  File "/home/lib64/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2256, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/lib64/python3.9/site-packages/sudachitra/tokenization_bert_sudachipy.py", line 155, in __init__
    super().__init__(
  File "/home/lib64/python3.9/site-packages/transformers/tokenization_utils.py", line 366, in __init__
    self._add_tokens(self.all_special_tokens_extended, special_tokens=True)
  File "/home/lib64/python3.9/site-packages/transformers/tokenization_utils.py", line 462, in _add_tokens
    current_vocab = self.get_vocab().copy()
  File "/home/lib64/python3.9/site-packages/sudachitra/tokenization_bert_sudachipy.py", line 218, in get_vocab
    return dict(self.vocab, **self.added_tokens_encoder)
AttributeError: 'ElectraSudachipyTokenizer' object has no attribute 'vocab'

If it is ok - I would like to contribute and submit a PR to fix for this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions