Closed
Description
System Info
transformers
version: 4.34.0- Platform: macOS-13.5-arm64-arm-64bit
- Python version: 3.10.12
- Huggingface_hub version: 0.17.3
- Safetensors version: 0.4.0
- Accelerate version: 0.20.3
- Accelerate config: not found
- PyTorch version (GPU?): 2.1.0 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
In [1]: import transformers
In [2]: t0tt = transformers.AutoTokenizer.from_pretrained('bigscience/T0pp')
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
In [3]: t0tt.add_special_tokens({'bos_token': '[NEWSPECIAL]'})
Out[3]: 1
In [4]: t0tt.save_pretrained('saved-tokenizer')
Out[4]:
('saved-tokenizer/tokenizer_config.json',
'saved-tokenizer/special_tokens_map.json',
'saved-tokenizer/spiece.model',
'saved-tokenizer/added_tokens.json',
'saved-tokenizer/tokenizer.json')
In [5]: loaded_t0tt = transformers.AutoTokenizer.from_pretrained('saved-tokenizer')
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
In [6]: t0tt.bos_token
Out[6]: '[NEWSPECIAL]'
In [7]: loaded_t0tt.bos_token
Using bos_token, but it is not set yet.
Expected behavior
Expected that an added pad_token persists when saving and then reloading
Metadata
Metadata
Assignees
Labels
No labels