Closed
Description
System Info
transformers
version: 4.28.1- Platform: Linux-5.10.147+-x86_64-with-glibc2.31
- Python version: 3.9.16
- Huggingface_hub version: 0.14.1
- Safetensors version: not installed
- PyTorch version (GPU?): 2.0.0+cu118 (False)
- Tensorflow version (GPU?): 2.12.0 (False)
- Flax version (CPU?/GPU?/TPU?): 0.6.8 (cpu)
- Jax version: 0.4.8
- JaxLib version: 0.4.7
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I'm studying the nlp course chapter 6, and find return_overflowing_tokens
has different behavior between slow tokenizer and fast tokenizer, is it a feature or a bug?
from transformers import DistilBertTokenizer, DistilBertTokenizerFast
model_checkpoint = "distilbert-base-cased-distilled-squad"
slow_tokenizer = DistilBertTokenizer.from_pretrained(model_checkpoint)
fast_tokenizer = DistilBertTokenizerFast.from_pretrained(model_checkpoint)
sentence = "This sentence is not too long but we are going to split it anyway."
inputs = fast_tokenizer(
sentence, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)
print(inputs["input_ids"])
Then I got the output
[[101, 1188, 5650, 1110, 1136, 1315, 1263, 102], [101, 1315, 1263, 1133, 1195, 1132, 1280, 102], [101, 1132, 1280, 1106, 3325, 1122, 4050, 102], [101, 1122, 4050, 119, 102]]
but when I replace fast_tokenizer
with slow_tokenizer
, I got
[101, 1188, 5650, 1110, 1136, 1315, 1263, 102]
Expected behavior
The slow tokenizer should behave same as fast tokenizer.