Skip to content

return_overflowing_tokens has different behavior between slow tokenizer and fast tokenizer #23001

Closed
@BuxianChen

Description

@BuxianChen

System Info

  • transformers version: 4.28.1
  • Platform: Linux-5.10.147+-x86_64-with-glibc2.31
  • Python version: 3.9.16
  • Huggingface_hub version: 0.14.1
  • Safetensors version: not installed
  • PyTorch version (GPU?): 2.0.0+cu118 (False)
  • Tensorflow version (GPU?): 2.12.0 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.6.8 (cpu)
  • Jax version: 0.4.8
  • JaxLib version: 0.4.7
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@Arthur

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I'm studying the nlp course chapter 6, and find return_overflowing_tokens has different behavior between slow tokenizer and fast tokenizer, is it a feature or a bug?

from transformers import DistilBertTokenizer, DistilBertTokenizerFast

model_checkpoint = "distilbert-base-cased-distilled-squad"
slow_tokenizer = DistilBertTokenizer.from_pretrained(model_checkpoint)
fast_tokenizer = DistilBertTokenizerFast.from_pretrained(model_checkpoint)
sentence = "This sentence is not too long but we are going to split it anyway."
inputs = fast_tokenizer(
    sentence, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)
print(inputs["input_ids"])

Then I got the output

[[101, 1188, 5650, 1110, 1136, 1315, 1263, 102], [101, 1315, 1263, 1133, 1195, 1132, 1280, 102], [101, 1132, 1280, 1106, 3325, 1122, 4050, 102], [101, 1122, 4050, 119, 102]]

but when I replace fast_tokenizer with slow_tokenizer, I got

[101, 1188, 5650, 1110, 1136, 1315, 1263, 102]

Expected behavior

The slow tokenizer should behave same as fast tokenizer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Core: TokenizationInternals of the library; Tokenization.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions