Skip to content

[tokenizers] Ensure that add_prefix_space is propagated to backend_tokenizer.pre_tokenizer#35593

Merged
tomaarsen merged 4 commits into
huggingface:mainfrom
tomaarsen:tokenization/add_prefix_space_patch
Jan 9, 2025
Merged

[tokenizers] Ensure that add_prefix_space is propagated to backend_tokenizer.pre_tokenizer#35593
tomaarsen merged 4 commits into
huggingface:mainfrom
tomaarsen:tokenization/add_prefix_space_patch

Conversation

@tomaarsen

Copy link
Copy Markdown
Member

Hello!

Pull Request overview

  • Ensure that add_prefix_space is propagated to backend_tokenizer.pre_tokenizer

What does this PR do?

This is an extrapolation of #35537, which was a fix to ensure that if AutoTokenizer.from_pretrained("...", add_prefix_space=True), then the underlying ByteLevel pre-tokenizer (at backend_tokenizer.pre_tokenizer) is also updated with add_prefix_space=True.

The PR is as simple as moving the patch, which looks like the following snippet, to the end of the PreTrainedTokenizerFast.__init__ method.

        pre_tok_state = json.loads(self.backend_tokenizer.pre_tokenizer.__getstate__())
        if pre_tok_state.get("add_prefix_space", add_prefix_space) != add_prefix_space:
            pre_tok_class = getattr(pre_tokenizers, pre_tok_state.pop("type"))
            pre_tok_state["add_prefix_space"] = add_prefix_space
            self.backend_tokenizer.pre_tokenizer = pre_tok_class(**pre_tok_state)

        self.add_prefix_space = add_prefix_space

The if-statement should only be True if the pre-tokenizer accepts "add_prefix_space" as an argument, and I assume that it's safer to override the pre_tokenizer with a new one than updating the add_prefix_space attribute on the pre-tokenizer, as it's all Rust-based behind the scenes.

I also added a test that should trigger for all tokenizers where the Rust-based tokenizer should be tested.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Discussed with @ArthurZucker outside of GitHub.

Who can review?

@ArthurZucker

  • Tom Aarsen

…okenizer

in PreTrainedTokenizerFast, rather than relying on subclasses to take care of this.
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker ArthurZucker left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Kinda long due 🤗

@KoichiYasuoka

KoichiYasuoka commented Jan 10, 2025

Copy link
Copy Markdown
Contributor

Thank you @tomaarsen and I agree that this PR well works when the pre_tokenizer is simply ByteLevel. But I suspect that it does not work when the pre_tokenizer is a Sequence including ByteLevel...

@ArthurZucker

Copy link
Copy Markdown
Collaborator

This was already not working before!
But wait 1 week, we are merging stuff in tokenizers to just iterate of the Sequence! FYI @McPatate 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants