[`tokenizers`] Ensure that add_prefix_space is propagated to backend_tokenizer.pre_tokenizer by tomaarsen · Pull Request #35593 · huggingface/transformers

tomaarsen · 2025-01-09T14:41:48Z

Hello!

Pull Request overview

Ensure that add_prefix_space is propagated to backend_tokenizer.pre_tokenizer

What does this PR do?

This is an extrapolation of #35537, which was a fix to ensure that if AutoTokenizer.from_pretrained("...", add_prefix_space=True), then the underlying ByteLevel pre-tokenizer (at backend_tokenizer.pre_tokenizer) is also updated with add_prefix_space=True.

The PR is as simple as moving the patch, which looks like the following snippet, to the end of the PreTrainedTokenizerFast.__init__ method.

        pre_tok_state = json.loads(self.backend_tokenizer.pre_tokenizer.__getstate__())
        if pre_tok_state.get("add_prefix_space", add_prefix_space) != add_prefix_space:
            pre_tok_class = getattr(pre_tokenizers, pre_tok_state.pop("type"))
            pre_tok_state["add_prefix_space"] = add_prefix_space
            self.backend_tokenizer.pre_tokenizer = pre_tok_class(**pre_tok_state)

        self.add_prefix_space = add_prefix_space

The if-statement should only be True if the pre-tokenizer accepts "add_prefix_space" as an argument, and I assume that it's safer to override the pre_tokenizer with a new one than updating the add_prefix_space attribute on the pre-tokenizer, as it's all Rust-based behind the scenes.

I also added a test that should trigger for all tokenizers where the Rust-based tokenizer should be tested.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Discussed with @ArthurZucker outside of GitHub.

Who can review?

@ArthurZucker

Tom Aarsen

…okenizer in PreTrainedTokenizerFast, rather than relying on subclasses to take care of this.

https://github.com/huggingface/tokenizers/blob/862d1a346a99183017b1eb5ad1aa3133b466784f/bindings/python/src/pre_tokenizers.rs#L672 produces the Exception. They're triggered by the roformer tests, as the RoFormerTokenizerFast uses a custom PreTokenizer.

HuggingFaceDocBuilderDev · 2025-01-09T15:38:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Thanks! Kinda long due 🤗

KoichiYasuoka · 2025-01-10T01:35:54Z

Thank you @tomaarsen and I agree that this PR well works when the pre_tokenizer is simply ByteLevel. But I suspect that it does not work when the pre_tokenizer is a Sequence including ByteLevel...

ArthurZucker · 2025-01-10T09:46:11Z

This was already not working before!
But wait 1 week, we are merging stuff in tokenizers to just iterate of the Sequence! FYI @McPatate 😉

Ensure that add_prefix_space is propagated to backend_tokenizer.pre_t…

e8738e5

…okenizer in PreTrainedTokenizerFast, rather than relying on subclasses to take care of this.

tomaarsen requested review from ArthurZucker and Rocketknight1 as code owners January 9, 2025 14:41

tomaarsen added 2 commits January 9, 2025 15:52

Simplify setting self.add_prefix_space, ensure pre_tok exists

8df9846

Propagate add_prefix_space in T5TokenizerFast to superclass

9888d79

ArthurZucker approved these changes Jan 9, 2025

View reviewed changes

tomaarsen merged commit 32e0db8 into huggingface:main Jan 9, 2025

tomaarsen mentioned this pull request Jan 9, 2025

[modernbert] Add ModernBertTokenizerFast to allow for 'add_prefix_space' #35537

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`tokenizers`] Ensure that add_prefix_space is propagated to backend_tokenizer.pre_tokenizer#35593

[`tokenizers`] Ensure that add_prefix_space is propagated to backend_tokenizer.pre_tokenizer#35593
tomaarsen merged 4 commits into
huggingface:mainfrom
tomaarsen:tokenization/add_prefix_space_patch

tomaarsen commented Jan 9, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jan 9, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

KoichiYasuoka commented Jan 10, 2025 •

edited

Loading

Uh oh!

ArthurZucker commented Jan 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tomaarsen commented Jan 9, 2025

Pull Request overview

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Jan 9, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

KoichiYasuoka commented Jan 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Jan 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

KoichiYasuoka commented Jan 10, 2025 •

edited

Loading