Skip to content

Conversation

@itazap
Copy link
Collaborator

@itazap itazap commented Mar 5, 2025

fix for #36438

Reason for the issue: that a list of tokens is expected, not a dict of tokens / ids, which is what tiktoken's encoding_name_for_model provides. This converter function was designed for more 'behind the scenes' use I guess, but we can make this fix to support it better for direct use.

We don't explicitly test convert_slow_tokenizer, so we can test equivalence for gpt2 (since it's common between tiktoken and transformers)

@github-actions github-actions bot marked this pull request as draft March 5, 2025 18:21
@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2025

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

@itazap itazap changed the title 36438 tiktoken convert fix tiktoken convert to pass AddedToken to Tokenizer Mar 5, 2025
@itazap itazap requested a review from ArthurZucker March 5, 2025 18:23
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, can we check why the fast tests were not failing before (no gguf on the docker?)

@itazap itazap force-pushed the 36438_tiktoken_convert branch from 78fc6c5 to 84b892d Compare March 8, 2025 17:38
@itazap itazap force-pushed the 36438_tiktoken_convert branch from 7cfaf74 to d7f52c5 Compare March 8, 2025 17:48
@itazap itazap requested a review from ArthurZucker March 8, 2025 17:50
@itazap itazap marked this pull request as ready for review March 11, 2025 20:29
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging

@ArthurZucker ArthurZucker merged commit 3f03c37 into main Mar 20, 2025
24 checks passed
@ArthurZucker ArthurZucker deleted the 36438_tiktoken_convert branch March 20, 2025 10:26
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request May 14, 2025
* pass AddedToken to Tokenizer

* ruff

* handle dict for special tokens

* option: test tokenizer from tiktoken same as fast

* ruff

* ruff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants