Skip to content

Add GPT 4.1 to Tiktoken Tokenizer #7450

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fanyang-mono opened this issue Apr 30, 2025 · 4 comments · Fixed by #7453
Closed

Add GPT 4.1 to Tiktoken Tokenizer #7450

fanyang-mono opened this issue Apr 30, 2025 · 4 comments · Fixed by #7453
Assignees
Labels
enhancement New feature or request Tokenizers
Milestone

Comments

@fanyang-mono
Copy link
Member

I would like to see GPT 4.1 being added to Tiktoken Tokenizer.

@fanyang-mono fanyang-mono added the enhancement New feature or request label Apr 30, 2025
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged label Apr 30, 2025
@stephentoub
Copy link
Member

@tarekgh, looks like OpenAI stated yesterday that gpt-4.1 uses the same tokenizer as 4o.
https://community.openai.com/t/whats-the-tokenization-algorithm-gpt-4-1-uses/1245758/2

@tarekgh
Copy link
Member

tarekgh commented Apr 30, 2025

@stephentoub I was waiting for OpenAI to officially add it to their tokenizer library openai/tiktoken#395.

@tarekgh tarekgh added this to the ML.NET 5.0 milestone Apr 30, 2025
@tarekgh tarekgh removed the untriaged New issue has not been triaged label Apr 30, 2025
@tarekgh tarekgh self-assigned this Apr 30, 2025
@fanyang-mono
Copy link
Member Author

@tarekgh Thanks for adding the support. I wonder when this will be available to be consumed?

@tarekgh
Copy link
Member

tarekgh commented May 6, 2025

@fanyang-mono the change will be included in our next preview (hopefully in 10 days or so). But you don't have to be blocked on that as you can still create the tokenizer using other compatible model name. Like

Tokenizer tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o");

or you can create it from the encoding name like:

Tokenizer tokenizer = TiktokenTokenizer.CreateForEncoding("o200k_base");

Both ways should create a tokenizer that can be used with gpt4.1 model. Let me know if you have any question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Tokenizers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants