Skip to content

Python hugging face tokenizer #8354

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Feb 13, 2025
Merged

Python hugging face tokenizer #8354

merged 8 commits into from
Feb 13, 2025

Conversation

jackzhxng
Copy link
Contributor

@jackzhxng jackzhxng commented Feb 11, 2025

Summary

Add python tokenizer for hugging face tokenizer format (-.json files)

Test plan

  • Existing CI runner tests for regression on TikToken / SentencePiece
  • Generated coherent output on Qwen2.5 with HuggingFace tokenizer.json

Copy link

pytorch-bot bot commented Feb 11, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/8354

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Cancelled Jobs

As of commit b0990da with merge base d99970b (image):

NEW FAILURE - The following job has failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 11, 2025
@jackzhxng jackzhxng added the release notes: examples Changes to any of our example LLMs integrations, such as Llama3 and Llava label Feb 12, 2025
@jackzhxng jackzhxng marked this pull request as ready for review February 12, 2025 19:59
Comment on lines 78 to 82
parser.add_argument(
"--tokenizer_config_path",
type=str,
default=None,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add help to clarify how to use this argument

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this should be mutually exclusive with tokenizer_path? We should make sure only one is passed in.

Copy link
Contributor Author

@jackzhxng jackzhxng Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tokenizer.json is always the source of truth, the config is just for metadata like this, e.g. it tells you what the eos_token is, which isn't info that is in the tokenizer.json itself

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add this info to the help string?

@jackzhxng jackzhxng merged commit 9ba5494 into main Feb 13, 2025
43 of 46 checks passed
@jackzhxng jackzhxng deleted the jz/python-tokenizer branch February 13, 2025 21:38
@jackzhxng jackzhxng mentioned this pull request Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: examples Changes to any of our example LLMs integrations, such as Llama3 and Llava
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants