Python hugging face tokenizer #8354

jackzhxng · 2025-02-11T00:52:22Z

Summary

Add python tokenizer for hugging face tokenizer format (-.json files)

Test plan

Existing CI runner tests for regression on TikToken / SentencePiece
Generated coherent output on Qwen2.5 with HuggingFace tokenizer.json

pytorch-bot · 2025-02-11T00:52:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/8354

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Cancelled Jobs

As of commit b0990da with merge base d99970b ():

NEW FAILURE - The following job has failed:

pull / test-static-llama-qnn-linux / linux-job (gh)
RuntimeError: Command docker exec -t 62db70bc9174142cba9cbfebda593b0caec6ac5aeee6141a8d4dfa2e794ab36c /exec failed with exit code 127

CANCELLED JOBS - The following jobs were cancelled. Please retry:

pull / test-eval_llama-mmlu-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-llava-runner-linux / linux-job (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

larryliu0820 · 2025-02-12T22:50:25Z

examples/models/llama/runner/eager.py

+    parser.add_argument(
+        "--tokenizer_config_path",
+        type=str,
+        default=None,
+    )


Please add help to clarify how to use this argument

Also this should be mutually exclusive with tokenizer_path? We should make sure only one is passed in.

The tokenizer.json is always the source of truth, the config is just for metadata like this, e.g. it tells you what the eos_token is, which isn't info that is in the tokenizer.json itself

Can you add this info to the help string?

extension/llm/tokenizer/utils.py

jackzhxng added 2 commits February 10, 2025 15:04

hf_tokenizer.py generated

8cbe6cd

Add hugging face tokenizer

88b3394

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 11, 2025

Qwen runs with HF tokenizer

70fd1fe

jackzhxng added the release notes: examples Changes to any of our example LLMs integrations, such as Llama3 and Llava label Feb 12, 2025

jackzhxng added 2 commits February 12, 2025 11:20

Fix encode, remove generated python tokenizer

5834d14

Comment / lint

421288b

jackzhxng marked this pull request as ready for review February 12, 2025 19:59

jackzhxng force-pushed the jz/python-tokenizer branch from 9ebd327 to f1573f2 Compare February 12, 2025 20:00

Move into class

e3352fa

jackzhxng force-pushed the jz/python-tokenizer branch from f1573f2 to e3352fa Compare February 12, 2025 20:02

jackzhxng requested review from larryliu0820 and iseeyuan February 12, 2025 20:04

Lint

01ed13c

jackzhxng force-pushed the jz/python-tokenizer branch from 911b9d9 to 01ed13c Compare February 12, 2025 22:49

larryliu0820 reviewed Feb 12, 2025

View reviewed changes

extension/llm/tokenizer/utils.py Show resolved Hide resolved

Mengwei pr rev

b0990da

jackzhxng force-pushed the jz/python-tokenizer branch from 237ebcb to b0990da Compare February 13, 2025 00:52

jackzhxng requested a review from larryliu0820 February 13, 2025 00:52

larryliu0820 approved these changes Feb 13, 2025

View reviewed changes

jackzhxng merged commit 9ba5494 into main Feb 13, 2025
43 of 46 checks passed

jackzhxng deleted the jz/python-tokenizer branch February 13, 2025 21:38

jackzhxng mentioned this pull request Feb 20, 2025

Add qwen 2.5 #8355

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python hugging face tokenizer #8354

Python hugging face tokenizer #8354

jackzhxng commented Feb 11, 2025 •

edited

Loading

pytorch-bot bot commented Feb 11, 2025 •

edited

Loading

larryliu0820 Feb 12, 2025

larryliu0820 Feb 12, 2025

jackzhxng Feb 12, 2025 •

edited

Loading

larryliu0820 Feb 12, 2025

Python hugging face tokenizer #8354

Python hugging face tokenizer #8354

Conversation

jackzhxng commented Feb 11, 2025 • edited Loading

Summary

Test plan

pytorch-bot bot commented Feb 11, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/8354

❌ 1 New Failure, 2 Cancelled Jobs

larryliu0820 Feb 12, 2025

Choose a reason for hiding this comment

larryliu0820 Feb 12, 2025

Choose a reason for hiding this comment

jackzhxng Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

larryliu0820 Feb 12, 2025

Choose a reason for hiding this comment

jackzhxng commented Feb 11, 2025 •

edited

Loading

pytorch-bot bot commented Feb 11, 2025 •

edited

Loading

jackzhxng Feb 12, 2025 •

edited

Loading