Add pcre2 as re2 fallback #50

jackzhxng · 2025-04-15T23:10:44Z

Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers.

Performance stays about the same from test runs before (run on last commit on main) and after (this pr).

Tokenizer library size (from ls -lh build/libtokenizers.a): 13M (on main) -> 15M. This most likely comes from adding the pcre2 lib.

🧱 Stack:

Summary: 🧱 Stack: - [ ] #45 - [ ] #48 - [x] #49 - [ ] #50 ### Testing Pass CI ``` cmake -DTOKENIZERS_BUILD_TEST=ON -DCMAKE_BUILD_TYPE=Debug . -Bbuild && cmake --build build -j9 --config Debug (cd build && ctest) ``` Differential Revision: D73238728 Pulled By: jackzhxng

facebook-github-bot · 2025-04-19T02:02:03Z

@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2025-04-21T18:35:02Z

This pull request was exported from Phabricator. Differential Revision: D73295314

Summary: Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers. Performance stays about the same from test runs [before](https://github.com/pytorch-labs/tokenizers/actions/runs/14480863330/job/40617329721#step:14:758) (run on last commit on main) and [after](https://github.com/pytorch-labs/tokenizers/actions/runs/14526152504/job/40757962551#step:14:901) (this pr). Tokenizer library size (from `ls -lh build/libtokenizers.a`): `13M` (on main) -> `15M`. This most likely comes from adding the `pcre2` lib. 🧱 Stack: - [ ] #45 - [ ] #48 - [ ] #49 - [x] #50 Pull Request resolved: #50 Differential Revision: D73295314 Pulled By: jackzhxng

facebook-github-bot · 2025-04-21T18:41:57Z

This pull request was exported from Phabricator. Differential Revision: D73295314

Summary: Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers. Performance stays about the same from test runs [before](https://github.com/pytorch-labs/tokenizers/actions/runs/14480863330/job/40617329721#step:14:758) (run on last commit on main) and [after](https://github.com/pytorch-labs/tokenizers/actions/runs/14526152504/job/40757962551#step:14:901) (this pr). Tokenizer library size (from `ls -lh build/libtokenizers.a`): `13M` (on main) -> `15M`. This most likely comes from adding the `pcre2` lib. 🧱 Stack: - [ ] #45 - [ ] #48 - [ ] #49 - [x] #50 Pull Request resolved: #50 Differential Revision: D73295314 Pulled By: jackzhxng

facebook-github-bot · 2025-04-21T18:52:01Z

This pull request was exported from Phabricator. Differential Revision: D73295314

### Summary Use https://github.com/pytorch-labs/tokenizers huggingface tokenizer in the Llama runner. Results on Qwen2.5 with `extension/llm/tokenizers` checked out to pytorch-labs/tokenizers#50: ``` Once upon a time, there was a little girl named Lily. She was very happy. She had a big garden in the back of her house. She planted many flowers in it. They were red, yellow and blue. They were very pretty. Lily loved them very much. One day, she was watering them. Suddenly, she heard a noise. It was a noise in the tree. She looked up. There was a big bird in the tree. It was eating one of Lily's flowers. Lily was very angry. She ran to the tree. "Hello!" she said to the bird. "What are you doing in my I 00:00:08.624959 executorch:runner.cpp:294] RSS after finishing text generation: 2147.121094 MiB (0 if unsupported) PyTorchObserver {"prompt_tokens":4,"generated_tokens":123,"model_load_start_ms":1744936315023,"model_load_end_ms":1744936318524,"inference_start_ms":1744936318524,"inference_end_ms":1744936323646,"prompt_eval_end_ms":1744936318580,"first_token_ms":1744936318580,"aggregate_sampling_time_ms":274877907025,"SCALING_FACTOR_UNITS_PER_SECOND":1000} I 00:00:08.625019 executorch:stats.h:106] Prompt Tokens: 4 Generated Tokens: 123 I 00:00:08.625021 executorch:stats.h:112] Model Load Time: 3.501000 (seconds) I 00:00:08.625023 executorch:stats.h:119] Total inference time: 5.122000 (seconds) Rate: 24.014057 (tokens/second) I 00:00:08.625033 executorch:stats.h:129] Prompt evaluation: 0.056000 (seconds) Rate: 71.428571 (tokens/second) I 00:00:08.625038 executorch:stats.h:138] Generated 123 tokens: 5.066000 (seconds) Rate: 24.279510 (tokens/second) I 00:00:08.625045 executorch:stats.h:149] Time to first generated token: 0.056000 (seconds) I 00:00:08.625047 executorch:stats.h:155] Sampling time over 127 tokens: 274877907.025000 (seconds) ``` ### Test plan Build llama runner locally (note the inclusion of `-DSUPPORT_REGEX_LOOKAHEAD=ON`): ``` cmake -DPYTHON_EXECUTABLE=python \ -DCMAKE_INSTALL_PREFIX=cmake-out \ -DCMAKE_BUILD_TYPE=Release \ -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ -DEXECUTORCH_BUILD_XNNPACK=ON \ -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ -DSUPPORT_REGEX_LOOKAHEAD=ON \ -Bcmake-out/examples/models/llama \ examples/models/llama cmake --build cmake-out/examples/models/llama -j16 --config Release ``` Run on Qwen2.5: ``` cmake-out/examples/models/llama/llama_main --model_path=qwen2_5.pte --tokenizer_path ~/hf/models--Qwen--Qwen2.5-1.5B/snapshots/8faed761d45a263340a0528343f099c05c9a4323/tokenizer.json --prompt="Once upon a time" --temperature 0 ```

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 15, 2025

This was referenced Apr 15, 2025

Add regex interface with re2 and std::regex implementations #48

Merged

Decouple tokenizers from Re2 and use IRegex interface #49

Merged

Use common base class private functions for TikToken #45

Merged

jackzhxng changed the base branch from main to jz/regex-2 April 16, 2025 17:01

jackzhxng mentioned this pull request Apr 16, 2025

Use internal result as regex return type #51

Closed

5 tasks

jackzhxng force-pushed the jz/regex-3 branch from c75a6d6 to 6f02e59 Compare April 16, 2025 20:31

larryliu0820 approved these changes Apr 17, 2025

View reviewed changes

jackzhxng mentioned this pull request Apr 18, 2025

Use external hf_tokenizer in llama runner pytorch/executorch#9112

Merged

facebook-github-bot force-pushed the jz/regex-2 branch from 6a4afd9 to aa70360 Compare April 18, 2025 18:28

facebook-github-bot force-pushed the jz/regex-2 branch from aa70360 to 0fc711a Compare April 18, 2025 19:49

jackzhxng changed the base branch from jz/regex-2 to main April 19, 2025 02:01

facebook-github-bot added the fb-exported label Apr 21, 2025

jackzhxng force-pushed the jz/regex-3 branch from 669755b to 92d6128 Compare April 21, 2025 18:35

jackzhxng force-pushed the jz/regex-3 branch from 92d6128 to 2e18217 Compare April 21, 2025 18:42

jackzhxng force-pushed the jz/regex-3 branch from 2e18217 to 586ac2c Compare April 21, 2025 18:52

jackzhxng merged commit 9378e21 into main Apr 21, 2025
5 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pcre2 as re2 fallback #50

Add pcre2 as re2 fallback #50

jackzhxng commented Apr 15, 2025 •

edited

Loading

facebook-github-bot commented Apr 19, 2025

facebook-github-bot commented Apr 21, 2025

facebook-github-bot commented Apr 21, 2025

facebook-github-bot commented Apr 21, 2025

Add pcre2 as re2 fallback #50

Add pcre2 as re2 fallback #50

Conversation

jackzhxng commented Apr 15, 2025 • edited Loading

facebook-github-bot commented Apr 19, 2025

facebook-github-bot commented Apr 21, 2025

facebook-github-bot commented Apr 21, 2025

facebook-github-bot commented Apr 21, 2025

jackzhxng commented Apr 15, 2025 •

edited

Loading