Skip to content

Add pcre2 as re2 fallback #50

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 21, 2025
Merged

Add pcre2 as re2 fallback #50

merged 1 commit into from
Apr 21, 2025

Conversation

jackzhxng
Copy link
Contributor

@jackzhxng jackzhxng commented Apr 15, 2025

Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers.

Performance stays about the same from test runs before (run on last commit on main) and after (this pr).

Tokenizer library size (from ls -lh build/libtokenizers.a): 13M (on main) -> 15M. This most likely comes from adding the pcre2 lib.

🧱 Stack:

facebook-github-bot pushed a commit that referenced this pull request Apr 18, 2025
Summary:
🧱 Stack:
- [ ] #45
- [ ] #48
- [x] #49
- [ ] #50

### Testing
Pass CI
```
cmake -DTOKENIZERS_BUILD_TEST=ON -DCMAKE_BUILD_TYPE=Debug . -Bbuild && cmake --build build -j9 --config Debug
(cd build && ctest)
```


Differential Revision: D73238728

Pulled By: jackzhxng
facebook-github-bot pushed a commit that referenced this pull request Apr 18, 2025
Summary:
🧱 Stack:
- [ ] #45
- [ ] #48
- [x] #49
- [ ] #50

### Testing
Pass CI
```
cmake -DTOKENIZERS_BUILD_TEST=ON -DCMAKE_BUILD_TYPE=Debug . -Bbuild && cmake --build build -j9 --config Debug
(cd build && ctest)
```


Differential Revision: D73238728

Pulled By: jackzhxng
@jackzhxng jackzhxng changed the base branch from jz/regex-2 to main April 19, 2025 02:01
@facebook-github-bot
Copy link
Contributor

@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73295314

jackzhxng added a commit that referenced this pull request Apr 21, 2025
Summary:
Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers.

Performance stays about the same from test runs [before](https://github.com/pytorch-labs/tokenizers/actions/runs/14480863330/job/40617329721#step:14:758) (run on last commit on main) and [after](https://github.com/pytorch-labs/tokenizers/actions/runs/14526152504/job/40757962551#step:14:901) (this pr).

Tokenizer library size (from `ls -lh build/libtokenizers.a`): `13M` (on main) -> `15M`. This most likely comes from adding the `pcre2` lib.

🧱 Stack:
- [ ] #45
- [ ] #48
- [ ] #49
- [x] #50

Pull Request resolved: #50

Differential Revision: D73295314

Pulled By: jackzhxng
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73295314

jackzhxng added a commit that referenced this pull request Apr 21, 2025
Summary:
Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers.

Performance stays about the same from test runs [before](https://github.com/pytorch-labs/tokenizers/actions/runs/14480863330/job/40617329721#step:14:758) (run on last commit on main) and [after](https://github.com/pytorch-labs/tokenizers/actions/runs/14526152504/job/40757962551#step:14:901) (this pr).

Tokenizer library size (from `ls -lh build/libtokenizers.a`): `13M` (on main) -> `15M`. This most likely comes from adding the `pcre2` lib.

🧱 Stack:
- [ ] #45
- [ ] #48
- [ ] #49
- [x] #50

Pull Request resolved: #50

Differential Revision: D73295314

Pulled By: jackzhxng
Summary:
Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers.

Performance stays about the same from test runs [before](https://github.com/pytorch-labs/tokenizers/actions/runs/14480863330/job/40617329721#step:14:758) (run on last commit on main) and [after](https://github.com/pytorch-labs/tokenizers/actions/runs/14526152504/job/40757962551#step:14:901) (this pr).

Tokenizer library size (from `ls -lh build/libtokenizers.a`): `13M` (on main) -> `15M`. This most likely comes from adding the `pcre2` lib.

🧱 Stack:
- [ ] #45
- [ ] #48
- [ ] #49
- [x] #50

Pull Request resolved: #50

Differential Revision: D73295314

Pulled By: jackzhxng
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73295314

@jackzhxng jackzhxng merged commit 9378e21 into main Apr 21, 2025
5 of 7 checks passed
jackzhxng added a commit to pytorch/executorch that referenced this pull request Apr 30, 2025
### Summary
Use https://github.com/pytorch-labs/tokenizers huggingface tokenizer in
the Llama runner.

Results on Qwen2.5 with `extension/llm/tokenizers` checked out to
pytorch-labs/tokenizers#50:
```
Once upon a time,  there was a little girl named Lily. She was very happy. She had a big garden in the back of her house. She planted many flowers in it. They were red, yellow and blue. They were very pretty. Lily loved them very much. One day, she was watering them. Suddenly, she heard a noise. It was a noise in the tree. She looked up. There was a big bird in the tree. It was eating one of Lily's flowers. Lily was very angry. She ran to the tree. "Hello!" she said to the bird. "What are you doing in my
I 00:00:08.624959 executorch:runner.cpp:294] RSS after finishing text generation: 2147.121094 MiB (0 if unsupported)
PyTorchObserver {"prompt_tokens":4,"generated_tokens":123,"model_load_start_ms":1744936315023,"model_load_end_ms":1744936318524,"inference_start_ms":1744936318524,"inference_end_ms":1744936323646,"prompt_eval_end_ms":1744936318580,"first_token_ms":1744936318580,"aggregate_sampling_time_ms":274877907025,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:08.625019 executorch:stats.h:106]       Prompt Tokens: 4    Generated Tokens: 123
I 00:00:08.625021 executorch:stats.h:112]       Model Load Time:                3.501000 (seconds)
I 00:00:08.625023 executorch:stats.h:119]       Total inference time:           5.122000 (seconds)               Rate:  24.014057 (tokens/second)
I 00:00:08.625033 executorch:stats.h:129]               Prompt evaluation:      0.056000 (seconds)               Rate:  71.428571 (tokens/second)
I 00:00:08.625038 executorch:stats.h:138]               Generated 123 tokens:   5.066000 (seconds)               Rate:  24.279510 (tokens/second)
I 00:00:08.625045 executorch:stats.h:149]       Time to first generated token:  0.056000 (seconds)
I 00:00:08.625047 executorch:stats.h:155]       Sampling time over 127 tokens:  274877907.025000 (seconds)
```

### Test plan
Build llama runner locally (note the inclusion of
`-DSUPPORT_REGEX_LOOKAHEAD=ON`):
```
cmake -DPYTHON_EXECUTABLE=python \
    -DCMAKE_INSTALL_PREFIX=cmake-out \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DSUPPORT_REGEX_LOOKAHEAD=ON \
    -Bcmake-out/examples/models/llama \
    examples/models/llama

cmake --build cmake-out/examples/models/llama -j16 --config Release
```

Run on Qwen2.5:
```
cmake-out/examples/models/llama/llama_main --model_path=qwen2_5.pte --tokenizer_path ~/hf/models--Qwen--Qwen2.5-1.5B/snapshots/8faed761d45a263340a0528343f099c05c9a4323/tokenizer.json --prompt="Once upon a time" --temperature 0
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants