Skip to content

Decouple tokenizers from Re2 and use IRegex interface #49

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 19, 2025

Conversation

jackzhxng
Copy link
Contributor

@jackzhxng jackzhxng commented Apr 15, 2025

🧱 Stack:

Testing

Pass CI

cmake -DTOKENIZERS_BUILD_TEST=ON -DCMAKE_BUILD_TYPE=Debug . -Bbuild && cmake --build build -j9 --config Debug
(cd build && ctest)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 15, 2025
@jackzhxng jackzhxng changed the base branch from main to jz/regex-1 April 16, 2025 17:01
@jackzhxng jackzhxng force-pushed the jz/regex-1 branch 8 times, most recently from 100430f to c2d4de8 Compare April 16, 2025 20:22
@jackzhxng jackzhxng changed the base branch from jz/regex-1 to main April 18, 2025 05:54
@facebook-github-bot
Copy link
Contributor

@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Apr 18, 2025
Summary:
🧱 Stack:
- [ ] #45
- [ ] #48
- [x] #49
- [ ] #50

### Testing
Pass CI
```
cmake -DTOKENIZERS_BUILD_TEST=ON -DCMAKE_BUILD_TYPE=Debug . -Bbuild && cmake --build build -j9 --config Debug
(cd build && ctest)
```


Differential Revision: D73238728

Pulled By: jackzhxng
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73238728

Summary:
🧱 Stack:
- [ ] #45
- [ ] #48
- [x] #49
- [ ] #50

### Testing
Pass CI
```
cmake -DTOKENIZERS_BUILD_TEST=ON -DCMAKE_BUILD_TYPE=Debug . -Bbuild && cmake --build build -j9 --config Debug
(cd build && ctest)
```


Differential Revision: D73238728

Pulled By: jackzhxng
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73238728

@facebook-github-bot facebook-github-bot merged commit f52b18b into main Apr 19, 2025
4 of 6 checks passed
jackzhxng added a commit that referenced this pull request Apr 21, 2025
Summary:
Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers.

Performance stays about the same from test runs [before](https://github.com/pytorch-labs/tokenizers/actions/runs/14480863330/job/40617329721#step:14:758) (run on last commit on main) and [after](https://github.com/pytorch-labs/tokenizers/actions/runs/14526152504/job/40757962551#step:14:901) (this pr).

Tokenizer library size (from `ls -lh build/libtokenizers.a`): `13M` (on main) -> `15M`. This most likely comes from adding the `pcre2` lib.

🧱 Stack:
- [ ] #45
- [ ] #48
- [ ] #49
- [x] #50

Pull Request resolved: #50

Differential Revision: D73295314

Pulled By: jackzhxng
jackzhxng added a commit that referenced this pull request Apr 21, 2025
Summary:
Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers.

Performance stays about the same from test runs [before](https://github.com/pytorch-labs/tokenizers/actions/runs/14480863330/job/40617329721#step:14:758) (run on last commit on main) and [after](https://github.com/pytorch-labs/tokenizers/actions/runs/14526152504/job/40757962551#step:14:901) (this pr).

Tokenizer library size (from `ls -lh build/libtokenizers.a`): `13M` (on main) -> `15M`. This most likely comes from adding the `pcre2` lib.

🧱 Stack:
- [ ] #45
- [ ] #48
- [ ] #49
- [x] #50

Pull Request resolved: #50

Differential Revision: D73295314

Pulled By: jackzhxng
jackzhxng added a commit that referenced this pull request Apr 21, 2025
Summary:
Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers.

Performance stays about the same from test runs [before](https://github.com/pytorch-labs/tokenizers/actions/runs/14480863330/job/40617329721#step:14:758) (run on last commit on main) and [after](https://github.com/pytorch-labs/tokenizers/actions/runs/14526152504/job/40757962551#step:14:901) (this pr).

Tokenizer library size (from `ls -lh build/libtokenizers.a`): `13M` (on main) -> `15M`. This most likely comes from adding the `pcre2` lib.

🧱 Stack:
- [ ] #45
- [ ] #48
- [ ] #49
- [x] #50

Pull Request resolved: #50

Differential Revision: D73295314

Pulled By: jackzhxng
split_with_allowed_special_token_(
re2::StringPiece& input,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use std::string_view?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants