-
Notifications
You must be signed in to change notification settings - Fork 7
Decouple tokenizers from Re2 and use IRegex interface #49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
100430f
to
c2d4de8
Compare
@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
6a4afd9
to
aa70360
Compare
This pull request was exported from Phabricator. Differential Revision: D73238728 |
aa70360
to
0fc711a
Compare
This pull request was exported from Phabricator. Differential Revision: D73238728 |
Summary: Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers. Performance stays about the same from test runs [before](https://github.com/pytorch-labs/tokenizers/actions/runs/14480863330/job/40617329721#step:14:758) (run on last commit on main) and [after](https://github.com/pytorch-labs/tokenizers/actions/runs/14526152504/job/40757962551#step:14:901) (this pr). Tokenizer library size (from `ls -lh build/libtokenizers.a`): `13M` (on main) -> `15M`. This most likely comes from adding the `pcre2` lib. 🧱 Stack: - [ ] #45 - [ ] #48 - [ ] #49 - [x] #50 Pull Request resolved: #50 Differential Revision: D73295314 Pulled By: jackzhxng
Summary: Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers. Performance stays about the same from test runs [before](https://github.com/pytorch-labs/tokenizers/actions/runs/14480863330/job/40617329721#step:14:758) (run on last commit on main) and [after](https://github.com/pytorch-labs/tokenizers/actions/runs/14526152504/job/40757962551#step:14:901) (this pr). Tokenizer library size (from `ls -lh build/libtokenizers.a`): `13M` (on main) -> `15M`. This most likely comes from adding the `pcre2` lib. 🧱 Stack: - [ ] #45 - [ ] #48 - [ ] #49 - [x] #50 Pull Request resolved: #50 Differential Revision: D73295314 Pulled By: jackzhxng
Summary: Adds pcre2 to handle the negative lookbehinds in HuggingFace tokenizers. Performance stays about the same from test runs [before](https://github.com/pytorch-labs/tokenizers/actions/runs/14480863330/job/40617329721#step:14:758) (run on last commit on main) and [after](https://github.com/pytorch-labs/tokenizers/actions/runs/14526152504/job/40757962551#step:14:901) (this pr). Tokenizer library size (from `ls -lh build/libtokenizers.a`): `13M` (on main) -> `15M`. This most likely comes from adding the `pcre2` lib. 🧱 Stack: - [ ] #45 - [ ] #48 - [ ] #49 - [x] #50 Pull Request resolved: #50 Differential Revision: D73295314 Pulled By: jackzhxng
split_with_allowed_special_token_( | ||
re2::StringPiece& input, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use std::string_view?
🧱 Stack:
Testing
Pass CI