Add python bindings #98
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces Python bindings for the PyTorch Tokenizers library. It includes changes to support Python bindings in the build system, integration of
pybind11
, and updates to the Python package for distribution. Additionally, it modifies the tokenizer classes and adds testing configurations for the new bindings.Python Bindings Integration:
CMakeLists.txt
: Introduced theTOKENIZERS_BUILD_PYTHON
option and the logic to build Python bindings usingpybind11
. This includes creating thepytorch_tokenizers_cpp
extension module and linking it with the tokenizers library. [1] [2]src/python_bindings.cpp
file: Implemented Python bindings for tokenizers usingpybind11
. This includes binding classes likeTokenizer
,HFTokenizer
,Tiktoken
,Llama2cTokenizer
, andSPTokenizer
.Python Package Updates:
setup.py
for Python bindings: Added support for building the Python extension module using CMake andpybind11
. This includes defining a customCMakeBuild
class for handling the build process.pytorch_tokenizers/__init__.py
: Updated the package to include the new C++ tokenizer bindings and removed older Python implementations. Added error handling for failed imports.Testing Enhancements:
pytest.ini
configuration: Configured Pytest for the project, including test discovery rules, ignored directories, and markers for different test types.targets.bzl
target for testing the Python bindings (test_python_bindings.py
).Tokenizer Class Changes:
Tiktoken
class: Introduced new constructors to let pybind11 bind init() to constructors (it doesn't supportstd::unique_ptr<std::vector<std::string>>
).Build System Changes:
targets.bzl
target for building the Python bindings, including dependencies on tokenizer modules andpybind11
.