Skip to content

Add python bindings #98

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Add python bindings #98

wants to merge 8 commits into from

Conversation

larryliu0820
Copy link
Contributor

This pull request introduces Python bindings for the PyTorch Tokenizers library. It includes changes to support Python bindings in the build system, integration of pybind11, and updates to the Python package for distribution. Additionally, it modifies the tokenizer classes and adds testing configurations for the new bindings.

Python Bindings Integration:

  • Added Python bindings option in CMakeLists.txt: Introduced the TOKENIZERS_BUILD_PYTHON option and the logic to build Python bindings using pybind11. This includes creating the pytorch_tokenizers_cpp extension module and linking it with the tokenizers library. [1] [2]
  • New src/python_bindings.cpp file: Implemented Python bindings for tokenizers using pybind11. This includes binding classes like Tokenizer, HFTokenizer, Tiktoken, Llama2cTokenizer, and SPTokenizer.

Python Package Updates:

  • Updated setup.py for Python bindings: Added support for building the Python extension module using CMake and pybind11. This includes defining a custom CMakeBuild class for handling the build process.
  • Modified pytorch_tokenizers/__init__.py: Updated the package to include the new C++ tokenizer bindings and removed older Python implementations. Added error handling for failed imports.

Testing Enhancements:

  • Added pytest.ini configuration: Configured Pytest for the project, including test discovery rules, ignored directories, and markers for different test types.
  • Defined Python tests in targets.bzl: Introduced a targets.bzl target for testing the Python bindings (test_python_bindings.py).

Tokenizer Class Changes:

  • Added constructors to Tiktoken class: Introduced new constructors to let pybind11 bind init() to constructors (it doesn't support std::unique_ptr<std::vector<std::string>>).

Build System Changes:

  • Added Bazel target for Python bindings: Defined a targets.bzl target for building the Python bindings, including dependencies on tokenizer modules and pybind11.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 8, 2025
@larryliu0820 larryliu0820 requested a review from jackzhxng July 8, 2025 06:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants