-
-
Notifications
You must be signed in to change notification settings - Fork 3
[Experiment] Switching from pybind11 to nanobind for function call overhead improvements #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@TkTech Did you see the instructions I included here? https://github.com/wjakob/nanobind/blob/master/src/nb_combined.cpp. This should allow you to compile with essentially any other kind of build system, though some work will be needed to replicate all the bells and whistles of what nanobind's cmake tooling provides out of the box. Out of curiosity, what's the relative speedup over the previous pybind11-based version? |
@wjakob That's fantastic, I'll give it a full read this weekend and give it a try. Relative speedup is 30-33%. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing.
Any update on this? I'd like to see a switch to nanobind. I was going to implement a cython version but if there is a nanobind version then there is no need since it's pretty much as fast as cython. Let me know if I can help! |
I think the latest Cython with new improvements should give just as much of speedups and performance as nanobind and wouldn't require CMake as hard dependency, not 100% sure because we will need to test this in practice. And I would love to see updates on this as well. :D |
Hm I built the package locally with nanobind based off this PR + pulled missing changes from latest and ran benchmarks and this is what I got: This benchmarck was run on Python 3.12.6 on Ubuntu 24.04.2 aarch64 machine: ============================================================================================== test session starts ==============================================================================================
platform linux -- Python 3.12.6, pytest-8.4.0, pluggy-1.6.0
benchmark: 5.1.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/eon/GitHub/can_ada
configfile: pyproject.toml
plugins: benchmark-5.1.0
collected 16 items
tests/test_benchmark.py .... [ 25%]
tests/test_idna.py .. [ 37%]
tests/test_misc.py . [ 43%]
tests/test_parsing.py ... [ 62%]
tests/test_search.py ...... [100%]
------------------------------------------------------------------------------------- benchmark: 4 tests ------------------------------------------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_can_ada_parse 74.2095 (1.0) 77.3226 (1.0) 76.2115 (1.0) 0.9218 (1.0) 76.4414 (1.0) 0.5105 (1.0) 4;2 13.1214 (1.0) 13 1
test_ada_python_parse 244.2262 (3.29) 252.6127 (3.27) 247.8758 (3.25) 3.4034 (3.69) 246.5964 (3.23) 5.2646 (10.31) 2;0 4.0343 (0.31) 5 1
test_yarl_parse 392.9242 (5.29) 404.8161 (5.24) 398.3661 (5.23) 4.5408 (4.93) 399.1666 (5.22) 6.1517 (12.05) 2;0 2.5103 (0.19) 5 1
test_urllib_parse 518.3281 (6.98) 526.5334 (6.81) 524.5661 (6.88) 3.5079 (3.81) 525.9881 (6.88) 2.6860 (5.26) 1;1 1.9063 (0.15) 5 1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Legend:
Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
OPS: Operations Per Second, computed as 1 / Mean
============================================================================================== 16 passed in 10.43s ============================================================================================= This benchmarch was done on Python 3.13.3 on ArchLinux x64 machine: ============================================================================================== test session starts ==============================================================================================
platform linux -- Python 3.13.3, pytest-8.4.0, pluggy-1.6.0
benchmark: 5.1.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /media/Data/GitHub/can_ada
configfile: pyproject.toml
plugins: benchmark-5.1.0
collected 16 items
tests/test_benchmark.py .... [ 25%]
tests/test_idna.py .. [ 37%]
tests/test_misc.py . [ 43%]
tests/test_parsing.py ... [ 62%]
tests/test_search.py ...... [100%]
------------------------------------------------------------------------------------- benchmark: 4 tests ------------------------------------------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_can_ada_parse 44.0162 (1.0) 49.4297 (1.0) 44.9846 (1.0) 1.4984 (1.11) 44.3686 (1.0) 0.8665 (1.0) 3;3 22.2298 (1.0) 20 1
test_ada_python_parse 139.0202 (3.16) 146.0516 (2.95) 141.0748 (3.14) 2.8255 (2.09) 139.7587 (3.15) 3.8139 (4.40) 2;0 7.0884 (0.32) 7 1
test_yarl_parse 267.3834 (6.07) 273.6444 (5.54) 269.1717 (5.98) 2.6015 (1.93) 268.1252 (6.04) 2.8348 (3.27) 1;0 3.7151 (0.17) 5 1
test_urllib_parse 307.3849 (6.98) 310.5946 (6.28) 309.1844 (6.87) 1.3497 (1.0) 308.8693 (6.96) 2.1906 (2.53) 2;0 3.2343 (0.15) 5 1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Legend:
Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
OPS: Operations Per Second, computed as 1 / Mean
============================================================================================== 16 passed in 7.42s =============================================================================================== though I had to do some changes in |
Switching from pybind11 to nanobind offers some performance improvements with minimal code changes. Our new benchmarks are:
I'm routinely seeing 6-7x better performance over urllib, and significantly improved performance when actually using the results (ie accessing
result.pathname
) due to lowered attribute access overhead.However, this introduces CMake as a build time dependency, and reduces the available targets (CPython 3.8+, PyPy > 3.8). Have not yet found a way to eliminate CMake as a dependency. I don't really mind if we only target newer versions of Python.
@lemire @wjakob