-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Calling a c++ function is roughly 2 times slower then calling a native python function #2005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Ok, but still maybe changing std::vector to small_vector is a low hanging fruit here. Could make things slightly better. It is not an uncommon use case where one calls getter/setter or an arithmetic operation on an exposed type inside of nested loops. The number of arguments doesn't go above 3 in a lot of cases |
The main devs can weigh in here, but I'll guess that adding boost as a dependency won't fly. |
I didn't mean adding boost dependency, I thought that maybe extracting small_vector from boost wouldn't be a big problem |
Perhaps a PR would help move that discussion forward?
…On Sat, Nov 23, 2019 at 5:38 AM tarasko ***@***.***> wrote:
I didn't mean adding boost dependency, I thought that maybe extracting
small_vector from wouldn't be a big problem
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2005?email_source=notifications&email_token=ABQ6OH4DJVNJ64PI44VSPB3QVEW2XA5CNFSM4JQZ2MV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE7VKDI#issuecomment-557798669>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABQ6OH5PRP6AC5VS6VEY2ELQVEW2XANCNFSM4JQZ2MVQ>
.
|
(deleted -- this was the wrong example. can't find the one I wanted at the moment) |
Actually extracting small_vector from boost is a big problem. Now after having a closer look at the code I don't think that it is such a good idea anymore. It would be better to use std array there because the number of arguments are known at compile time. However it can't be applied directly atm, dispatcher code has to be slightly refactored first. Something like adding virtual methods to function_record and implementing them in a template function_record_impl<size_t NumberOfArgument> I don't mind to do it, but don't want these efforts to be in vain. If only could someone confirm that this would be a useful change :) |
Where do you get this from? I see a lot of run time code...
Also unclear what this means? Do you mean for your cases, or in general? Clearly it isn't true in general. |
pybind11's dispatch code is clearly not very optimized, and contributions here are welcomed as long as they don't complicate what is already a fairly complex implementation. Note that if this overhead is a significant problem for you, you're probably not using pybind11 in the most optimal way -- to wrap expensive implementation that is implemented on the C++ side (such that function call overheads are irrelevant). |
I made a PR with a draft implementation of the idea. I agree that, ideally, expensive implementation should be in C++ thereby making function call overhead negligible. But in my case we have a domain model with a lot of classes and we just want to expose them to python, so that other devs who don't know c++ can use it to build something useful. A lot of exposed methods are simple getters and setters, calling them introduces significant overhead. |
@tarasko, I am experiencing a similar issue, as in any function call from python to C++ seems to be quite expensive. Have you found any solution to this? |
Commenting here too. I'm working on porting a significant portion of pure python functions to C++, with the hope that compiled code should run faster than interpreted code. Doing micro benchmarking on simple functions I need to port reveal that the equivalent C++ function is noticeably slower. I'm wondering if porting pure python code to C++ and use pybind purely for performance purpose is a common use-case of pybind. |
I am porting tens of thousands of lines of C++ to Python using pybind11, leaving the performance critical items in C++, as well as some complex (data acquisition) hardware related code that will be abandoned in the near future. From my experience so far, I would say that you want to minimize crossing between C++ and Python if you want performance. That is, if you have a loop in python that runs 1 million times and calls a few C++ functions each time, it will be faster to write a C++ function that runs 1 million times and calls those functions (assuming each loop iteration completes in a few microseconds.) Before deciding to use pybind11, I evaluated other options, and I believe Cython and PyPy looked like they may be more appropriate for frequent (tens of microseconds) language crossings. Just some food for thought... good luck! |
@everyone here: A 2x overhead for crossing the Python -> C++ -> Python boundary is actually quite reasonable. Even with considerable optimization work on pybind11, I find it unlikely that we could squeeze more than 20-30% of extra performance out of the dispatch mechanism without unacceptable effects elsewhere (e.g. binary size in PR #2013). So bumping this thread is not particularly helpful, unless you have an amazing idea on how to improve performance by a larger margin and are willing to spend the time to implement it. If your code relies this much on calling functions that themselves execute very quickly, you should ask yourself if Python is a good language to use in general. Something that compiles or JIT-compiles into more efficient code (e.g. Julia) would likely lead to significant speedups. I will close this issue. |
@wjakob, I wonder - is it possible to completely bypass Pybind11's dispatching logic and just use raw Python C API to define functions but expose them to Python using Pybind11? |
You don't need pybind11 for that -- that's what the Python C API does, and that's how pybind11 does its work. Once you start wanting more, and start writing wrappers around the Python C API to make it nicer and much more convenient to use then you are basically rewriting pybind11. I'll echo what @wjakob said: if you are microbenchmarking tiny functions then all you end up measuring is the overhead of crossing the Python<->C++ boundary. Writing a C++ function that calls some other tiny function 1 million times is going to be massively faster than looping in Python and calling a bound method from Python 1 million times. Often pushing your code "one more step" into C++ (such as by moving the loop into C++ in this example) can get you big performance gains. |
Issue description
I tried to build a simple example with a function that adds two integer. I was surprised that calling such a function is roughly 2 times slower than calling its pure python counterpart. Measuring it with valgrind I realized that dispatcher incurs quite a lot of additional cost, significant portion of it comes from new/deletes done by std::vector's on every call, but the logic itself is also complex and not cheap. I tried to replace std::vector with the boost::container::small_vector and it helped but it's still not on par with the pure python implementation
Any ideas how this can be improved? Cython also generates code that is faster than pybind11 and is on par with pure python
Reproducible example code
The text was updated successfully, but these errors were encountered: