Skip to content

Calling a c++ function is roughly 2 times slower then calling a native python function #2005

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tarasko opened this issue Nov 23, 2019 · 16 comments

Comments

@tarasko
Copy link

tarasko commented Nov 23, 2019

Issue description

I tried to build a simple example with a function that adds two integer. I was surprised that calling such a function is roughly 2 times slower than calling its pure python counterpart. Measuring it with valgrind I realized that dispatcher incurs quite a lot of additional cost, significant portion of it comes from new/deletes done by std::vector's on every call, but the logic itself is also complex and not cheap. I tried to replace std::vector with the boost::container::small_vector and it helped but it's still not on par with the pure python implementation

struct function_call {
...
    /// Arguments passed to the function:
    boost::container::small_vector<handle, 2> args;

    /// The `convert` value the arguments should be loaded with
    boost::container::small_vector<bool, 2> args_convert;

Any ideas how this can be improved? Cython also generates code that is faster than pybind11 and is on par with pure python

Reproducible example code

#include <pybind11/pybind11.h>

namespace py = pybind11;

__attribute__((noinline)) int simple(int a, int b) { return a + b; }

PYBIND11_MODULE(example_plugin, m) {
    m.doc() = "pybind11 example plugin"; // optional module docstring

    m.def("simple", &simple);
}
[ 50%] Building CXX object CMakeFiles/example_plugin.dir/main.cpp.o
/usr/bin/g++  -Dexample_plugin_EXPORTS -I/home/taras/example_plugin/pybind11/include -I/usr/include/python3.7m  -O2 -g -DNDEBUG -fPIC -fvisibility=hidden   -std=c++17 -flto -fno-fat-lto-objects -o CMakeFiles/example_plugin.dir/main.cpp.o -c /home/taras/example_plugin/main.cpp
[100%] Linking CXX shared module example_plugin.cpython-37m-x86_64-linux-gnu.so
/usr/bin/cmake -E cmake_link_script CMakeFiles/example_plugin.dir/link.txt --verbose=1
/usr/bin/g++ -fPIC -O2 -g -DNDEBUG  -shared  -o example_plugin.cpython-37m-x86_64-linux-gnu.so CMakeFiles/example_plugin.dir/main.cpp.o -flto 
# using std::vector
from example_plugin import simple as simple_cpp
%timeit simple_cpp(42, 94)
496 ns ± 20.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# using boost::container::small_vector
from example_plugin import simple as simple_cpp
%timeit simple_cpp(42, 94)
382 ns ± 15.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
def simple_pure_python(a, b):
    return a+b

%timeit simple_pure_python(42, 94)
260 ns ± 11.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
@molpopgen
Copy link

This is a duplicate of #1227 and #1825

Cython does something very different from pybind11. Your example also must be slower than pure Python. As you note, you are microbenchmarking new/delete, too.

@tarasko
Copy link
Author

tarasko commented Nov 23, 2019

Ok, but still maybe changing std::vector to small_vector is a low hanging fruit here. Could make things slightly better. It is not an uncommon use case where one calls getter/setter or an arithmetic operation on an exposed type inside of nested loops. The number of arguments doesn't go above 3 in a lot of cases

@molpopgen
Copy link

The main devs can weigh in here, but I'll guess that adding boost as a dependency won't fly.

@tarasko
Copy link
Author

tarasko commented Nov 23, 2019

I didn't mean adding boost dependency, I thought that maybe extracting small_vector from boost wouldn't be a big problem

@molpopgen
Copy link

molpopgen commented Nov 23, 2019 via email

@molpopgen
Copy link

molpopgen commented Nov 23, 2019

(deleted -- this was the wrong example. can't find the one I wanted at the moment)

@tarasko
Copy link
Author

tarasko commented Nov 23, 2019

Actually extracting small_vector from boost is a big problem. Now after having a closer look at the code I don't think that it is such a good idea anymore. It would be better to use std array there because the number of arguments are known at compile time. However it can't be applied directly atm, dispatcher code has to be slightly refactored first. Something like adding virtual methods to function_record and implementing them in a template function_record_impl<size_t NumberOfArgument>

I don't mind to do it, but don't want these efforts to be in vain. If only could someone confirm that this would be a useful change :)

@molpopgen
Copy link

It would be better to use std array there because the number of arguments are known at compile time.

Where do you get this from? I see a lot of run time code...

The number of arguments doesn't go above 3 in a lot of cases

Also unclear what this means? Do you mean for your cases, or in general? Clearly it isn't true in general.

@wjakob
Copy link
Member

wjakob commented Nov 25, 2019

pybind11's dispatch code is clearly not very optimized, and contributions here are welcomed as long as they don't complicate what is already a fairly complex implementation. Note that if this overhead is a significant problem for you, you're probably not using pybind11 in the most optimal way -- to wrap expensive implementation that is implemented on the C++ side (such that function call overheads are irrelevant).

@tarasko
Copy link
Author

tarasko commented Nov 30, 2019

I made a PR with a draft implementation of the idea.
#2013

I agree that, ideally, expensive implementation should be in C++ thereby making function call overhead negligible. But in my case we have a domain model with a lot of classes and we just want to expose them to python, so that other devs who don't know c++ can use it to build something useful. A lot of exposed methods are simple getters and setters, calling them introduces significant overhead.
We are currently using boost.python and I was looking at pybind11 in hope of finding a better option

@m4ce
Copy link

m4ce commented Feb 24, 2020

@tarasko, I am experiencing a similar issue, as in any function call from python to C++ seems to be quite expensive. Have you found any solution to this?

@mengdilin
Copy link

Commenting here too. I'm working on porting a significant portion of pure python functions to C++, with the hope that compiled code should run faster than interpreted code. Doing micro benchmarking on simple functions I need to port reveal that the equivalent C++ function is noticeably slower. I'm wondering if porting pure python code to C++ and use pybind purely for performance purpose is a common use-case of pybind.

@carlsonmark
Copy link

I am porting tens of thousands of lines of C++ to Python using pybind11, leaving the performance critical items in C++, as well as some complex (data acquisition) hardware related code that will be abandoned in the near future.

From my experience so far, I would say that you want to minimize crossing between C++ and Python if you want performance.

That is, if you have a loop in python that runs 1 million times and calls a few C++ functions each time, it will be faster to write a C++ function that runs 1 million times and calls those functions (assuming each loop iteration completes in a few microseconds.)

Before deciding to use pybind11, I evaluated other options, and I believe Cython and PyPy looked like they may be more appropriate for frequent (tens of microseconds) language crossings.

Just some food for thought... good luck!

@wjakob
Copy link
Member

wjakob commented Feb 27, 2020

@everyone here: A 2x overhead for crossing the Python -> C++ -> Python boundary is actually quite reasonable. Even with considerable optimization work on pybind11, I find it unlikely that we could squeeze more than 20-30% of extra performance out of the dispatch mechanism without unacceptable effects elsewhere (e.g. binary size in PR #2013). So bumping this thread is not particularly helpful, unless you have an amazing idea on how to improve performance by a larger margin and are willing to spend the time to implement it.

If your code relies this much on calling functions that themselves execute very quickly, you should ask yourself if Python is a good language to use in general. Something that compiles or JIT-compiles into more efficient code (e.g. Julia) would likely lead to significant speedups.

I will close this issue.

@wjakob wjakob closed this as completed Feb 27, 2020
@AndreiPashkin
Copy link

@wjakob, I wonder - is it possible to completely bypass Pybind11's dispatching logic and just use raw Python C API to define functions but expose them to Python using Pybind11?

@jagerman
Copy link
Member

@wjakob, I wonder - is it possible to completely bypass Pybind11's dispatching logic and just use raw Python C API to define functions but expose them to Python using Pybind11?

You don't need pybind11 for that -- that's what the Python C API does, and that's how pybind11 does its work. Once you start wanting more, and start writing wrappers around the Python C API to make it nicer and much more convenient to use then you are basically rewriting pybind11.

I'll echo what @wjakob said: if you are microbenchmarking tiny functions then all you end up measuring is the overhead of crossing the Python<->C++ boundary. Writing a C++ function that calls some other tiny function 1 million times is going to be massively faster than looping in Python and calling a bound method from Python 1 million times. Often pushing your code "one more step" into C++ (such as by moving the loop into C++ in this example) can get you big performance gains.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants