Calling a c++ function is roughly 2 times slower then calling a native python function

## Issue description

I tried to build a simple example with a function that adds two integer. I was surprised that calling such a function is roughly 2 times slower than calling its pure python counterpart. Measuring it with valgrind I realized that dispatcher incurs quite a lot of additional cost, significant portion of it comes from new/deletes done by std::vector's on every call, but the logic itself is also complex and not cheap. I tried to replace std::vector with the boost::container::small_vector and it helped but it's still not on par with the pure python implementation

```
struct function_call {
...
    /// Arguments passed to the function:
    boost::container::small_vector<handle, 2> args;

    /// The `convert` value the arguments should be loaded with
    boost::container::small_vector<bool, 2> args_convert;
```
Any ideas how this can be improved? Cython also generates code that is faster than pybind11 and is on par with pure python 

## Reproducible example code
```
#include <pybind11/pybind11.h>

namespace py = pybind11;

__attribute__((noinline)) int simple(int a, int b) { return a + b; }

PYBIND11_MODULE(example_plugin, m) {
    m.doc() = "pybind11 example plugin"; // optional module docstring

    m.def("simple", &simple);
}
```
```
[ 50%] Building CXX object CMakeFiles/example_plugin.dir/main.cpp.o
/usr/bin/g++  -Dexample_plugin_EXPORTS -I/home/taras/example_plugin/pybind11/include -I/usr/include/python3.7m  -O2 -g -DNDEBUG -fPIC -fvisibility=hidden   -std=c++17 -flto -fno-fat-lto-objects -o CMakeFiles/example_plugin.dir/main.cpp.o -c /home/taras/example_plugin/main.cpp
[100%] Linking CXX shared module example_plugin.cpython-37m-x86_64-linux-gnu.so
/usr/bin/cmake -E cmake_link_script CMakeFiles/example_plugin.dir/link.txt --verbose=1
/usr/bin/g++ -fPIC -O2 -g -DNDEBUG  -shared  -o example_plugin.cpython-37m-x86_64-linux-gnu.so CMakeFiles/example_plugin.dir/main.cpp.o -flto 
```

```
# using std::vector
from example_plugin import simple as simple_cpp
%timeit simple_cpp(42, 94)
496 ns ± 20.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

```
# using boost::container::small_vector
from example_plugin import simple as simple_cpp
%timeit simple_cpp(42, 94)
382 ns ± 15.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

```
def simple_pure_python(a, b):
    return a+b

%timeit simple_pure_python(42, 94)
260 ns ± 11.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Calling a c++ function is roughly 2 times slower then calling a native python function #2005

Issue description

Reproducible example code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Calling a c++ function is roughly 2 times slower then calling a native python function #2005

Description

Issue description

Reproducible example code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions