optimization of strided array vectorize #671

nbecker · 2017-02-15T15:40:10Z

I'm comparing performance of py::vectorize with performance of nd::Array for different test conditions. I find that py::vectorize is a factor of 2 slower on this test than ndarray for a 1-d array with stride 2.

In the following test code, when the comments are removed, a function is overloaded for the 1D non-contiguous case to use nd::Array, while without it the py::vectorize is used. In the ndarray case, the profile looks like:

     340  67.3%  67.3%      421  83.4% pybind11::cpp_function::initialize::{lambda#3}::_FUN [clone .lto_priv.280]
      81  16.0%  83.4%       81  16.0% __ieee754_atan2_avx
      36   7.1%  90.5%       36   7.1% __brk

While for py::vectorize:

     608  50.5%  50.5%      608  50.5% _aligned_strided_to_contig_size16
     167  13.9%  64.4%      170  14.1% pybind11::detail::vectorize_helper::run [clone .isra.536] [clone .lto_priv.257]
     123  10.2%  74.6%      123  10.2% __ieee754_atan2_avx
     116   9.6%  84.3%      279  23.2% pybind11::detail::vectorize_helper::run [clone .isra.533] [clone .lto_priv.258]
     104   8.6%  92.9%      104   8.6% __brk

  // m.def("rect_to_polar", [](nd::Array<complex_t,1,0> const& a) {
  //     nd::Vector<size_t,1> shape = a.getShape();
  //     nd::Array<double,1,1> r (shape[0]);
  //     nd::Array<double,1,1> theta (shape[0]);
  //     for (int i = 0; i < shape[0]; ++i) {
  //       r[i] = std::abs(a[i]);
  //       theta[i] = std::arg(a[i]);
  //     }
  //     return py::make_tuple (r, theta);
  //   });

  m.def("rect_to_polar", [](py::array a) {
      if (py::isinstance<py::array_t<complex_t>>(a))
        return py::make_tuple (py::vectorize([](complex_t x) { return std::abs(x); })(a),
                               py::vectorize([](complex_t x) { return std::arg(x); })(a)
                               );
      else
        throw py::type_error("rect_to_polar unhandled type");
    });

The python test code:

from timeit import timeit
w = np.ones(1000000, dtype=complex)
#print (timeit('rect_to_polar(w)', 'from __main__ import rect_to_polar, w', number=1000))
print (timeit('rect_to_polar(w[::2])', 'from numpy_fncs import rect_to_polar; from __main__ import w', number=1000))
#print (timeit('rect_to_polar(w.reshape(w.shape[0]//2, 2))', 'from numpy_fncs import rect_to_polar; from __main__ import w', number=1000))

run as:

[nbecker@nbecker2 pybind11_test]$ python3 -m yep -- test_numpy_fncs.py 
11.713223681988893
PROFILE: interrupts/evictions/bytes = 1203/109/41896
[nbecker@nbecker2 pybind11_test]$ pprof --text /usr/bin/python3 test_numpy_fncs.py.prof > test_numpy_fncs.test2.pyarray.profile
Using local file /usr/bin/python3.
Using local file test_numpy_fncs.py.prof.

Thanks for your help,
Neal

The text was updated successfully, but these errors were encountered:

dean0x7d · 2017-03-13T15:49:43Z

The performance hit that you're seeing is due to py::vectorize creating a copy of the non-unit stride array (it converts it to unit stride before executing the function). It should be possible to resolve this by taking non-unit strides into account natively, but I'm not too familiar with py::vectorize (I don't use it myself) so I'd rather not muck around in the code. But if anyone wants to take a look, I've simplified the test code somewhat (see below) which might help.

@nbecker If you need a quick workaround, you could try using Eigen (example below) or maybe xtensor-python which has a similar pyvectorize function and may not have the strides limitation.

#include <pybind11/pybind11.h>
#include <pybind11/eigen.h>
namespace py = pybind11;
using complex_t = std::complex<double>;

PYBIND11_PLUGIN(cmake_example) {
    py::module m("cmake_example");

    m.def("rect_to_polar1", [](py::EigenDRef<Eigen::ArrayXcd const> a) {
        return py::make_tuple(abs(a), arg(a));
    });

    m.def("rect_to_polar2", [](py::array_t<complex_t> a) {
        return py::make_tuple(py::vectorize([](complex_t x) { return std::abs(x); })(a),
                              py::vectorize([](complex_t x) { return std::arg(x); })(a));
    });

    return m.ptr();
}

from cmake_example import rect_to_polar1, rect_to_polar2
import numpy as np
from timeit import timeit

w = np.ones(1000000, dtype=complex)
w_half = np.ones(w.size // 2, dtype=complex)
n = 200

print("rect_to_polar1:")
print("  unit:", timeit(lambda: rect_to_polar1(w_half), number=n))
print("  non-unit:", timeit(lambda: rect_to_polar1(w[::2]), number=n))

print("rect_to_polar2:")
print("  unit:", timeit(lambda: rect_to_polar2(w_half), number=n))
print("  non-unit:", timeit(lambda: rect_to_polar2(w[::2]), number=n))

jagerman · 2017-03-14T01:55:50Z

I'll look at this; I think it should be a fairly easy fix.

nbecker · 2017-03-14T15:59:48Z

That would be great, I'd love to make use of py::array_t and vectorize since it allows for dimension-independent code, but right now the performance on some tests lags by a factor of 2 compared with using explicit loops and dimension-dependent code (via nd::Array)

jagerman · 2017-03-15T04:09:09Z

@nbecker - can you give #730 a try and see if that resolves the performance hit? If so, I'll merge it for 2.1.

nbecker · 2017-03-15T12:26:02Z

I think it's improved, but not perfect. Here are some test results:

 python3 ./test_mag_sqr.py

dtype: <class 'float'>
mag_sqr_ufunc(u) 1.285100394001347 (1000000,)
mag_sqr_ufunc(u[::2]) 1.308188828988932 (500000,)
mag_sqr_ufunc(u.reshape(u.shape[0]//2,2)) 1.2440469749999465 (500000, 2)
dtype: <class 'float'>
ms(u) 1.1893465800094418 (1000000,)
ms(u[::2]) 0.856014018994756 (500000,)
ms(u.reshape(u.shape[0]//2,2)) 1.4603546269936487 (500000, 2)
dtype: <class 'complex'>
mag_sqr_ufunc(u) 1.6713096910098102 (1000000,)
mag_sqr_ufunc(u[::2]) 1.600571528004366 (500000,)
mag_sqr_ufunc(u.reshape(u.shape[0]//2,2)) 1.6170859679987188 (500000, 2)
dtype: <class 'complex'>
ms(u) 1.6698526199907064 (1000000,)
ms(u[::2]) 1.2591366249980638 (500000,)
ms(u.reshape(u.shape[0]//2,2)) 1.8911341290076962 (500000, 2)

tests are with 'float' and 'complex' arrays. 'mag_sqr_ufunc' is the pybind11 version, 'ms' is using nd::Array with boost::python.

The test prints time, and also the resulting shape to confirm it's maybe computing the correct result.

The biggest discrepancy is the test ms[u::2] runs much faster than ms[u], but the test of mag_sqr_ufunc[u::2] shows no speedup compared to mag_sqr_ufunc[u], even though it's computing only 1/2 as much.

jagerman · 2017-03-15T15:19:54Z

This is mainly an issue of the overhead of the non-trivial path (i.e. when input is something other than c-order-contiguous). e.g. if you use ::4 you will see roughly a 50% decrease in time versus ::2. I'll play around with it and see if I can improve performance.

jagerman · 2017-03-15T21:01:52Z

I tried having a go at rewriting it, but I didn't improve things (at least, not when compile optimization was turned on). For a simple test (below), compiled with -O2 -mavx, I see some pretty reasonable improvements:

#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>

namespace py = pybind11;

double squared(double x) {
    return x*x;
}

PYBIND11_PLUGIN(numpy_fncs) {
    py::module m("numpy_fncs");

    m.def("sq", py::vectorize(squared));
    return m.ptr();
}

Test code:

from numpy_fncs import sq
from timeit import timeit
import numpy as np
wc = np.ones((500000, 10), dtype="float64", order="C")
wf = np.ones((500000, 10), dtype="float64", order="F")
for w in ["wc", "wf"]:
    for s in ["", "[::2]", "[::4]", "[::8]"]:
        run = "sq({}{})".format(w, s)
        print(run, end=": ", flush=True)
        print(timeit(run, "from __main__ import wc, wf; from numpy_fncs import sq", number=500))

Compiling using gcc with -Os -mavx that gives me:

sq(wc): 9.02930161000404
sq(wc[::2]): 3.965398814994842
sq(wc[::4]): 2.263082914010738
sq(wc[::8]): 2.1717588890023762
sq(wf): 12.047965466001187
sq(wf[::2]): 4.055329613009235
sq(wf[::4]): 2.2905197710060747
sq(wf[::8]): 1.4324185820005368

showing a decent improvement/scaling.

nbecker · 2017-03-16T10:57:37Z

when you say "showing a decent improvement", what are the 2 cases you are comparing? With and without #730?

jagerman · 2017-03-16T14:14:26Z

They are both with #730. The wc case is in c storage order, wf is Fortran order. Without #730 everything except the first wc case would make a copy.

jagerman · 2017-03-16T16:57:22Z

Under current master, where everything except wc copies into a new array, the same test gives:

sq(wc): 9.841329707000114
sq(wc[::2]): 7.97703917599938
sq(wc[::4]): 2.9569860170004176
sq(wc[::8]): 1.5508598509995863
sq(wf): 15.514638511999692
sq(wf[::2]): 7.989528872000847
sq(wf[::4]): 2.96592605800015
sq(wf[::8]): 2.037063783999656

Edit: almost the same test: I amended it to:

from numpy_fncs import sq
from timeit import timeit
for w in ["wc", "wf"]:
    for s in ["", "[::2]", "[::4]", "[::8]"]:
        run = "sq({}{})".format(w, s)
        print(run, end=": ", flush=True)
        print(timeit(run, """from numpy_fncs import sq
import numpy as np
wc = np.ones((500000, 10), dtype="float64", order="C")
wf = np.ones((500000, 10), dtype="float64", order="F")""",
            number=500))

jagerman · 2017-03-18T17:01:31Z

I think we'll merge #730 for the imminent 2.1 release because it gets us at least partway there, but I'll leave this open for now to consider further optimizations for 2.2.

jagerman · 2017-07-23T04:30:05Z

I'm going to close this bug, at least for now, as I'm not sure what more we can do. If someone wants to take another stab at further optimizations, they are of course more than welcome.

jagerman added this to the v2.1 milestone Mar 14, 2017

jagerman self-assigned this Mar 14, 2017

jagerman mentioned this issue Mar 14, 2017

Make a v2.1 release? #726

Closed

jagerman mentioned this issue Mar 15, 2017

Stop forcing c-contiguous in py::vectorize #730

Merged

jagerman modified the milestones: v2.2, v2.1 Mar 18, 2017

jagerman closed this as completed in #730 Mar 21, 2017

jagerman reopened this Mar 21, 2017

jagerman closed this as completed Jul 23, 2017

rwgk mentioned this issue Feb 9, 2023

FWD pybind11 google/pybind11clif#671

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimization of strided array vectorize #671

optimization of strided array vectorize #671

nbecker commented Feb 15, 2017

dean0x7d commented Mar 13, 2017

jagerman commented Mar 14, 2017

nbecker commented Mar 14, 2017

jagerman commented Mar 15, 2017

nbecker commented Mar 15, 2017

jagerman commented Mar 15, 2017

jagerman commented Mar 15, 2017

nbecker commented Mar 16, 2017

jagerman commented Mar 16, 2017

jagerman commented Mar 16, 2017 •

edited

Loading

jagerman commented Mar 18, 2017

jagerman commented Jul 23, 2017

optimization of strided array vectorize #671

optimization of strided array vectorize #671

Comments

nbecker commented Feb 15, 2017

dean0x7d commented Mar 13, 2017

jagerman commented Mar 14, 2017

nbecker commented Mar 14, 2017

jagerman commented Mar 15, 2017

nbecker commented Mar 15, 2017

jagerman commented Mar 15, 2017

jagerman commented Mar 15, 2017

nbecker commented Mar 16, 2017

jagerman commented Mar 16, 2017

jagerman commented Mar 16, 2017 • edited Loading

jagerman commented Mar 18, 2017

jagerman commented Jul 23, 2017

jagerman commented Mar 16, 2017 •

edited

Loading