Skip to content

optimization of strided array vectorize #671

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nbecker opened this issue Feb 15, 2017 · 12 comments
Closed

optimization of strided array vectorize #671

nbecker opened this issue Feb 15, 2017 · 12 comments
Assignees
Milestone

Comments

@nbecker
Copy link

nbecker commented Feb 15, 2017

I'm comparing performance of py::vectorize with performance of nd::Array for different test conditions. I find that py::vectorize is a factor of 2 slower on this test than ndarray for a 1-d array with stride 2.

In the following test code, when the comments are removed, a function is overloaded for the 1D non-contiguous case to use nd::Array, while without it the py::vectorize is used. In the ndarray case, the profile looks like:

     340  67.3%  67.3%      421  83.4% pybind11::cpp_function::initialize::{lambda#3}::_FUN [clone .lto_priv.280]
      81  16.0%  83.4%       81  16.0% __ieee754_atan2_avx
      36   7.1%  90.5%       36   7.1% __brk

While for py::vectorize:

     608  50.5%  50.5%      608  50.5% _aligned_strided_to_contig_size16
     167  13.9%  64.4%      170  14.1% pybind11::detail::vectorize_helper::run [clone .isra.536] [clone .lto_priv.257]
     123  10.2%  74.6%      123  10.2% __ieee754_atan2_avx
     116   9.6%  84.3%      279  23.2% pybind11::detail::vectorize_helper::run [clone .isra.533] [clone .lto_priv.258]
     104   8.6%  92.9%      104   8.6% __brk
  // m.def("rect_to_polar", [](nd::Array<complex_t,1,0> const& a) {
  //     nd::Vector<size_t,1> shape = a.getShape();
  //     nd::Array<double,1,1> r (shape[0]);
  //     nd::Array<double,1,1> theta (shape[0]);
  //     for (int i = 0; i < shape[0]; ++i) {
  //       r[i] = std::abs(a[i]);
  //       theta[i] = std::arg(a[i]);
  //     }
  //     return py::make_tuple (r, theta);
  //   });

  m.def("rect_to_polar", [](py::array a) {
      if (py::isinstance<py::array_t<complex_t>>(a))
        return py::make_tuple (py::vectorize([](complex_t x) { return std::abs(x); })(a),
                               py::vectorize([](complex_t x) { return std::arg(x); })(a)
                               );
      else
        throw py::type_error("rect_to_polar unhandled type");
    });

The python test code:

from timeit import timeit
w = np.ones(1000000, dtype=complex)
#print (timeit('rect_to_polar(w)', 'from __main__ import rect_to_polar, w', number=1000))
print (timeit('rect_to_polar(w[::2])', 'from numpy_fncs import rect_to_polar; from __main__ import w', number=1000))
#print (timeit('rect_to_polar(w.reshape(w.shape[0]//2, 2))', 'from numpy_fncs import rect_to_polar; from __main__ import w', number=1000))

run as:

[nbecker@nbecker2 pybind11_test]$ python3 -m yep -- test_numpy_fncs.py 
11.713223681988893
PROFILE: interrupts/evictions/bytes = 1203/109/41896
[nbecker@nbecker2 pybind11_test]$ pprof --text /usr/bin/python3 test_numpy_fncs.py.prof > test_numpy_fncs.test2.pyarray.profile
Using local file /usr/bin/python3.
Using local file test_numpy_fncs.py.prof.

Thanks for your help,
Neal

@dean0x7d
Copy link
Member

The performance hit that you're seeing is due to py::vectorize creating a copy of the non-unit stride array (it converts it to unit stride before executing the function). It should be possible to resolve this by taking non-unit strides into account natively, but I'm not too familiar with py::vectorize (I don't use it myself) so I'd rather not muck around in the code. But if anyone wants to take a look, I've simplified the test code somewhat (see below) which might help.

@nbecker If you need a quick workaround, you could try using Eigen (example below) or maybe xtensor-python which has a similar pyvectorize function and may not have the strides limitation.

#include <pybind11/pybind11.h>
#include <pybind11/eigen.h>
namespace py = pybind11;
using complex_t = std::complex<double>;

PYBIND11_PLUGIN(cmake_example) {
    py::module m("cmake_example");

    m.def("rect_to_polar1", [](py::EigenDRef<Eigen::ArrayXcd const> a) {
        return py::make_tuple(abs(a), arg(a));
    });

    m.def("rect_to_polar2", [](py::array_t<complex_t> a) {
        return py::make_tuple(py::vectorize([](complex_t x) { return std::abs(x); })(a),
                              py::vectorize([](complex_t x) { return std::arg(x); })(a));
    });

    return m.ptr();
}
from cmake_example import rect_to_polar1, rect_to_polar2
import numpy as np
from timeit import timeit

w = np.ones(1000000, dtype=complex)
w_half = np.ones(w.size // 2, dtype=complex)
n = 200

print("rect_to_polar1:")
print("  unit:", timeit(lambda: rect_to_polar1(w_half), number=n))
print("  non-unit:", timeit(lambda: rect_to_polar1(w[::2]), number=n))

print("rect_to_polar2:")
print("  unit:", timeit(lambda: rect_to_polar2(w_half), number=n))
print("  non-unit:", timeit(lambda: rect_to_polar2(w[::2]), number=n))

@jagerman jagerman added this to the v2.1 milestone Mar 14, 2017
@jagerman
Copy link
Member

I'll look at this; I think it should be a fairly easy fix.

@jagerman jagerman self-assigned this Mar 14, 2017
@nbecker
Copy link
Author

nbecker commented Mar 14, 2017

That would be great, I'd love to make use of py::array_t and vectorize since it allows for dimension-independent code, but right now the performance on some tests lags by a factor of 2 compared with using explicit loops and dimension-dependent code (via nd::Array)

@jagerman
Copy link
Member

@nbecker - can you give #730 a try and see if that resolves the performance hit? If so, I'll merge it for 2.1.

@nbecker
Copy link
Author

nbecker commented Mar 15, 2017

I think it's improved, but not perfect. Here are some test results:

 python3 ./test_mag_sqr.py

dtype: <class 'float'>
mag_sqr_ufunc(u) 1.285100394001347 (1000000,)
mag_sqr_ufunc(u[::2]) 1.308188828988932 (500000,)
mag_sqr_ufunc(u.reshape(u.shape[0]//2,2)) 1.2440469749999465 (500000, 2)
dtype: <class 'float'>
ms(u) 1.1893465800094418 (1000000,)
ms(u[::2]) 0.856014018994756 (500000,)
ms(u.reshape(u.shape[0]//2,2)) 1.4603546269936487 (500000, 2)
dtype: <class 'complex'>
mag_sqr_ufunc(u) 1.6713096910098102 (1000000,)
mag_sqr_ufunc(u[::2]) 1.600571528004366 (500000,)
mag_sqr_ufunc(u.reshape(u.shape[0]//2,2)) 1.6170859679987188 (500000, 2)
dtype: <class 'complex'>
ms(u) 1.6698526199907064 (1000000,)
ms(u[::2]) 1.2591366249980638 (500000,)
ms(u.reshape(u.shape[0]//2,2)) 1.8911341290076962 (500000, 2)

tests are with 'float' and 'complex' arrays. 'mag_sqr_ufunc' is the pybind11 version, 'ms' is using nd::Array with boost::python.

The test prints time, and also the resulting shape to confirm it's maybe computing the correct result.

The biggest discrepancy is the test ms[u::2] runs much faster than ms[u], but the test of mag_sqr_ufunc[u::2] shows no speedup compared to mag_sqr_ufunc[u], even though it's computing only 1/2 as much.

@jagerman
Copy link
Member

This is mainly an issue of the overhead of the non-trivial path (i.e. when input is something other than c-order-contiguous). e.g. if you use ::4 you will see roughly a 50% decrease in time versus ::2. I'll play around with it and see if I can improve performance.

@jagerman
Copy link
Member

I tried having a go at rewriting it, but I didn't improve things (at least, not when compile optimization was turned on). For a simple test (below), compiled with -O2 -mavx, I see some pretty reasonable improvements:

#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>

namespace py = pybind11;

double squared(double x) {
    return x*x;
}

PYBIND11_PLUGIN(numpy_fncs) {
    py::module m("numpy_fncs");

    m.def("sq", py::vectorize(squared));
    return m.ptr();
}

Test code:

from numpy_fncs import sq
from timeit import timeit
import numpy as np
wc = np.ones((500000, 10), dtype="float64", order="C")
wf = np.ones((500000, 10), dtype="float64", order="F")
for w in ["wc", "wf"]:
    for s in ["", "[::2]", "[::4]", "[::8]"]:
        run = "sq({}{})".format(w, s)
        print(run, end=": ", flush=True)
        print(timeit(run, "from __main__ import wc, wf; from numpy_fncs import sq", number=500))

Compiling using gcc with -Os -mavx that gives me:

sq(wc): 9.02930161000404
sq(wc[::2]): 3.965398814994842
sq(wc[::4]): 2.263082914010738
sq(wc[::8]): 2.1717588890023762
sq(wf): 12.047965466001187
sq(wf[::2]): 4.055329613009235
sq(wf[::4]): 2.2905197710060747
sq(wf[::8]): 1.4324185820005368

showing a decent improvement/scaling.

@nbecker
Copy link
Author

nbecker commented Mar 16, 2017

when you say "showing a decent improvement", what are the 2 cases you are comparing? With and without #730?

@jagerman
Copy link
Member

They are both with #730. The wc case is in c storage order, wf is Fortran order. Without #730 everything except the first wc case would make a copy.

@jagerman
Copy link
Member

jagerman commented Mar 16, 2017

Under current master, where everything except wc copies into a new array, the same test gives:

sq(wc): 9.841329707000114
sq(wc[::2]): 7.97703917599938
sq(wc[::4]): 2.9569860170004176
sq(wc[::8]): 1.5508598509995863
sq(wf): 15.514638511999692
sq(wf[::2]): 7.989528872000847
sq(wf[::4]): 2.96592605800015
sq(wf[::8]): 2.037063783999656

Edit: almost the same test: I amended it to:

from numpy_fncs import sq
from timeit import timeit
for w in ["wc", "wf"]:
    for s in ["", "[::2]", "[::4]", "[::8]"]:
        run = "sq({}{})".format(w, s)
        print(run, end=": ", flush=True)
        print(timeit(run, """from numpy_fncs import sq
import numpy as np
wc = np.ones((500000, 10), dtype="float64", order="C")
wf = np.ones((500000, 10), dtype="float64", order="F")""",
            number=500))

@jagerman jagerman modified the milestones: v2.2, v2.1 Mar 18, 2017
@jagerman
Copy link
Member

I think we'll merge #730 for the imminent 2.1 release because it gets us at least partway there, but I'll leave this open for now to consider further optimizations for 2.2.

@jagerman jagerman reopened this Mar 21, 2017
@jagerman
Copy link
Member

I'm going to close this bug, at least for now, as I'm not sure what more we can do. If someone wants to take another stab at further optimizations, they are of course more than welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants