-
Notifications
You must be signed in to change notification settings - Fork 2.2k
optimization of strided array vectorize #671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The performance hit that you're seeing is due to @nbecker If you need a quick workaround, you could try using Eigen (example below) or maybe xtensor-python which has a similar #include <pybind11/pybind11.h>
#include <pybind11/eigen.h>
namespace py = pybind11;
using complex_t = std::complex<double>;
PYBIND11_PLUGIN(cmake_example) {
py::module m("cmake_example");
m.def("rect_to_polar1", [](py::EigenDRef<Eigen::ArrayXcd const> a) {
return py::make_tuple(abs(a), arg(a));
});
m.def("rect_to_polar2", [](py::array_t<complex_t> a) {
return py::make_tuple(py::vectorize([](complex_t x) { return std::abs(x); })(a),
py::vectorize([](complex_t x) { return std::arg(x); })(a));
});
return m.ptr();
} from cmake_example import rect_to_polar1, rect_to_polar2
import numpy as np
from timeit import timeit
w = np.ones(1000000, dtype=complex)
w_half = np.ones(w.size // 2, dtype=complex)
n = 200
print("rect_to_polar1:")
print(" unit:", timeit(lambda: rect_to_polar1(w_half), number=n))
print(" non-unit:", timeit(lambda: rect_to_polar1(w[::2]), number=n))
print("rect_to_polar2:")
print(" unit:", timeit(lambda: rect_to_polar2(w_half), number=n))
print(" non-unit:", timeit(lambda: rect_to_polar2(w[::2]), number=n)) |
I'll look at this; I think it should be a fairly easy fix. |
That would be great, I'd love to make use of py::array_t and vectorize since it allows for dimension-independent code, but right now the performance on some tests lags by a factor of 2 compared with using explicit loops and dimension-dependent code (via nd::Array) |
I think it's improved, but not perfect. Here are some test results:
tests are with 'float' and 'complex' arrays. 'mag_sqr_ufunc' is the pybind11 version, 'ms' is using nd::Array with boost::python. The test prints time, and also the resulting shape to confirm it's maybe computing the correct result. The biggest discrepancy is the test ms[u::2] runs much faster than ms[u], but the test of mag_sqr_ufunc[u::2] shows no speedup compared to mag_sqr_ufunc[u], even though it's computing only 1/2 as much. |
This is mainly an issue of the overhead of the non-trivial path (i.e. when input is something other than c-order-contiguous). e.g. if you use |
I tried having a go at rewriting it, but I didn't improve things (at least, not when compile optimization was turned on). For a simple test (below), compiled with #include <pybind11/pybind11.h>
#include <pybind11/numpy.h>
namespace py = pybind11;
double squared(double x) {
return x*x;
}
PYBIND11_PLUGIN(numpy_fncs) {
py::module m("numpy_fncs");
m.def("sq", py::vectorize(squared));
return m.ptr();
} Test code: from numpy_fncs import sq
from timeit import timeit
import numpy as np
wc = np.ones((500000, 10), dtype="float64", order="C")
wf = np.ones((500000, 10), dtype="float64", order="F")
for w in ["wc", "wf"]:
for s in ["", "[::2]", "[::4]", "[::8]"]:
run = "sq({}{})".format(w, s)
print(run, end=": ", flush=True)
print(timeit(run, "from __main__ import wc, wf; from numpy_fncs import sq", number=500)) Compiling using gcc with
showing a decent improvement/scaling. |
when you say "showing a decent improvement", what are the 2 cases you are comparing? With and without #730? |
Under current master, where everything except
Edit: almost the same test: I amended it to: from numpy_fncs import sq
from timeit import timeit
for w in ["wc", "wf"]:
for s in ["", "[::2]", "[::4]", "[::8]"]:
run = "sq({}{})".format(w, s)
print(run, end=": ", flush=True)
print(timeit(run, """from numpy_fncs import sq
import numpy as np
wc = np.ones((500000, 10), dtype="float64", order="C")
wf = np.ones((500000, 10), dtype="float64", order="F")""",
number=500)) |
I think we'll merge #730 for the imminent 2.1 release because it gets us at least partway there, but I'll leave this open for now to consider further optimizations for 2.2. |
I'm going to close this bug, at least for now, as I'm not sure what more we can do. If someone wants to take another stab at further optimizations, they are of course more than welcome. |
I'm comparing performance of py::vectorize with performance of nd::Array for different test conditions. I find that py::vectorize is a factor of 2 slower on this test than ndarray for a 1-d array with stride 2.
In the following test code, when the comments are removed, a function is overloaded for the 1D non-contiguous case to use nd::Array, while without it the py::vectorize is used. In the ndarray case, the profile looks like:
While for py::vectorize:
The python test code:
run as:
Thanks for your help,
Neal
The text was updated successfully, but these errors were encountered: