Skip to content

Make access unsafe #599

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

SylvainCorlay
Copy link
Member

This is the removal of the bound checks I am proposing in #584 .

@dean0x7d
Copy link
Member

dean0x7d commented Jan 9, 2017

Why not use byte_offset_unsafe?

As far as I know, py::array was never meant to be a high-performance container, but just a convenient C++ interface for numpy.ndarray. When performance is required there are two options:

  1. Take the raw data, dims, strides pointers from py::array and give it to a map/view of a proper high-performance container.

  2. Access the data using iterators instead of indexing. This isn't very developed in py::array but it seems to be direction that xtensor-python is going in.

I think the main issue with making the default py::array unsafe is that it's surprising behavior compared to its Python counterpart. py::array isn't a C++ std container. It's not even a real container. It's a wrapper around an existing Python container which has it's own established behavior. By making the element access unsafe, it goes from a reliable exception in Python to undefined behavior in C++.

Keeping the Python behavior makes it consistent and easy to explain: "It looks like a numpy.ndarray and behaves like a numpy.ndarray". If py::array starts deviating from that, then there are additional caveats which need to be explained. This makes for a frustrating learning curve and hidden gotchas.

@SylvainCorlay
Copy link
Member Author

(From the original issue)

I agree that when exposing a feature to Python, it must be written in such a way that regardless of the values passed to it, it cannot crash the interpreter, but these accessors are not directly exposed to the Python users. Instead, I think that one of the main reasons people would right c++ extensions with pybind is because they want speed, and they might be disappointed with this behavior. #500 already improved the situation, but as we discussed there, I would prefer to completely remove the bound checking.

@wjakob
Copy link
Member

wjakob commented Jan 9, 2017

An alternative proposal would be to have two functions of each kind: data(...), mutable_data(...) , etc: checks only done in DEBUG mode. data_safe(...), mutable_data_safe(...) , etc: checks always done.

@wjakob
Copy link
Member

wjakob commented Jan 9, 2017

cc @aldanor

@SylvainCorlay
Copy link
Member Author

SylvainCorlay commented Jan 9, 2017

Safe vs unsafe access could be determined with by tag dispatching through a policy.

@wjakob
Copy link
Member

wjakob commented Jan 9, 2017

Can you elaborate what this would look like?

@SylvainCorlay
Copy link
Member Author

SylvainCorlay commented Jan 9, 2017

Arrays would have an additional template parameter whose type determines whether things are checked.

https://en.wikipedia.org/wiki/Policy-based_design

@dean0x7d
Copy link
Member

dean0x7d commented Jan 9, 2017

But the main argument for unsafe access is that people using pybind (and not pure python) use it because they want speed.

Sure, but even with unsafe access for (int i = 0; i < size; ++i) loops aren't really guaranteed to be faster than what numpy already does. Using proper high-performance C++ libraries/containers with views to array.data/.shape/.strides is almost always going to give better results than accessing element by element.

My main worry is that py::array ends up being somewhere in limbo: neither consistent with numpy nor truly high performance (since expression templates and SIMD intrinsics are out of scope of pybind11).

But I may have misinterpreted the purpose of py::array. It's not py::ndarray, so perhaps it doesn't need to follow numpy.ndarray conventions?

@SylvainCorlay
Copy link
Member Author

Sure, but even with unsafe access for (int i = 0; i < size; ++i) loops aren't really guaranteed to be faster than what numpy already does. Using proper high-performance C++ libraries/containers with views to array.data/.shape/.strides is almost always going to give better results than accessing element by element.

Sure, but looping is still something that people will want to do reasonably fast. Checking attributes at every iteration is a good way to kill performance.

@dean0x7d
Copy link
Member

Sure, but looping is still something that people will want to do reasonably fast. Checking attributes at every iteration is a good way to kill performance.

Perhaps I'm looking at this too much from the C++ side, where we have the performance and want to extend to Python for the flexibility. So I'm mostly looking at py::array as means to get some information about a Python array or package up some data to send to Python. On the other hand, people starting out in Python are extending to C++ for the performance and looking to py::array as the closest tool. So yeah, it being slow would be a disappointment from that point of view.

In that case, the interface should also get an overhaul. Currently, the main way to access elements is .at() and .mutable_at(). We expect from std::vector that .at() is checked, so it would be better to introduce an operator() overload for indexing which would also combine the interface for reading and writing. The writeable check should then also be removed. Both functions also have dimension checks which should also be removed. This PR currently only removes the bounds check in byte_offset() which is an internal function and only part of the bottleneck for a full .at() / .mutable_at() access.

So I guess my main point is: if it's going to be unsafe, make it consistently unsafe so that there's no false sense of security. That's also required for full c-array-level access speed. (Although I'm still not convinced that unsafe-by-default should be the purpose of py::array.)

@wjakob
Copy link
Member

wjakob commented Jan 13, 2017

@dean0x7d: In that case, what do you propose for the read-only version of operator()?

@SylvainCorlay
Copy link
Member Author

SylvainCorlay commented Jan 13, 2017

Regarding the writable check, what exactly is the semantic of it in numpy?

@aldanor
Copy link
Member

aldanor commented Jan 13, 2017

Regarding the writable check, what exactly is the semantic of it in numpy?

>>> import numpy as np
>>> a = np.array([0])
>>> a.flags.writeable = False
>>> a[0] = 1
ValueError: assignment destination is read-only

@SylvainCorlay
Copy link
Member Author

Yeah since it is determined at runtime, this cannot really play well with the const semantics of c++

@aldanor
Copy link
Member

aldanor commented Jan 13, 2017

Sure, but looping is still something that people will want to do reasonably fast. Checking attributes at every iteration is a good way to kill performance.

But computing byte offset from scratch on every iteration is not optimal regardless. If you have a tight loop, the chances are you'll fetch the data pointer once anyway and then increment the strides manually in the loops to avoid needless multiplications and stride fetching? NumPy C API has some macros (PyArray_ITER_*) to simplify iteration, maybe instead of removing bound checks we could provide iteration primitives in the same fashion?

@SylvainCorlay
Copy link
Member Author

SylvainCorlay commented Jan 13, 2017

Sure, but looping is still something that people will want to do reasonably fast. Checking attributes at every iteration is a good way to kill performance.

But computing byte offset from scratch on every iteration is not optimal regardless. If you have a tight loop, the chances are you'll fetch a pointer once anyway and then increment the strides manually in the loops to avoid needless multiplications and stride fetching? NumPy C API has some macros (PyArray_ITER_*) to simplify iteration, maybe instead of removing bound checks we could provide iteration primitives?

I agree, xtensor has efficient iterator pairs that allow people to iterate more efficiently on arrays, but we should expect people to write quick-&-easy things with loops over indices.

@aldanor
Copy link
Member

aldanor commented Jan 13, 2017

Just a random thought, another option: alongside the existing methods, add operator[](...) that would be unchecked:

arr[{i, j, k}]

(that is, if it's possible to do this so it's inlined to the same thing as at() without the bound check)

@SylvainCorlay
Copy link
Member Author

In xtensor, we have different semantics for () and [].

() takes a compile-time determined number of arguments
[] takes a single xindex argument which may have a length that is computed at runtime.

@wjakob
Copy link
Member

wjakob commented Jan 13, 2017

Apologies in advance for the following rant: I think that repeatedly bringing xtensor into the discussions here at pybind11 is not conductive. If there is some clever technical insight that can be ported to this project, then that's all fine with me. However, there is really no reason why the API/style of this project should somehow be bound to that of xtensor (rather than NumPy, which we're trying to wrap)

@SylvainCorlay
Copy link
Member Author

SylvainCorlay commented Jan 13, 2017

Apologies in advance for the following rant: I think that repeatedly bringing xtensor into the discussions here at pybind11 is not conductive. If there is some clever technical insight that can be ported to this project, then that's all fine with me. However, there is really no reason why the API/style of this project should somehow be bound to that of xtensor (rather than NumPy, which we're trying to wrap)

I apologize if you found that mentioning xtensor in the discussions on numpy bindings was not appropriate. I thought that bringing this perspective in the discussion would help since we are working on numpy-style semantics in c++.

I deleted the comment where I detail the iterators that we use which was not so much related to the discussion.

@JohanMabille
Copy link
Contributor

JohanMabille commented Jan 19, 2017

Actually, I wonder if someone would mix checked and unchecked access in their code. If that's not the case, maybe a policy-based design as mentioned by @SylvainCorlay could help. The default behavior would be the actual one, so it would remain consistent with the one of Python, and would avoid a frustrating learning curve as @dean0x7d pointed it out. And people that need pure performance could have the behavior they want with a simple template parameter change.

Besides, that would keep the interface simple and intuitive, and would avoid a counter intuitive change in py::array behavior.

Would that be an acceptable change for you guys ? If so, I can start working on it.

@dean0x7d
Copy link
Member

dean0x7d commented Jan 19, 2017

I think policy-based would be a nice way to go. Using operator() would also make Python -> C++ pretty uniformly [] -> (). E.g. accessing .shape is also [] -> ():

Python

n = a.shape[1]
x = a[3, 3]

C++

auto n = a.shape(1)
auto x = a(3, 3)

operator[] with some special indexing objects sounds good too, but operator[](int) should explicitly be a compile-time error (with a nice message if practical) to avoid the headscratcher which is a[1, 2, 3] in C and C++. I assume this would be a common (silent) error when adapting code from Python to C++.

Regarding operator(), the unsafe version can simply return a reference: T& array_t<T>:: operator()(...). The safe version would need to use a proxy type to distinguish reading and writing for the mutability check: Proxy<T> array_t<T>::operator()(...). But the resulting syntax would be the same.

@SylvainCorlay
Copy link
Member Author

SylvainCorlay commented Jan 19, 2017

The safe version would need to use a proxy type to distinguish reading and writing for the mutability check: Proxy array_t::operator()(...). But the resulting syntax would be the same.

I like the idea of the safe version check for the mutability for the writable flag.

@JohanMabille JohanMabille mentioned this pull request Jan 25, 2017
@wjakob
Copy link
Member

wjakob commented Feb 26, 2017

I'm wondering what the status is with regards to this? It looks like there was a useful discussion about a potential API, but that never quite materialized in a concrete PR. Is there still a plan to implement a form of unchecked access?

@dean0x7d
Copy link
Member

I think the main discussion has moved to #617 where there are 2 active proposals. It would be best to continue over there and close this one.

@dean0x7d dean0x7d closed this Feb 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants