Skip to content

Unnecessary copy when indexing to obtain a 0d array #2622

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
danielwe opened this issue Dec 19, 2018 · 4 comments
Closed

Unnecessary copy when indexing to obtain a 0d array #2622

danielwe opened this issue Dec 19, 2018 · 4 comments

Comments

@danielwe
Copy link
Contributor

Code Sample

>>> import numpy as np
>>> import xarray as xr
>>> da = xr.DataArray(np.arange(3))
>>> da
<xarray.DataArray (dim_0: 3)>
array([0, 1, 2])
Dimensions without coordinates: dim_0
>>> da[0].values.fill(99)
>>> da
<xarray.DataArray (dim_0: 3)>
array([0, 1, 2])
Dimensions without coordinates: dim_0

Problem description

Indexing into xarray objects creates a view of the underlying data if possible. A surprising exception is when all dimensions are indexed out and the resulting object is 0d. Xarray insists on returning a 0d array rather than a scalar, which suggests (at least to me) that this is also a view whenever possible; however, it is always a copy, and modifying it will never affect the original array.

(The example above is a little contrived, since one could always call da[0] = 99. In my actual use case I am indexing into a Dataset in a way that creates views for all variables except the one that happens to collapse to 0d, and thus I'm unable to use the indexed Dataset to modify that variable in the original Dataset.)

The copy happens because, internally, the 0d array is created by retrieving a scalar from the underlying numpy array and then wrapping a new array around it. However, in numpy a 0d view can be created directly by indexing with Ellipsis/..., as follows:

>>> import numpy as np
>>> arr = np.arange(3)
>>> arr[0, ...]
array(0)

Thus, a fix that solves my immediate issues and passes all current tests is to modify the following method:

xarray/xarray/core/indexing.py

Lines 1154 to 1163 in 778ffc4

def _indexing_array_and_key(self, key):
if isinstance(key, OuterIndexer):
array = self.array
key = _outer_to_numpy_indexer(key, self.array.shape)
elif isinstance(key, VectorizedIndexer):
array = nputils.NumpyVIndexAdapter(self.array)
key = key.tuple
elif isinstance(key, BasicIndexer):
array = self.array
key = key.tuple

to always append an ellipsis for basic and outer indexing:

    def _indexing_array_and_key(self, key):
        if isinstance(key, OuterIndexer):
            array = self.array
>           key = _outer_to_numpy_indexer(key, self.array.shape) + (Ellipsis,)
        elif isinstance(key, VectorizedIndexer):
            array = nputils.NumpyVIndexAdapter(self.array)
            key = key.tuple
        elif isinstance(key, BasicIndexer):
            array = self.array
>           key = key.tuple + (Ellipsis,)

I'm not familiar enough with all the indexing variants in xarray to know if this covers all cases of 0d arrays that are currently copies but could be views. If someone wants to share some insight (e.g., some more advanced test cases), I could try and put together a pull request.

Expected Output

>>> da[0].values.fill(99)
>>> da
<xarray.DataArray (dim_0: 3)>
array([99, 1, 2])
Dimensions without coordinates: dim_0

Output of xr.show_versions()

/home/daniel/local/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-42-lowlatency
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

xarray: 0.11.0
pandas: 0.23.0
numpy: 1.14.3
scipy: 1.1.0
netCDF4: 1.4.0
h5netcdf: 0.6.2
h5py: 2.7.1
Nio: None
zarr: None
cftime: 1.0.0b1
PseudonetCDF: None
rasterio: None
iris: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.17.5
distributed: 1.21.8
matplotlib: 2.2.2
cartopy: None
seaborn: 0.8.1
setuptools: 39.1.0
pip: 10.0.1
conda: 4.5.12
pytest: 3.5.1
IPython: 6.4.0
sphinx: 1.7.4

@fujiisoup
Copy link
Member

It sounds a reasonable improvement, as far as I think of.

Is any possible use case in your mind that will be broken if we return a view for 0 dimensional array rather than a copy?

I'm not familiar enough with all the indexing variants in xarray to know if this covers all cases of 0d arrays that are currently copies but could be views.

I think it would be sufficient to change the basic indexing part.

@danielwe
Copy link
Contributor Author

Is any possible use case in your mind that will be broken if we return a view for 0 dimensional array rather than a copy?

Not that I can think of. The chance of accidentally modifying data due to the change is very slim; the change will only affect statements that convey an explicit intent to do an in-place modification, but currently fail to do so. For example, say you have a dataset ds with a variable "foo" that becomes 0d when using a basic index index; then, the following statements will work the same as before:

  • ds[index]["foo"] = 99 will leave the original ds["foo"] unchanged (it only replaces "foo" in the temporary ds[index]),
  • ds["foo"][index] = 99 will modify ds["foo"] in-place.

The change affects statements such as the following:

  • ds[index]["foo"][...] = 99,
  • ds[index]["foo"].values[...] = 99,
  • ds[index]["foo"].values.fill(99),

These statements currently modify ds["foo"] in all cases except when ds[index]["foo"] is 0d. After the change they will succeed also in the 0d case.

However, keep in mind that I'm not a seasoned xarray user. Perhaps I'm overlooking some subtleties.

I think it would be sufficient to change the basic indexing part.

You're right, I just realized outer indexing never returns a view after all, and vectorized indexing certainly doesn't. My comment was mainly about the different array wrappers for lazy indexing, copy on write, etc., but upon closer inspection of the code, it looks like modifying the NumpyIndexingAdapter class is sufficient to change the behavior in all the relevant cases.

@shoyer
Copy link
Member

shoyer commented Dec 20, 2018

upon closer inspection of the code, it looks like modifying the NumpyIndexingAdapter class is sufficient to change the behavior in all the relevant cases.

+1 for modifying NumpyIndexingAdapter to fix this. It would absolutely be cleaner to always return NumPy arrays directly from indexing rather than rewrapping scalars. I actually didn't know that this "append ellipsis" trick worked, but I see now that's documented in the detailed notes in numpy's indexing docs.

@fujiisoup
Copy link
Member

@danielwe, do you mind to send a fix of it?
Let's just leave the other lazy indexing classes later for the future improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants