Skip to content

Advanced indexing #172

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 68 commits into from
Nov 13, 2017
Merged

Advanced indexing #172

merged 68 commits into from
Nov 13, 2017

Conversation

alimanfoo
Copy link
Member

@alimanfoo alimanfoo commented Oct 31, 2017

This PR adds support for indexing Zarr arrays with Boolean or integer arrays. Resolves #78. Also adds support for selecting fields from structured array (resolves #112). Also resolves #89, resolves #93.

TODO:

  • Improve error messages.
  • Docstrings.
  • API Docs.
  • Add section on advanced indexing to tutorial.
  • Increase test coverage.
  • Fix failing PY36 doctest.

@alimanfoo
Copy link
Member Author

alimanfoo commented Oct 31, 2017

Some examples of usage, and some performance benchmarking, are in this notebook.

Regarding the API, I have added support for orthogonal indexing (a.k.a. outer indexing) via __getitem__ and __setitem__. N.B., if there is more than one bool or int array used in the indexing selection, then this differs from numpy fancy indexing. There are at least two possible options here:

(1) Keep this as-is, i.e., implement orthogonal indexing via __getitem__ and __setitem__. Pros: API and code simplicity. Cons: different behaviour from numpy for some operations.

(2) Keep behaviour of __getitem__ and __setitem__ consistent with numpy fancy indexing by disallowing more than one indexing array. Add a different accessor that supports full orthogonal indexing (e.g., called 'iloc' following pandas naming, or 'oindex' following naming proposed for a new numpy orthogonal indexing accessor). Pros: consistency with numpy; leaves open possibility for implementing more complete fancy indexing support in future (although I don't think I'll ever have the brain-power to do that). Cons: more complex API and code.

Regarding performance, results from some simple benchmarks look quite promising. Performance obviously depends on how many items are being selected, i.e., how dense or sparse the selection is. For relatively dense selections (~50% of items), indexing with a boolean array within a factor of 2 of speed for same operation on a plain numpy array, which seems decent given that zarr has to do the extra work of managing and decompressing chunks. For relatively sparse selections (~0.01% of items) we are about 10 times slower than numpy, but almost all the time is being spent in Array._decode_chunk which is where decompression happens, so I think this proves the overhead from processing the array selection is minimal compared with the time required for decompressing chunks, even when using a very fast compressor (Blosc with LZ4, multithreaded).

I also did a quick performance comparison with h5py, which isn't really fair as h5py was using a slower compressor (gzip level 1), however FWIW, with the sparse boolean array zarr is ~4X faster than h5py, and with the dense boolean array h5py performance is pathological taking longer than 1 minute to complete, so zarr wins big there taking <1 second.

Comments on API and implementation very welcome. cc @shoyer, @mrocklin, @jakirkham, @FrancescAlted.

@shoyer
Copy link
Contributor

shoyer commented Oct 31, 2017

Very cool to see this!

Keep behaviour of __getitem__ and __setitem__ consistent with numpy fancy indexing by disallowing more than one indexing array.

Watch out: NumPy considers even scalars to be indexing arrays:

In [15]: x = np.zeros((1, 2, 3))

In [16]: x[0, :, [0, 1, 2]].shape
Out[16]: (3, 2)

(This is my favorite NumPy indexing edge case.)

I don't really have a opinion here on (1) vs (2), as long as it is clearly documented and you don't try to do both outer/orthogonal and vectorized/broadcasting indexing in the same API. NetCDF4-Python only does outer indexing and that works fine for it. I would be just as happy to use a special .vindex[] indexer for vectorized indexing if anymore ever bothers to add it.

leaves open possibility for implementing more complete fancy indexing support in future (although I don't think I'll ever have the brain-power to do that)

This might actually be easier than you think. @mrocklin wrote a version of this for dask that might be a good reference point:
https://github.com/dask/dask/blob/7113a3c9bf335f2fe58989760af7b671d940e92f/dask/array/core.py#L3024

@alimanfoo
Copy link
Member Author

@FrancescAlted
Copy link

Good work! Maybe I'm looking at the benchmarks incorrectly, but I only see zarr being 4x faster (not 10x) than h5py:

%time zc[ix_sparse_bool]
CPU times: user 472 ms, sys: 88 ms, total: 560 ms
Wall time: 262 ms

vs

%time hc[ix_sparse_bool]
CPU times: user 1.1 s, sys: 0 ns, total: 1.1 s
Wall time: 1.1 s

For what is worth, I think zarr might benefit with the forthcoming introduction of dictionaries support for zstd inside Blosc2. The nice thing about dictionaries is that you can make your data blocks ridiculously small (apparently up to 1 KB), but still get good compression ratios and more importantly, very fast decompression speed. This should reduce the latency quite a bit when you have to decompress a whole block for getting just 1 (or a few) values out of it.

@alimanfoo
Copy link
Member Author

alimanfoo commented Oct 31, 2017 via email

@alimanfoo
Copy link
Member Author

Good work! Maybe I'm looking at the benchmarks incorrectly, but I only see zarr being 4x faster (not 10x) than h5py

Sorry, yes, my mistake, 4X faster.

For what is worth, I think zarr might benefit with the forthcoming introduction of dictionaries support for zstd inside Blosc2. The nice thing about dictionaries is that you can make your data blocks ridiculously small (apparently up to 1 KB), but still get good compression ratios and more importantly, very fast decompression speed. This should reduce the latency quite a bit when you have to decompress a whole block for getting just 1 (or a few) values out of it.

Very interesting, thanks!

@alimanfoo
Copy link
Member Author

@mrocklin regarding API, how would/should this play with da.from_array(fancy=True/False)? If fancy=True, what does dask assume about the API?

@mrocklin
Copy link
Contributor

I believe that setting fancy=True means that the underlying data store supports fancy indexing (which my understanding is that now zarr does) and so Dask.array should feel comfortable sending complex slicing arguments down to the underlying store. I think that we had to implement this because h5py didn't support some things that numpy did.

It's has been a while since then though, so I may be misremembering things. The relevant docstring is here

    fancy : bool, optional
        If ``x`` doesn't support fancy indexing (e.g. indexing with lists or
        arrays) then set to False. Default is True.

It sounds like you do support these things, so presumably people loading dask arrays from zarr arrays should set fancy=True, which is the default.

@alimanfoo
Copy link
Member Author

Thanks @mrocklin. Currently in this PR zarr does not implement fancy indexing same as numpy, but rather implements orthogonal indexing. So I was concerned dask may get unexpected results if fancy=True assumes numpy fancy indexing, depending on what indexes are passed through.

Actually just looking at @shoyer favourite edge case, it looks like dask __getitem__ behaviour does something different from numpy fancy indexing anyway, before even worrying about zarr interaction. E.g.:

In [17]: x = np.arange(6).reshape(1, 2, 3)

In [18]: d = da.from_array(x, chunks=(1, 2, 3))

In [19]: x[0, :, [0, 1, 2]]
Out[19]: 
array([[0, 3],
       [1, 4],
       [2, 5]])

In [20]: d[0, :, [0, 1, 2]].compute()
Out[20]: 
array([[0, 1, 2],
       [3, 4, 5]])

So I guess there are a couple of separate questions:

(a) It looks like dask.array __getitem__ behaviour currently does not match np.ndarray __getitem__ behaviour for some edge cases, should it?

(b) What exactly does fancy=True mean in terms of expected behaviour of __getitem__ on wrapped array? I.e., if dask.from_array is given fancy=True and the wrapped array implements orthogonal indexing via __getitem__, could dask ever pass through a combination of indexes that would produce different results for numpy fancy indexing versus orthogonal indexing (e.g., two 1d integer arrays, or @shoyer edge case)?

@alimanfoo
Copy link
Member Author

cc @benjeffery

@alimanfoo
Copy link
Member Author

I think I prefer option (2): allow only slices and/or ints in __getitem__/__setitem__; implement orthogonal indexing via .oindex[] in this PR; maybe implement point selection via .vindex[] in future PR. Seems like a safer thing to do, less potential for confusion and bugs. When using zarr with dask, can stick with fancy=False for now, and down the line figure out if worth finding a way that dask could make use of .oindex[] and/or .vindex[] if available on the wrapped array.

@shoyer
Copy link
Contributor

shoyer commented Oct 31, 2017

Just to note, the way we currently disambiguate vectorized/orthogonal indexing internally in xarray is that we use dedicated classes to store each type of indexer:
https://github.com/pydata/xarray/blob/17956ea5de2cf5029992e8f83460fcc878e3d024/xarray/core/indexing.py#L280-L303

This way, indexing can go through the same code paths but still dispatch to appropriate backend specific methods (e.g., dask vs numpy vs netCDF4 vs zarr).

@alimanfoo alimanfoo added this to the v2.2 milestone Oct 31, 2017
@mrocklin
Copy link
Contributor

mrocklin commented Nov 1, 2017

Actually just looking at @shoyer favourite edge case, it looks like dask getitem behaviour does something different from numpy fancy indexing anyway, before even worrying about zarr interaction. E.g.:

Yes, I think that this came up when we were hammering out slicing. I think that it was intentionally decided to deviate from NumPy's behavior. I wouldn't be surprised if @shoyer was the one to make this call actually. My memory here is a bit hazy.

@shoyer
Copy link
Contributor

shoyer commented Nov 1, 2017 via email

@mrocklin
Copy link
Contributor

mrocklin commented Nov 1, 2017 via email

@alimanfoo
Copy link
Member Author

alimanfoo commented Nov 1, 2017 via email

@alimanfoo
Copy link
Member Author

A better name for "get_point_selection_bool" could be "get_mask_selection", and "get_point_selection_int" could be better named "get_coordinate_selection".

@alimanfoo
Copy link
Member Author

alimanfoo commented Nov 6, 2017

I've pushed some new work on this, here's a synopsis.

Vectorized (inner) indexing

I've added support for vectorized indexing using coordinate arrays (a.k.a. point selection), actually wasn't too hard to do. This functionality is available via get/set_coordinate_selection() method and also for convenience via .vindex[].

Vectorized indexing using a Boolean mask array is also supported via get/set_mask_selection() method and .vindex[] - this just calls np.nonzero() internally to construct coordinate arrays and the rest is done via coordinate selection.

More complicated vectorized indexing scenarios, e.g., mixing coordinate or mask arrays with slices, are currently not supported.

The indexing coordinates do not have to be sorted in any particular order. Zarr shuffles the coordinates so they are grouped by their corresponding chunk, so that each chunk is processed once only.

Orthogonal (outer) indexing

Orthogonal indexing is supported via get/set_orthogonal_selection() and for convenience via .oindex[]. Any mix of int, slice with step >= 1, 1D int array and 1D bool array is supported.

Integer arrays do not need to be sorted. Zarr shuffles the index values so they are grouped by their corresponding chunk, so that each chunk is processed once only.

Slice with step > 1

Slice with step > 1 is now supported in __getitem__ for 1D arrays and in .oindex[] for multi-dimensional arrays. Internally the slice is converted to an int array via np.arange then processed via the orthogonal selection machinery.

Open questions

What functionality should be available via __getitem__?

For 1D arrays there is no ambiguity on how to process advanced selections, i.e., there is no difference between vectorized versus orthogonal indexing. So for convenience, if a Zarr array is 1D, currently __getitem__ supports any of int, slice with step >= 1, 1D int array, 1D bool array.

For multi-dimensional arrays things are more complex, and so currently I restrict __getitem__ to only handle basic selections, i.e., any combination of int and slice with step == 1. To do advanced indexing the user must use one of .oindex[] or .vindex[] or the corresponding selection methods.

If anyone feels this isn't a good way to go, happy to discuss.

Benchmarks

Performance seems reasonable in all cases, can't see any obvious ways to improve. More examples and some benchmarking data are in this notebook.

Copy link
Contributor

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking very nice! Fancy indexing support might give zarr a decisive edge over HDF5 :).

zarr/core.py Outdated

elif len(self._shape) == 1:
# safe to do "fancy" indexing, no ambiguity
return self.get_orthogonal_selection(selection)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do vectorized indexing on 1D arrays, too, e.g.,

In [22]: a = np.arange(4)

In [23]: a[a.reshape(2, 2)]
Out[23]:
array([[0, 1],
       [2, 3]])

More generally, I agree that it's unambiguous for 1D, but given the focus of zarr on N-dimensions I would be reluctant to add this shortcut. The special case feels like more trouble than it's worth.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yep I think your probably right. I've added support for vectorized indexing with multi-dimensional coordinate arrays, but have limited __getitem__ to basic selections only.

zarr/core.py Outdated
not self._filters and \
((self._order == 'C' and dest.flags.c_contiguous) or
(self._order == 'F' and dest.flags.f_contiguous)):
if isinstance(out, np.ndarray) and \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: PEP8 suggests using extra parentheses rather than explicit \ for line continuation. I think it looks a little cleaner, too.

zarr/indexing.py Outdated


def is_integer(x):
return isinstance(x, numbers.Integral)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure this catches numpy's signed and unsigned integer types -- missing those lead to issues in dask and xarray.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked this, looks OK:

In [5]: for t in int, np.int8, np.int16, np.int32, np.int64, np.uint8, np.uint16, np.uint32, np.uint64:
   ...:     print(t, isinstance(t(42), numbers.Integral))
   ...:     
<class 'int'> True
<class 'numpy.int8'> True
<class 'numpy.int16'> True
<class 'numpy.int32'> True
<class 'numpy.int64'> True
<class 'numpy.uint8'> True
<class 'numpy.uint16'> True
<class 'numpy.uint32'> True
<class 'numpy.uint64'> True

zarr/indexing.py Outdated


def slice_to_range(s):
return range(s.start, s.stop, 1 if s.step is None else s.step)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use slice.indices() instead to get start/stop/step (this is especially important for tricky cases like negative steps). You'll also need the size of the array dimension.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, didn't know about that, nice.

zarr/indexing.py Outdated

def oindex(a, selection):
"""Implementation of orthogonal indexing with slices and ints."""
drop_axes = tuple([i for i, s in enumerate(selection) if isinstance(s, int)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, be careful assuming that all integer selections are native Python ints.

zarr/indexing.py Outdated
# validation
if not is_coordinate_selection(selection, array):
# TODO refactor error messages for consistency
raise IndexError('invalid coordinate selection')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add an informative error message here about slices, because assuredly somebody is going to try that. (For what it's worth, I agree that it's a good choice not to support them!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'll do some work on error messages when implementation has settled.

zarr/indexing.py Outdated
for dim_sel, dim_len in zip(selection, array.shape):

# check number of dimensions, only support indexing with 1d array
if len(dim_sel.shape) > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'm reading this right, but does this mean you only support vectorized indexing with 1D arrays?

Vectorized indexing with >1D arrays should be pretty easy and can be quite useful. You just need to flatten the indices after broadcasting and unflatten the result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this tip, I finally get coordinate indexing (at least without slices)! I've added support for multi-dimensional coordinate arrays.

assert_array_equal(a[0], z[0])
assert_array_equal(a[-1], z[-1])
assert_array_equal(a[:, 0], z[:, 0])
assert_array_equal(a[:, -1], z[:, -1])
eq(a[0, 0], z[0, 0])
eq(a[-1, -1], z[-1, -1])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would strongly recommend adding some short-form cases for vectorized indexing (i.e., with .vindex). You have partial test coverage for this already, but there are some many indexing edge cases that it's a good idea to write them in the most succinct way possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some tests to cover these cases. Still a bit more coverage needed.

slice(50, 150, 1),
slice(50, 150, 10),
slice(50, 150, 100),
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about negative steps? At the least, those should give an appropriate error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Negative steps are supported, I've added tests to confirm.

@alimanfoo
Copy link
Member Author

alimanfoo commented Nov 7, 2017

Thank you @shoyer for the hugely useful feedback. Here's a summary of latest pushes:

Support has been added for coordinate indexing with multi-dimensional arrays.

For arrays with a structured dtype, all get/set_..._selection() methods now support a fields argument to allow selecting data for specific fields (xref #112). I've also tentatively implemented h5py-style support for fields within __getitem__, although that deviates from the numpy API so not 100% sure it's a good idea.

I've also simplified __getitem__ so it only supports basic selections (mix of int and contiguous slice). I think that's a decent position for now, it would be possible to add support via __getitem__ for more advanced scenarios but it's not straightforward to work through all the different options and how they should be dispatched to appropriate selection methods.

Examples and benchmarks notebook has been updated for the above changes.

@alimanfoo
Copy link
Member Author

Just to mention I've reworked the implementation of slices with step > 1, these are now supported via __getitem__ as well as orthogonal selection methods, and are approx 10X faster and use less memory (no longer implemented via np.arange). The downside is that slices with step < 0 are not supported, but I think that's a reasonable compromise. Updated benchmarks here.

Test coverage is also back up, and I think I'm done with the main implementation work, so will work on docs and improving error messages before merging.

@FrancescAlted
Copy link

Nice job. After having a look at your benchmarks, I see that _chunk_getitem is usually the most consuming function (cumtime wise) in your profiles, so I am wondering if that could be improved somehow. I see that your chunksizes are typically between 256 KB to 1 MB, but the benchmark page does not show the blocksize per every chunk, which is the important parameter when you try to get a handful of values out of a chunk (only a block or a few need to be decompressed). You could get such blocksize parameter by using the blosc_get_blocksize() call, and you can explicitly set it using blosc_set_blocksize() (if you don't call it, an automatic blocksize is used). You may want to add support to these functions in zarr and try a smaller blocksize to see how it would affect your current figures.

Also, I see that np.argsort() shows up sometimes the first in time usage. I am wondering if you could make use of a handy keysort that I did many years ago. keysort() takes two arrays as arguments, sorting in-place the first one and also the second, but following the order of the first, in one shot. This requires less temporaries and hence it is quite more efficient than an np.argsort followed by an indexing operation. It has been in production in PyTables for years, so it should be safe enough.

@alimanfoo
Copy link
Member Author

alimanfoo commented Nov 9, 2017 via email

@FrancescAlted
Copy link

(Btw when I was running the benchmarks yesterday I could actually hear my
computer audibly fizzing, only Blosc can make it do that :-)

Ha ha, this probably has to do with the SIMD support in blosc shuffle/bitshuffle, which makes CPUs consume quite more energy. Add multithreading to the equation and yeah, I can imagine you could fry something on top of your CPU while you are at it :)

@alimanfoo alimanfoo self-assigned this Nov 9, 2017
@alimanfoo alimanfoo added the enhancement New features or improvements label Nov 9, 2017
@alimanfoo alimanfoo force-pushed the advanced-indexing-20171028 branch from 6a38007 to f39bc40 Compare November 9, 2017 17:47
@alimanfoo
Copy link
Member Author

OK, I think I am done here. There is a new tutorial section on advanced indexing. Error messages have been improved. I'll let the dust settle for a few days.

@alimanfoo alimanfoo mentioned this pull request Nov 10, 2017
@alimanfoo alimanfoo force-pushed the advanced-indexing-20171028 branch from 03176cf to 4e19759 Compare November 11, 2017 00:30
@alimanfoo
Copy link
Member Author

After rebasing, I hit this unicode weirdness on Windows:

>>> import numpy as np
>>> v = np.array('xxx', dtype='U3')[()]
>>> v
'xxx'
>>> a = np.empty(10, dtype='U3')
>>> a[:] = v
>>> a[0] == v
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)

I think this is related to default locale of cp1252 on my Windows VM, but I don't really understand what's happening. In any case I've pushed a simple workaround.

@alimanfoo
Copy link
Member Author

Alright, merging.

@alimanfoo alimanfoo merged commit 3f66393 into master Nov 13, 2017
@alimanfoo alimanfoo deleted the advanced-indexing-20171028 branch November 13, 2017 01:39
@alimanfoo alimanfoo added the release notes done Automatically applied to PRs which have release notes. label Nov 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New features or improvements release notes done Automatically applied to PRs which have release notes.
Projects
None yet
4 participants