Advanced indexing #172

alimanfoo · 2017-10-31T00:09:17Z

This PR adds support for indexing Zarr arrays with Boolean or integer arrays. Resolves #78. Also adds support for selecting fields from structured array (resolves #112). Also resolves #89, resolves #93.

TODO:

alimanfoo · 2017-10-31T00:42:59Z

Some examples of usage, and some performance benchmarking, are in this notebook.

Regarding the API, I have added support for orthogonal indexing (a.k.a. outer indexing) via __getitem__ and __setitem__. N.B., if there is more than one bool or int array used in the indexing selection, then this differs from numpy fancy indexing. There are at least two possible options here:

(1) Keep this as-is, i.e., implement orthogonal indexing via __getitem__ and __setitem__. Pros: API and code simplicity. Cons: different behaviour from numpy for some operations.

(2) Keep behaviour of __getitem__ and __setitem__ consistent with numpy fancy indexing by disallowing more than one indexing array. Add a different accessor that supports full orthogonal indexing (e.g., called 'iloc' following pandas naming, or 'oindex' following naming proposed for a new numpy orthogonal indexing accessor). Pros: consistency with numpy; leaves open possibility for implementing more complete fancy indexing support in future (although I don't think I'll ever have the brain-power to do that). Cons: more complex API and code.

Regarding performance, results from some simple benchmarks look quite promising. Performance obviously depends on how many items are being selected, i.e., how dense or sparse the selection is. For relatively dense selections (~50% of items), indexing with a boolean array within a factor of 2 of speed for same operation on a plain numpy array, which seems decent given that zarr has to do the extra work of managing and decompressing chunks. For relatively sparse selections (~0.01% of items) we are about 10 times slower than numpy, but almost all the time is being spent in Array._decode_chunk which is where decompression happens, so I think this proves the overhead from processing the array selection is minimal compared with the time required for decompressing chunks, even when using a very fast compressor (Blosc with LZ4, multithreaded).

I also did a quick performance comparison with h5py, which isn't really fair as h5py was using a slower compressor (gzip level 1), however FWIW, with the sparse boolean array zarr is ~4X faster than h5py, and with the dense boolean array h5py performance is pathological taking longer than 1 minute to complete, so zarr wins big there taking <1 second.

Comments on API and implementation very welcome. cc @shoyer, @mrocklin, @jakirkham, @FrancescAlted.

shoyer · 2017-10-31T01:03:06Z

Very cool to see this!

Keep behaviour of __getitem__ and __setitem__ consistent with numpy fancy indexing by disallowing more than one indexing array.

Watch out: NumPy considers even scalars to be indexing arrays:

In [15]: x = np.zeros((1, 2, 3))

In [16]: x[0, :, [0, 1, 2]].shape
Out[16]: (3, 2)

(This is my favorite NumPy indexing edge case.)

I don't really have a opinion here on (1) vs (2), as long as it is clearly documented and you don't try to do both outer/orthogonal and vectorized/broadcasting indexing in the same API. NetCDF4-Python only does outer indexing and that works fine for it. I would be just as happy to use a special .vindex[] indexer for vectorized indexing if anymore ever bothers to add it.

leaves open possibility for implementing more complete fancy indexing support in future (although I don't think I'll ever have the brain-power to do that)

This might actually be easier than you think. @mrocklin wrote a version of this for dask that might be a good reference point:
https://github.com/dask/dask/blob/7113a3c9bf335f2fe58989760af7b671d940e92f/dask/array/core.py#L3024

alimanfoo · 2017-10-31T08:48:14Z

xref dask/dask#439, dask/dask#1776, dask/dask#2573, numpy/numpy#6256, h5py/h5py#652

FrancescAlted · 2017-10-31T08:51:37Z

Good work! Maybe I'm looking at the benchmarks incorrectly, but I only see zarr being 4x faster (not 10x) than h5py:

%time zc[ix_sparse_bool]
CPU times: user 472 ms, sys: 88 ms, total: 560 ms
Wall time: 262 ms

vs

%time hc[ix_sparse_bool]
CPU times: user 1.1 s, sys: 0 ns, total: 1.1 s
Wall time: 1.1 s

For what is worth, I think zarr might benefit with the forthcoming introduction of dictionaries support for zstd inside Blosc2. The nice thing about dictionaries is that you can make your data blocks ridiculously small (apparently up to 1 KB), but still get good compression ratios and more importantly, very fast decompression speed. This should reduce the latency quite a bit when you have to decompress a whole block for getting just 1 (or a few) values out of it.

alimanfoo · 2017-10-31T09:00:18Z

Watch out: NumPy considers even scalars to be indexing arrays: In [15]: x = np.zeros((1, 2, 3)) In [16]: x[0, :, [0, 1, 2]].shape Out[16]: (3, 2) (This is my favorite NumPy indexing edge case.)

Ouch. OK, maybe option (2) should be: allow only slices and/or ints in __getitem__/__setitem__; implement orthogonal indexing via .oindex[] in this PR; implement point selection via .vindex[] in future PR.

This might actually be easier than you think. @mrocklin <https://github.com/mrocklin> wrote a version of this for dask that might be a good reference point: https://github.com/dask/dask/blob/7113a3c9bf335f2fe58989760af7b6 71d940e92f/dask/array/core.py#L3024

Well, one of @mrocklin comments on the original PR was basically, "I pity the next person has to look at this" :-) But I see you did some further work on it in another PR so maybe @mrocklin was being pessimistic :-)

alimanfoo · 2017-10-31T09:09:09Z

Good work! Maybe I'm looking at the benchmarks incorrectly, but I only see zarr being 4x faster (not 10x) than h5py

Sorry, yes, my mistake, 4X faster.

For what is worth, I think zarr might benefit with the forthcoming introduction of dictionaries support for zstd inside Blosc2. The nice thing about dictionaries is that you can make your data blocks ridiculously small (apparently up to 1 KB), but still get good compression ratios and more importantly, very fast decompression speed. This should reduce the latency quite a bit when you have to decompress a whole block for getting just 1 (or a few) values out of it.

Very interesting, thanks!

alimanfoo · 2017-10-31T09:11:21Z

@mrocklin regarding API, how would/should this play with da.from_array(fancy=True/False)? If fancy=True, what does dask assume about the API?

mrocklin · 2017-10-31T09:20:30Z

I believe that setting fancy=True means that the underlying data store supports fancy indexing (which my understanding is that now zarr does) and so Dask.array should feel comfortable sending complex slicing arguments down to the underlying store. I think that we had to implement this because h5py didn't support some things that numpy did.

It's has been a while since then though, so I may be misremembering things. The relevant docstring is here

    fancy : bool, optional
        If ``x`` doesn't support fancy indexing (e.g. indexing with lists or
        arrays) then set to False. Default is True.

It sounds like you do support these things, so presumably people loading dask arrays from zarr arrays should set fancy=True, which is the default.

alimanfoo · 2017-10-31T12:20:52Z

Thanks @mrocklin. Currently in this PR zarr does not implement fancy indexing same as numpy, but rather implements orthogonal indexing. So I was concerned dask may get unexpected results if fancy=True assumes numpy fancy indexing, depending on what indexes are passed through.

Actually just looking at @shoyer favourite edge case, it looks like dask __getitem__ behaviour does something different from numpy fancy indexing anyway, before even worrying about zarr interaction. E.g.:

In [17]: x = np.arange(6).reshape(1, 2, 3)

In [18]: d = da.from_array(x, chunks=(1, 2, 3))

In [19]: x[0, :, [0, 1, 2]]
Out[19]: 
array([[0, 3],
       [1, 4],
       [2, 5]])

In [20]: d[0, :, [0, 1, 2]].compute()
Out[20]: 
array([[0, 1, 2],
       [3, 4, 5]])

So I guess there are a couple of separate questions:

(a) It looks like dask.array __getitem__ behaviour currently does not match np.ndarray __getitem__ behaviour for some edge cases, should it?

(b) What exactly does fancy=True mean in terms of expected behaviour of __getitem__ on wrapped array? I.e., if dask.from_array is given fancy=True and the wrapped array implements orthogonal indexing via __getitem__, could dask ever pass through a combination of indexes that would produce different results for numpy fancy indexing versus orthogonal indexing (e.g., two 1d integer arrays, or @shoyer edge case)?

alimanfoo · 2017-10-31T16:39:27Z

cc @benjeffery

alimanfoo · 2017-10-31T16:47:53Z

I think I prefer option (2): allow only slices and/or ints in __getitem__/__setitem__; implement orthogonal indexing via .oindex[] in this PR; maybe implement point selection via .vindex[] in future PR. Seems like a safer thing to do, less potential for confusion and bugs. When using zarr with dask, can stick with fancy=False for now, and down the line figure out if worth finding a way that dask could make use of .oindex[] and/or .vindex[] if available on the wrapped array.

shoyer · 2017-10-31T20:32:54Z

Just to note, the way we currently disambiguate vectorized/orthogonal indexing internally in xarray is that we use dedicated classes to store each type of indexer:
https://github.com/pydata/xarray/blob/17956ea5de2cf5029992e8f83460fcc878e3d024/xarray/core/indexing.py#L280-L303

This way, indexing can go through the same code paths but still dispatch to appropriate backend specific methods (e.g., dask vs numpy vs netCDF4 vs zarr).

mrocklin · 2017-11-01T13:21:05Z

Actually just looking at @shoyer favourite edge case, it looks like dask getitem behaviour does something different from numpy fancy indexing anyway, before even worrying about zarr interaction. E.g.:

Yes, I think that this came up when we were hammering out slicing. I think that it was intentionally decided to deviate from NumPy's behavior. I wouldn't be surprised if @shoyer was the one to make this call actually. My memory here is a bit hazy.

shoyer · 2017-11-01T15:22:29Z

I don't recall agreeing intentionally deviating from numpy for dask. But I did notice this recently and hadn't gotten around to filing a bug yet. Dask is at least in good company here: h5py also gets this wrong.

…

On Wed, Nov 1, 2017 at 6:21 AM Matthew Rocklin ***@***.***> wrote: Actually just looking at @shoyer <https://github.com/shoyer> favourite edge case, it looks like dask *getitem* behaviour does something different from numpy fancy indexing anyway, before even worrying about zarr interaction. E.g.: Yes, I think that this came up when we were hammering out slicing. I think that it was intentionally decided to deviate from NumPy's behavior. I wouldn't be surprised if @shoyer <https://github.com/shoyer> was the one to make this call actually. My memory here is a bit hazy. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/alimanfoo/zarr/pull/172#issuecomment-341104126>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1lMfX_zsaVNOHKUIHiAHoiZjw9hmks5syHBBgaJpZM4QMBwG> .

mrocklin · 2017-11-01T15:23:24Z

OK, my mistake. I must be mis-remembering things. On Wed, Nov 1, 2017 at 11:22 AM, Stephan Hoyer <[email protected]> wrote:

…

I don't recall agreeing intentionally deviating from numpy for dask. But I did notice this recently and hadn't gotten around to filing a bug yet. Dask is at least in good company here: h5py also gets this wrong. On Wed, Nov 1, 2017 at 6:21 AM Matthew Rocklin ***@***.***> wrote: > Actually just looking at @shoyer <https://github.com/shoyer> favourite > edge case, it looks like dask *getitem* behaviour does something > different from numpy fancy indexing anyway, before even worrying about zarr > interaction. E.g.: > > Yes, I think that this came up when we were hammering out slicing. I think > that it was intentionally decided to deviate from NumPy's behavior. I > wouldn't be surprised if @shoyer <https://github.com/shoyer> was the one > to make this call actually. My memory here is a bit hazy. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/alimanfoo/zarr/pull/172#issuecomment-341104126>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABKS1lMfX_ zsaVNOHKUIHiAHoiZjw9hmks5syHBBgaJpZM4QMBwG> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/alimanfoo/zarr/pull/172#issuecomment-341138751>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszJw4Pm3V44IVtmCAhcgEotjVFn3Gks5syIy2gaJpZM4QMBwG> .

alimanfoo · 2017-11-01T16:40:47Z

Hm, this is tricky. For my part, there are three indexing use cases which I think are worth considering. The first is orthogonal indexing within any combination of int, slice, ellipsis, 1d int array or 1d bool array. That's the one I need most often, and is currently implemented in this PR via __getitem__. The second is point selection with any combination of int and 1d int array but no slice or ellipsis. The indexers can then be broadcast to provide fully specified coordinate arrays and the output is always 1d. I think this is essentially what is implemented in dask via .vindex and in xarray via .isel_points(), and I think I can see how this could be done efficiently in zarr. The third is point selection with a single boolean array, where the boolean indexer has the same shape as the array being indexed (generalization of x[x > 0]). This maps to point selection via integer arrays by doing np.nonzero() on the boolean indexer array. Any other type of point selection/vectorized indexing (e.g., including slices, int arrays with >1d, mixture of int and bool arrays, ...) I have never needed and have trouble with understanding, mainly because dimensions can get moved around in a way that I don't fully understand. In this PR I think I'll change to limit __getitem__ to basic indexing with int, slice and ellipsis only. I may also expose this functionality via a couple of methods, something like get_basic_selection(selection, out=None) and set_basic_selection(selection, value), where selection is a fully specified tuple of ints and/or slices (i.e., no ellipsis, all dims are indexed). This is partly for clarity, but also get_basic_selection() could accept an "out" param, which would allow the selected data to be loaded directly into an array given by the user, which could be numpy or another zarr array. I think I'll then expose orthogonal indexing via .oindex[] as has been proposed for numpy. I may also implement this via methods, e.g., get_orthogonal_selection(selection, out=None) and set_orthogonal_selection(selection, value), again for clarity and flexibility. I will probably leave point selection for future work. However, I think I would start by implementing methods separately targeting the second and third use cases above. E.g., get_point_selection_int(selection, out=None), where "selection" is fully specified tuple of ints and/or 1d int arrays, and get_point_selection_bool(selection, out=None) where "selection" is single bool array with same shape as indexed array. This would at least simplify implementation and make it a bit clearer what subset of point selection indexing is being implemented. Both of these could then be used to provide implementations for a subset of the proposed functionality for .vindex[] in numpy, if that goes forwards. On Wed, Nov 1, 2017 at 3:23 PM, Matthew Rocklin <[email protected]> wrote:

…

OK, my mistake. I must be mis-remembering things. On Wed, Nov 1, 2017 at 11:22 AM, Stephan Hoyer ***@***.***> wrote: > I don't recall agreeing intentionally deviating from numpy for dask. But I > did notice this recently and hadn't gotten around to filing a bug yet. Dask > is at least in good company here: h5py also gets this wrong. > > On Wed, Nov 1, 2017 at 6:21 AM Matthew Rocklin ***@***.*** > > wrote: > > > Actually just looking at @shoyer <https://github.com/shoyer> favourite > > edge case, it looks like dask *getitem* behaviour does something > > different from numpy fancy indexing anyway, before even worrying about > zarr > > interaction. E.g.: > > > > Yes, I think that this came up when we were hammering out slicing. I > think > > that it was intentionally decided to deviate from NumPy's behavior. I > > wouldn't be surprised if @shoyer <https://github.com/shoyer> was the one > > to make this call actually. My memory here is a bit hazy. > > > > — > > You are receiving this because you were mentioned. > > Reply to this email directly, view it on GitHub > > <https://github.com/alimanfoo/zarr/pull/172#issuecomment-341104126>, or > mute > > the thread > > <https://github.com/notifications/unsubscribe-auth/ABKS1lMfX_ > zsaVNOHKUIHiAHoiZjw9hmks5syHBBgaJpZM4QMBwG> > > > . > > > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/alimanfoo/zarr/pull/172#issuecomment-341138751>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ AASszJw4Pm3V44IVtmCAhcgEotjVFn3Gks5syIy2gaJpZM4QMBwG> > . > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/alimanfoo/zarr/pull/172#issuecomment-341139089>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8Qo-QXsC3I2Z-B9BdKSyodMfyK4Rzks5syIztgaJpZM4QMBwG> .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: [email protected] Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

alimanfoo · 2017-11-01T16:53:36Z

A better name for "get_point_selection_bool" could be "get_mask_selection", and "get_point_selection_int" could be better named "get_coordinate_selection".

alimanfoo · 2017-11-06T14:10:51Z

I've pushed some new work on this, here's a synopsis.

Vectorized (inner) indexing

I've added support for vectorized indexing using coordinate arrays (a.k.a. point selection), actually wasn't too hard to do. This functionality is available via get/set_coordinate_selection() method and also for convenience via .vindex[].

Vectorized indexing using a Boolean mask array is also supported via get/set_mask_selection() method and .vindex[] - this just calls np.nonzero() internally to construct coordinate arrays and the rest is done via coordinate selection.

More complicated vectorized indexing scenarios, e.g., mixing coordinate or mask arrays with slices, are currently not supported.

The indexing coordinates do not have to be sorted in any particular order. Zarr shuffles the coordinates so they are grouped by their corresponding chunk, so that each chunk is processed once only.

Orthogonal (outer) indexing

Orthogonal indexing is supported via get/set_orthogonal_selection() and for convenience via .oindex[]. Any mix of int, slice with step >= 1, 1D int array and 1D bool array is supported.

Integer arrays do not need to be sorted. Zarr shuffles the index values so they are grouped by their corresponding chunk, so that each chunk is processed once only.

Slice with step > 1

Slice with step > 1 is now supported in __getitem__ for 1D arrays and in .oindex[] for multi-dimensional arrays. Internally the slice is converted to an int array via np.arange then processed via the orthogonal selection machinery.

Open questions

What functionality should be available via __getitem__?

For 1D arrays there is no ambiguity on how to process advanced selections, i.e., there is no difference between vectorized versus orthogonal indexing. So for convenience, if a Zarr array is 1D, currently __getitem__ supports any of int, slice with step >= 1, 1D int array, 1D bool array.

For multi-dimensional arrays things are more complex, and so currently I restrict __getitem__ to only handle basic selections, i.e., any combination of int and slice with step == 1. To do advanced indexing the user must use one of .oindex[] or .vindex[] or the corresponding selection methods.

If anyone feels this isn't a good way to go, happy to discuss.

Benchmarks

Performance seems reasonable in all cases, can't see any obvious ways to improve. More examples and some benchmarking data are in this notebook.

shoyer

This is looking very nice! Fancy indexing support might give zarr a decisive edge over HDF5 :).

shoyer · 2017-11-06T16:05:21Z

zarr/core.py

+
+        elif len(self._shape) == 1:
+            # safe to do "fancy" indexing, no ambiguity
+            return self.get_orthogonal_selection(selection)


You can do vectorized indexing on 1D arrays, too, e.g.,

In [22]: a = np.arange(4) In [23]: a[a.reshape(2, 2)] Out[23]: array([[0, 1], [2, 3]])

More generally, I agree that it's unambiguous for 1D, but given the focus of zarr on N-dimensions I would be reluctant to add this shortcut. The special case feels like more trouble than it's worth.

Thanks, yep I think your probably right. I've added support for vectorized indexing with multi-dimensional coordinate arrays, but have limited __getitem__ to basic selections only.

shoyer · 2017-11-06T16:10:07Z

zarr/core.py

-                    not self._filters and \
-                    ((self._order == 'C' and dest.flags.c_contiguous) or
-                     (self._order == 'F' and dest.flags.f_contiguous)):
+            if isinstance(out, np.ndarray) and \


note: PEP8 suggests using extra parentheses rather than explicit \ for line continuation. I think it looks a little cleaner, too.

shoyer · 2017-11-06T16:12:21Z

zarr/indexing.py

+
+
+def is_integer(x):
+    return isinstance(x, numbers.Integral)


Make sure this catches numpy's signed and unsigned integer types -- missing those lead to issues in dask and xarray.

I've checked this, looks OK:

In [5]: for t in int, np.int8, np.int16, np.int32, np.int64, np.uint8, np.uint16, np.uint32, np.uint64: ...: print(t, isinstance(t(42), numbers.Integral)) ...: <class 'int'> True <class 'numpy.int8'> True <class 'numpy.int16'> True <class 'numpy.int32'> True <class 'numpy.int64'> True <class 'numpy.uint8'> True <class 'numpy.uint16'> True <class 'numpy.uint32'> True <class 'numpy.uint64'> True

shoyer · 2017-11-06T16:16:36Z

zarr/indexing.py

+
+
+def slice_to_range(s):
+    return range(s.start, s.stop, 1 if s.step is None else s.step)


Use slice.indices() instead to get start/stop/step (this is especially important for tricky cases like negative steps). You'll also need the size of the array dimension.

Ah, didn't know about that, nice.

shoyer · 2017-11-06T16:17:20Z

zarr/indexing.py

+
+def oindex(a, selection):
+    """Implementation of orthogonal indexing with slices and ints."""
+    drop_axes = tuple([i for i, s in enumerate(selection) if isinstance(s, int)])


again, be careful assuming that all integer selections are native Python ints.

shoyer · 2017-11-06T16:20:39Z

zarr/indexing.py

+        # validation
+        if not is_coordinate_selection(selection, array):
+            # TODO refactor error messages for consistency
+            raise IndexError('invalid coordinate selection')


It would be good to add an informative error message here about slices, because assuredly somebody is going to try that. (For what it's worth, I agree that it's a good choice not to support them!)

Agreed, I'll do some work on error messages when implementation has settled.

shoyer · 2017-11-06T16:36:40Z

zarr/indexing.py

+        for dim_sel, dim_len in zip(selection, array.shape):
+
+            # check number of dimensions, only support indexing with 1d array
+            if len(dim_sel.shape) > 1:


I'm not sure I'm reading this right, but does this mean you only support vectorized indexing with 1D arrays?

Vectorized indexing with >1D arrays should be pretty easy and can be quite useful. You just need to flatten the indices after broadcasting and unflatten the result.

Thank you for this tip, I finally get coordinate indexing (at least without slices)! I've added support for multi-dimensional coordinate arrays.

shoyer · 2017-11-06T16:40:09Z

zarr/tests/test_core.py

        assert_array_equal(a[0], z[0])
        assert_array_equal(a[-1], z[-1])
+        assert_array_equal(a[:, 0], z[:, 0])
+        assert_array_equal(a[:, -1], z[:, -1])
        eq(a[0, 0], z[0, 0])
        eq(a[-1, -1], z[-1, -1])



I would strongly recommend adding some short-form cases for vectorized indexing (i.e., with .vindex). You have partial test coverage for this already, but there are some many indexing edge cases that it's a good idea to write them in the most succinct way possible.

I've added some tests to cover these cases. Still a bit more coverage needed.

shoyer · 2017-11-06T16:40:59Z

zarr/tests/test_indexing.py

+        slice(50, 150, 1),
+        slice(50, 150, 10),
+        slice(50, 150, 100),
+    ]


What about negative steps? At the least, those should give an appropriate error.

Negative steps are supported, I've added tests to confirm.

alimanfoo · 2017-11-07T14:05:55Z

Thank you @shoyer for the hugely useful feedback. Here's a summary of latest pushes:

Support has been added for coordinate indexing with multi-dimensional arrays.

For arrays with a structured dtype, all get/set_..._selection() methods now support a fields argument to allow selecting data for specific fields (xref #112). I've also tentatively implemented h5py-style support for fields within __getitem__, although that deviates from the numpy API so not 100% sure it's a good idea.

I've also simplified __getitem__ so it only supports basic selections (mix of int and contiguous slice). I think that's a decent position for now, it would be possible to add support via __getitem__ for more advanced scenarios but it's not straightforward to work through all the different options and how they should be dispatched to appropriate selection methods.

Examples and benchmarks notebook has been updated for the above changes.

alimanfoo · 2017-11-08T17:49:30Z

Just to mention I've reworked the implementation of slices with step > 1, these are now supported via __getitem__ as well as orthogonal selection methods, and are approx 10X faster and use less memory (no longer implemented via np.arange). The downside is that slices with step < 0 are not supported, but I think that's a reasonable compromise. Updated benchmarks here.

Test coverage is also back up, and I think I'm done with the main implementation work, so will work on docs and improving error messages before merging.

FrancescAlted · 2017-11-09T11:42:54Z

Nice job. After having a look at your benchmarks, I see that _chunk_getitem is usually the most consuming function (cumtime wise) in your profiles, so I am wondering if that could be improved somehow. I see that your chunksizes are typically between 256 KB to 1 MB, but the benchmark page does not show the blocksize per every chunk, which is the important parameter when you try to get a handful of values out of a chunk (only a block or a few need to be decompressed). You could get such blocksize parameter by using the blosc_get_blocksize() call, and you can explicitly set it using blosc_set_blocksize() (if you don't call it, an automatic blocksize is used). You may want to add support to these functions in zarr and try a smaller blocksize to see how it would affect your current figures.

Also, I see that np.argsort() shows up sometimes the first in time usage. I am wondering if you could make use of a handy keysort that I did many years ago. keysort() takes two arrays as arguments, sorting in-place the first one and also the second, but following the order of the first, in one shot. This requires less temporaries and hence it is quite more efficient than an np.argsort followed by an indexing operation. It has been in production in PyTables for years, so it should be safe enough.

alimanfoo · 2017-11-09T12:09:00Z

Nice job. After having a look at your benchmarks, I see that _chunk_getitem is usually the most consuming function (cumtime wise) in your profiles, so I am wondering if that could be improved somehow. I see that your chunksizes are typically between 256 KB to 1 MB, but the benchmark page does not show the blocksize per every chunk, which is the important parameter when you try to get a handful of values out of a chunk (only a block or a few need to be decompressed). You could get such blocksize parameter by using the blosc_get_blocksize() call, and you can explicitly set it using blosc_set_blocksize() (if you don't call it, an automatic blocksize is used). You may want to add support to these functions in zarr and try a smaller blocksize to see how it would affect your current figures.

Thanks Francesc. In fact the blocksize parameter is exposed for the numcodecs.Blosc compressor, so this could be tuned. But the main thing for the indexing work is to know that the compressor is the limiting factor. The Blosc compressor is already extremely fast, and although could probably be tuned even further, it is already very useful to know that my indexing implementation is not getting in the way performance-wise, at least for most of the indexing operations. (Btw when I was running the benchmarks yesterday I could actually hear my computer audibly fizzing, only Blosc can make it do that :-)

Also, I see that np.argsort() shows up sometimes the first in time usage. I am wondering if you could make use of a handy keysort <https://github.com/PyTables/PyTables/blob/6782047b9223897fd59ff4967d71d7fdfb474f16/tables/indexesextension.pyx#L147> that I did many years ago. keysort() takes two arrays as arguments, sorting in-place the first one and also the second, but following the order of the first, in one shot. This requires less temporaries and hence it is quite more efficient than an np.argsort followed by an indexing operation. It has been in production in PyTables for years, so it should be safe enough.

Thank you, actually I did do some trawling the internet to see if there was any way to accelerate the argsort and saw a mention that you had done something like this in pytables. I think it would be great to explore this, I noticed that sorting an array in place takes less than half the time of an argsort followed by indexing when just using numpy. I'm a bit hesitant to do include cython code in zarr as currently zarr is pure Python, and there are some benefits to not having any cython code to build. What would be cool is if the pytables keysort implementation was available as a standalone package, then zarr could depend on it. But I know you're super-busy so don't want to ask you to do any more :-)

FrancescAlted · 2017-11-09T12:25:25Z

(Btw when I was running the benchmarks yesterday I could actually hear my
computer audibly fizzing, only Blosc can make it do that :-)

Ha ha, this probably has to do with the SIMD support in blosc shuffle/bitshuffle, which makes CPUs consume quite more energy. Add multithreading to the equation and yeah, I can imagine you could fry something on top of your CPU while you are at it :)

alimanfoo · 2017-11-10T01:06:38Z

OK, I think I am done here. There is a new tutorial section on advanced indexing. Error messages have been improved. I'll let the dust settle for a few days.

…r into advanced-indexing-20171028

alimanfoo · 2017-11-13T00:44:52Z

After rebasing, I hit this unicode weirdness on Windows:

>>> import numpy as np
>>> v = np.array('xxx', dtype='U3')[()]
>>> v
'xxx'
>>> a = np.empty(10, dtype='U3')
>>> a[:] = v
>>> a[0] == v
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)

I think this is related to default locale of cp1252 on my Windows VM, but I don't really understand what's happening. In any case I've pushed a simple workaround.

alimanfoo · 2017-11-13T01:39:40Z

Alright, merging.

alimanfoo added this to the v2.2 milestone Oct 31, 2017

shoyer reviewed Nov 6, 2017

View reviewed changes

alimanfoo self-assigned this Nov 9, 2017

alimanfoo added the enhancement New features or improvements label Nov 9, 2017

alimanfoo force-pushed the advanced-indexing-20171028 branch from 6a38007 to f39bc40 Compare November 9, 2017 17:47

alimanfoo mentioned this pull request Nov 10, 2017

Ellipsis handling #168

Closed

alimanfoo added 5 commits November 10, 2017 23:20

documentation

9115ffe

review tutorial, add section on advanced indexing

0bac9f3

improve errors

af2b505

review errors

600aa93

rebase; resolve issues with structured arrays

4e19759

alimanfoo force-pushed the advanced-indexing-20171028 branch from 03176cf to 4e19759 Compare November 11, 2017 00:30

alimanfoo added 7 commits November 11, 2017 22:50

spike appveyor

c7654c6

spike appveyor 2

c459730

spike appveyor 3

7347c4d

spike appveyor 4

e776977

spike appveyor 5

d1d8bbe

spike appveyor 6

5283bd9

spike appveyor 7

c1d2c5f

This was referenced Nov 12, 2017

Outer indexing support dask/dask#2882

Open

Explicitly handle array indexing types in dask.array.from_array dask/dask#2883

Open

alimanfoo added 4 commits November 13, 2017 00:26

fix for windows unicode issue

fbb29a9

doh

23f7e96

comments

8b41414

Merge branch 'advanced-indexing-20171028' of github.com:alimanfoo/zar…

2234c33

…r into advanced-indexing-20171028

py2 compat

19de333

alimanfoo mentioned this pull request Nov 13, 2017

UnicodeDecodeError on Windows numpy/numpy#10015

Closed

alimanfoo added 2 commits November 13, 2017 01:22

cover py27

64db65c

pragma

d08189c

alimanfoo merged commit 3f66393 into master Nov 13, 2017

alimanfoo deleted the advanced-indexing-20171028 branch November 13, 2017 01:39

alimanfoo added the release notes done Automatically applied to PRs which have release notes. label Nov 20, 2017

alimanfoo mentioned this pull request Jan 27, 2018

Accelerate coordinate selection via keysort #236

Closed

jakirkham mentioned this pull request Nov 24, 2020

Request for fancy indexing #657

Closed



		def slice_to_range(s):
		return range(s.start, s.stop, 1 if s.step is None else s.step)

Advanced indexing #172

Advanced indexing #172

Conversation

alimanfoo commented Oct 31, 2017 • edited Loading

alimanfoo commented Oct 31, 2017 • edited Loading

shoyer commented Oct 31, 2017

alimanfoo commented Oct 31, 2017

FrancescAlted commented Oct 31, 2017

alimanfoo commented Oct 31, 2017 via email • edited Loading

alimanfoo commented Oct 31, 2017

alimanfoo commented Oct 31, 2017

mrocklin commented Oct 31, 2017

alimanfoo commented Oct 31, 2017

alimanfoo commented Oct 31, 2017

alimanfoo commented Oct 31, 2017

shoyer commented Oct 31, 2017

mrocklin commented Nov 1, 2017

shoyer commented Nov 1, 2017 via email

mrocklin commented Nov 1, 2017 via email

alimanfoo commented Nov 1, 2017 via email

alimanfoo commented Nov 1, 2017

alimanfoo commented Nov 6, 2017 • edited Loading

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alimanfoo commented Nov 7, 2017 • edited Loading

alimanfoo commented Nov 8, 2017

FrancescAlted commented Nov 9, 2017

alimanfoo commented Nov 9, 2017 via email

FrancescAlted commented Nov 9, 2017

alimanfoo commented Nov 10, 2017

alimanfoo commented Nov 13, 2017

alimanfoo commented Nov 13, 2017

alimanfoo commented Oct 31, 2017 •

edited

Loading

alimanfoo commented Oct 31, 2017 •

edited

Loading

alimanfoo commented Oct 31, 2017 via email •

edited

Loading

alimanfoo commented Nov 6, 2017 •

edited

Loading

alimanfoo commented Nov 7, 2017 •

edited

Loading