adds partial_decompress capabilites #584

andrewfulton9 · 2020-08-11T21:12:00Z

Draft for adding capability to partial decompress chunks when they have been compressed for zarr. Not finished yet. Have added an indexer for getting the within chunk start and nitem pairs. Still need to finish adding logic to core.Array._chunk_getitem to use the indexer with and add tests. Also need to add test of example I found where I can get the PartialChunkIterator to fail. Test isn't added because I need to figure out the expected value. It fails with the below example array and slice tuple because it calculates the start value wrong if a middle dimension is 1. Also need to add tests that check if the index works for 'F' order

arr = np.arange(2, 100002).reshape((10, 1, 10000))
buff = codec.encode(arr)

slices = (slice(5, 8, 1), slice(2, 4, 1), slice(0, 5, 1))

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/tutorial.rst
Changes documented in docs/release.rst
AppVeyor and Travis CI passes
Test coverage is 100% (Coveralls passes)

pep8speaks · 2020-08-11T21:12:10Z

Hello @andrewfulton9! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file zarr/core.py:

Line 1612:13: E116 unexpected indentation (comment)
Line 1791:82: W291 trailing whitespace
Line 1797:5: E129 visually indented line with same indent as next logical line
Line 1797:101: E501 line too long (106 > 100 characters)

In the file zarr/indexing.py:

Line 836:1: E302 expected 2 blank lines, found 1

In the file zarr/tests/test_indexing.py:

Line 1326:8: E231 missing whitespace after ','
Line 1336:5: E124 closing bracket does not match visual indentation
Line 1341:5: E124 closing bracket does not match visual indentation
Line 1348:1: W391 blank line at end of file

Comment last updated at 2020-10-23 17:34:07 UTC

zarr/indexing.py

Carreau · 2020-08-11T21:24:39Z

zarr/tests/test_indexing.py

+    ((slice(5, 8, 1), slice(2, 4, 1), slice(0, 100, 1)),
+     [(5200, 200, (slice(0, 1, 1), slice(0, 2, 1))),
+      (6200, 200, (slice(1, 2, 1), slice(0, 2, 1))),
+      (7200, 200, (slice(2, 3, 1), slice(0, 2, 1)))]),


So trying to understand this.

arr[5:8, 2:4, 0:100]

mean we need to read 200 items at position 5200, 6200 and 7200 ?

Yeah, really its taking 1 item from the first dimension and 2 items of 100 from the second dimension. So I wrote the code so that it takes 200 items since they are consecutive in the compressed buffer anyway. The data just has to be reshaped before it is put into the chunk output array

Carreau · 2020-08-11T21:25:13Z

zarr/tests/test_indexing.py

+      (6200.0, 5.0, (slice(1, 2, 1), slice(0, 1, 1), slice(0, 5, 1))),
+      (6300.0, 5.0, (slice(1, 2, 1), slice(1, 2, 1), slice(0, 5, 1))),
+      (7200.0, 5.0, (slice(2, 3, 1), slice(0, 1, 1), slice(0, 5, 1))),
+      (7300.0, 5.0, (slice(2, 3, 1), slice(1, 2, 1), slice(0, 5, 1)))]),


floats are weird, only int in output no ?

Yeah, I think the np.prod call used to generate the nitems is making that a float. I'll update that

Carreau · 2020-08-11T21:32:31Z

zarr/indexing.py

+                if slice_nitems == dim_shape:
+                    self.selection.pop()
+                else:
+                    break


If I understand this block of code correctly:

here we seem to be looking for dimensions where we select the whole thing.
we start from the end of selection, and shape, and while there is no selections
or the length is 100%, we pop.

Is that correct ?

That's right. I do this to maximize nitems/minimize the number of partial decompresions called. This logic helps get to the 200 nitems in the test example above.

Carreau · 2020-08-11T21:46:36Z

zarr/indexing.py

+            for i, x in enumerate(range(*sl.indices(len(self.arr)))):
+                dim_out_slices.append(slice(i, i+1, 1))
+                dim_chunk_loc_slices.append(slice(x, x+1, 1))


IIUC here you are computing each individual 1 element wide selection across each axis both in the origin chunk coordinate system, and in the output version right ?

Would/could this be simpler if the step of the sl ic 1 ? because in that case you have a continuous sl right ? or did I misunderstood.

No a request to change, it can be an optimisation for later.

Thats right, except for the slice of the last dimension, if the that slice has a step of 1. if that slice has a step of more than one, then it is included here, though if thats the last dimension of the chunk, then it would be really inefficient since nitems would be 1. If the slice of the last dimension has a step of one, then the slice will cover more than one slice and the output slice is calculated by stop - start

Co-authored-by: Matthias Bussonnier <[email protected]>

Carreau · 2020-08-11T21:54:17Z

zarr/indexing.py

+            chunk_loc_slices.append([last_dim_slice])
+
+        self.out_slices = itertools.product(*out_slices)
+        self.chunk_loc_slices = itertools.product(*chunk_loc_slices)


and for myself, out_slices are the slices to assign elements in the output array, chunk_loc_slices the indices of where to read the the original chunk.

Thats right

Carreau · 2020-08-11T22:10:00Z

You may want to also test with negative indexing, and negative steps just in case.

zarr/indexing.py

Carreau · 2020-08-20T16:47:52Z

zarr/tests/test_indexing.py

+
+@pytest.mark.parametrize('selection, arr, expected', [
+    ((slice(5, 8, 1), slice(2, 4, 1), slice(0, 100, 1)),
+     np.arange(2, 100_002).reshape((100, 10, 100)),


Python 3.5 won't like those... we can argue to drop Python 3.5 I think

The underscore in the number you mean? I can also just take them out too. I didn't realize they weren't supported in all the python3 versions

zarr/core.py

Co-authored-by: Matthias Bussonnier <[email protected]>

joshmoore · 2021-02-18T07:49:09Z

Hi @andrewfulton9. Master is now conflicting with this branch. Do you have a moment to take a look at the conflicts?

andrewfulton9 · 2021-02-18T17:03:03Z

Hey @joshmoore. I think this PR should actually be closed. I used the work here for the partial read capabilities that was merged into master in PR #667

adds partial_decompress capabilites

331d2a8

Carreau reviewed Aug 11, 2020

View reviewed changes

zarr/indexing.py Show resolved Hide resolved

Carreau reviewed Aug 11, 2020

View reviewed changes

zarr/indexing.py Outdated Show resolved Hide resolved

Carreau reviewed Aug 11, 2020

View reviewed changes

andrewfulton9 and others added 2 commits August 11, 2020 15:48

Update zarr/indexing.py

0a81a03

Co-authored-by: Matthias Bussonnier <[email protected]>

Update zarr/indexing.py

50797de

Co-authored-by: Matthias Bussonnier <[email protected]>

Carreau reviewed Aug 11, 2020

View reviewed changes

zarr/indexing.py Outdated Show resolved Hide resolved

improve tests

a4e1cf6

jakirkham mentioned this pull request Aug 12, 2020

Add blosc getitem zarr-developers/numcodecs#235

Merged

7 tasks

andrewfulton9 added 5 commits August 19, 2020 19:50

partial chunk read working

2333fd0

fix merge conflicts

492a2ce

improves indexer and partial chunk

4f00a5f

makes nitems a constant

e5f7e58

makes start and nitems ints

559b041

Carreau reviewed Aug 20, 2020

View reviewed changes

zarr/core.py Outdated Show resolved Hide resolved

Update zarr/core.py

81ec2f0

Co-authored-by: Matthias Bussonnier <[email protected]>

rabernat mentioned this pull request Sep 1, 2020

Best practices for zarr and GCS streaming applications #595

Open

andrewfulton9 added 5 commits October 21, 2020 14:34

merged master

67f7380

figuring out tests

5c508d7

fixed indendation causing failing tests

07344e7

add partial_decompress back in

6e11703

merging from origin

3b0e651

ravwojdyla mentioned this pull request Jan 12, 2021

Genetics data IO performance stats/doc sgkit-dev/sgkit#437

Open

andrewfulton9 closed this Feb 18, 2021

JackKelly mentioned this pull request Jul 19, 2021

Don't read entire chunks at a time openclimatefix/nowcasting_dataset#57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adds partial_decompress capabilites #584

adds partial_decompress capabilites #584

andrewfulton9 commented Aug 11, 2020

pep8speaks commented Aug 11, 2020 •

edited

Loading

Carreau Aug 11, 2020

andrewfulton9 Aug 11, 2020

Carreau Aug 11, 2020

andrewfulton9 Aug 11, 2020

Carreau Aug 11, 2020

andrewfulton9 Aug 11, 2020

Carreau Aug 11, 2020

andrewfulton9 Aug 11, 2020

Carreau Aug 11, 2020

andrewfulton9 Aug 11, 2020

Carreau commented Aug 11, 2020

Carreau Aug 20, 2020

andrewfulton9 Aug 21, 2020

joshmoore commented Feb 18, 2021

andrewfulton9 commented Feb 18, 2021

adds partial_decompress capabilites #584

adds partial_decompress capabilites #584

Conversation

andrewfulton9 commented Aug 11, 2020

pep8speaks commented Aug 11, 2020 • edited Loading

Comment last updated at 2020-10-23 17:34:07 UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Carreau commented Aug 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joshmoore commented Feb 18, 2021

andrewfulton9 commented Feb 18, 2021

pep8speaks commented Aug 11, 2020 •

edited

Loading