-
-
Notifications
You must be signed in to change notification settings - Fork 329
adds partial_decompress capabilites #584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adds partial_decompress capabilites #584
Conversation
Hello @andrewfulton9! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2020-10-23 17:34:07 UTC |
zarr/tests/test_indexing.py
Outdated
((slice(5, 8, 1), slice(2, 4, 1), slice(0, 100, 1)), | ||
[(5200, 200, (slice(0, 1, 1), slice(0, 2, 1))), | ||
(6200, 200, (slice(1, 2, 1), slice(0, 2, 1))), | ||
(7200, 200, (slice(2, 3, 1), slice(0, 2, 1)))]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So trying to understand this.
arr[5:8, 2:4, 0:100]
mean we need to read 200 items at position 5200, 6200 and 7200 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, really its taking 1 item from the first dimension and 2 items of 100 from the second dimension. So I wrote the code so that it takes 200 items since they are consecutive in the compressed buffer anyway. The data just has to be reshaped before it is put into the chunk output array
zarr/tests/test_indexing.py
Outdated
(6200.0, 5.0, (slice(1, 2, 1), slice(0, 1, 1), slice(0, 5, 1))), | ||
(6300.0, 5.0, (slice(1, 2, 1), slice(1, 2, 1), slice(0, 5, 1))), | ||
(7200.0, 5.0, (slice(2, 3, 1), slice(0, 1, 1), slice(0, 5, 1))), | ||
(7300.0, 5.0, (slice(2, 3, 1), slice(1, 2, 1), slice(0, 5, 1)))]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
floats are weird, only int in output no ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think the np.prod call used to generate the nitems is making that a float. I'll update that
if slice_nitems == dim_shape: | ||
self.selection.pop() | ||
else: | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this block of code correctly:
here we seem to be looking for dimensions where we select the whole thing.
we start from the end of selection, and shape, and while there is no selections
or the length is 100%, we pop.
Is that correct ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. I do this to maximize nitems/minimize the number of partial decompresions called. This logic helps get to the 200 nitems in the test example above.
zarr/indexing.py
Outdated
for i, x in enumerate(range(*sl.indices(len(self.arr)))): | ||
dim_out_slices.append(slice(i, i+1, 1)) | ||
dim_chunk_loc_slices.append(slice(x, x+1, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC here you are computing each individual 1 element wide selection across each axis both in the origin chunk coordinate system, and in the output version right ?
Would/could this be simpler if the step of the sl
ic 1 ? because in that case you have a continuous sl right ? or did I misunderstood.
No a request to change, it can be an optimisation for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats right, except for the slice of the last dimension, if the that slice has a step of 1. if that slice has a step of more than one, then it is included here, though if thats the last dimension of the chunk, then it would be really inefficient since nitems
would be 1. If the slice of the last dimension has a step of one, then the slice will cover more than one slice and the output slice is calculated by stop - start
Co-authored-by: Matthias Bussonnier <[email protected]>
Co-authored-by: Matthias Bussonnier <[email protected]>
zarr/indexing.py
Outdated
chunk_loc_slices.append([last_dim_slice]) | ||
|
||
self.out_slices = itertools.product(*out_slices) | ||
self.chunk_loc_slices = itertools.product(*chunk_loc_slices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and for myself, out_slices
are the slices to assign elements in the output array, chunk_loc_slices
the indices of where to read the the original chunk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats right
You may want to also test with negative indexing, and negative steps just in case. |
|
||
@pytest.mark.parametrize('selection, arr, expected', [ | ||
((slice(5, 8, 1), slice(2, 4, 1), slice(0, 100, 1)), | ||
np.arange(2, 100_002).reshape((100, 10, 100)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python 3.5 won't like those... we can argue to drop Python 3.5 I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The underscore in the number you mean? I can also just take them out too. I didn't realize they weren't supported in all the python3 versions
Co-authored-by: Matthias Bussonnier <[email protected]>
Hi @andrewfulton9. Master is now conflicting with this branch. Do you have a moment to take a look at the conflicts? |
Hey @joshmoore. I think this PR should actually be closed. I used the work here for the partial read capabilities that was merged into master in PR #667 |
Draft for adding capability to partial decompress chunks when they have been compressed for zarr. Not finished yet. Have added an indexer for getting the within chunk start and nitem pairs. Still need to finish adding logic to core.Array._chunk_getitem to use the indexer with and add tests. Also need to add test of example I found where I can get the PartialChunkIterator to fail. Test isn't added because I need to figure out the expected value. It fails with the below example array and slice tuple because it calculates the start value wrong if a middle dimension is 1. Also need to add tests that check if the index works for 'F' order
TODO: