Skip to content

adds partial_decompress capabilites #584

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

andrewfulton9
Copy link
Contributor

Draft for adding capability to partial decompress chunks when they have been compressed for zarr. Not finished yet. Have added an indexer for getting the within chunk start and nitem pairs. Still need to finish adding logic to core.Array._chunk_getitem to use the indexer with and add tests. Also need to add test of example I found where I can get the PartialChunkIterator to fail. Test isn't added because I need to figure out the expected value. It fails with the below example array and slice tuple because it calculates the start value wrong if a middle dimension is 1. Also need to add tests that check if the index works for 'F' order

arr = np.arange(2, 100002).reshape((10, 1, 10000))
buff = codec.encode(arr)

slices = (slice(5, 8, 1), slice(2, 4, 1), slice(0, 5, 1))

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • AppVeyor and Travis CI passes
  • Test coverage is 100% (Coveralls passes)

@pep8speaks
Copy link

pep8speaks commented Aug 11, 2020

Hello @andrewfulton9! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 1612:13: E116 unexpected indentation (comment)
Line 1791:82: W291 trailing whitespace
Line 1797:5: E129 visually indented line with same indent as next logical line
Line 1797:101: E501 line too long (106 > 100 characters)

Line 836:1: E302 expected 2 blank lines, found 1

Line 1326:8: E231 missing whitespace after ','
Line 1336:5: E124 closing bracket does not match visual indentation
Line 1341:5: E124 closing bracket does not match visual indentation
Line 1348:1: W391 blank line at end of file

Comment last updated at 2020-10-23 17:34:07 UTC

Comment on lines 1295 to 1298
((slice(5, 8, 1), slice(2, 4, 1), slice(0, 100, 1)),
[(5200, 200, (slice(0, 1, 1), slice(0, 2, 1))),
(6200, 200, (slice(1, 2, 1), slice(0, 2, 1))),
(7200, 200, (slice(2, 3, 1), slice(0, 2, 1)))]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So trying to understand this.

arr[5:8, 2:4, 0:100]

mean we need to read 200 items at position 5200, 6200 and 7200 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, really its taking 1 item from the first dimension and 2 items of 100 from the second dimension. So I wrote the code so that it takes 200 items since they are consecutive in the compressed buffer anyway. The data just has to be reshaped before it is put into the chunk output array

(6200.0, 5.0, (slice(1, 2, 1), slice(0, 1, 1), slice(0, 5, 1))),
(6300.0, 5.0, (slice(1, 2, 1), slice(1, 2, 1), slice(0, 5, 1))),
(7200.0, 5.0, (slice(2, 3, 1), slice(0, 1, 1), slice(0, 5, 1))),
(7300.0, 5.0, (slice(2, 3, 1), slice(1, 2, 1), slice(0, 5, 1)))]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

floats are weird, only int in output no ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the np.prod call used to generate the nitems is making that a float. I'll update that

if slice_nitems == dim_shape:
self.selection.pop()
else:
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this block of code correctly:

here we seem to be looking for dimensions where we select the whole thing.
we start from the end of selection, and shape, and while there is no selections
or the length is 100%, we pop.

Is that correct ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. I do this to maximize nitems/minimize the number of partial decompresions called. This logic helps get to the 200 nitems in the test example above.

zarr/indexing.py Outdated
Comment on lines 863 to 865
for i, x in enumerate(range(*sl.indices(len(self.arr)))):
dim_out_slices.append(slice(i, i+1, 1))
dim_chunk_loc_slices.append(slice(x, x+1, 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC here you are computing each individual 1 element wide selection across each axis both in the origin chunk coordinate system, and in the output version right ?

Would/could this be simpler if the step of the sl ic 1 ? because in that case you have a continuous sl right ? or did I misunderstood.

No a request to change, it can be an optimisation for later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats right, except for the slice of the last dimension, if the that slice has a step of 1. if that slice has a step of more than one, then it is included here, though if thats the last dimension of the chunk, then it would be really inefficient since nitems would be 1. If the slice of the last dimension has a step of one, then the slice will cover more than one slice and the output slice is calculated by stop - start

andrewfulton9 and others added 2 commits August 11, 2020 15:48
Co-authored-by: Matthias Bussonnier <[email protected]>
Co-authored-by: Matthias Bussonnier <[email protected]>
zarr/indexing.py Outdated
chunk_loc_slices.append([last_dim_slice])

self.out_slices = itertools.product(*out_slices)
self.chunk_loc_slices = itertools.product(*chunk_loc_slices)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and for myself, out_slices are the slices to assign elements in the output array, chunk_loc_slices the indices of where to read the the original chunk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats right

@Carreau
Copy link
Contributor

Carreau commented Aug 11, 2020

You may want to also test with negative indexing, and negative steps just in case.


@pytest.mark.parametrize('selection, arr, expected', [
((slice(5, 8, 1), slice(2, 4, 1), slice(0, 100, 1)),
np.arange(2, 100_002).reshape((100, 10, 100)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python 3.5 won't like those... we can argue to drop Python 3.5 I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underscore in the number you mean? I can also just take them out too. I didn't realize they weren't supported in all the python3 versions

Co-authored-by: Matthias Bussonnier <[email protected]>
@joshmoore
Copy link
Member

Hi @andrewfulton9. Master is now conflicting with this branch. Do you have a moment to take a look at the conflicts?

@andrewfulton9
Copy link
Contributor Author

Hey @joshmoore. I think this PR should actually be closed. I used the work here for the partial read capabilities that was merged into master in PR #667

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants