-
-
Notifications
You must be signed in to change notification settings - Fork 328
Optimize setitem with chunk equal to fill_value, round 2 #738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize setitem with chunk equal to fill_value, round 2 #738
Conversation
Matches how these lines are written in `_set_basic_selection_zd`.
Add a simple check to see if the key-value pair is just being set with a chunk equal to the fill value. If so, simply delete the key-value pair instead of storing a chunk that only contains the fill value. The Array will behave the same externally. However this will cutdown on the space require to store the Array. Also will make sure that copying one Array to another Array won't dramatically effect the storage size.
Codecov Report
@@ Coverage Diff @@
## master #738 +/- ##
========================================
Coverage 99.94% 99.94%
========================================
Files 31 31
Lines 10871 11068 +197
========================================
+ Hits 10865 11062 +197
Misses 6 6
|
oh and I should add -- I considered a different logic for the "is the chunk empty" check. instead of comparing python objects (the chunk and the fill value), we could compress the chunk and compare the resulting bytes to the (cached) result of compressing a chunk-sized array of |
ping @jakirkham , would love to get your thoughts on this I've been using this feature for my workplace and it's been extremely useful for managing large, sparse volumes. I think the broader community would benefit from it. |
@zarr-developers/core-devs I'm 👍 on this change, and I would go even further and make the new default behaviour @d-v-b I've made a minor comment on the code that should be an easy refactor. It would also be good if you could ensure that all the lines are covered, though I don't know what could exercise the Also, could you update this to match the latest master? Thanks! |
Some new changes inspired by @jni's comments
Still need to improve test coverage.. |
FWIW, I can just report from my experiences in the Julia Zarr implementation. In Zarr.jl, not writing empty chunks is already the default. Regarding performance implications, from some small tests that I have done it looked like the time spent for performing the Having this controlled by a keyword argument would have the disadvantage that users of downstream packages like xarray would not directly benefit from this change but probably some option propagation through xarray would be necessary. |
zarr/core.py
Outdated
# [np.array([True,True]), True, True] | ||
is_empty = all(flatten(np.equal(chunk, self.fill_value, dtype='object'))) | ||
else: | ||
is_empty = np.all(chunk == self._fill_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this work properly if fill_value
is NaN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope! Do you have any advice for making that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like this?
is_empty = np.all(chunk == self._fill_value) | |
is_empty = np.all(chunk == self._fill_value if np.isnan(self._fill_value) else np.isnan(chunk)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"np.nan
as a fill value" is now supported: https://github.com/d-v-b/zarr-python/blob/opt_setitem_fill_value/zarr/util.py#L664
I found a potential performance issue at the key deletion step. The deletion routine in this PR does not use bulk delete APIs (e.g., via the |
That does sound potentially problematic!
…On Wed, May 26, 2021 at 10:33 AM Davis Bennett ***@***.***> wrote:
I found a potential performance issue at the key deletion step. The
deletion routine in this PR does not use bulk delete APIs (e.g., via the
delitems method on FSSpec.FSMap) for removing keys. I think this could be
very inefficient.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#738 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJJFVQOYXMMWQYSGGMF3DDTPUWHLANCNFSM44W4O7JA>
.
|
I think this is ready to go. I decided to keep the name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@d-v-b thanks for the update. By my reading of the conversation, everyone agrees that write_empty_chunks=False
should be the default, although, if you feel that this will get the PR in sooner, 🤷 that can indeed come in a separate PR. I do agree that this PR is now wholly uncontroversial.
Regarding sparse_write
, imho this is similar to a scipy.sparse.bsr_matrix
(block sparse row), so the analogy holds. But again, I don't feel strongly about this.
One minor suggestion (take it or leave it) is to make write_empty_chunks
a keyword-only argument. This makes it easier to deprecate behaviour in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, @jni! Very looking forward to getting this in. I've added a few questions none of them are really show stoppers. Knowing that others are waiting on this, I'm inclined to get it in and then we can have the small adjustments in follow ups if necessary.
@@ -86,9 +86,10 @@ def create_array(self, read_only=False, **kwargs): | |||
kwargs.setdefault('compressor', Zlib(level=1)) | |||
cache_metadata = kwargs.pop('cache_metadata', True) | |||
cache_attrs = kwargs.pop('cache_attrs', True) | |||
write_empty_chunks = kwargs.pop('write_empty_chunks', True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a side note, we may want to capture defaults like this True
somewhere as a constant so it's easy to switch it in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. changing this required a lot of copy+paste that made me feel icky. I'm sure there's a cleaner way to do this without adding a lot of indirection in the test suite
self.chunk_store.setitems(to_store) | ||
|
||
def _chunk_delitems(self, ckeys): | ||
if hasattr(self.store, "delitems"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@grlee77: does this work with the BaseStore
refactoring?
Yeah it might be worth creating an RC for these newer features to get them in the hands of users and iterate on anything that needs improvement |
In zarr-developers#738, we accidentally moved the 2.7 versionadded tag of partial_decompress to the new write_empty_chunks argument. This commit restores it to its rightful place and adds a 2.11 versionadded tag to write_empty_chunks.
In #738, we accidentally moved the 2.7 versionadded tag of partial_decompress to the new write_empty_chunks argument. This commit restores it to its rightful place and adds a 2.11 versionadded tag to write_empty_chunks.
I chickened out and pushed |
https://pypi.org/project/zarr/2.11.0a1/ is out for testing, @d-v-b & @jni. Not sure if you have any candidates you'd like to ping. |
🐔
I can confirm that I can paint labels into zarr arrays in napari with 2.11.0a1! 🎉 So much fun. 😃 I'll let @d-v-b confirm that |
Continuation of #367
From the original description:
While working with sparse datasets, it would great if we could avoid creating objects in storage for empty chunks. This PR implements this functionality, which is controlled via the boolean property
Array._write_empty_chunks
.In this PR, the logic for detecting "empty" chunks is implemented in
Array._process_for_setitem()
, a function that prepares chunk data for writing. I added two levels of conditional execution to this function, the first checking ifself._write_empty_chunks
isFalse
, and the second dispatching to a suitable routine for checking if the elements of a chunk are all equal tofill_value
. Object arrays required a special handling here.If
_write_empty_chunks
isFalse
and the chunk is empty,_process_for_setitem()
returnsNone
instead of returning compressed bytes, and I added a corresponding check forNone
-ness in the downstream code.The routines I added have performance implications. Every chunk write now requires evaluating two new conditionals -- the check for
_write_empty_chunks
, and the null check for the output of_process_for_setitem
. More conditionals are evaluated for arrays with_write_empty_chunks
set toFalse
, and there is added computation required for checking each chunk for emptiness. I would love to avoid these conditionals, especially for the default case when we are not checking chunks for emptiness, so if anyone sees a way to do that, chime in :)TODO: