Skip to content

Optimize setitem with chunk equal to fill_value, round 2 #738

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 68 commits into from
Oct 19, 2021

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented May 11, 2021

Continuation of #367
From the original description:

Fixes #366

If a chunk only contains the fill_value, delete the key-value pair for that chunk instead of storing the array. Should cutdown on storage space require for the Array. Also should improve the performance of copying one Array to another.

While working with sparse datasets, it would great if we could avoid creating objects in storage for empty chunks. This PR implements this functionality, which is controlled via the boolean property Array._write_empty_chunks.

In this PR, the logic for detecting "empty" chunks is implemented in Array._process_for_setitem(), a function that prepares chunk data for writing. I added two levels of conditional execution to this function, the first checking if self._write_empty_chunks is False, and the second dispatching to a suitable routine for checking if the elements of a chunk are all equal to fill_value. Object arrays required a special handling here.

If _write_empty_chunks is False and the chunk is empty, _process_for_setitem() returns None instead of returning compressed bytes, and I added a corresponding check for None-ness in the downstream code.

The routines I added have performance implications. Every chunk write now requires evaluating two new conditionals -- the check for _write_empty_chunks, and the null check for the output of _process_for_setitem. More conditionals are evaluated for arrays with _write_empty_chunks set to False, and there is added computation required for checking each chunk for emptiness. I would love to avoid these conditionals, especially for the default case when we are not checking chunks for emptiness, so if anyone sees a way to do that, chime in :)

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

jakirkham and others added 10 commits December 16, 2018 01:29
Matches how these lines are written in `_set_basic_selection_zd`.
Add a simple check to see if the key-value pair is just being set with a
chunk equal to the fill value. If so, simply delete the key-value pair
instead of storing a chunk that only contains the fill value. The Array
will behave the same externally. However this will cutdown on the space
require to store the Array. Also will make sure that copying one Array
to another Array won't dramatically effect the storage size.
@codecov
Copy link

codecov bot commented May 11, 2021

Codecov Report

Merging #738 (3dd1afd) into master (d1dc987) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head 3dd1afd differs from pull request most recent head 4a4adb1. Consider uploading reports for the commit 4a4adb1 to get more accurate results

@@           Coverage Diff            @@
##           master     #738    +/-   ##
========================================
  Coverage   99.94%   99.94%            
========================================
  Files          31       31            
  Lines       10871    11068   +197     
========================================
+ Hits        10865    11062   +197     
  Misses          6        6            
Impacted Files Coverage Δ
zarr/creation.py 100.00% <ø> (ø)
zarr/core.py 100.00% <100.00%> (ø)
zarr/storage.py 100.00% <100.00%> (ø)
zarr/tests/test_core.py 100.00% <100.00%> (ø)
zarr/tests/test_storage.py 100.00% <100.00%> (ø)
zarr/tests/test_sync.py 100.00% <100.00%> (ø)
zarr/tests/test_util.py 100.00% <100.00%> (ø)
zarr/util.py 100.00% <100.00%> (ø)
zarr/meta.py 100.00% <0.00%> (ø)
... and 12 more

@d-v-b
Copy link
Contributor Author

d-v-b commented May 12, 2021

oh and I should add -- I considered a different logic for the "is the chunk empty" check. instead of comparing python objects (the chunk and the fill value), we could compress the chunk and compare the resulting bytes to the (cached) result of compressing a chunk-sized array of fill_values. The latter approach would use more memory but would avoid the special casing for object arrays.

@pep8speaks
Copy link

pep8speaks commented May 12, 2021

Hello @d-v-b! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-10-19 19:59:41 UTC

@d-v-b
Copy link
Contributor Author

d-v-b commented May 24, 2021

ping @jakirkham , would love to get your thoughts on this

I've been using this feature for my workplace and it's been extremely useful for managing large, sparse volumes. I think the broader community would benefit from it.

@jni
Copy link
Contributor

jni commented May 25, 2021

@zarr-developers/core-devs I'm 👍 on this change, and I would go even further and make the new default behaviour write_empty_chunks to be False. It will indeed change the output of the library but imho it is an implementation detail and conforms to the spec (afaik! 😅), so I think it's worth pulling the band-aid off. My own personal expectation was that zarr already did this!

@d-v-b I've made a minor comment on the code that should be an easy refactor. It would also be good if you could ensure that all the lines are covered, though I don't know what could exercise the except Exception clause.

Also, could you update this to match the latest master? Thanks!

@d-v-b
Copy link
Contributor Author

d-v-b commented May 25, 2021

Some new changes inspired by @jni's comments

  • _process_for_setitem no longer returns an encoded chunk. chunk encoding happens in the functions that perform storage.
  • new function _chunk_isempty, which checks if a chunk is empty. returns True if a chunk is empty, False otherwise
  • new function _chunk_delitem, which attempts to delete a chunk key from the store. returns True if the delete was successful or on KeyError, returns False on any other exception.

Still need to improve test coverage..

@meggart
Copy link
Member

meggart commented May 26, 2021

FWIW, I can just report from my experiences in the Julia Zarr implementation. In Zarr.jl, not writing empty chunks is already the default. Regarding performance implications, from some small tests that I have done it looked like the time spent for performing the isempty check was negligible compared to time spent in compression or even writing to disk, so I would be in favor of changing the default behavior in python as well.

Having this controlled by a keyword argument would have the disadvantage that users of downstream packages like xarray would not directly benefit from this change but probably some option propagation through xarray would be necessary.

zarr/core.py Outdated
# [np.array([True,True]), True, True]
is_empty = all(flatten(np.equal(chunk, self.fill_value, dtype='object')))
else:
is_empty = np.all(chunk == self._fill_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work properly if fill_value is NaN?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope! Do you have any advice for making that work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like this?

Suggested change
is_empty = np.all(chunk == self._fill_value)
is_empty = np.all(chunk == self._fill_value if np.isnan(self._fill_value) else np.isnan(chunk))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer
Copy link
Contributor

shoyer commented May 26, 2021

I agree with @jni and @meggart -- it is not clear to me why you would not want to do this. I think this could be the new default behavior.

@d-v-b
Copy link
Contributor Author

d-v-b commented May 26, 2021

I found a potential performance issue at the key deletion step. The deletion routine in this PR does not use bulk delete APIs (e.g., via the delitems method on FSSpec.FSMap) for removing keys. I think this could be very inefficient.

@shoyer
Copy link
Contributor

shoyer commented May 26, 2021 via email

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 7, 2021

I think this is ready to go. I decided to keep the name write_empty_chunks because I felt that sparse_write ran the risk of making people think about sparse matrices or sparse array storage, which is another topic entirely. I also decided to set write_empty_chunks=True by default, because this minimizes any surprise people might experience with the new release. Maybe it's worth thinking about how to gracefully change this default to False if people find that desirable.

Copy link
Contributor

@jni jni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@d-v-b thanks for the update. By my reading of the conversation, everyone agrees that write_empty_chunks=False should be the default, although, if you feel that this will get the PR in sooner, 🤷 that can indeed come in a separate PR. I do agree that this PR is now wholly uncontroversial.

Regarding sparse_write, imho this is similar to a scipy.sparse.bsr_matrix (block sparse row), so the analogy holds. But again, I don't feel strongly about this.

One minor suggestion (take it or leave it) is to make write_empty_chunks a keyword-only argument. This makes it easier to deprecate behaviour in the future.

Copy link
Member

@joshmoore joshmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, @jni! Very looking forward to getting this in. I've added a few questions none of them are really show stoppers. Knowing that others are waiting on this, I'm inclined to get it in and then we can have the small adjustments in follow ups if necessary.

@@ -86,9 +86,10 @@ def create_array(self, read_only=False, **kwargs):
kwargs.setdefault('compressor', Zlib(level=1))
cache_metadata = kwargs.pop('cache_metadata', True)
cache_attrs = kwargs.pop('cache_attrs', True)
write_empty_chunks = kwargs.pop('write_empty_chunks', True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note, we may want to capture defaults like this True somewhere as a constant so it's easy to switch it in one place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. changing this required a lot of copy+paste that made me feel icky. I'm sure there's a cleaner way to do this without adding a lot of indirection in the test suite

self.chunk_store.setitems(to_store)

def _chunk_delitems(self, ckeys):
if hasattr(self.store, "delitems"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grlee77: does this work with the BaseStore refactoring?

@joshmoore
Copy link
Member

Pushed a changelog entry, @d-v-b. Please feel free to open a new PR with adjustments.

Leaving @grlee77 to evalute the impact on BaseStore post hoc. Targeting this for 2.11rc1 along with #725. Some of @grlee77's PR (BaseStore and metadata handling) are other candidates.

@joshmoore joshmoore merged commit 831e687 into zarr-developers:master Oct 19, 2021
@jakirkham
Copy link
Member

Yeah it might be worth creating an RC for these newer features to get them in the hands of users and iterate on anything that needs improvement

jni added a commit to jni/zarr-python that referenced this pull request Oct 20, 2021
In zarr-developers#738, we accidentally moved the 2.7 versionadded tag of
partial_decompress to the new write_empty_chunks argument. This commit
restores it to its rightful place and adds a 2.11 versionadded tag to
write_empty_chunks.
joshmoore pushed a commit that referenced this pull request Oct 20, 2021
In #738, we accidentally moved the 2.7 versionadded tag of
partial_decompress to the new write_empty_chunks argument. This commit
restores it to its rightful place and adds a 2.11 versionadded tag to
write_empty_chunks.
@joshmoore
Copy link
Member

I chickened out and pushed v2.11.0a1 ;) (all the more recent instable periods began with an alpha version)

@joshmoore
Copy link
Member

https://pypi.org/project/zarr/2.11.0a1/ is out for testing, @d-v-b & @jni. Not sure if you have any candidates you'd like to ping.

@jni
Copy link
Contributor

jni commented Oct 21, 2021

I chickened out and pushed v2.11.0a1

🐔

is out for testing

I can confirm that I can paint labels into zarr arrays in napari with 2.11.0a1! 🎉 So much fun. 😃

I'll let @d-v-b confirm that write_empty_chunks=False works well for him so you can merge #853 and release .a2, @joshmoore! 😝 😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize storing fill_value to Array
8 participants