Optimize setitem with chunk equal to fill_value, round 2 #738

d-v-b · 2021-05-11T20:56:50Z

Continuation of #367
From the original description:

Fixes #366

If a chunk only contains the fill_value, delete the key-value pair for that chunk instead of storing the array. Should cutdown on storage space require for the Array. Also should improve the performance of copying one Array to another.

While working with sparse datasets, it would great if we could avoid creating objects in storage for empty chunks. This PR implements this functionality, which is controlled via the boolean property Array._write_empty_chunks.

In this PR, the logic for detecting "empty" chunks is implemented in Array._process_for_setitem(), a function that prepares chunk data for writing. I added two levels of conditional execution to this function, the first checking if self._write_empty_chunks is False, and the second dispatching to a suitable routine for checking if the elements of a chunk are all equal to fill_value. Object arrays required a special handling here.

If _write_empty_chunks is False and the chunk is empty, _process_for_setitem() returns None instead of returning compressed bytes, and I added a corresponding check for None-ness in the downstream code.

The routines I added have performance implications. Every chunk write now requires evaluating two new conditionals -- the check for _write_empty_chunks, and the null check for the output of _process_for_setitem. More conditionals are evaluated for arrays with _write_empty_chunks set to False, and there is added computation required for checking each chunk for emptiness. I would love to avoid these conditionals, especially for the default case when we are not checking chunks for emptiness, so if anyone sees a way to do that, chime in :)

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/tutorial.rst
Changes documented in docs/release.rst
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

Matches how these lines are written in `_set_basic_selection_zd`.

Add a simple check to see if the key-value pair is just being set with a chunk equal to the fill value. If so, simply delete the key-value pair instead of storing a chunk that only contains the fill value. The Array will behave the same externally. However this will cutdown on the space require to store the Array. Also will make sure that copying one Array to another Array won't dramatically effect the storage size.

codecov · 2021-05-11T21:03:29Z

Codecov Report

Merging #738 (3dd1afd) into master (d1dc987) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head 3dd1afd differs from pull request most recent head 4a4adb1. Consider uploading reports for the commit 4a4adb1 to get more accurate results

@@           Coverage Diff            @@
##           master     #738    +/-   ##
========================================
  Coverage   99.94%   99.94%            
========================================
  Files          31       31            
  Lines       10871    11068   +197     
========================================
+ Hits        10865    11062   +197     
  Misses          6        6

Impacted Files	Coverage Δ
zarr/creation.py	`100.00% <ø> (ø)`
zarr/core.py	`100.00% <100.00%> (ø)`
zarr/storage.py	`100.00% <100.00%> (ø)`
zarr/tests/test_core.py	`100.00% <100.00%> (ø)`
zarr/tests/test_storage.py	`100.00% <100.00%> (ø)`
zarr/tests/test_sync.py	`100.00% <100.00%> (ø)`
zarr/tests/test_util.py	`100.00% <100.00%> (ø)`
zarr/util.py	`100.00% <100.00%> (ø)`
zarr/meta.py	`100.00% <0.00%> (ø)`
... and 12 more

d-v-b · 2021-05-12T14:38:19Z

oh and I should add -- I considered a different logic for the "is the chunk empty" check. instead of comparing python objects (the chunk and the fill value), we could compress the chunk and compare the resulting bytes to the (cached) result of compressing a chunk-sized array of fill_values. The latter approach would use more memory but would avoid the special casing for object arrays.

pep8speaks · 2021-05-12T20:34:19Z

Hello @d-v-b! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-10-19 19:59:41 UTC

d-v-b · 2021-05-24T14:15:50Z

ping @jakirkham , would love to get your thoughts on this

I've been using this feature for my workplace and it's been extremely useful for managing large, sparse volumes. I think the broader community would benefit from it.

zarr/core.py

jni · 2021-05-25T00:45:47Z

@zarr-developers/core-devs I'm 👍 on this change, and I would go even further and make the new default behaviour write_empty_chunks to be False. It will indeed change the output of the library but imho it is an implementation detail and conforms to the spec (afaik! 😅), so I think it's worth pulling the band-aid off. My own personal expectation was that zarr already did this!

@d-v-b I've made a minor comment on the code that should be an easy refactor. It would also be good if you could ensure that all the lines are covered, though I don't know what could exercise the except Exception clause.

Also, could you update this to match the latest master? Thanks!

d-v-b · 2021-05-25T18:18:34Z

Some new changes inspired by @jni's comments

_process_for_setitem no longer returns an encoded chunk. chunk encoding happens in the functions that perform storage.
new function _chunk_isempty, which checks if a chunk is empty. returns True if a chunk is empty, False otherwise
new function _chunk_delitem, which attempts to delete a chunk key from the store. returns True if the delete was successful or on KeyError, returns False on any other exception.

Still need to improve test coverage..

meggart · 2021-05-26T10:17:04Z

FWIW, I can just report from my experiences in the Julia Zarr implementation. In Zarr.jl, not writing empty chunks is already the default. Regarding performance implications, from some small tests that I have done it looked like the time spent for performing the isempty check was negligible compared to time spent in compression or even writing to disk, so I would be in favor of changing the default behavior in python as well.

Having this controlled by a keyword argument would have the disadvantage that users of downstream packages like xarray would not directly benefit from this change but probably some option propagation through xarray would be necessary.

shoyer · 2021-05-26T17:00:28Z

zarr/core.py

+            # [np.array([True,True]), True, True]
+            is_empty = all(flatten(np.equal(chunk, self.fill_value, dtype='object')))
+        else:
+            is_empty = np.all(chunk == self._fill_value)


does this work properly if fill_value is NaN?

Nope! Do you have any advice for making that work?

Maybe something like this?

Suggested change

is_empty = np.all(chunk == self._fill_value)

is_empty = np.all(chunk == self._fill_value if np.isnan(self._fill_value) else np.isnan(chunk))

"np.nan as a fill value" is now supported: https://github.com/d-v-b/zarr-python/blob/opt_setitem_fill_value/zarr/util.py#L664

shoyer · 2021-05-26T17:05:05Z

I agree with @jni and @meggart -- it is not clear to me why you would not want to do this. I think this could be the new default behavior.

d-v-b · 2021-05-26T17:33:22Z

I found a potential performance issue at the key deletion step. The deletion routine in this PR does not use bulk delete APIs (e.g., via the delitems method on FSSpec.FSMap) for removing keys. I think this could be very inefficient.

shoyer · 2021-05-26T18:00:44Z

That does sound potentially problematic!

…

On Wed, May 26, 2021 at 10:33 AM Davis Bennett ***@***.***> wrote: I found a potential performance issue at the key deletion step. The deletion routine in this PR does not use bulk delete APIs (e.g., via the delitems method on FSSpec.FSMap) for removing keys. I think this could be very inefficient. — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#738 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJJFVQOYXMMWQYSGGMF3DDTPUWHLANCNFSM44W4O7JA> .

…nk_key method

d-v-b · 2021-10-07T23:30:15Z

I think this is ready to go. I decided to keep the name write_empty_chunks because I felt that sparse_write ran the risk of making people think about sparse matrices or sparse array storage, which is another topic entirely. I also decided to set write_empty_chunks=True by default, because this minimizes any surprise people might experience with the new release. Maybe it's worth thinking about how to gracefully change this default to False if people find that desirable.

jni

@d-v-b thanks for the update. By my reading of the conversation, everyone agrees that write_empty_chunks=False should be the default, although, if you feel that this will get the PR in sooner, 🤷 that can indeed come in a separate PR. I do agree that this PR is now wholly uncontroversial.

Regarding sparse_write, imho this is similar to a scipy.sparse.bsr_matrix (block sparse row), so the analogy holds. But again, I don't feel strongly about this.

One minor suggestion (take it or leave it) is to make write_empty_chunks a keyword-only argument. This makes it easier to deprecate behaviour in the future.

joshmoore

Thanks for the review, @jni! Very looking forward to getting this in. I've added a few questions none of them are really show stoppers. Knowing that others are waiting on this, I'm inclined to get it in and then we can have the small adjustments in follow ups if necessary.

joshmoore · 2021-10-08T06:14:33Z

zarr/tests/test_core.py

@@ -86,9 +86,10 @@ def create_array(self, read_only=False, **kwargs):
        kwargs.setdefault('compressor', Zlib(level=1))
        cache_metadata = kwargs.pop('cache_metadata', True)
        cache_attrs = kwargs.pop('cache_attrs', True)
+        write_empty_chunks = kwargs.pop('write_empty_chunks', True)


As a side note, we may want to capture defaults like this True somewhere as a constant so it's easy to switch it in one place.

yes. changing this required a lot of copy+paste that made me feel icky. I'm sure there's a cleaner way to do this without adding a lot of indirection in the test suite

requirements_dev_optional.txt

joshmoore · 2021-10-08T06:19:21Z

zarr/core.py

+        self.chunk_store.setitems(to_store)
+
+    def _chunk_delitems(self, ckeys):
+        if hasattr(self.store, "delitems"):


@grlee77: does this work with the BaseStore refactoring?

zarr/core.py

joshmoore · 2021-10-19T20:01:55Z

Pushed a changelog entry, @d-v-b. Please feel free to open a new PR with adjustments.

Leaving @grlee77 to evalute the impact on BaseStore post hoc. Targeting this for 2.11rc1 along with #725. Some of @grlee77's PR (BaseStore and metadata handling) are other candidates.

jakirkham · 2021-10-19T20:04:05Z

Yeah it might be worth creating an RC for these newer features to get them in the hands of users and iterate on anything that needs improvement

In zarr-developers#738, we accidentally moved the 2.7 versionadded tag of partial_decompress to the new write_empty_chunks argument. This commit restores it to its rightful place and adds a 2.11 versionadded tag to write_empty_chunks.

In #738, we accidentally moved the 2.7 versionadded tag of partial_decompress to the new write_empty_chunks argument. This commit restores it to its rightful place and adds a 2.11 versionadded tag to write_empty_chunks.

joshmoore · 2021-10-20T09:25:38Z

I chickened out and pushed v2.11.0a1 ;) (all the more recent instable periods began with an alpha version)

joshmoore · 2021-10-20T09:33:08Z

https://pypi.org/project/zarr/2.11.0a1/ is out for testing, @d-v-b & @jni. Not sure if you have any candidates you'd like to ping.

jni · 2021-10-21T00:32:53Z

I chickened out and pushed v2.11.0a1

🐔

is out for testing

I can confirm that I can paint labels into zarr arrays in napari with 2.11.0a1! 🎉 So much fun. 😃

I'll let @d-v-b confirm that write_empty_chunks=False works well for him so you can merge #853 and release .a2, @joshmoore! 😝 😂

jakirkham and others added 10 commits December 16, 2018 01:29

Consolidate encode/store in _chunk_setitem_nosync

8153810

Matches how these lines are written in `_set_basic_selection_zd`.

don't set cdata in _process_for_setitem

750d696

set empty chunk write behavior via array constructor

6ac2349

add rudimentary tests, np.equal -> np.array_equal

eb36713

add test for chunk deletion

053ad4c

add flattening function

3375bf0

add kwarg for empty writes to array creators

30c3a30

fix check for chunk equality to fill value

d2fc396

flake8

e4e4012

d-v-b added 2 commits May 12, 2021 14:58

add None check to setitems

bd27b9a

add write_empty_chunks to output of __getstate__

814d009

d-v-b added 3 commits May 12, 2021 16:47

flake8

769f5a6

add partial decompress to __get_state__

cd56b35

Merge branch 'master' into opt_setitem_fill_value

bcbaac4

jni reviewed May 25, 2021

View reviewed changes

zarr/core.py Outdated Show resolved Hide resolved

d-v-b added 3 commits May 25, 2021 11:19

Merge branch 'master' into opt_setitem_fill_value

9096f2c

functionalize emptiness checks and key deletion

044a9b8

flake8

74e0852

shoyer reviewed May 26, 2021

View reviewed changes

d-v-b added 8 commits October 4, 2021 11:15

correctly handle merge from upstream master

1c29fe8

don't use os.path.join for constructing a chunk key; instead use _chu…

710b875

…nk_key method

complete removal of os.path.join calls

8a06884

add coverage exemption to type error branch in all_equal

7f859c3

remove unreachable conditionals in n5 tests

0a7a3cc

instantiate ReadOnlyError

1a0f41c

add explcit delitems and setitems calls to readonly fsstore tests

94d5d0a

Update docstrings

a918f1d

jni approved these changes Oct 8, 2021

View reviewed changes

joshmoore reviewed Oct 8, 2021

View reviewed changes

d-v-b and others added 3 commits October 11, 2021 10:14

Update requirements_dev_optional

3dd1afd

Merge 'origin/master' into pr-738

2165164

Add changelog

4a4adb1

joshmoore merged commit 831e687 into zarr-developers:master Oct 19, 2021

jni mentioned this pull request Oct 20, 2021

Fix versionadded tag in zarr.core.Array docstring #852

Merged

jni mentioned this pull request Oct 20, 2021

Set write_empty_chunks to default to False #853

Merged

6 tasks

jakirkham mentioned this pull request Dec 1, 2021

Blogpost for Zarr 2.11 #901

Closed

joshmoore mentioned this pull request Dec 23, 2021

Allow disabling filling of missing chunks #489

Open

7 tasks

d-v-b mentioned this pull request Jan 8, 2022

Skip writing fill-value-only chunks pangeo-data/rechunker#94

Closed

croth1 mentioned this pull request Oct 23, 2022

[WIP] Attempt to continue LRU cache for decoded chunks #1214

Closed

6 tasks

d-v-b mentioned this pull request Oct 22, 2024

Feat/write empty chunks #2429

Merged

6 tasks

d-v-b mentioned this pull request Jun 19, 2025

How to prevent Zarr from returning NaN for missing chunks? #486

Open

	is_empty = np.all(chunk == self._fill_value)
	is_empty = np.all(chunk == self._fill_value if np.isnan(self._fill_value) else np.isnan(chunk))

Uh oh!

Optimize setitem with chunk equal to fill_value, round 2 #738

Optimize setitem with chunk equal to fill_value, round 2 #738

Uh oh!

Conversation

d-v-b commented May 11, 2021

Uh oh!

codecov bot commented May 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

d-v-b commented May 12, 2021

Uh oh!

pep8speaks commented May 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2021-10-19 19:59:41 UTC

Uh oh!

d-v-b commented May 24, 2021

Uh oh!

Uh oh!

jni commented May 25, 2021

Uh oh!

d-v-b commented May 25, 2021

Uh oh!

meggart commented May 26, 2021

Uh oh!

shoyer May 26, 2021

Choose a reason for hiding this comment

Uh oh!

d-v-b May 26, 2021

Choose a reason for hiding this comment

Uh oh!

jakirkham Aug 18, 2021

Choose a reason for hiding this comment

Uh oh!

d-v-b Oct 7, 2021

Choose a reason for hiding this comment

Uh oh!

shoyer commented May 26, 2021

Uh oh!

d-v-b commented May 26, 2021

Uh oh!

shoyer commented May 26, 2021 via email

Uh oh!

d-v-b commented Oct 7, 2021

Uh oh!

jni left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joshmoore left a comment

Choose a reason for hiding this comment

Uh oh!

joshmoore Oct 8, 2021

Choose a reason for hiding this comment

Uh oh!

d-v-b Oct 8, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joshmoore Oct 8, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joshmoore commented Oct 19, 2021

Uh oh!

jakirkham commented Oct 19, 2021

Uh oh!

joshmoore commented Oct 20, 2021

Uh oh!

joshmoore commented Oct 20, 2021

Uh oh!

jni commented Oct 21, 2021

Uh oh!

Uh oh!

codecov bot commented May 11, 2021 •

edited

Loading

pep8speaks commented May 12, 2021 •

edited

Loading

jni left a comment •

edited

Loading