Skip to content

WIP, RFC: Optimize setitem with chunk equal to fill_value #367

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

WIP, RFC: Optimize setitem with chunk equal to fill_value #367

wants to merge 2 commits into from

Conversation

jakirkham
Copy link
Member

Fixes #366

If a chunk only contains the fill_value, delete the key-value pair for that chunk instead of storing the array. Should cutdown on storage space require for the Array. Also should improve the performance of copying one Array to another.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • Docs build locally (e.g., run tox -e docs)
  • AppVeyor and Travis CI passes
  • Test coverage is 100% (Coveralls passes)

Matches how these lines are written in `_set_basic_selection_zd`.
Add a simple check to see if the key-value pair is just being set with a
chunk equal to the fill value. If so, simply delete the key-value pair
instead of storing a chunk that only contains the fill value. The Array
will behave the same externally. However this will cutdown on the space
require to store the Array. Also will make sure that copying one Array
to another Array won't dramatically effect the storage size.
@rabernat
Copy link
Contributor

I have a very large, relatively sparse (lots of empty chunks) array. I would really like to avoid writing all those empty chunks to disk.

So 👍 to this!

@@ -1476,6 +1476,17 @@ def _set_basic_selection_zd(self, selection, value, fields=None):
else:
chunk[selection] = value

# clear chunk if it only contains the fill value
if np.all(np.equal(chunk, self._fill_value)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it could sometimes be an expensive operation. So it should be optional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How large are typical chunk sizes for you?

Copy link
Contributor

@d-v-b d-v-b Apr 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regardless of chunk size, in many cases the user knows in advance whether this check will ever be useful. For dense imaging data, this conditional will almost never evaluate to true; for sparse data, this conditional will often evaluate to true. So the user should have some discretion over whether this check occurs.

@jakirkham
Copy link
Member Author

Thanks! Yeah that was the idea behind it.

There was funkiness with ZipStore as deletion is not supported there.

Can give this another look 🙂

@joshmoore
Copy link
Member

I actually assumed recently that this was the default.

@jakirkham
Copy link
Member Author

Nope, but I think we discussed recently how we might get this to work for ZipStore and other non-deletable stores by just making the value b''. So I think we may be able to generalize this a bit

@d-v-b
Copy link
Contributor

d-v-b commented Apr 25, 2021

I was also bitten by thinking that was the default! Would love to see this merged.

@d-v-b
Copy link
Contributor

d-v-b commented May 10, 2021

@jakirkham I would be happy to move this PR forward, but I don't think I have push access to your fork. I made some changes (added a kwarg to the Array constructor, and added tests) here: https://github.com/d-v-b/zarr-python/tree/opt_setitem_fill_value

@jakirkham
Copy link
Member Author

Sounds good. In that case feel free to send a PR to zarr-developers. Am going to close this out.

@jakirkham jakirkham closed this May 11, 2021
@jakirkham jakirkham deleted the opt_setitem_fill_value branch May 11, 2021 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize storing fill_value to Array
4 participants