Skip to content

Optimize setitem with chunk equal to fill_value, round 2 #738

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 68 commits into from
Oct 19, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
8153810
Consolidate encode/store in _chunk_setitem_nosync
jakirkham Dec 16, 2018
c84839e
Clear key-value pair if chunk is just fill value
jakirkham Dec 16, 2018
750d696
don't set cdata in _process_for_setitem
d-v-b May 10, 2021
6ac2349
set empty chunk write behavior via array constructor
d-v-b May 10, 2021
eb36713
add rudimentary tests, np.equal -> np.array_equal
d-v-b May 10, 2021
053ad4c
add test for chunk deletion
d-v-b May 10, 2021
3375bf0
add flattening function
d-v-b May 11, 2021
30c3a30
add kwarg for empty writes to array creators
d-v-b May 11, 2021
d2fc396
fix check for chunk equality to fill value
d-v-b May 11, 2021
e4e4012
flake8
d-v-b May 11, 2021
bd27b9a
add None check to setitems
d-v-b May 12, 2021
814d009
add write_empty_chunks to output of __getstate__
d-v-b May 12, 2021
769f5a6
flake8
d-v-b May 12, 2021
cd56b35
add partial decompress to __get_state__
d-v-b May 12, 2021
bcbaac4
Merge branch 'master' into opt_setitem_fill_value
d-v-b May 14, 2021
9096f2c
Merge branch 'master' into opt_setitem_fill_value
d-v-b May 25, 2021
044a9b8
functionalize emptiness checks and key deletion
d-v-b May 25, 2021
74e0852
flake8
d-v-b May 25, 2021
b2ec5ad
Merge branch 'master' into opt_setitem_fill_value
d-v-b May 27, 2021
62a55ab
add path for delitems, and add some failing tests
d-v-b May 27, 2021
dbc32fd
flake8
d-v-b May 27, 2021
cd28aff
add delitems method to FSStore, and correspondingly change zarr.Array…
d-v-b May 27, 2021
160c2dc
add nested + empty writes test
d-v-b May 27, 2021
b7fe1fe
Merge branch 'master' into opt_setitem_fill_value
d-v-b Jun 18, 2021
af715fe
set write_empty_chunks to True by default
d-v-b Jul 14, 2021
e17993a
Merge branch 'opt_setitem_fill_value' of https://github.com/d-v-b/zar…
d-v-b Jul 14, 2021
7c9a041
Merge branch 'master' into opt_setitem_fill_value
d-v-b Jul 14, 2021
59328f0
rename chunk_is_empty method and clean up logic in _chunk_setitem
d-v-b Jul 14, 2021
72488a8
rename test
d-v-b Jul 14, 2021
7489ae9
add test for flatten
d-v-b Jul 14, 2021
10199b3
initial support for using delitems api in chunk_setitems
d-v-b Jul 14, 2021
88b4811
flake8
d-v-b Jul 15, 2021
3c69719
strip path separator that was screwing up store.listdir
d-v-b Jul 15, 2021
8aa93fa
change tests to check for empty writing behavior
d-v-b Jul 15, 2021
0dae9da
bump fsspec and s3fs versions
d-v-b Jul 15, 2021
40c3f14
delitems: only attempt to delete keys that exist
d-v-b Jul 15, 2021
7dde846
cdon't pass empty collections to self.map.delitems
d-v-b Aug 17, 2021
7a45fd2
flake8
d-v-b Aug 17, 2021
99f59ef
use main branch of fsspec until a new release is cut
d-v-b Aug 19, 2021
a6ba3c7
add support for checking if a chunk is all nans in _chunk_is_empty
d-v-b Aug 20, 2021
6f8b6c4
docstring tweaks
d-v-b Aug 20, 2021
054399e
Merge branch 'master' into opt_setitem_fill_value
d-v-b Aug 20, 2021
7025d19
clean up empty_write tests
d-v-b Aug 20, 2021
bbabe5c
Merge branch 'opt_setitem_fill_value' of https://github.com/d-v-b/zar…
d-v-b Aug 20, 2021
3abcbc3
fix hexdigests for FSStore + empty writes, and remove redundant nchun…
d-v-b Aug 20, 2021
c3b4455
Merge branch 'master' into opt_setitem_fill_value
joshmoore Sep 8, 2021
ea3356c
Merge branch 'master' of https://github.com/zarr-developers/zarr-pyth…
d-v-b Oct 2, 2021
b81c14a
resolve merge conflicts in favor of master
d-v-b Oct 2, 2021
05716a3
set write_empty_chunks to True by default; put chunk emptiness checki…
d-v-b Oct 4, 2021
23bfc1e
update requirements_dev_optional
d-v-b Oct 4, 2021
b921b34
remove np.typing import
d-v-b Oct 4, 2021
48c38c7
use public attribute in test_nchunks_initialized
d-v-b Oct 4, 2021
7e4fbad
remove duplicated logic in _chunk_setitems, instead using _chunk_deli…
d-v-b Oct 4, 2021
b063f52
expand 0d tests and nchunks_initialized tests to hit more parts of th…
d-v-b Oct 4, 2021
020475b
remove return type annotation for all_equal that was breaking CI
d-v-b Oct 4, 2021
a81ac83
refactor write_empty_chunks tests by expanding the create_array logic…
d-v-b Oct 4, 2021
f8d8415
merge upstream changes to requirements_dev_optional
d-v-b Oct 4, 2021
1c29fe8
correctly handle merge from upstream master
d-v-b Oct 4, 2021
710b875
don't use os.path.join for constructing a chunk key; instead use _chu…
d-v-b Oct 4, 2021
8a06884
complete removal of os.path.join calls
d-v-b Oct 4, 2021
7f859c3
add coverage exemption to type error branch in all_equal
d-v-b Oct 6, 2021
0a7a3cc
remove unreachable conditionals in n5 tests
d-v-b Oct 7, 2021
1a0f41c
instantiate ReadOnlyError
d-v-b Oct 7, 2021
94d5d0a
add explcit delitems and setitems calls to readonly fsstore tests
d-v-b Oct 7, 2021
a918f1d
Update docstrings
d-v-b Oct 7, 2021
3dd1afd
Update requirements_dev_optional
d-v-b Oct 11, 2021
2165164
Merge 'origin/master' into pr-738
joshmoore Oct 19, 2021
4a4adb1
Add changelog
joshmoore Oct 19, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ Enhancements
* array indexing with [] (getitem and setitem) now supports fancy indexing.
By :user:`Juan Nunez-Iglesias <jni>`; :issue:`725`.

* write_empty_chunks=False deletes chunks consisting of only fill_value.
By :user:`Davis Bennett <d-v-b>`; :issue:`738`.

.. _release_2.10.2:

2.10.2
Expand Down
85 changes: 72 additions & 13 deletions zarr/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
from zarr.meta import decode_array_metadata, encode_array_metadata
from zarr.storage import array_meta_key, attrs_key, getsize, listdir
from zarr.util import (
all_equal,
InfoReporter,
check_array_shape,
human_readable_size,
Expand Down Expand Up @@ -75,6 +76,14 @@ class Array:
If True and while the chunk_store is a FSStore and the compresion used
is Blosc, when getting data from the array chunks will be partially
read and decompressed when possible.
write_empty_chunks : bool, optional
If True (default), all chunks will be stored regardless of their
contents. If False, each chunk is compared to the array's fill
value prior to storing. If a chunk is uniformly equal to the fill
value, then that chunk is not be stored, and the store entry for
that chunk's key is deleted. This setting enables sparser storage,
as only chunks with non-fill-value data are stored, at the expense
of overhead associated with checking the data of each chunk.

.. versionadded:: 2.7

Expand Down Expand Up @@ -107,6 +116,7 @@ class Array:
info
vindex
oindex
write_empty_chunks

Methods
-------
Expand Down Expand Up @@ -139,6 +149,7 @@ def __init__(
cache_metadata=True,
cache_attrs=True,
partial_decompress=False,
write_empty_chunks=True,
):
# N.B., expect at this point store is fully initialized with all
# configuration metadata fully specified and normalized
Expand All @@ -155,6 +166,7 @@ def __init__(
self._cache_metadata = cache_metadata
self._is_view = False
self._partial_decompress = partial_decompress
self._write_empty_chunks = write_empty_chunks

# initialize metadata
self._load_metadata()
Expand Down Expand Up @@ -455,6 +467,13 @@ def vindex(self):
:func:`set_mask_selection` for documentation and examples."""
return self._vindex

@property
def write_empty_chunks(self) -> bool:
"""A Boolean, True if chunks composed of the array's fill value
will be stored. If False, such chunks will not be stored.
"""
return self._write_empty_chunks

def __eq__(self, other):
return (
isinstance(other, Array) and
Expand Down Expand Up @@ -1626,9 +1645,18 @@ def _set_basic_selection_zd(self, selection, value, fields=None):
else:
chunk[selection] = value

# encode and store
cdata = self._encode_chunk(chunk)
self.chunk_store[ckey] = cdata
# remove chunk if write_empty_chunks is false and it only contains the fill value
if (not self.write_empty_chunks) and all_equal(self.fill_value, chunk):
try:
del self.chunk_store[ckey]
return
except Exception: # pragma: no cover
# deleting failed, fallback to overwriting
pass
else:
# encode and store
cdata = self._encode_chunk(chunk)
self.chunk_store[ckey] = cdata

def _set_basic_selection_nd(self, selection, value, fields=None):
# implementation of __setitem__ for array with at least one dimension
Expand Down Expand Up @@ -1896,11 +1924,38 @@ def _chunk_getitems(self, lchunk_coords, lchunk_selection, out, lout_selection,
out[out_select] = fill_value

def _chunk_setitems(self, lchunk_coords, lchunk_selection, values, fields=None):
ckeys = [self._chunk_key(co) for co in lchunk_coords]
cdatas = [self._process_for_setitem(key, sel, val, fields=fields)
for key, sel, val in zip(ckeys, lchunk_selection, values)]
values = {k: v for k, v in zip(ckeys, cdatas)}
self.chunk_store.setitems(values)
ckeys = map(self._chunk_key, lchunk_coords)
cdatas = {key: self._process_for_setitem(key, sel, val, fields=fields)
for key, sel, val in zip(ckeys, lchunk_selection, values)}
to_store = {}
if not self.write_empty_chunks:
empty_chunks = {k: v for k, v in cdatas.items() if all_equal(self.fill_value, v)}
self._chunk_delitems(empty_chunks.keys())
nonempty_keys = cdatas.keys() - empty_chunks.keys()
to_store = {k: self._encode_chunk(cdatas[k]) for k in nonempty_keys}
else:
to_store = {k: self._encode_chunk(v) for k, v in cdatas.items()}
self.chunk_store.setitems(to_store)

def _chunk_delitems(self, ckeys):
if hasattr(self.store, "delitems"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grlee77: does this work with the BaseStore refactoring?

self.store.delitems(ckeys)
else: # pragma: no cover
# exempting this branch from coverage as there are no extant stores
# that will trigger this condition, but it's possible that they
# will be developed in the future.
tuple(map(self._chunk_delitem, ckeys))
return None

def _chunk_delitem(self, ckey):
"""
Attempt to delete the value associated with ckey.
"""
try:
del self.chunk_store[ckey]
return
except KeyError:
return

def _chunk_setitem(self, chunk_coords, chunk_selection, value, fields=None):
"""Replace part or whole of a chunk.
Expand Down Expand Up @@ -1931,8 +1986,12 @@ def _chunk_setitem(self, chunk_coords, chunk_selection, value, fields=None):
def _chunk_setitem_nosync(self, chunk_coords, chunk_selection, value, fields=None):
ckey = self._chunk_key(chunk_coords)
cdata = self._process_for_setitem(ckey, chunk_selection, value, fields=fields)
# store
self.chunk_store[ckey] = cdata

# attempt to delete chunk if it only contains the fill value
if (not self.write_empty_chunks) and all_equal(self.fill_value, cdata):
self._chunk_delitem(ckey)
else:
self.chunk_store[ckey] = self._encode_chunk(cdata)

def _process_for_setitem(self, ckey, chunk_selection, value, fields=None):
if is_total_slice(chunk_selection, self._chunks) and not fields:
Expand Down Expand Up @@ -1988,8 +2047,7 @@ def _process_for_setitem(self, ckey, chunk_selection, value, fields=None):
else:
chunk[chunk_selection] = value

# encode chunk
return self._encode_chunk(chunk)
return chunk

def _chunk_key(self, chunk_coords):
return self._key_prefix + self._dimension_separator.join(map(str, chunk_coords))
Expand Down Expand Up @@ -2209,7 +2267,8 @@ def hexdigest(self, hashname="sha1"):

def __getstate__(self):
return (self._store, self._path, self._read_only, self._chunk_store,
self._synchronizer, self._cache_metadata, self._attrs.cache)
self._synchronizer, self._cache_metadata, self._attrs.cache,
self._partial_decompress, self._write_empty_chunks)

def __setstate__(self, state):
self.__init__(*state)
Expand Down
27 changes: 22 additions & 5 deletions zarr/creation.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ def create(shape, chunks=True, dtype=None, compressor='default',
fill_value=0, order='C', store=None, synchronizer=None,
overwrite=False, path=None, chunk_store=None, filters=None,
cache_metadata=True, cache_attrs=True, read_only=False,
object_codec=None, dimension_separator=None, **kwargs):
object_codec=None, dimension_separator=None, write_empty_chunks=True, **kwargs):
"""Create an array.

Parameters
Expand Down Expand Up @@ -71,6 +71,15 @@ def create(shape, chunks=True, dtype=None, compressor='default',
dimension_separator : {'.', '/'}, optional
Separator placed between the dimensions of a chunk.
.. versionadded:: 2.8
write_empty_chunks : bool, optional
If True (default), all chunks will be stored regardless of their
contents. If False, each chunk is compared to the array's fill
value prior to storing. If a chunk is uniformly equal to the fill
value, then that chunk is not be stored, and the store entry for
that chunk's key is deleted. This setting enables sparser storage,
as only chunks with non-fill-value data are stored, at the expense
of overhead associated with checking the data of each chunk.


Returns
-------
Expand Down Expand Up @@ -142,7 +151,8 @@ def create(shape, chunks=True, dtype=None, compressor='default',

# instantiate array
z = Array(store, path=path, chunk_store=chunk_store, synchronizer=synchronizer,
cache_metadata=cache_metadata, cache_attrs=cache_attrs, read_only=read_only)
cache_metadata=cache_metadata, cache_attrs=cache_attrs, read_only=read_only,
write_empty_chunks=write_empty_chunks)

return z

Expand Down Expand Up @@ -400,6 +410,7 @@ def open_array(
chunk_store=None,
storage_options=None,
partial_decompress=False,
write_empty_chunks=True,
**kwargs
):
"""Open an array using file-mode-like semantics.
Expand Down Expand Up @@ -454,8 +465,14 @@ def open_array(
If True and while the chunk_store is a FSStore and the compresion used
is Blosc, when getting data from the array chunks will be partially
read and decompressed when possible.

.. versionadded:: 2.7
write_empty_chunks : bool, optional
If True (default), all chunks will be stored regardless of their
contents. If False, each chunk is compared to the array's fill
value prior to storing. If a chunk is uniformly equal to the fill
value, then that chunk is not be stored, and the store entry for
that chunk's key is deleted. This setting enables sparser storage,
as only chunks with non-fill-value data are stored, at the expense
of overhead associated with checking the data of each chunk.

Returns
-------
Expand Down Expand Up @@ -545,7 +562,7 @@ def open_array(
# instantiate array
z = Array(store, read_only=read_only, synchronizer=synchronizer,
cache_metadata=cache_metadata, cache_attrs=cache_attrs, path=path,
chunk_store=chunk_store)
chunk_store=chunk_store, write_empty_chunks=write_empty_chunks)

return z

Expand Down
9 changes: 9 additions & 0 deletions zarr/storage.py
Original file line number Diff line number Diff line change
Expand Up @@ -1154,6 +1154,15 @@ def __delitem__(self, key):
else:
del self.map[key]

def delitems(self, keys):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the implication that this is defined on FSStore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean, what if a store doesn't have this method? For stores without delitems, there's a fallback to sequential calls to _chunk_delitem

if self.mode == 'r':
raise ReadOnlyError()
# only remove the keys that exist in the store
nkeys = [self._normalize_key(key) for key in keys if key in self]
# rm errors if you pass an empty collection
if len(nkeys) > 0:
self.map.delitems(nkeys)

def __contains__(self, key):
key = self._normalize_key(key)
return key in self.map
Expand Down
Loading