Skip to content

Optimize setitem with chunk equal to fill_value, round 2 #738

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 68 commits into from
Oct 19, 2021
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
8153810
Consolidate encode/store in _chunk_setitem_nosync
jakirkham Dec 16, 2018
c84839e
Clear key-value pair if chunk is just fill value
jakirkham Dec 16, 2018
750d696
don't set cdata in _process_for_setitem
d-v-b May 10, 2021
6ac2349
set empty chunk write behavior via array constructor
d-v-b May 10, 2021
eb36713
add rudimentary tests, np.equal -> np.array_equal
d-v-b May 10, 2021
053ad4c
add test for chunk deletion
d-v-b May 10, 2021
3375bf0
add flattening function
d-v-b May 11, 2021
30c3a30
add kwarg for empty writes to array creators
d-v-b May 11, 2021
d2fc396
fix check for chunk equality to fill value
d-v-b May 11, 2021
e4e4012
flake8
d-v-b May 11, 2021
bd27b9a
add None check to setitems
d-v-b May 12, 2021
814d009
add write_empty_chunks to output of __getstate__
d-v-b May 12, 2021
769f5a6
flake8
d-v-b May 12, 2021
cd56b35
add partial decompress to __get_state__
d-v-b May 12, 2021
bcbaac4
Merge branch 'master' into opt_setitem_fill_value
d-v-b May 14, 2021
9096f2c
Merge branch 'master' into opt_setitem_fill_value
d-v-b May 25, 2021
044a9b8
functionalize emptiness checks and key deletion
d-v-b May 25, 2021
74e0852
flake8
d-v-b May 25, 2021
b2ec5ad
Merge branch 'master' into opt_setitem_fill_value
d-v-b May 27, 2021
62a55ab
add path for delitems, and add some failing tests
d-v-b May 27, 2021
dbc32fd
flake8
d-v-b May 27, 2021
cd28aff
add delitems method to FSStore, and correspondingly change zarr.Array…
d-v-b May 27, 2021
160c2dc
add nested + empty writes test
d-v-b May 27, 2021
b7fe1fe
Merge branch 'master' into opt_setitem_fill_value
d-v-b Jun 18, 2021
af715fe
set write_empty_chunks to True by default
d-v-b Jul 14, 2021
e17993a
Merge branch 'opt_setitem_fill_value' of https://github.com/d-v-b/zar…
d-v-b Jul 14, 2021
7c9a041
Merge branch 'master' into opt_setitem_fill_value
d-v-b Jul 14, 2021
59328f0
rename chunk_is_empty method and clean up logic in _chunk_setitem
d-v-b Jul 14, 2021
72488a8
rename test
d-v-b Jul 14, 2021
7489ae9
add test for flatten
d-v-b Jul 14, 2021
10199b3
initial support for using delitems api in chunk_setitems
d-v-b Jul 14, 2021
88b4811
flake8
d-v-b Jul 15, 2021
3c69719
strip path separator that was screwing up store.listdir
d-v-b Jul 15, 2021
8aa93fa
change tests to check for empty writing behavior
d-v-b Jul 15, 2021
0dae9da
bump fsspec and s3fs versions
d-v-b Jul 15, 2021
40c3f14
delitems: only attempt to delete keys that exist
d-v-b Jul 15, 2021
7dde846
cdon't pass empty collections to self.map.delitems
d-v-b Aug 17, 2021
7a45fd2
flake8
d-v-b Aug 17, 2021
99f59ef
use main branch of fsspec until a new release is cut
d-v-b Aug 19, 2021
a6ba3c7
add support for checking if a chunk is all nans in _chunk_is_empty
d-v-b Aug 20, 2021
6f8b6c4
docstring tweaks
d-v-b Aug 20, 2021
054399e
Merge branch 'master' into opt_setitem_fill_value
d-v-b Aug 20, 2021
7025d19
clean up empty_write tests
d-v-b Aug 20, 2021
bbabe5c
Merge branch 'opt_setitem_fill_value' of https://github.com/d-v-b/zar…
d-v-b Aug 20, 2021
3abcbc3
fix hexdigests for FSStore + empty writes, and remove redundant nchun…
d-v-b Aug 20, 2021
c3b4455
Merge branch 'master' into opt_setitem_fill_value
joshmoore Sep 8, 2021
ea3356c
Merge branch 'master' of https://github.com/zarr-developers/zarr-pyth…
d-v-b Oct 2, 2021
b81c14a
resolve merge conflicts in favor of master
d-v-b Oct 2, 2021
05716a3
set write_empty_chunks to True by default; put chunk emptiness checki…
d-v-b Oct 4, 2021
23bfc1e
update requirements_dev_optional
d-v-b Oct 4, 2021
b921b34
remove np.typing import
d-v-b Oct 4, 2021
48c38c7
use public attribute in test_nchunks_initialized
d-v-b Oct 4, 2021
7e4fbad
remove duplicated logic in _chunk_setitems, instead using _chunk_deli…
d-v-b Oct 4, 2021
b063f52
expand 0d tests and nchunks_initialized tests to hit more parts of th…
d-v-b Oct 4, 2021
020475b
remove return type annotation for all_equal that was breaking CI
d-v-b Oct 4, 2021
a81ac83
refactor write_empty_chunks tests by expanding the create_array logic…
d-v-b Oct 4, 2021
f8d8415
merge upstream changes to requirements_dev_optional
d-v-b Oct 4, 2021
1c29fe8
correctly handle merge from upstream master
d-v-b Oct 4, 2021
710b875
don't use os.path.join for constructing a chunk key; instead use _chu…
d-v-b Oct 4, 2021
8a06884
complete removal of os.path.join calls
d-v-b Oct 4, 2021
7f859c3
add coverage exemption to type error branch in all_equal
d-v-b Oct 6, 2021
0a7a3cc
remove unreachable conditionals in n5 tests
d-v-b Oct 7, 2021
1a0f41c
instantiate ReadOnlyError
d-v-b Oct 7, 2021
94d5d0a
add explcit delitems and setitems calls to readonly fsstore tests
d-v-b Oct 7, 2021
a918f1d
Update docstrings
d-v-b Oct 7, 2021
3dd1afd
Update requirements_dev_optional
d-v-b Oct 11, 2021
2165164
Merge 'origin/master' into pr-738
joshmoore Oct 19, 2021
4a4adb1
Add changelog
joshmoore Oct 19, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 61 additions & 5 deletions zarr/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
from zarr.util import (
InfoReporter,
check_array_shape,
flatten,
human_readable_size,
is_total_slice,
nolock,
Expand Down Expand Up @@ -74,6 +75,12 @@ class Array:
If True and while the chunk_store is a FSStore and the compresion used
is Blosc, when getting data from the array chunks will be partially
read and decompressed when possible.
write_empty_chunks : bool, optional
Determines chunk writing behavior for chunks filled with `fill_value` ("empty" chunks).
If True (default), all chunks will be written regardless of their contents.
If False, empty chunks will not be written, and the `store` entry for
the chunk key of an empty chunk will be deleted. Note that setting this option to False
will incur additional overhead per chunk write.

.. versionadded:: 2.7

Expand Down Expand Up @@ -138,6 +145,7 @@ def __init__(
cache_metadata=True,
cache_attrs=True,
partial_decompress=False,
write_empty_chunks=True,
):
# N.B., expect at this point store is fully initialized with all
# configuration metadata fully specified and normalized
Expand All @@ -154,6 +162,7 @@ def __init__(
self._cache_metadata = cache_metadata
self._is_view = False
self._partial_decompress = partial_decompress
self._write_empty_chunks = write_empty_chunks

# initialize metadata
self._load_metadata()
Expand Down Expand Up @@ -1586,6 +1595,17 @@ def _set_basic_selection_zd(self, selection, value, fields=None):
else:
chunk[selection] = value

# clear chunk if it only contains the fill value
if self._chunk_isempty(chunk):
try:
del self.chunk_store[ckey]
return
except KeyError:
return
except Exception:
# deleting failed, fallback to overwriting
pass

# encode and store
cdata = self._encode_chunk(chunk)
self.chunk_store[ckey] = cdata
Expand Down Expand Up @@ -1859,9 +1879,38 @@ def _chunk_setitems(self, lchunk_coords, lchunk_selection, values, fields=None):
ckeys = [self._chunk_key(co) for co in lchunk_coords]
cdatas = [self._process_for_setitem(key, sel, val, fields=fields)
for key, sel, val in zip(ckeys, lchunk_selection, values)]
values = {k: v for k, v in zip(ckeys, cdatas)}
values = {}
if not self._write_empty_chunks:
for ckey, cdata in zip(ckeys, cdatas):
if self._chunk_isempty(cdata) and not self._chunk_delitem(ckey):
values[ckey] = self._encode_chunk(cdata)
else:
values = dict(zip(ckeys, map(self._encode_chunk, cdatas)))
self.chunk_store.setitems(values)

def _chunk_isempty(self, chunk):
if self.dtype == 'object':
# we have to flatten the result of np.equal to handle outputs like
# [np.array([True,True]), True, True]
is_empty = all(flatten(np.equal(chunk, self.fill_value, dtype='object')))
else:
is_empty = np.all(chunk == self._fill_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work properly if fill_value is NaN?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope! Do you have any advice for making that work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like this?

Suggested change
is_empty = np.all(chunk == self._fill_value)
is_empty = np.all(chunk == self._fill_value if np.isnan(self._fill_value) else np.isnan(chunk))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return is_empty

def _chunk_delitem(self, ckey):
"""
Attempt to delete the value associated with ckey.
Returns True if deletion succeeds or KeyError is raised.
Returns False if any other exception is raised.
"""
try:
del self.chunk_store[ckey]
return True
except KeyError:
return True
except Exception:
return False

def _chunk_setitem(self, chunk_coords, chunk_selection, value, fields=None):
"""Replace part or whole of a chunk.

Expand Down Expand Up @@ -1889,10 +1938,17 @@ def _chunk_setitem(self, chunk_coords, chunk_selection, value, fields=None):
fields=fields)

def _chunk_setitem_nosync(self, chunk_coords, chunk_selection, value, fields=None):
do_store = True
ckey = self._chunk_key(chunk_coords)
cdata = self._process_for_setitem(ckey, chunk_selection, value, fields=fields)

# clear chunk if it only contains the fill value
if (not self._write_empty_chunks) and self._chunk_isempty(cdata):
do_store = not self._chunk_delitem(ckey)

# store
self.chunk_store[ckey] = cdata
if do_store:
self.chunk_store[ckey] = self._encode_chunk(cdata)

def _process_for_setitem(self, ckey, chunk_selection, value, fields=None):
if is_total_slice(chunk_selection, self._chunks) and not fields:
Expand Down Expand Up @@ -1948,8 +2004,7 @@ def _process_for_setitem(self, ckey, chunk_selection, value, fields=None):
else:
chunk[chunk_selection] = value

# encode chunk
return self._encode_chunk(chunk)
return chunk

def _chunk_key(self, chunk_coords):
return self._key_prefix + '.'.join(map(str, chunk_coords))
Expand Down Expand Up @@ -2169,7 +2224,8 @@ def hexdigest(self, hashname="sha1"):

def __getstate__(self):
return (self._store, self._path, self._read_only, self._chunk_store,
self._synchronizer, self._cache_metadata, self._attrs.cache)
self._synchronizer, self._cache_metadata, self._attrs.cache,
self._partial_decompress, self._write_empty_chunks)

def __setstate__(self, state):
self.__init__(*state)
Expand Down
22 changes: 17 additions & 5 deletions zarr/creation.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ def create(shape, chunks=True, dtype=None, compressor='default',
fill_value=0, order='C', store=None, synchronizer=None,
overwrite=False, path=None, chunk_store=None, filters=None,
cache_metadata=True, cache_attrs=True, read_only=False,
object_codec=None, dimension_separator=None, **kwargs):
object_codec=None, dimension_separator=None, write_empty_chunks=True, **kwargs):
"""Create an array.

Parameters
Expand Down Expand Up @@ -70,6 +70,12 @@ def create(shape, chunks=True, dtype=None, compressor='default',
dimension_separator : {'.', '/'}, optional
Separator placed between the dimensions of a chunk.
.. versionadded:: 2.8
write_empty_chunks : bool, optional
Determines chunk writing behavior for chunks filled with `fill_value` ("empty" chunks).
If True (default), all chunks will be written regardless of their contents.
If False, empty chunks will not be written, and the `store` entry for
the chunk key of an empty chunk will be deleted. Note that setting this option to False
will incur additional overhead per chunk write.

Returns
-------
Expand Down Expand Up @@ -140,7 +146,8 @@ def create(shape, chunks=True, dtype=None, compressor='default',

# instantiate array
z = Array(store, path=path, chunk_store=chunk_store, synchronizer=synchronizer,
cache_metadata=cache_metadata, cache_attrs=cache_attrs, read_only=read_only)
cache_metadata=cache_metadata, cache_attrs=cache_attrs, read_only=read_only,
write_empty_chunks=write_empty_chunks)

return z

Expand Down Expand Up @@ -396,6 +403,7 @@ def open_array(
chunk_store=None,
storage_options=None,
partial_decompress=False,
write_empty_chunks=True,
**kwargs
):
"""Open an array using file-mode-like semantics.
Expand Down Expand Up @@ -450,8 +458,12 @@ def open_array(
If True and while the chunk_store is a FSStore and the compresion used
is Blosc, when getting data from the array chunks will be partially
read and decompressed when possible.

.. versionadded:: 2.7
write_empty_chunks : bool, optional
Determines chunk writing behavior for chunks filled with `fill_value` ("empty" chunks).
If True (default), all chunks will be written regardless of their contents.
If False, empty chunks will not be written, and the `store` entry for
the chunk key of an empty chunk will be deleted. Note that setting this option to False
will incur additional overhead per chunk write.

Returns
-------
Expand Down Expand Up @@ -541,7 +553,7 @@ def open_array(
# instantiate array
z = Array(store, read_only=read_only, synchronizer=synchronizer,
cache_metadata=cache_metadata, cache_attrs=cache_attrs, path=path,
chunk_store=chunk_store)
chunk_store=chunk_store, write_empty_chunks=write_empty_chunks)

return z

Expand Down
31 changes: 31 additions & 0 deletions zarr/tests/test_core.py
Original file line number Diff line number Diff line change
Expand Up @@ -1597,6 +1597,37 @@ def test_nbytes_stored(self):
assert -1 == z.nbytes_stored


class TestArrayWithoutEmptyWrites(TestArray):

@staticmethod
def create_array(read_only=False, **kwargs):
path = mkdtemp()
atexit.register(shutil.rmtree, path)
store = DirectoryStore(path)
cache_metadata = kwargs.pop('cache_metadata', True)
cache_attrs = kwargs.pop('cache_attrs', True)
kwargs.setdefault('compressor', Zlib(1))
init_array(store, **kwargs)
return Array(store, read_only=read_only, cache_metadata=cache_metadata,
cache_attrs=cache_attrs, write_empty_chunks=False)

def test_nchunks_initialized(self):
for fill_value in -1, 0, 1, 10:
z = self.create_array(shape=100, chunks=10, fill_value=fill_value)
assert 0 == z.nchunks_initialized
# manually put something into the store to confuse matters
z.store['foo'] = b'bar'
assert 0 == z.nchunks_initialized
z[:] = 42
assert 10 == z.nchunks_initialized
z[:] = fill_value
assert 0 == z.nchunks_initialized
z[0] = 42
assert 1 == z.nchunks_initialized
if hasattr(z.store, 'close'):
z.store.close()


class TestArrayWithDirectoryStore(TestArray):

@staticmethod
Expand Down
9 changes: 9 additions & 0 deletions zarr/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,22 @@
import numpy as np
from asciitree import BoxStyle, LeftAligned
from asciitree.traversal import Traversal
from collections.abc import Iterable
from numcodecs.compat import ensure_ndarray, ensure_text
from numcodecs.registry import codec_registry
from numcodecs.blosc import cbuffer_sizes, cbuffer_metainfo

from typing import Any, Callable, Dict, Optional, Tuple, Union


def flatten(arg: Iterable) -> Iterable:
for element in arg:
if isinstance(element, Iterable) and not isinstance(element, (str, bytes)):
yield from flatten(element)
else:
yield element


# codecs to use for object dtype convenience API
object_codecs = {
str.__name__: 'vlen-utf8',
Expand Down