Skip to content

Set write_empty_chunks to default to False #853

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/minimal.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,11 @@ jobs:
run: |
conda activate minimal
python -m pip install .
pytest -svx
pytest -svx --timeout=300
- name: Fixture generation
shell: "bash -l {0}"
run: |
conda activate minimal
rm -rf fixture/
pytest -svx zarr/tests/test_dim_separator.py zarr/tests/test_storage.py
pytest -svx --timeout=300 zarr/tests/test_dim_separator.py zarr/tests/test_storage.py
# This simulates fixture-less tests in conda and debian packaging
2 changes: 1 addition & 1 deletion .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ jobs:
conda activate zarr-env
mkdir ~/blob_emulator
azurite -l ~/blob_emulator --debug debug.log 2>&1 > stdouterr.log &
pytest --cov=zarr --cov-config=.coveragerc --doctest-plus --cov-report xml --cov=./
pytest --cov=zarr --cov-config=.coveragerc --doctest-plus --cov-report xml --cov=./ --timeout=300
- uses: codecov/codecov-action@v1
with:
#token: ${{ secrets.CODECOV_TOKEN }} # not required for public repos
Expand Down
3 changes: 3 additions & 0 deletions docs/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@ Unreleased
Enhancements
~~~~~~~~~~~~

* write_empty_chunks defaults to False.
By :user:`Juan Nunez-Iglesias <jni>`; :issue:`853`.

* Allow to assign array ``fill_values`` and update metadata accordingly. :issue:`662`

* array indexing with [] (getitem and setitem) now supports fancy indexing.
Expand Down
69 changes: 69 additions & 0 deletions docs/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1302,6 +1302,75 @@ bytes within chunks of an array may improve the compression ratio, depending on
the structure of the data, the compression algorithm used, and which compression
filters (e.g., byte-shuffle) have been applied.

.. _tutorial_chunks_empty_chunks:

Empty chunks
~~~~~~~~~~~~

As of version 2.11, it is possible to configure how Zarr handles the storage of
chunks that are "empty" (i.e., every element in the chunk is equal to the array's fill value).
When creating an array with ``write_empty_chunks=False`` (the default),
Zarr will check whether a chunk is empty before compression and storage. If a chunk is empty,
then Zarr does not store it, and instead deletes the chunk from storage
if the chunk had been previously stored.

This optimization prevents storing redundant objects and can speed up reads, but the cost is
added computation during array writes, since the contents of
each chunk must be compared to the fill value, and these advantages are contingent on the content of the array.
If you know that your data will form chunks that are almost always non-empty, then there is no advantage to the optimization described above.
In this case, creating an array with ``write_empty_chunks=True`` will instruct Zarr to write every chunk without checking for emptiness.

The following example illustrates the effect of the ``write_empty_chunks`` flag on
the time required to write an array with different values.::

>>> import zarr
>>> import numpy as np
>>> import time
>>> from tempfile import TemporaryDirectory
>>> def timed_write(write_empty_chunks):
... """
... Measure the time required and number of objects created when writing
... to a Zarr array with random ints or fill value.
... """
... chunks = (8192,)
... shape = (chunks[0] * 1024,)
... data = np.random.randint(0, 255, shape)
... dtype = 'uint8'
...
... with TemporaryDirectory() as store:
... arr = zarr.open(store,
... shape=shape,
... chunks=chunks,
... dtype=dtype,
... write_empty_chunks=write_empty_chunks,
... fill_value=0,
... mode='w')
... # initialize all chunks
... arr[:] = 100
... result = []
... for value in (data, arr.fill_value):
... start = time.time()
... arr[:] = value
... elapsed = time.time() - start
... result.append((elapsed, arr.nchunks_initialized))
...
... return result
>>> for write_empty_chunks in (True, False):
... full, empty = timed_write(write_empty_chunks)
... print(f'\nwrite_empty_chunks={write_empty_chunks}:\n\tRandom Data: {full[0]:.4f}s, {full[1]} objects stored\n\t Empty Data: {empty[0]:.4f}s, {empty[1]} objects stored\n')

write_empty_chunks=True:
Random Data: 0.1252s, 1024 objects stored
Empty Data: 0.1060s, 1024 objects stored


write_empty_chunks=False:
Random Data: 0.1359s, 1024 objects stored
Empty Data: 0.0301s, 0 objects stored

In this example, writing random data is slightly slower with ``write_empty_chunks=True``,
but writing empty data is substantially faster and generates far fewer objects in storage.

.. _tutorial_rechunking:

Changing chunk shapes (rechunking)
Expand Down
3 changes: 2 additions & 1 deletion environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ dependencies:
- pip
- pip:
- asciitree
- fasteners
- fasteners == 0.16.3
- pytest
- pytest-timeout
- setuptools_scm
2 changes: 1 addition & 1 deletion requirements_dev_minimal.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# library requirements
asciitree==0.3.3
fasteners==0.17.3
fasteners==0.16.3
numcodecs==0.9.1
msgpack-python==0.5.6
setuptools-scm==6.4.2
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
dependencies = [
'asciitree',
'numpy>=1.7',
'fasteners',
'fasteners==0.16.3',
'numcodecs>=0.6.4',
]

Expand Down
1 change: 1 addition & 0 deletions tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ envlist = py37-npy{117,latest}, py38, py39, docs
install_command = pip install --no-binary=numcodecs {opts} {packages}
setenv =
PYTHONHASHSEED = 42
PYTEST_TIMEOUT = {env:PYTEST_TIMEOUT:300}
passenv =
ZARR_TEST_ABS
ZARR_TEST_MONGO
Expand Down
2 changes: 1 addition & 1 deletion windows_conda_dev.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
coverage
fasteners
fasteners==0.16.3
flake8
monotonic
msgpack-python
Expand Down
10 changes: 10 additions & 0 deletions zarr/_storage/absstore.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,26 +17,36 @@ class ABSStore(Store):
----------
container : string
The name of the ABS container to use.

.. deprecated::
Use ``client`` instead.

prefix : string
Location of the "directory" to use as the root of the storage hierarchy
within the container.

account_name : string
The Azure blob storage account name.

.. deprecated:: 2.8.3
Use ``client`` instead.

account_key : string
The Azure blob storage account access key.

.. deprecated:: 2.8.3
Use ``client`` instead.

blob_service_kwargs : dictionary
Extra arguments to be passed into the azure blob client, for e.g. when
using the emulator, pass in blob_service_kwargs={'is_emulated': True}.

.. deprecated:: 2.8.3
Use ``client`` instead.

dimension_separator : {'.', '/'}, optional
Separator placed between the dimensions of a chunk.

client : azure.storage.blob.ContainerClient, optional
And ``azure.storage.blob.ContainerClient`` to connect with. See
`here <https://docs.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.containerclient?view=azure-python>`_ # noqa
Expand Down
16 changes: 8 additions & 8 deletions zarr/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,13 +81,13 @@ class Array:
.. versionadded:: 2.7

write_empty_chunks : bool, optional
If True (default), all chunks will be stored regardless of their
contents. If False, each chunk is compared to the array's fill
value prior to storing. If a chunk is uniformly equal to the fill
value, then that chunk is not be stored, and the store entry for
that chunk's key is deleted. This setting enables sparser storage,
as only chunks with non-fill-value data are stored, at the expense
of overhead associated with checking the data of each chunk.
If True, all chunks will be stored regardless of their contents. If
False (default), each chunk is compared to the array's fill value prior
to storing. If a chunk is uniformly equal to the fill value, then that
chunk is not be stored, and the store entry for that chunk's key is
deleted. This setting enables sparser storage, as only chunks with
non-fill-value data are stored, at the expense of overhead associated
with checking the data of each chunk.

.. versionadded:: 2.11

Expand Down Expand Up @@ -154,7 +154,7 @@ def __init__(
cache_metadata=True,
cache_attrs=True,
partial_decompress=False,
write_empty_chunks=True,
write_empty_chunks=False,
):
# N.B., expect at this point store is fully initialized with all
# configuration metadata fully specified and normalized
Expand Down
35 changes: 19 additions & 16 deletions zarr/creation.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,8 @@ def create(shape, chunks=True, dtype=None, compressor='default',
fill_value=0, order='C', store=None, synchronizer=None,
overwrite=False, path=None, chunk_store=None, filters=None,
cache_metadata=True, cache_attrs=True, read_only=False,
object_codec=None, dimension_separator=None, write_empty_chunks=True, **kwargs):
object_codec=None, dimension_separator=None,
write_empty_chunks=False, **kwargs):
"""Create an array.

Parameters
Expand Down Expand Up @@ -72,13 +73,14 @@ def create(shape, chunks=True, dtype=None, compressor='default',
.. versionadded:: 2.8

write_empty_chunks : bool, optional
If True (default), all chunks will be stored regardless of their
contents. If False, each chunk is compared to the array's fill
value prior to storing. If a chunk is uniformly equal to the fill
value, then that chunk is not be stored, and the store entry for
that chunk's key is deleted. This setting enables sparser storage,
as only chunks with non-fill-value data are stored, at the expense
of overhead associated with checking the data of each chunk.
If True, all chunks will be stored regardless of their contents. If
False (default), each chunk is compared to the array's fill value prior
to storing. If a chunk is uniformly equal to the fill value, then that
chunk is not be stored, and the store entry for that chunk's key is
deleted. This setting enables sparser storage, as only chunks with
non-fill-value data are stored, at the expense of overhead associated
with checking the data of each chunk.
.. versionadded:: 2.11


Returns
Expand Down Expand Up @@ -389,7 +391,7 @@ def open_array(
chunk_store=None,
storage_options=None,
partial_decompress=False,
write_empty_chunks=True,
write_empty_chunks=False,
**kwargs
):
"""Open an array using file-mode-like semantics.
Expand Down Expand Up @@ -445,13 +447,14 @@ def open_array(
is Blosc, when getting data from the array chunks will be partially
read and decompressed when possible.
write_empty_chunks : bool, optional
If True (default), all chunks will be stored regardless of their
contents. If False, each chunk is compared to the array's fill
value prior to storing. If a chunk is uniformly equal to the fill
value, then that chunk is not be stored, and the store entry for
that chunk's key is deleted. This setting enables sparser storage,
as only chunks with non-fill-value data are stored, at the expense
of overhead associated with checking the data of each chunk.
If True, all chunks will be stored regardless of their contents. If
False (default), each chunk is compared to the array's fill value prior
to storing. If a chunk is uniformly equal to the fill value, then that
chunk is not be stored, and the store entry for that chunk's key is
deleted. This setting enables sparser storage, as only chunks with
non-fill-value data are stored, at the expense of overhead associated
with checking the data of each chunk.
.. versionadded:: 2.11

Returns
-------
Expand Down
2 changes: 1 addition & 1 deletion zarr/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -670,7 +670,7 @@ def all_equal(value: Any, array: Any):
# optimized to return on the first truthy value in `array`.
try:
return not np.any(array)
except TypeError: # pragma: no cover
except (TypeError, ValueError): # pragma: no cover
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What causes the ValueError? Is there a particular array type or value that np.any is raising on?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs when a category value (baz) gets passed to int().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from failing build
______________________ test_array_with_categorize_filter _______________________

    def test_array_with_categorize_filter():
    
        # setup
        data = np.random.choice(['foo', 'bar', 'baz'], size=100)
        flt = Categorize(dtype=data.dtype, labels=['foo', 'bar', 'baz'])
        filters = [flt]
    
        for compressor in compressors:
    
>           a = array(data, chunks=5, compressor=compressor, filters=filters)

zarr/tests/test_filters.py:172: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
zarr/creation.py:366: in array
    z[...] = data
zarr/core.py:1285: in __setitem__
    self.set_basic_selection(pure_selection, value, fields=fields)
zarr/core.py:1380: in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
zarr/core.py:1680: in _set_basic_selection_nd
    self._set_selection(indexer, value, fields=fields)
zarr/core.py:1732: in _set_selection
    self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
zarr/core.py:1994: in _chunk_setitem
    self._chunk_setitem_nosync(chunk_coords, chunk_selection, value,
zarr/core.py:2002: in _chunk_setitem_nosync
    if (not self.write_empty_chunks) and all_equal(self.fill_value, cdata):
zarr/util.py:672: in all_equal
    return not np.any(array)
<__array_function__ internals>:180: in any
    ???
/usr/share/miniconda/envs/minimal/lib/python3.10/site-packages/numpy/core/fromnumeric.py:2395: in any
    return _wrapreduction(a, np.logical_or, 'any', axis, None, out,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

obj = array(['baz', 'baz', 'baz', 'bar', 'foo'], dtype='<U3')
ufunc = <ufunc 'logical_or'>, method = 'any', axis = None, dtype = None
out = None, kwargs = {'keepdims': <no value>, 'where': <no value>}
passkwargs = {}

    def _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs):
        passkwargs = {k: v for k, v in kwargs.items()
                      if v is not np._NoValue}
    
        if type(obj) is not mu.ndarray:
            try:
                reduction = getattr(obj, method)
            except AttributeError:
                pass
            else:
                # This branch is needed for reductions like any which don't
                # support a dtype.
                if dtype is not None:
                    return reduction(axis=axis, dtype=dtype, out=out, **passkwargs)
                else:
                    return reduction(axis=axis, out=out, **passkwargs)
    
>       return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
E       ValueError: invalid literal for int() with base 10: 'baz'

/usr/share/miniconda/envs/minimal/lib/python3.10/site-packages/numpy/core/fromnumeric.py:86: ValueError

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Looks like the error message and type changed from NumPy 1.21 to 1.22. Don't think that was intentional, but could be wrong. Raised issue ( numpy/numpy#20898 )

In any event this workaround seems reasonable in the interim

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks! Once this is green, I'll move forward with 2.11.0.

pass
if np.issubdtype(array.dtype, np.object_):
# we have to flatten the result of np.equal to handle outputs like
Expand Down