Skip to content

Default write empty chunks false #951

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/minimal.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,11 @@ jobs:
run: |
conda activate minimal
python -m pip install .
pytest -svx
pytest -svx --timeout=300
- name: Fixture generation
shell: "bash -l {0}"
run: |
conda activate minimal
rm -rf fixture/
pytest -svx zarr/tests/test_dim_separator.py zarr/tests/test_storage.py
pytest -svx --timeout=300 zarr/tests/test_dim_separator.py zarr/tests/test_storage.py
# This simulates fixture-less tests in conda and debian packaging
2 changes: 1 addition & 1 deletion .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ jobs:
conda activate zarr-env
mkdir ~/blob_emulator
azurite -l ~/blob_emulator --debug debug.log 2>&1 > stdouterr.log &
pytest --cov=zarr --cov-config=.coveragerc --doctest-plus --cov-report xml --cov=./
pytest --cov=zarr --cov-config=.coveragerc --doctest-plus --cov-report xml --cov=./ --timeout=300
- uses: codecov/codecov-action@v1
with:
#token: ${{ secrets.CODECOV_TOKEN }} # not required for public repos
Expand Down
3 changes: 3 additions & 0 deletions docs/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@ Unreleased
Enhancements
~~~~~~~~~~~~

* write_empty_chunks defaults to False.
By :user:`Juan Nunez-Iglesias <jni>`; :issue:`853`.

* Allow to assign array ``fill_values`` and update metadata accordingly. :issue:`662`

* array indexing with [] (getitem and setitem) now supports fancy indexing.
Expand Down
69 changes: 69 additions & 0 deletions docs/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1302,6 +1302,75 @@ bytes within chunks of an array may improve the compression ratio, depending on
the structure of the data, the compression algorithm used, and which compression
filters (e.g., byte-shuffle) have been applied.

.. _tutorial_chunks_empty_chunks:

Empty chunks
~~~~~~~~~~~~

As of version 2.11, it is possible to configure how Zarr handles the storage of
chunks that are "empty" (i.e., every element in the chunk is equal to the array's fill value).
When creating an array with ``write_empty_chunks=False`` (the default),
Zarr will check whether a chunk is empty before compression and storage. If a chunk is empty,
then Zarr does not store it, and instead deletes the chunk from storage
if the chunk had been previously stored.

This optimization prevents storing redundant objects and can speed up reads, but the cost is
added computation during array writes, since the contents of
each chunk must be compared to the fill value, and these advantages are contingent on the content of the array.
If you know that your data will form chunks that are almost always non-empty, then there is no advantage to the optimization described above.
In this case, creating an array with ``write_empty_chunks=True`` will instruct Zarr to write every chunk without checking for emptiness.

The following example illustrates the effect of the ``write_empty_chunks`` flag on
the time required to write an array with different values.::

>>> import zarr
>>> import numpy as np
>>> import time
>>> from tempfile import TemporaryDirectory
>>> def timed_write(write_empty_chunks):
... """
... Measure the time required and number of objects created when writing
... to a Zarr array with random ints or fill value.
... """
... chunks = (8192,)
... shape = (chunks[0] * 1024,)
... data = np.random.randint(0, 255, shape)
... dtype = 'uint8'
...
... with TemporaryDirectory() as store:
... arr = zarr.open(store,
... shape=shape,
... chunks=chunks,
... dtype=dtype,
... write_empty_chunks=write_empty_chunks,
... fill_value=0,
... mode='w')
... # initialize all chunks
... arr[:] = 100
... result = []
... for value in (data, arr.fill_value):
... start = time.time()
... arr[:] = value
... elapsed = time.time() - start
... result.append((elapsed, arr.nchunks_initialized))
...
... return result
>>> for write_empty_chunks in (True, False):
... full, empty = timed_write(write_empty_chunks)
... print(f'\nwrite_empty_chunks={write_empty_chunks}:\n\tRandom Data: {full[0]:.4f}s, {full[1]} objects stored\n\t Empty Data: {empty[0]:.4f}s, {empty[1]} objects stored\n')

write_empty_chunks=True:
Random Data: 0.1252s, 1024 objects stored
Empty Data: 0.1060s, 1024 objects stored


write_empty_chunks=False:
Random Data: 0.1359s, 1024 objects stored
Empty Data: 0.0301s, 0 objects stored

In this example, writing random data is slightly slower with ``write_empty_chunks=True``,
but writing empty data is substantially faster and generates far fewer objects in storage.

.. _tutorial_rechunking:

Changing chunk shapes (rechunking)
Expand Down
5 changes: 3 additions & 2 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,11 @@ channels:
dependencies:
- wheel
- numcodecs >= 0.6.4
- numpy >= 1.7
- numpy == 1.20.3
- pip
- pip:
- asciitree
- fasteners
- fasteners == 0.16.3
- pytest
- pytest-timeout
- setuptools_scm
2 changes: 1 addition & 1 deletion requirements_dev_minimal.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# library requirements
asciitree==0.3.3
fasteners==0.17.3
fasteners==0.16.3
numcodecs==0.9.1
msgpack-python==0.5.6
setuptools-scm==6.4.2
Expand Down
2 changes: 1 addition & 1 deletion requirements_dev_numpy.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Break this out into a separate file to allow testing against
# different versions of numpy. This file should pin to the latest
# numpy version.
numpy==1.22.1
numpy==1.22.0
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@

dependencies = [
'asciitree',
'numpy>=1.7',
'fasteners',
'numpy==1.22.0',
'fasteners==0.16.3',
'numcodecs>=0.6.4',
]

Expand Down
1 change: 1 addition & 0 deletions tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ envlist = py37-npy{117,latest}, py38, py39, docs
install_command = pip install --no-binary=numcodecs {opts} {packages}
setenv =
PYTHONHASHSEED = 42
PYTEST_TIMEOUT = {env:PYTEST_TIMEOUT:300}
passenv =
ZARR_TEST_ABS
ZARR_TEST_MONGO
Expand Down
4 changes: 2 additions & 2 deletions windows_conda_dev.txt
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
coverage
fasteners
fasteners==0.16.3
flake8
monotonic
msgpack-python
numcodecs
numpy
numpy==1.22.0
setuptools_scm
twine
10 changes: 10 additions & 0 deletions zarr/_storage/absstore.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,26 +17,36 @@ class ABSStore(Store):
----------
container : string
The name of the ABS container to use.

.. deprecated::
Use ``client`` instead.

prefix : string
Location of the "directory" to use as the root of the storage hierarchy
within the container.

account_name : string
The Azure blob storage account name.

.. deprecated:: 2.8.3
Use ``client`` instead.

account_key : string
The Azure blob storage account access key.

.. deprecated:: 2.8.3
Use ``client`` instead.

blob_service_kwargs : dictionary
Extra arguments to be passed into the azure blob client, for e.g. when
using the emulator, pass in blob_service_kwargs={'is_emulated': True}.

.. deprecated:: 2.8.3
Use ``client`` instead.

dimension_separator : {'.', '/'}, optional
Separator placed between the dimensions of a chunk.

client : azure.storage.blob.ContainerClient, optional
And ``azure.storage.blob.ContainerClient`` to connect with. See
`here <https://docs.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.containerclient?view=azure-python>`_ # noqa
Expand Down
16 changes: 8 additions & 8 deletions zarr/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,13 +81,13 @@ class Array:
.. versionadded:: 2.7

write_empty_chunks : bool, optional
If True (default), all chunks will be stored regardless of their
contents. If False, each chunk is compared to the array's fill
value prior to storing. If a chunk is uniformly equal to the fill
value, then that chunk is not be stored, and the store entry for
that chunk's key is deleted. This setting enables sparser storage,
as only chunks with non-fill-value data are stored, at the expense
of overhead associated with checking the data of each chunk.
If True, all chunks will be stored regardless of their contents. If
False (default), each chunk is compared to the array's fill value prior
to storing. If a chunk is uniformly equal to the fill value, then that
chunk is not be stored, and the store entry for that chunk's key is
deleted. This setting enables sparser storage, as only chunks with
non-fill-value data are stored, at the expense of overhead associated
with checking the data of each chunk.

.. versionadded:: 2.11

Expand Down Expand Up @@ -154,7 +154,7 @@ def __init__(
cache_metadata=True,
cache_attrs=True,
partial_decompress=False,
write_empty_chunks=True,
write_empty_chunks=False,
):
# N.B., expect at this point store is fully initialized with all
# configuration metadata fully specified and normalized
Expand Down
35 changes: 19 additions & 16 deletions zarr/creation.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,8 @@ def create(shape, chunks=True, dtype=None, compressor='default',
fill_value=0, order='C', store=None, synchronizer=None,
overwrite=False, path=None, chunk_store=None, filters=None,
cache_metadata=True, cache_attrs=True, read_only=False,
object_codec=None, dimension_separator=None, write_empty_chunks=True, **kwargs):
object_codec=None, dimension_separator=None,
write_empty_chunks=False, **kwargs):
"""Create an array.

Parameters
Expand Down Expand Up @@ -72,13 +73,14 @@ def create(shape, chunks=True, dtype=None, compressor='default',
.. versionadded:: 2.8

write_empty_chunks : bool, optional
If True (default), all chunks will be stored regardless of their
contents. If False, each chunk is compared to the array's fill
value prior to storing. If a chunk is uniformly equal to the fill
value, then that chunk is not be stored, and the store entry for
that chunk's key is deleted. This setting enables sparser storage,
as only chunks with non-fill-value data are stored, at the expense
of overhead associated with checking the data of each chunk.
If True, all chunks will be stored regardless of their contents. If
False (default), each chunk is compared to the array's fill value prior
to storing. If a chunk is uniformly equal to the fill value, then that
chunk is not be stored, and the store entry for that chunk's key is
deleted. This setting enables sparser storage, as only chunks with
non-fill-value data are stored, at the expense of overhead associated
with checking the data of each chunk.
.. versionadded:: 2.11


Returns
Expand Down Expand Up @@ -389,7 +391,7 @@ def open_array(
chunk_store=None,
storage_options=None,
partial_decompress=False,
write_empty_chunks=True,
write_empty_chunks=False,
**kwargs
):
"""Open an array using file-mode-like semantics.
Expand Down Expand Up @@ -445,13 +447,14 @@ def open_array(
is Blosc, when getting data from the array chunks will be partially
read and decompressed when possible.
write_empty_chunks : bool, optional
If True (default), all chunks will be stored regardless of their
contents. If False, each chunk is compared to the array's fill
value prior to storing. If a chunk is uniformly equal to the fill
value, then that chunk is not be stored, and the store entry for
that chunk's key is deleted. This setting enables sparser storage,
as only chunks with non-fill-value data are stored, at the expense
of overhead associated with checking the data of each chunk.
If True, all chunks will be stored regardless of their contents. If
False (default), each chunk is compared to the array's fill value prior
to storing. If a chunk is uniformly equal to the fill value, then that
chunk is not be stored, and the store entry for that chunk's key is
deleted. This setting enables sparser storage, as only chunks with
non-fill-value data are stored, at the expense of overhead associated
with checking the data of each chunk.
.. versionadded:: 2.11

Returns
-------
Expand Down
2 changes: 1 addition & 1 deletion zarr/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -670,7 +670,7 @@ def all_equal(value: Any, array: Any):
# optimized to return on the first truthy value in `array`.
try:
return not np.any(array)
except TypeError: # pragma: no cover
except (TypeError, ValueError): # pragma: no cover
pass
if np.issubdtype(array.dtype, np.object_):
# we have to flatten the result of np.equal to handle outputs like
Expand Down