Skip to content

Commit f461eb7

Browse files
jnid-v-bjoshmoore
authored
Set write_empty_chunks to default to False (#853)
* Set write_empty_chunks to default to False * Add release entry for write_empty_chunks default * add Empty chunks section to tutorial.rst * add benchmarky example * proper formatting of code block * Fix abstore deprecated strings * Also catch ValueError in all_equal The call to `np.any(array)` in zarr.util.all_equal triggers the following ValueError: ``` > return ufunc.reduce(obj, axis, dtype, out, **passkwargs) E ValueError: invalid literal for int() with base 10: 'baz' ``` Extending the catch block allows test_array_with_categorize_filter to pass, but it's unclear if this points to a deeper issue. * Add --timeout argument to all uses of pytest * Pin fasteners to 0.16.3 (see #952) Co-authored-by: Davis Vann Bennett <[email protected]> Co-authored-by: Josh Moore <[email protected]> Co-authored-by: jmoore <[email protected]>
1 parent 24dae27 commit f461eb7

13 files changed

+119
-32
lines changed

.github/workflows/minimal.yml

+2-2
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,11 @@ jobs:
2525
run: |
2626
conda activate minimal
2727
python -m pip install .
28-
pytest -svx
28+
pytest -svx --timeout=300
2929
- name: Fixture generation
3030
shell: "bash -l {0}"
3131
run: |
3232
conda activate minimal
3333
rm -rf fixture/
34-
pytest -svx zarr/tests/test_dim_separator.py zarr/tests/test_storage.py
34+
pytest -svx --timeout=300 zarr/tests/test_dim_separator.py zarr/tests/test_storage.py
3535
# This simulates fixture-less tests in conda and debian packaging

.github/workflows/python-package.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ jobs:
7373
conda activate zarr-env
7474
mkdir ~/blob_emulator
7575
azurite -l ~/blob_emulator --debug debug.log 2>&1 > stdouterr.log &
76-
pytest --cov=zarr --cov-config=.coveragerc --doctest-plus --cov-report xml --cov=./
76+
pytest --cov=zarr --cov-config=.coveragerc --doctest-plus --cov-report xml --cov=./ --timeout=300
7777
- uses: codecov/codecov-action@v1
7878
with:
7979
#token: ${{ secrets.CODECOV_TOKEN }} # not required for public repos

docs/release.rst

+3
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@ This release of Zarr Python introduces a new ``BaseStore`` class that all provid
1212
Enhancements
1313
~~~~~~~~~~~~
1414

15+
* write_empty_chunks defaults to False.
16+
By :user:`Juan Nunez-Iglesias <jni>`; :issue:`853`.
17+
1518
* Allow to assign array ``fill_values`` and update metadata accordingly. :issue:`662`
1619

1720
* array indexing with [] (getitem and setitem) now supports fancy indexing.

docs/tutorial.rst

+69
Original file line numberDiff line numberDiff line change
@@ -1302,6 +1302,75 @@ bytes within chunks of an array may improve the compression ratio, depending on
13021302
the structure of the data, the compression algorithm used, and which compression
13031303
filters (e.g., byte-shuffle) have been applied.
13041304

1305+
.. _tutorial_chunks_empty_chunks:
1306+
1307+
Empty chunks
1308+
~~~~~~~~~~~~
1309+
1310+
As of version 2.11, it is possible to configure how Zarr handles the storage of
1311+
chunks that are "empty" (i.e., every element in the chunk is equal to the array's fill value).
1312+
When creating an array with ``write_empty_chunks=False`` (the default),
1313+
Zarr will check whether a chunk is empty before compression and storage. If a chunk is empty,
1314+
then Zarr does not store it, and instead deletes the chunk from storage
1315+
if the chunk had been previously stored.
1316+
1317+
This optimization prevents storing redundant objects and can speed up reads, but the cost is
1318+
added computation during array writes, since the contents of
1319+
each chunk must be compared to the fill value, and these advantages are contingent on the content of the array.
1320+
If you know that your data will form chunks that are almost always non-empty, then there is no advantage to the optimization described above.
1321+
In this case, creating an array with ``write_empty_chunks=True`` will instruct Zarr to write every chunk without checking for emptiness.
1322+
1323+
The following example illustrates the effect of the ``write_empty_chunks`` flag on
1324+
the time required to write an array with different values.::
1325+
1326+
>>> import zarr
1327+
>>> import numpy as np
1328+
>>> import time
1329+
>>> from tempfile import TemporaryDirectory
1330+
>>> def timed_write(write_empty_chunks):
1331+
... """
1332+
... Measure the time required and number of objects created when writing
1333+
... to a Zarr array with random ints or fill value.
1334+
... """
1335+
... chunks = (8192,)
1336+
... shape = (chunks[0] * 1024,)
1337+
... data = np.random.randint(0, 255, shape)
1338+
... dtype = 'uint8'
1339+
...
1340+
... with TemporaryDirectory() as store:
1341+
... arr = zarr.open(store,
1342+
... shape=shape,
1343+
... chunks=chunks,
1344+
... dtype=dtype,
1345+
... write_empty_chunks=write_empty_chunks,
1346+
... fill_value=0,
1347+
... mode='w')
1348+
... # initialize all chunks
1349+
... arr[:] = 100
1350+
... result = []
1351+
... for value in (data, arr.fill_value):
1352+
... start = time.time()
1353+
... arr[:] = value
1354+
... elapsed = time.time() - start
1355+
... result.append((elapsed, arr.nchunks_initialized))
1356+
...
1357+
... return result
1358+
>>> for write_empty_chunks in (True, False):
1359+
... full, empty = timed_write(write_empty_chunks)
1360+
... print(f'\nwrite_empty_chunks={write_empty_chunks}:\n\tRandom Data: {full[0]:.4f}s, {full[1]} objects stored\n\t Empty Data: {empty[0]:.4f}s, {empty[1]} objects stored\n')
1361+
1362+
write_empty_chunks=True:
1363+
Random Data: 0.1252s, 1024 objects stored
1364+
Empty Data: 0.1060s, 1024 objects stored
1365+
1366+
1367+
write_empty_chunks=False:
1368+
Random Data: 0.1359s, 1024 objects stored
1369+
Empty Data: 0.0301s, 0 objects stored
1370+
1371+
In this example, writing random data is slightly slower with ``write_empty_chunks=True``,
1372+
but writing empty data is substantially faster and generates far fewer objects in storage.
1373+
13051374
.. _tutorial_rechunking:
13061375

13071376
Changing chunk shapes (rechunking)

environment.yml

+2-1
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ dependencies:
88
- pip
99
- pip:
1010
- asciitree
11-
- fasteners
11+
- fasteners == 0.16.3
1212
- pytest
13+
- pytest-timeout
1314
- setuptools_scm

requirements_dev_minimal.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# library requirements
22
asciitree==0.3.3
3-
fasteners==0.17.3
3+
fasteners==0.16.3
44
numcodecs==0.9.1
55
msgpack-python==0.5.6
66
setuptools-scm==6.4.2

setup.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
dependencies = [
1010
'asciitree',
1111
'numpy>=1.7',
12-
'fasteners',
12+
'fasteners==0.16.3',
1313
'numcodecs>=0.6.4',
1414
]
1515

tox.ini

+1
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ envlist = py37-npy{117,latest}, py38, py39, docs
1010
install_command = pip install --no-binary=numcodecs {opts} {packages}
1111
setenv =
1212
PYTHONHASHSEED = 42
13+
PYTEST_TIMEOUT = {env:PYTEST_TIMEOUT:300}
1314
passenv =
1415
ZARR_TEST_ABS
1516
ZARR_TEST_MONGO

windows_conda_dev.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
coverage
2-
fasteners
2+
fasteners==0.16.3
33
flake8
44
monotonic
55
msgpack-python

zarr/_storage/absstore.py

+10
Original file line numberDiff line numberDiff line change
@@ -17,26 +17,36 @@ class ABSStore(Store):
1717
----------
1818
container : string
1919
The name of the ABS container to use.
20+
2021
.. deprecated::
2122
Use ``client`` instead.
23+
2224
prefix : string
2325
Location of the "directory" to use as the root of the storage hierarchy
2426
within the container.
27+
2528
account_name : string
2629
The Azure blob storage account name.
30+
2731
.. deprecated:: 2.8.3
2832
Use ``client`` instead.
33+
2934
account_key : string
3035
The Azure blob storage account access key.
36+
3137
.. deprecated:: 2.8.3
3238
Use ``client`` instead.
39+
3340
blob_service_kwargs : dictionary
3441
Extra arguments to be passed into the azure blob client, for e.g. when
3542
using the emulator, pass in blob_service_kwargs={'is_emulated': True}.
43+
3644
.. deprecated:: 2.8.3
3745
Use ``client`` instead.
46+
3847
dimension_separator : {'.', '/'}, optional
3948
Separator placed between the dimensions of a chunk.
49+
4050
client : azure.storage.blob.ContainerClient, optional
4151
And ``azure.storage.blob.ContainerClient`` to connect with. See
4252
`here <https://docs.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.containerclient?view=azure-python>`_ # noqa

zarr/core.py

+8-8
Original file line numberDiff line numberDiff line change
@@ -81,13 +81,13 @@ class Array:
8181
.. versionadded:: 2.7
8282
8383
write_empty_chunks : bool, optional
84-
If True (default), all chunks will be stored regardless of their
85-
contents. If False, each chunk is compared to the array's fill
86-
value prior to storing. If a chunk is uniformly equal to the fill
87-
value, then that chunk is not be stored, and the store entry for
88-
that chunk's key is deleted. This setting enables sparser storage,
89-
as only chunks with non-fill-value data are stored, at the expense
90-
of overhead associated with checking the data of each chunk.
84+
If True, all chunks will be stored regardless of their contents. If
85+
False (default), each chunk is compared to the array's fill value prior
86+
to storing. If a chunk is uniformly equal to the fill value, then that
87+
chunk is not be stored, and the store entry for that chunk's key is
88+
deleted. This setting enables sparser storage, as only chunks with
89+
non-fill-value data are stored, at the expense of overhead associated
90+
with checking the data of each chunk.
9191
9292
.. versionadded:: 2.11
9393
@@ -154,7 +154,7 @@ def __init__(
154154
cache_metadata=True,
155155
cache_attrs=True,
156156
partial_decompress=False,
157-
write_empty_chunks=True,
157+
write_empty_chunks=False,
158158
):
159159
# N.B., expect at this point store is fully initialized with all
160160
# configuration metadata fully specified and normalized

zarr/creation.py

+19-16
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,8 @@ def create(shape, chunks=True, dtype=None, compressor='default',
1919
fill_value=0, order='C', store=None, synchronizer=None,
2020
overwrite=False, path=None, chunk_store=None, filters=None,
2121
cache_metadata=True, cache_attrs=True, read_only=False,
22-
object_codec=None, dimension_separator=None, write_empty_chunks=True, **kwargs):
22+
object_codec=None, dimension_separator=None,
23+
write_empty_chunks=False, **kwargs):
2324
"""Create an array.
2425
2526
Parameters
@@ -72,13 +73,14 @@ def create(shape, chunks=True, dtype=None, compressor='default',
7273
.. versionadded:: 2.8
7374
7475
write_empty_chunks : bool, optional
75-
If True (default), all chunks will be stored regardless of their
76-
contents. If False, each chunk is compared to the array's fill
77-
value prior to storing. If a chunk is uniformly equal to the fill
78-
value, then that chunk is not be stored, and the store entry for
79-
that chunk's key is deleted. This setting enables sparser storage,
80-
as only chunks with non-fill-value data are stored, at the expense
81-
of overhead associated with checking the data of each chunk.
76+
If True, all chunks will be stored regardless of their contents. If
77+
False (default), each chunk is compared to the array's fill value prior
78+
to storing. If a chunk is uniformly equal to the fill value, then that
79+
chunk is not be stored, and the store entry for that chunk's key is
80+
deleted. This setting enables sparser storage, as only chunks with
81+
non-fill-value data are stored, at the expense of overhead associated
82+
with checking the data of each chunk.
83+
.. versionadded:: 2.11
8284
8385
8486
Returns
@@ -389,7 +391,7 @@ def open_array(
389391
chunk_store=None,
390392
storage_options=None,
391393
partial_decompress=False,
392-
write_empty_chunks=True,
394+
write_empty_chunks=False,
393395
**kwargs
394396
):
395397
"""Open an array using file-mode-like semantics.
@@ -445,13 +447,14 @@ def open_array(
445447
is Blosc, when getting data from the array chunks will be partially
446448
read and decompressed when possible.
447449
write_empty_chunks : bool, optional
448-
If True (default), all chunks will be stored regardless of their
449-
contents. If False, each chunk is compared to the array's fill
450-
value prior to storing. If a chunk is uniformly equal to the fill
451-
value, then that chunk is not be stored, and the store entry for
452-
that chunk's key is deleted. This setting enables sparser storage,
453-
as only chunks with non-fill-value data are stored, at the expense
454-
of overhead associated with checking the data of each chunk.
450+
If True, all chunks will be stored regardless of their contents. If
451+
False (default), each chunk is compared to the array's fill value prior
452+
to storing. If a chunk is uniformly equal to the fill value, then that
453+
chunk is not be stored, and the store entry for that chunk's key is
454+
deleted. This setting enables sparser storage, as only chunks with
455+
non-fill-value data are stored, at the expense of overhead associated
456+
with checking the data of each chunk.
457+
.. versionadded:: 2.11
455458
456459
Returns
457460
-------

zarr/util.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -670,7 +670,7 @@ def all_equal(value: Any, array: Any):
670670
# optimized to return on the first truthy value in `array`.
671671
try:
672672
return not np.any(array)
673-
except TypeError: # pragma: no cover
673+
except (TypeError, ValueError): # pragma: no cover
674674
pass
675675
if np.issubdtype(array.dtype, np.object_):
676676
# we have to flatten the result of np.equal to handle outputs like

0 commit comments

Comments
 (0)