Skip to content

Commit df18805

Browse files
authored
Adds documentation for sharding and sharding in Array.info (#2644)
1 parent 0adcbe7 commit df18805

File tree

7 files changed

+220
-61
lines changed

7 files changed

+220
-61
lines changed

docs/user-guide/arrays.rst

Lines changed: 35 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -574,8 +574,41 @@ Any combination of integer and slice can be used for block indexing::
574574
Sharding
575575
--------
576576

577-
Coming soon.
578-
577+
Using small chunk shapes in very large arrays can lead to a very large number of chunks.
578+
This can become a performance issue for file systems and object storage.
579+
With Zarr format 3, a new sharding feature has been added to address this issue.
580+
581+
With sharding, multiple chunks can be stored in a single storage object (e.g. a file).
582+
Within a shard, chunks are compressed and serialized separately.
583+
This allows individual chunks to be read independently.
584+
However, when writing data, a full shard must be written in one go for optimal
585+
performance and to avoid concurrency issues.
586+
That means that shards are the units of writing and chunks are the units of reading.
587+
Users need to configure the chunk and shard shapes accordingly.
588+
589+
Sharded arrays can be created by providing the ``shards`` parameter to :func:`zarr.create_array`.
590+
591+
>>> a = zarr.create_array('data/example-20.zarr', shape=(10000, 10000), shards=(1000, 1000), chunks=(100, 100), dtype='uint8')
592+
>>> a[:] = (np.arange(10000 * 10000) % 256).astype('uint8').reshape(10000, 10000)
593+
>>> a.info_complete()
594+
Type : Array
595+
Zarr format : 3
596+
Data type : DataType.uint8
597+
Shape : (10000, 10000)
598+
Shard shape : (1000, 1000)
599+
Chunk shape : (100, 100)
600+
Order : C
601+
Read-only : False
602+
Store type : LocalStore
603+
Codecs : [{'chunk_shape': (100, 100), 'codecs': ({'endian': <Endian.little: 'little'>}, {'level': 0, 'checksum': False}), 'index_codecs': ({'endian': <Endian.little: 'little'>}, {}), 'index_location': <ShardingCodecIndexLocation.end: 'end'>}]
604+
No. bytes : 100000000 (95.4M)
605+
No. bytes stored : 3981060
606+
Storage ratio : 25.1
607+
Chunks Initialized : 100
608+
609+
In this example a shard shape of (1000, 1000) and a chunk shape of (100, 100) is used.
610+
This means that 10*10 chunks are stored in each shard, and there are 10*10 shards in total.
611+
Without the ``shards`` argument, there would be 10,000 chunks stored as individual files.
579612

580613
Missing features in 3.0
581614
-----------------------

docs/user-guide/performance.rst

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,45 @@ will be one single chunk for the array::
6262
>>> z5.chunks
6363
(10000, 10000)
6464

65+
66+
Sharding
67+
~~~~~~~~
68+
69+
If you have large arrays but need small chunks to efficiently access the data, you can
70+
use sharding. Sharding provides a mechanism to store multiple chunks in a single
71+
storage object or file. This can be useful because traditional file systems and object
72+
storage systems may have performance issues storing and accessing many files.
73+
Additionally, small files can be inefficient to store if they are smaller than the
74+
block size of the file system.
75+
76+
Picking a good combination of chunk shape and shard shape is important for performance.
77+
The chunk shape determines what unit of your data can be read independently, while the
78+
shard shape determines what unit of your data can be written efficiently.
79+
80+
For an example, consider you have a 100 GB array and need to read small chunks of 1 MB.
81+
Without sharding, each chunk would be one file resulting in 100,000 files. That can
82+
already cause performance issues on some file systems.
83+
With sharding, you could use a shard size of 1 GB. This would result in 1000 chunks per
84+
file and 100 files in total, which seems manageable for most storage systems.
85+
You would still be able to read each 1 MB chunk independently, but you would need to
86+
write your data in 1 GB increments.
87+
88+
To use sharding, you need to specify the ``shards`` parameter when creating the array.
89+
90+
>>> z6 = zarr.create_array(store={}, shape=(10000, 10000, 1000), shards=(1000, 1000, 1000), chunks=(100, 100, 100), dtype='uint8')
91+
>>> z6.info
92+
Type : Array
93+
Zarr format : 3
94+
Data type : DataType.uint8
95+
Shape : (10000, 10000, 1000)
96+
Shard shape : (1000, 1000, 1000)
97+
Chunk shape : (100, 100, 100)
98+
Order : C
99+
Read-only : False
100+
Store type : MemoryStore
101+
Codecs : [{'chunk_shape': (100, 100, 100), 'codecs': ({'endian': <Endian.little: 'little'>}, {'level': 0, 'checksum': False}), 'index_codecs': ({'endian': <Endian.little: 'little'>}, {}), 'index_location': <ShardingCodecIndexLocation.end: 'end'>}]
102+
No. bytes : 100000000000 (93.1G)
103+
65104
.. _user-guide-chunks-order:
66105

67106
Chunk memory layout

src/zarr/core/_info.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@ class ArrayInfo:
8080
_zarr_format: ZarrFormat
8181
_data_type: np.dtype[Any] | DataType
8282
_shape: tuple[int, ...]
83+
_shard_shape: tuple[int, ...] | None = None
8384
_chunk_shape: tuple[int, ...] | None = None
8485
_order: Literal["C", "F"]
8586
_read_only: bool
@@ -96,7 +97,13 @@ def __repr__(self) -> str:
9697
Type : {_type}
9798
Zarr format : {_zarr_format}
9899
Data type : {_data_type}
99-
Shape : {_shape}
100+
Shape : {_shape}""")
101+
102+
if self._shard_shape is not None:
103+
template += textwrap.dedent("""
104+
Shard shape : {_shard_shape}""")
105+
106+
template += textwrap.dedent("""
100107
Chunk shape : {_chunk_shape}
101108
Order : {_order}
102109
Read-only : {_read_only}

src/zarr/core/array.py

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1573,14 +1573,8 @@ def _info(
15731573
else:
15741574
kwargs["_codecs"] = self.metadata.codecs
15751575
kwargs["_data_type"] = self.metadata.data_type
1576-
# just regular?
1577-
chunk_grid = self.metadata.chunk_grid
1578-
if isinstance(chunk_grid, RegularChunkGrid):
1579-
kwargs["_chunk_shape"] = chunk_grid.chunk_shape
1580-
else:
1581-
raise NotImplementedError(
1582-
"'info' is not yet implemented for chunk grids of type {type(self.metadata.chunk_grid)}"
1583-
)
1576+
kwargs["_chunk_shape"] = self.chunks
1577+
kwargs["_shard_shape"] = self.shards
15841578

15851579
return ArrayInfo(
15861580
_zarr_format=self.metadata.zarr_format,

src/zarr/core/codec_pipeline.py

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -332,12 +332,21 @@ async def write_batch(
332332
drop_axes: tuple[int, ...] = (),
333333
) -> None:
334334
if self.supports_partial_encode:
335-
await self.encode_partial_batch(
336-
[
337-
(byte_setter, value[out_selection], chunk_selection, chunk_spec)
338-
for byte_setter, chunk_spec, chunk_selection, out_selection in batch_info
339-
],
340-
)
335+
# Pass scalar values as is
336+
if len(value.shape) == 0:
337+
await self.encode_partial_batch(
338+
[
339+
(byte_setter, value, chunk_selection, chunk_spec)
340+
for byte_setter, chunk_spec, chunk_selection, out_selection in batch_info
341+
],
342+
)
343+
else:
344+
await self.encode_partial_batch(
345+
[
346+
(byte_setter, value[out_selection], chunk_selection, chunk_spec)
347+
for byte_setter, chunk_spec, chunk_selection, out_selection in batch_info
348+
],
349+
)
341350

342351
else:
343352
# Read existing bytes if not total slice

tests/test_array.py

Lines changed: 92 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
VLenUTF8Codec,
2121
ZstdCodec,
2222
)
23+
from zarr.codecs.sharding import ShardingCodec
2324
from zarr.core._info import ArrayInfo
2425
from zarr.core.array import (
2526
CompressorsLike,
@@ -478,121 +479,168 @@ def test_update_attrs(zarr_format: ZarrFormat) -> None:
478479
assert arr2.attrs["foo"] == "bar"
479480

480481

482+
@pytest.mark.parametrize(("chunks", "shards"), [((2, 2), None), ((2, 2), (4, 4))])
481483
class TestInfo:
482-
def test_info_v2(self) -> None:
483-
arr = zarr.create(shape=(4, 4), chunks=(2, 2), zarr_format=2)
484+
def test_info_v2(self, chunks: tuple[int, int], shards: tuple[int, int] | None) -> None:
485+
arr = zarr.create_array(store={}, shape=(8, 8), dtype="f8", chunks=chunks, zarr_format=2)
484486
result = arr.info
485487
expected = ArrayInfo(
486488
_zarr_format=2,
487489
_data_type=np.dtype("float64"),
488-
_shape=(4, 4),
489-
_chunk_shape=(2, 2),
490+
_shape=(8, 8),
491+
_chunk_shape=chunks,
492+
_shard_shape=None,
490493
_order="C",
491494
_read_only=False,
492495
_store_type="MemoryStore",
493-
_count_bytes=128,
496+
_count_bytes=512,
494497
_compressor=numcodecs.Zstd(),
495498
)
496499
assert result == expected
497500

498-
def test_info_v3(self) -> None:
499-
arr = zarr.create(shape=(4, 4), chunks=(2, 2), zarr_format=3)
501+
def test_info_v3(self, chunks: tuple[int, int], shards: tuple[int, int] | None) -> None:
502+
arr = zarr.create_array(store={}, shape=(8, 8), dtype="f8", chunks=chunks, shards=shards)
500503
result = arr.info
501504
expected = ArrayInfo(
502505
_zarr_format=3,
503506
_data_type=DataType.parse("float64"),
504-
_shape=(4, 4),
505-
_chunk_shape=(2, 2),
507+
_shape=(8, 8),
508+
_chunk_shape=chunks,
509+
_shard_shape=shards,
506510
_order="C",
507511
_read_only=False,
508512
_store_type="MemoryStore",
509-
_codecs=[BytesCodec(), ZstdCodec()],
510-
_count_bytes=128,
513+
_codecs=[BytesCodec(), ZstdCodec()]
514+
if shards is None
515+
else [ShardingCodec(chunk_shape=chunks, codecs=[BytesCodec(), ZstdCodec()])],
516+
_count_bytes=512,
511517
)
512518
assert result == expected
513519

514-
def test_info_complete(self) -> None:
515-
arr = zarr.create(shape=(4, 4), chunks=(2, 2), zarr_format=3, codecs=[BytesCodec()])
520+
def test_info_complete(self, chunks: tuple[int, int], shards: tuple[int, int] | None) -> None:
521+
arr = zarr.create_array(
522+
store={},
523+
shape=(8, 8),
524+
dtype="f8",
525+
chunks=chunks,
526+
shards=shards,
527+
compressors=(),
528+
)
516529
result = arr.info_complete()
517530
expected = ArrayInfo(
518531
_zarr_format=3,
519532
_data_type=DataType.parse("float64"),
520-
_shape=(4, 4),
521-
_chunk_shape=(2, 2),
533+
_shape=(8, 8),
534+
_chunk_shape=chunks,
535+
_shard_shape=shards,
522536
_order="C",
523537
_read_only=False,
524538
_store_type="MemoryStore",
525-
_codecs=[BytesCodec()],
526-
_count_bytes=128,
539+
_codecs=[BytesCodec()] if shards is None else [ShardingCodec(chunk_shape=chunks)],
540+
_count_bytes=512,
527541
_count_chunks_initialized=0,
528-
_count_bytes_stored=373, # the metadata?
542+
_count_bytes_stored=373 if shards is None else 578, # the metadata?
529543
)
530544
assert result == expected
531545

532-
arr[:2, :2] = 10
546+
arr[:4, :4] = 10
533547
result = arr.info_complete()
534-
expected = dataclasses.replace(
535-
expected, _count_chunks_initialized=1, _count_bytes_stored=405
536-
)
548+
if shards is None:
549+
expected = dataclasses.replace(
550+
expected, _count_chunks_initialized=4, _count_bytes_stored=501
551+
)
552+
else:
553+
expected = dataclasses.replace(
554+
expected, _count_chunks_initialized=1, _count_bytes_stored=774
555+
)
537556
assert result == expected
538557

539-
async def test_info_v2_async(self) -> None:
540-
arr = await zarr.api.asynchronous.create(shape=(4, 4), chunks=(2, 2), zarr_format=2)
558+
async def test_info_v2_async(
559+
self, chunks: tuple[int, int], shards: tuple[int, int] | None
560+
) -> None:
561+
arr = await zarr.api.asynchronous.create_array(
562+
store={}, shape=(8, 8), dtype="f8", chunks=chunks, zarr_format=2
563+
)
541564
result = arr.info
542565
expected = ArrayInfo(
543566
_zarr_format=2,
544567
_data_type=np.dtype("float64"),
545-
_shape=(4, 4),
568+
_shape=(8, 8),
546569
_chunk_shape=(2, 2),
570+
_shard_shape=None,
547571
_order="C",
548572
_read_only=False,
549573
_store_type="MemoryStore",
550-
_count_bytes=128,
574+
_count_bytes=512,
551575
_compressor=numcodecs.Zstd(),
552576
)
553577
assert result == expected
554578

555-
async def test_info_v3_async(self) -> None:
556-
arr = await zarr.api.asynchronous.create(shape=(4, 4), chunks=(2, 2), zarr_format=3)
579+
async def test_info_v3_async(
580+
self, chunks: tuple[int, int], shards: tuple[int, int] | None
581+
) -> None:
582+
arr = await zarr.api.asynchronous.create_array(
583+
store={},
584+
shape=(8, 8),
585+
dtype="f8",
586+
chunks=chunks,
587+
shards=shards,
588+
)
557589
result = arr.info
558590
expected = ArrayInfo(
559591
_zarr_format=3,
560592
_data_type=DataType.parse("float64"),
561-
_shape=(4, 4),
562-
_chunk_shape=(2, 2),
593+
_shape=(8, 8),
594+
_chunk_shape=chunks,
595+
_shard_shape=shards,
563596
_order="C",
564597
_read_only=False,
565598
_store_type="MemoryStore",
566-
_codecs=[BytesCodec(), ZstdCodec()],
567-
_count_bytes=128,
599+
_codecs=[BytesCodec(), ZstdCodec()]
600+
if shards is None
601+
else [ShardingCodec(chunk_shape=chunks, codecs=[BytesCodec(), ZstdCodec()])],
602+
_count_bytes=512,
568603
)
569604
assert result == expected
570605

571-
async def test_info_complete_async(self) -> None:
572-
arr = await zarr.api.asynchronous.create(
573-
shape=(4, 4), chunks=(2, 2), zarr_format=3, codecs=[BytesCodec()]
606+
async def test_info_complete_async(
607+
self, chunks: tuple[int, int], shards: tuple[int, int] | None
608+
) -> None:
609+
arr = await zarr.api.asynchronous.create_array(
610+
store={},
611+
dtype="f8",
612+
shape=(8, 8),
613+
chunks=chunks,
614+
shards=shards,
615+
compressors=None,
574616
)
575617
result = await arr.info_complete()
576618
expected = ArrayInfo(
577619
_zarr_format=3,
578620
_data_type=DataType.parse("float64"),
579-
_shape=(4, 4),
580-
_chunk_shape=(2, 2),
621+
_shape=(8, 8),
622+
_chunk_shape=chunks,
623+
_shard_shape=shards,
581624
_order="C",
582625
_read_only=False,
583626
_store_type="MemoryStore",
584-
_codecs=[BytesCodec()],
585-
_count_bytes=128,
627+
_codecs=[BytesCodec()] if shards is None else [ShardingCodec(chunk_shape=chunks)],
628+
_count_bytes=512,
586629
_count_chunks_initialized=0,
587-
_count_bytes_stored=373, # the metadata?
630+
_count_bytes_stored=373 if shards is None else 578, # the metadata?
588631
)
589632
assert result == expected
590633

591-
await arr.setitem((slice(2), slice(2)), 10)
634+
await arr.setitem((slice(4), slice(4)), 10)
592635
result = await arr.info_complete()
593-
expected = dataclasses.replace(
594-
expected, _count_chunks_initialized=1, _count_bytes_stored=405
595-
)
636+
if shards is None:
637+
expected = dataclasses.replace(
638+
expected, _count_chunks_initialized=4, _count_bytes_stored=501
639+
)
640+
else:
641+
expected = dataclasses.replace(
642+
expected, _count_chunks_initialized=1, _count_bytes_stored=774
643+
)
596644
assert result == expected
597645

598646

0 commit comments

Comments
 (0)