Skip to content

Commit 4e19759

Browse files
committed
rebase; resolve issues with structured arrays
1 parent 600aa93 commit 4e19759

File tree

11 files changed

+157
-121
lines changed

11 files changed

+157
-121
lines changed

docs/spec/v2.rst

Lines changed: 65 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -3,31 +3,31 @@
33
Zarr storage specification version 2
44
====================================
55

6-
This document provides a technical specification of the protocol and format
7-
used for storing Zarr arrays. The key words "MUST", "MUST NOT", "REQUIRED",
8-
"SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
9-
"OPTIONAL" in this document are to be interpreted as described in `RFC 2119
6+
This document provides a technical specification of the protocol and format
7+
used for storing Zarr arrays. The key words "MUST", "MUST NOT", "REQUIRED",
8+
"SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
9+
"OPTIONAL" in this document are to be interpreted as described in `RFC 2119
1010
<https://www.ietf.org/rfc/rfc2119.txt>`_.
1111

1212
Status
1313
------
1414

15-
This specification is the latest version. See :ref:`spec` for previous
15+
This specification is the latest version. See :ref:`spec` for previous
1616
versions.
1717

1818
Storage
1919
-------
2020

21-
A Zarr array can be stored in any storage system that provides a key/value
22-
interface, where a key is an ASCII string and a value is an arbitrary sequence
23-
of bytes, and the supported operations are read (get the sequence of bytes
24-
associated with a given key), write (set the sequence of bytes associated with
21+
A Zarr array can be stored in any storage system that provides a key/value
22+
interface, where a key is an ASCII string and a value is an arbitrary sequence
23+
of bytes, and the supported operations are read (get the sequence of bytes
24+
associated with a given key), write (set the sequence of bytes associated with
2525
a given key) and delete (remove a key/value pair).
2626

27-
For example, a directory in a file system can provide this interface, where
28-
keys are file names, values are file contents, and files can be read, written
29-
or deleted via the operating system. Equally, an S3 bucket can provide this
30-
interface, where keys are resource names, values are resource contents, and
27+
For example, a directory in a file system can provide this interface, where
28+
keys are file names, values are file contents, and files can be read, written
29+
or deleted via the operating system. Equally, an S3 bucket can provide this
30+
interface, where keys are resource names, values are resource contents, and
3131
resources can be read, written or deleted via HTTP.
3232

3333
Below an "array store" refers to any system implementing this interface.
@@ -38,11 +38,11 @@ Arrays
3838
Metadata
3939
~~~~~~~~
4040

41-
Each array requires essential configuration metadata to be stored, enabling
42-
correct interpretation of the stored data. This metadata is encoded using JSON
41+
Each array requires essential configuration metadata to be stored, enabling
42+
correct interpretation of the stored data. This metadata is encoded using JSON
4343
and stored as the value of the ".zarray" key within an array store.
4444

45-
The metadata resource is a JSON object. The following keys MUST be present
45+
The metadata resource is a JSON object. The following keys MUST be present
4646
within the object:
4747

4848
zarr_format
@@ -57,8 +57,8 @@ dtype
5757
A string or list defining a valid data type for the array. See also
5858
the subsection below on data type encoding.
5959
compressor
60-
A JSON object identifying the primary compression codec and providing
61-
configuration parameters, or ``null`` if no compressor is to be used.
60+
A JSON object identifying the primary compression codec and providing
61+
configuration parameters, or ``null`` if no compressor is to be used.
6262
The object MUST contain an ``"id"`` key identifying the codec to be used.
6363
fill_value
6464
A scalar value providing the default value to use for uninitialized
@@ -74,10 +74,10 @@ filters
7474

7575
Other keys MUST NOT be present within the metadata object.
7676

77-
For example, the JSON object below defines a 2-dimensional array of 64-bit
78-
little-endian floating point numbers with 10000 rows and 10000 columns, divided
79-
into chunks of 1000 rows and 1000 columns (so there will be 100 chunks in total
80-
arranged in a 10 by 10 grid). Within each chunk the data are laid out in C
77+
For example, the JSON object below defines a 2-dimensional array of 64-bit
78+
little-endian floating point numbers with 10000 rows and 10000 columns, divided
79+
into chunks of 1000 rows and 1000 columns (so there will be 100 chunks in total
80+
arranged in a 10 by 10 grid). Within each chunk the data are laid out in C
8181
contiguous order. Each chunk is encoded using a delta filter and compressed
8282
using the Blosc compression library prior to storage::
8383

@@ -109,8 +109,8 @@ Data type encoding
109109
~~~~~~~~~~~~~~~~~~
110110

111111
Simple data types are encoded within the array metadata as a string,
112-
following the `NumPy array protocol type string (typestr) format
113-
<http://docs.scipy.org/doc/numpy/reference/arrays.interface.html>`_. The format
112+
following the `NumPy array protocol type string (typestr) format
113+
<http://docs.scipy.org/doc/numpy/reference/arrays.interface.html>`_. The format
114114
consists of 3 parts:
115115

116116
* One character describing the byteorder of the data (``"<"``: little-endian;
@@ -127,9 +127,9 @@ The byte order MUST be specified. E.g., ``"<f8"``, ``">i4"``, ``"|b1"`` and
127127
``"|S12"`` are valid data type encodings.
128128

129129
Structured data types (i.e., with multiple named fields) are encoded as a list
130-
of two-element lists, following `NumPy array protocol type descriptions (descr)
131-
<http://docs.scipy.org/doc/numpy/reference/arrays.interface.html#>`_. For
132-
example, the JSON list ``[["r", "|u1"], ["g", "|u1"], ["b", "|u1"]]`` defines a
130+
of two-element lists, following `NumPy array protocol type descriptions (descr)
131+
<http://docs.scipy.org/doc/numpy/reference/arrays.interface.html#>`_. For
132+
example, the JSON list ``[["r", "|u1"], ["g", "|u1"], ["b", "|u1"]]`` defines a
133133
data type composed of three single-byte unsigned integers labelled "r", "g" and
134134
"b".
135135

@@ -147,37 +147,41 @@ Positive Infinity ``"Infinity"``
147147
Negative Infinity ``"-Infinity"``
148148
================= ===============
149149

150+
If an array has a fixed length byte string data type (e.g., ``"|S12"``), or a
151+
structured data type, and if the fill value is not null, then the fill value
152+
MUST be encoded as an ASCII string using the standard Base64 alphabet.
153+
150154
Chunks
151155
~~~~~~
152156

153-
Each chunk of the array is compressed by passing the raw bytes for the chunk
154-
through the primary compression library to obtain a new sequence of bytes
155-
comprising the compressed chunk data. No header is added to the compressed
156-
bytes or any other modification made. The internal structure of the compressed
157-
bytes will depend on which primary compressor was used. For example, the `Blosc
158-
compressor <https://github.com/Blosc/c-blosc/blob/master/README_HEADER.rst>`_
159-
produces a sequence of bytes that begins with a 16-byte header followed by
157+
Each chunk of the array is compressed by passing the raw bytes for the chunk
158+
through the primary compression library to obtain a new sequence of bytes
159+
comprising the compressed chunk data. No header is added to the compressed
160+
bytes or any other modification made. The internal structure of the compressed
161+
bytes will depend on which primary compressor was used. For example, the `Blosc
162+
compressor <https://github.com/Blosc/c-blosc/blob/master/README_HEADER.rst>`_
163+
produces a sequence of bytes that begins with a 16-byte header followed by
160164
compressed data.
161165

162-
The compressed sequence of bytes for each chunk is stored under a key formed
163-
from the index of the chunk within the grid of chunks representing the array.
164-
To form a string key for a chunk, the indices are converted to strings and
166+
The compressed sequence of bytes for each chunk is stored under a key formed
167+
from the index of the chunk within the grid of chunks representing the array.
168+
To form a string key for a chunk, the indices are converted to strings and
165169
concatenated with the period character (".") separating each index. For
166-
example, given an array with shape (10000, 10000) and chunk shape (1000, 1000)
167-
there will be 100 chunks laid out in a 10 by 10 grid. The chunk with indices
168-
(0, 0) provides data for rows 0-1000 and columns 0-1000 and is stored under the
170+
example, given an array with shape (10000, 10000) and chunk shape (1000, 1000)
171+
there will be 100 chunks laid out in a 10 by 10 grid. The chunk with indices
172+
(0, 0) provides data for rows 0-1000 and columns 0-1000 and is stored under the
169173
key "0.0"; the chunk with indices (2, 4) provides data for rows 2000-3000 and
170174
columns 4000-5000 and is stored under the key "2.4"; etc.
171175

172-
There is no need for all chunks to be present within an array store. If a chunk
173-
is not present then it is considered to be in an uninitialized state. An
174-
unitialized chunk MUST be treated as if it was uniformly filled with the value
176+
There is no need for all chunks to be present within an array store. If a chunk
177+
is not present then it is considered to be in an uninitialized state. An
178+
unitialized chunk MUST be treated as if it was uniformly filled with the value
175179
of the "fill_value" field in the array metadata. If the "fill_value" field is
176180
``null`` then the contents of the chunk are undefined.
177181

178-
Note that all chunks in an array have the same shape. If the length of any
179-
array dimension is not exactly divisible by the length of the corresponding
180-
chunk dimension then some chunks will overhang the edge of the array. The
182+
Note that all chunks in an array have the same shape. If the length of any
183+
array dimension is not exactly divisible by the length of the corresponding
184+
chunk dimension then some chunks will overhang the edge of the array. The
181185
contents of any chunk region falling outside the array are undefined.
182186

183187
Filters
@@ -196,15 +200,15 @@ Hierarchies
196200
Logical storage paths
197201
~~~~~~~~~~~~~~~~~~~~~
198202

199-
Multiple arrays can be stored in the same array store by associating each array
200-
with a different logical path. A logical path is simply an ASCII string. The
201-
logical path is used to form a prefix for keys used by the array. For example,
203+
Multiple arrays can be stored in the same array store by associating each array
204+
with a different logical path. A logical path is simply an ASCII string. The
205+
logical path is used to form a prefix for keys used by the array. For example,
202206
if an array is stored at logical path "foo/bar" then the array metadata will be
203207
stored under the key "foo/bar/.zarray", the user-defined attributes will be
204208
stored under the key "foo/bar/.zattrs", and the chunks will be stored under
205209
keys like "foo/bar/0.0", "foo/bar/0.1", etc.
206210

207-
To ensure consistent behaviour across different storage systems, logical paths
211+
To ensure consistent behaviour across different storage systems, logical paths
208212
MUST be normalized as follows:
209213

210214
* Replace all backward slash characters ("\\") with forward slash characters
@@ -221,24 +225,24 @@ After normalization, if splitting a logical path by the "/" character results
221225
in any path segment equal to the string "." or the string ".." then an error
222226
MUST be raised.
223227

224-
N.B., how the underlying array store processes requests to store values under
228+
N.B., how the underlying array store processes requests to store values under
225229
keys containing the "/" character is entirely up to the store implementation
226-
and is not constrained by this specification. E.g., an array store could simply
227-
treat all keys as opaque ASCII strings; equally, an array store could map
228-
logical paths onto some kind of hierarchical storage (e.g., directories on a
230+
and is not constrained by this specification. E.g., an array store could simply
231+
treat all keys as opaque ASCII strings; equally, an array store could map
232+
logical paths onto some kind of hierarchical storage (e.g., directories on a
229233
file system).
230234

231235
Groups
232236
~~~~~~
233237

234238
Arrays can be organized into groups which can also contain other groups. A
235239
group is created by storing group metadata under the ".zgroup" key under some
236-
logical path. E.g., a group exists at the root of an array store if the
240+
logical path. E.g., a group exists at the root of an array store if the
237241
".zgroup" key exists in the store, and a group exists at logical path "foo/bar"
238242
if the "foo/bar/.zgroup" key exists in the store.
239243

240-
If the user requests a group to be created under some logical path, then groups
241-
MUST also be created at all ancestor paths. E.g., if the user requests group
244+
If the user requests a group to be created under some logical path, then groups
245+
MUST also be created at all ancestor paths. E.g., if the user requests group
242246
creation at path "foo/bar" then groups MUST be created at path "foo" and the
243247
root of the store, if they don't already exist.
244248

@@ -256,7 +260,7 @@ zarr_format
256260

257261
Other keys MUST NOT be present within the metadata object.
258262

259-
The members of a group are arrays and groups stored under logical paths that
263+
The members of a group are arrays and groups stored under logical paths that
260264
are direct children of the parent group's logical path. E.g., if groups exist
261265
under the logical paths "foo" and "foo/bar" and an array exists at logical path
262266
"foo/baz" then the members of the group at path "foo" are the group at path
@@ -265,8 +269,8 @@ under the logical paths "foo" and "foo/bar" and an array exists at logical path
265269
Attributes
266270
----------
267271

268-
An array or group can be associated with custom attributes, which are simple
269-
key/value items with application-specific meaning. Custom attributes are
272+
An array or group can be associated with custom attributes, which are simple
273+
key/value items with application-specific meaning. Custom attributes are
270274
encoded as a JSON object and stored under the ".zattrs" key within an array
271275
store.
272276

@@ -377,7 +381,7 @@ Modify the array attributes::
377381
Storing multiple arrays in a hierarchy
378382
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
379383

380-
Below is an example of storing multiple Zarr arrays organized into a group
384+
Below is an example of storing multiple Zarr arrays organized into a group
381385
hierarchy, using a directory on the local file system as storage. This storage
382386
implementation maps logical paths onto directory paths on the file system,
383387
however this is an implementation choice and is not required.

docs/tutorial.rst

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ stored in memory. Zarr arrays can also be stored on a file system,
7676
enabling persistence of data between sessions. For example::
7777

7878
>>> z1 = zarr.open_array('example.zarr', mode='w', shape=(10000, 10000),
79-
... chunks=(1000, 1000), dtype='i4', fill_value=0)
79+
... chunks=(1000, 1000), dtype='i4')
8080

8181
The array above will store its configuration metadata and all
8282
compressed chunk data in a directory called 'example.zarr' relative to
@@ -382,8 +382,7 @@ and :func:`zarr.hierarchy.Group.require_dataset` methods, e.g.::
382382

383383
>>> z = bar_group.create_dataset('quux', shape=(10000, 10000),
384384
... chunks=(1000, 1000), dtype='i4',
385-
... fill_value=0, compression='gzip',
386-
... compression_opts=1)
385+
... compression='gzip', compression_opts=1)
387386
>>> z
388387
<zarr.core.Array '/foo/bar/quux' (10000, 10000) int32>
389388

@@ -408,8 +407,7 @@ stored in sub-directories, e.g.::
408407
>>> persistent_group
409408
<zarr.hierarchy.Group '/'>
410409
>>> z = persistent_group.create_dataset('foo/bar/baz', shape=(10000, 10000),
411-
... chunks=(1000, 1000), dtype='i4',
412-
... fill_value=0)
410+
... chunks=(1000, 1000), dtype='i4')
413411
>>> z
414412
<zarr.core.Array '/foo/bar/baz' (10000, 10000) int32>
415413

@@ -722,7 +720,7 @@ directory on the local file system. This is used under the hood by the
722720
:func:`zarr.creation.open_array` and :func:`zarr.hierarchy.open_group` functions. In other words,
723721
the following code::
724722

725-
>>> z = zarr.open_array('example.zarr', mode='w', shape=1000000, dtype='i4', fill_value=0)
723+
>>> z = zarr.open_array('example.zarr', mode='w', shape=1000000, dtype='i4')
726724

727725
...is just short-hand for::
728726

0 commit comments

Comments
 (0)