diff --git a/docs/api/codecs.rst b/docs/api/codecs.rst index e35de08b30..f35ea861b4 100644 --- a/docs/api/codecs.rst +++ b/docs/api/codecs.rst @@ -2,27 +2,22 @@ Compressors and filters (``zarr.codecs``) ========================================= .. module:: zarr.codecs -This module contains compressor and filter classes for use with Zarr. +This module contains compressor and filter classes for use with Zarr. Please note that this module +is provided for backwards compatibility with previous versions of Zarr. From Zarr version 2.2 +onwards, all codec classes have been moved to a separate package called Numcodecs_. The two +packages (Zarr and Numcodecs_) are designed to be used together. For example, a Numcodecs_ codec +class can be used as a compressor for a Zarr array:: -Other codecs can be registered dynamically with Zarr. All that is required -is to implement a class that provides the same interface as the classes listed -below, and then to add the class to the ``codec_registry``. See the source -code of this module for details. + >>> import zarr + >>> from numcodecs import Blosc + >>> z = zarr.zeros(1000000, compressor=Blosc(cname='zstd', clevel=1, shuffle=Blosc.SHUFFLE) -.. autoclass:: Codec +Codec classes can also be used as filters. See the tutorial section on :ref:`tutorial_filters` +for more information. - .. automethod:: encode - .. automethod:: decode - .. automethod:: get_config - .. automethod:: from_config +Please note that it is also relatively straightforward to define and register custom codec +classes. See the Numcodecs `codec API `_ and +`codec registry `_ documentation for more +information. -.. autoclass:: Blosc -.. autoclass:: Zlib -.. autoclass:: BZ2 -.. autoclass:: LZMA -.. autoclass:: Delta -.. autoclass:: AsType -.. autoclass:: FixedScaleOffset -.. autoclass:: Quantize -.. autoclass:: PackBits -.. autoclass:: Categorize +.. _Numcodecs: http://numcodecs.readthedocs.io/ diff --git a/docs/api/core.rst b/docs/api/core.rst index 4f2c5cc6bb..ada6a653ca 100644 --- a/docs/api/core.rst +++ b/docs/api/core.rst @@ -6,6 +6,14 @@ The Array class (``zarr.core``) .. automethod:: __getitem__ .. automethod:: __setitem__ + .. automethod:: get_basic_selection + .. automethod:: set_basic_selection + .. automethod:: get_mask_selection + .. automethod:: set_mask_selection + .. automethod:: get_coordinate_selection + .. automethod:: set_coordinate_selection + .. automethod:: get_orthogonal_selection + .. automethod:: set_orthogonal_selection .. automethod:: resize .. automethod:: append .. automethod:: view diff --git a/docs/index.rst b/docs/index.rst index 5215ba272a..80c7de664d 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -12,27 +12,25 @@ Highlights * Create N-dimensional arrays with any NumPy dtype. * Chunk arrays along any dimension. -* Compress chunks using the fast Blosc_ meta-compressor or alternatively using zlib, BZ2 or LZMA. +* Compress and/or filter chunks using any numcodecs_ codec. * Store arrays in memory, on disk, inside a Zip file, on S3, ... * Read an array concurrently from multiple threads or processes. * Write to an array concurrently from multiple threads or processes. * Organize arrays into hierarchies via groups. -* Use filters to preprocess data and improve compression. Status ------ -Zarr is still in an early phase of development. Feedback and bug -reports are very welcome, please get in touch via the `GitHub issue -tracker `_. +Zarr is still a young project. Feedback and bug reports are very welcome, please get in touch via +the `GitHub issue tracker `_. Installation ------------ Zarr depends on NumPy. It is generally best to `install NumPy -`_ first using -whatever method is most appropriate for you operating system and -Python distribution. +`_ first using whatever method is most +appropriate for you operating system and Python distribution. Other dependencies should be +installed automatically if using one of the installation methods below. Install Zarr from PyPI:: @@ -41,26 +39,18 @@ Install Zarr from PyPI:: Alternatively, install Zarr via conda:: $ conda install -c conda-forge zarr - -Zarr includes a C extension providing integration with the Blosc_ -library. Installing via conda will install a pre-compiled binary distribution. -However, if you have a newer CPU that supports the AVX2 instruction set (e.g., -Intel Haswell, Broadwell or Skylake) then installing via pip is preferable, -because this will compile the Blosc library from source with optimisations -for AVX2. - + To work with Zarr source code in development, install from GitHub:: $ git clone --recursive https://github.com/alimanfoo/zarr.git $ cd zarr $ python setup.py install -To verify that Zarr has been fully installed (including the Blosc -extension) run the test suite:: +To verify that Zarr has been fully installed, run the test suite:: $ pip install nose $ python -m nose -v zarr - + Contents -------- @@ -75,13 +65,20 @@ Contents Acknowledgments --------------- -Zarr bundles the `c-blosc `_ -library and uses it as the default compressor. +The following people have contributed to the development of Zarr, by contributing code and/or +providing ideas, feedback and advice: + +* `Francesc Alted `_ +* `Stephan Hoyer `_ +* `John Kirkham `_ +* `Alistair Miles `_ +* `Matthew Rocklin `_ +* `Vincent Schut `_ Zarr is inspired by `HDF5 `_, `h5py `_ and `bcolz `_. -Development of this package is supported by the +Development of Zarr is supported by the `MRC Centre for Genomics and Global Health `_. Indices and tables @@ -91,4 +88,4 @@ Indices and tables * :ref:`modindex` * :ref:`search` -.. _Blosc: http://www.blosc.org/ +.. _numcodecs: http://numcodecs.readthedocs.io/ diff --git a/docs/spec/v2.rst b/docs/spec/v2.rst index 00a9bcc495..88df4f9439 100644 --- a/docs/spec/v2.rst +++ b/docs/spec/v2.rst @@ -3,31 +3,31 @@ Zarr storage specification version 2 ==================================== -This document provides a technical specification of the protocol and format -used for storing Zarr arrays. The key words "MUST", "MUST NOT", "REQUIRED", -"SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and -"OPTIONAL" in this document are to be interpreted as described in `RFC 2119 +This document provides a technical specification of the protocol and format +used for storing Zarr arrays. The key words "MUST", "MUST NOT", "REQUIRED", +"SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and +"OPTIONAL" in this document are to be interpreted as described in `RFC 2119 `_. Status ------ -This specification is the latest version. See :ref:`spec` for previous +This specification is the latest version. See :ref:`spec` for previous versions. Storage ------- -A Zarr array can be stored in any storage system that provides a key/value -interface, where a key is an ASCII string and a value is an arbitrary sequence -of bytes, and the supported operations are read (get the sequence of bytes -associated with a given key), write (set the sequence of bytes associated with +A Zarr array can be stored in any storage system that provides a key/value +interface, where a key is an ASCII string and a value is an arbitrary sequence +of bytes, and the supported operations are read (get the sequence of bytes +associated with a given key), write (set the sequence of bytes associated with a given key) and delete (remove a key/value pair). -For example, a directory in a file system can provide this interface, where -keys are file names, values are file contents, and files can be read, written -or deleted via the operating system. Equally, an S3 bucket can provide this -interface, where keys are resource names, values are resource contents, and +For example, a directory in a file system can provide this interface, where +keys are file names, values are file contents, and files can be read, written +or deleted via the operating system. Equally, an S3 bucket can provide this +interface, where keys are resource names, values are resource contents, and resources can be read, written or deleted via HTTP. Below an "array store" refers to any system implementing this interface. @@ -38,11 +38,11 @@ Arrays Metadata ~~~~~~~~ -Each array requires essential configuration metadata to be stored, enabling -correct interpretation of the stored data. This metadata is encoded using JSON +Each array requires essential configuration metadata to be stored, enabling +correct interpretation of the stored data. This metadata is encoded using JSON and stored as the value of the ".zarray" key within an array store. -The metadata resource is a JSON object. The following keys MUST be present +The metadata resource is a JSON object. The following keys MUST be present within the object: zarr_format @@ -57,8 +57,8 @@ dtype A string or list defining a valid data type for the array. See also the subsection below on data type encoding. compressor - A JSON object identifying the primary compression codec and providing - configuration parameters, or ``null`` if no compressor is to be used. + A JSON object identifying the primary compression codec and providing + configuration parameters, or ``null`` if no compressor is to be used. The object MUST contain an ``"id"`` key identifying the codec to be used. fill_value A scalar value providing the default value to use for uninitialized @@ -74,10 +74,10 @@ filters Other keys MUST NOT be present within the metadata object. -For example, the JSON object below defines a 2-dimensional array of 64-bit -little-endian floating point numbers with 10000 rows and 10000 columns, divided -into chunks of 1000 rows and 1000 columns (so there will be 100 chunks in total -arranged in a 10 by 10 grid). Within each chunk the data are laid out in C +For example, the JSON object below defines a 2-dimensional array of 64-bit +little-endian floating point numbers with 10000 rows and 10000 columns, divided +into chunks of 1000 rows and 1000 columns (so there will be 100 chunks in total +arranged in a 10 by 10 grid). Within each chunk the data are laid out in C contiguous order. Each chunk is encoded using a delta filter and compressed using the Blosc compression library prior to storage:: @@ -109,8 +109,8 @@ Data type encoding ~~~~~~~~~~~~~~~~~~ Simple data types are encoded within the array metadata as a string, -following the `NumPy array protocol type string (typestr) format -`_. The format +following the `NumPy array protocol type string (typestr) format +`_. The format consists of 3 parts: * One character describing the byteorder of the data (``"<"``: little-endian; @@ -127,9 +127,9 @@ The byte order MUST be specified. E.g., ``"i4"``, ``"|b1"`` and ``"|S12"`` are valid data type encodings. Structured data types (i.e., with multiple named fields) are encoded as a list -of two-element lists, following `NumPy array protocol type descriptions (descr) -`_. For -example, the JSON list ``[["r", "|u1"], ["g", "|u1"], ["b", "|u1"]]`` defines a +of two-element lists, following `NumPy array protocol type descriptions (descr) +`_. For +example, the JSON list ``[["r", "|u1"], ["g", "|u1"], ["b", "|u1"]]`` defines a data type composed of three single-byte unsigned integers labelled "r", "g" and "b". @@ -147,37 +147,41 @@ Positive Infinity ``"Infinity"`` Negative Infinity ``"-Infinity"`` ================= =============== +If an array has a fixed length byte string data type (e.g., ``"|S12"``), or a +structured data type, and if the fill value is not null, then the fill value +MUST be encoded as an ASCII string using the standard Base64 alphabet. + Chunks ~~~~~~ -Each chunk of the array is compressed by passing the raw bytes for the chunk -through the primary compression library to obtain a new sequence of bytes -comprising the compressed chunk data. No header is added to the compressed -bytes or any other modification made. The internal structure of the compressed -bytes will depend on which primary compressor was used. For example, the `Blosc -compressor `_ -produces a sequence of bytes that begins with a 16-byte header followed by +Each chunk of the array is compressed by passing the raw bytes for the chunk +through the primary compression library to obtain a new sequence of bytes +comprising the compressed chunk data. No header is added to the compressed +bytes or any other modification made. The internal structure of the compressed +bytes will depend on which primary compressor was used. For example, the `Blosc +compressor `_ +produces a sequence of bytes that begins with a 16-byte header followed by compressed data. -The compressed sequence of bytes for each chunk is stored under a key formed -from the index of the chunk within the grid of chunks representing the array. -To form a string key for a chunk, the indices are converted to strings and +The compressed sequence of bytes for each chunk is stored under a key formed +from the index of the chunk within the grid of chunks representing the array. +To form a string key for a chunk, the indices are converted to strings and concatenated with the period character (".") separating each index. For -example, given an array with shape (10000, 10000) and chunk shape (1000, 1000) -there will be 100 chunks laid out in a 10 by 10 grid. The chunk with indices -(0, 0) provides data for rows 0-1000 and columns 0-1000 and is stored under the +example, given an array with shape (10000, 10000) and chunk shape (1000, 1000) +there will be 100 chunks laid out in a 10 by 10 grid. The chunk with indices +(0, 0) provides data for rows 0-1000 and columns 0-1000 and is stored under the key "0.0"; the chunk with indices (2, 4) provides data for rows 2000-3000 and columns 4000-5000 and is stored under the key "2.4"; etc. -There is no need for all chunks to be present within an array store. If a chunk -is not present then it is considered to be in an uninitialized state. An -unitialized chunk MUST be treated as if it was uniformly filled with the value +There is no need for all chunks to be present within an array store. If a chunk +is not present then it is considered to be in an uninitialized state. An +unitialized chunk MUST be treated as if it was uniformly filled with the value of the "fill_value" field in the array metadata. If the "fill_value" field is ``null`` then the contents of the chunk are undefined. -Note that all chunks in an array have the same shape. If the length of any -array dimension is not exactly divisible by the length of the corresponding -chunk dimension then some chunks will overhang the edge of the array. The +Note that all chunks in an array have the same shape. If the length of any +array dimension is not exactly divisible by the length of the corresponding +chunk dimension then some chunks will overhang the edge of the array. The contents of any chunk region falling outside the array are undefined. Filters @@ -196,15 +200,15 @@ Hierarchies Logical storage paths ~~~~~~~~~~~~~~~~~~~~~ -Multiple arrays can be stored in the same array store by associating each array -with a different logical path. A logical path is simply an ASCII string. The -logical path is used to form a prefix for keys used by the array. For example, +Multiple arrays can be stored in the same array store by associating each array +with a different logical path. A logical path is simply an ASCII string. The +logical path is used to form a prefix for keys used by the array. For example, if an array is stored at logical path "foo/bar" then the array metadata will be stored under the key "foo/bar/.zarray", the user-defined attributes will be stored under the key "foo/bar/.zattrs", and the chunks will be stored under keys like "foo/bar/0.0", "foo/bar/0.1", etc. -To ensure consistent behaviour across different storage systems, logical paths +To ensure consistent behaviour across different storage systems, logical paths MUST be normalized as follows: * Replace all backward slash characters ("\\") with forward slash characters @@ -221,11 +225,11 @@ After normalization, if splitting a logical path by the "/" character results in any path segment equal to the string "." or the string ".." then an error MUST be raised. -N.B., how the underlying array store processes requests to store values under +N.B., how the underlying array store processes requests to store values under keys containing the "/" character is entirely up to the store implementation -and is not constrained by this specification. E.g., an array store could simply -treat all keys as opaque ASCII strings; equally, an array store could map -logical paths onto some kind of hierarchical storage (e.g., directories on a +and is not constrained by this specification. E.g., an array store could simply +treat all keys as opaque ASCII strings; equally, an array store could map +logical paths onto some kind of hierarchical storage (e.g., directories on a file system). Groups @@ -233,12 +237,12 @@ Groups Arrays can be organized into groups which can also contain other groups. A group is created by storing group metadata under the ".zgroup" key under some -logical path. E.g., a group exists at the root of an array store if the +logical path. E.g., a group exists at the root of an array store if the ".zgroup" key exists in the store, and a group exists at logical path "foo/bar" if the "foo/bar/.zgroup" key exists in the store. -If the user requests a group to be created under some logical path, then groups -MUST also be created at all ancestor paths. E.g., if the user requests group +If the user requests a group to be created under some logical path, then groups +MUST also be created at all ancestor paths. E.g., if the user requests group creation at path "foo/bar" then groups MUST be created at path "foo" and the root of the store, if they don't already exist. @@ -256,7 +260,7 @@ zarr_format Other keys MUST NOT be present within the metadata object. -The members of a group are arrays and groups stored under logical paths that +The members of a group are arrays and groups stored under logical paths that are direct children of the parent group's logical path. E.g., if groups exist under the logical paths "foo" and "foo/bar" and an array exists at logical path "foo/baz" then the members of the group at path "foo" are the group at path @@ -265,8 +269,8 @@ under the logical paths "foo" and "foo/bar" and an array exists at logical path Attributes ---------- -An array or group can be associated with custom attributes, which are simple -key/value items with application-specific meaning. Custom attributes are +An array or group can be associated with custom attributes, which are simple +key/value items with application-specific meaning. Custom attributes are encoded as a JSON object and stored under the ".zattrs" key within an array store. @@ -377,7 +381,7 @@ Modify the array attributes:: Storing multiple arrays in a hierarchy ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Below is an example of storing multiple Zarr arrays organized into a group +Below is an example of storing multiple Zarr arrays organized into a group hierarchy, using a directory on the local file system as storage. This storage implementation maps logical paths onto directory paths on the file system, however this is an implementation choice and is not required. diff --git a/docs/tutorial.rst b/docs/tutorial.rst index cb633af27a..8b8d2cbe83 100644 --- a/docs/tutorial.rst +++ b/docs/tutorial.rst @@ -40,10 +40,6 @@ scalar value:: >>> z[:] = 42 -Notice that the values of ``initialized`` has changed. This is because -when a Zarr array is first created, none of the chunks are initialized. -Writing data into the array will cause the necessary chunks to be initialized. - Regions of the array can also be written to, e.g.:: >>> import numpy as np @@ -51,7 +47,7 @@ Regions of the array can also be written to, e.g.:: >>> z[:, 0] = np.arange(10000) The contents of the array can be retrieved by slicing, which will load -the requested region into a NumPy array, e.g.:: +the requested region into memory as a NumPy array, e.g.:: >>> z[0, 0] 0 @@ -61,7 +57,7 @@ the requested region into a NumPy array, e.g.:: array([ 0, 1, 2, ..., 9997, 9998, 9999], dtype=int32) >>> z[:, 0] array([ 0, 1, 2, ..., 9997, 9998, 9999], dtype=int32) - >>> z[:] + >>> z[...] array([[ 0, 1, 2, ..., 9997, 9998, 9999], [ 1, 42, 42, ..., 42, 42, 42], [ 2, 42, 42, ..., 42, 42, 42], @@ -80,9 +76,7 @@ stored in memory. Zarr arrays can also be stored on a file system, enabling persistence of data between sessions. For example:: >>> z1 = zarr.open_array('example.zarr', mode='w', shape=(10000, 10000), - ... chunks=(1000, 1000), dtype='i4', fill_value=0) - >>> z1 - + ... chunks=(1000, 1000), dtype='i4') The array above will store its configuration metadata and all compressed chunk data in a directory called 'example.zarr' relative to @@ -102,11 +96,12 @@ data, e.g.:: Check that the data have been written and can be read again:: >>> z2 = zarr.open_array('example.zarr', mode='r') - >>> z2 - - >>> np.all(z1[:] == z2[:]) + >>> np.all(z1[...] == z2[...]) True +Please note that there are a number of other options for persistent array storage, see the +section on :ref:`tutorial_tips_storage` below. + .. _tutorial_resize: Resizing and appending @@ -145,44 +140,57 @@ which can be used to append data to any axis. E.g.:: Compressors ----------- -By default, Zarr uses the `Blosc `_ compression -library to compress each chunk of an array. Blosc is extremely fast -and can be configured in a variety of ways to improve the compression -ratio for different types of data. Blosc is in fact a -"meta-compressor", which means that it can used a number of different -compression algorithms internally to compress the data. Blosc also -provides highly optimized implementations of byte and bit shuffle -filters, which can significantly improve compression ratios for some -data. - -Different compressors can be provided via the ``compressor`` keyword argument -accepted by all array creation functions. For example:: +A number of different compressors can be used with Zarr. A separate package called Numcodecs_ is +available which provides an interface to various compressor libraries including Blosc, Zstandard, +LZ4, Zlib, BZ2 and LZMA. Different compressors can be provided via the ``compressor`` keyword +argument accepted by all array creation functions. For example:: >>> from numcodecs import Blosc - >>> z = zarr.array(np.arange(100000000, dtype='i4').reshape(10000, 10000), - ... chunks=(1000, 1000), - ... compressor=Blosc(cname='zstd', clevel=3, shuffle=Blosc.BITSHUFFLE)) + >>> compressor = Blosc(cname='zstd', clevel=3, shuffle=Blosc.BITSHUFFLE) + >>> data = np.arange(100000000, dtype='i4').reshape(10000, 10000) + >>> z = zarr.array(data, chunks=(1000, 1000), compressor=compressor) >>> z.compressor Blosc(cname='zstd', clevel=3, shuffle=BITSHUFFLE, blocksize=0) -The array above will use Blosc as the primary compressor, using the -Zstandard algorithm (compression level 3) internally within Blosc, and with -the bitshuffle filter applied. +This array above will use Blosc as the primary compressor, using the Zstandard algorithm +(compression level 3) internally within Blosc, and with the bitshuffle filter applied. + +When using a compressor, it can be useful to get some diagnostics on the compression ratio. Zarr +arrays provide a ``info`` property which can be used to print some diagnostics, e.g.:: + + >>> z.info + Type : zarr.core.Array + Data type : int32 + Shape : (10000, 10000) + Chunk shape : (1000, 1000) + Order : C + Read-only : False + Compressor : Blosc(cname='zstd', clevel=3, shuffle=BITSHUFFLE, + : blocksize=0) + Store type : builtins.dict + No. bytes : 400000000 (381.5M) + No. bytes stored : 4565055 (4.4M) + Storage ratio : 87.6 + Chunks initialized : 100/100 -A list of the internal compression libraries available within Blosc can be -obtained via:: +If you don't specify a compressor, by default Zarr uses the Blosc compressor. Blosc is extremely +fast and can be configured in a variety of ways to improve the compression ratio for different +types of data. Blosc is in fact a "meta-compressor", which means that it can used a number of +different compression algorithms internally to compress the data. Blosc also provides highly +optimized implementations of byte and bit shuffle filters, which can significantly improve +compression ratios for some data. A list of the internal compression libraries available within +Blosc can be obtained via:: >>> from numcodecs import blosc >>> blosc.list_compressors() ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', 'zstd'] -In addition to Blosc, other compression libraries can also be -used. For example, here is an array using Zstandard compression, level 1:: +In addition to Blosc, other compression libraries can also be used. For example, here is an array +using Zstandard compression, level 1:: >>> from numcodecs import Zstd >>> z = zarr.array(np.arange(100000000, dtype='i4').reshape(10000, 10000), - ... chunks=(1000, 1000), - ... compressor=Zstd(level=1)) + ... chunks=(1000, 1000), compressor=Zstd(level=1)) >>> z.compressor Zstd(level=1) @@ -235,15 +243,13 @@ flexibility for implementing and using filters in combination with different compressors, Zarr also provides a mechanism for configuring filters outside of the primary compressor. -Here is an example using the delta filter with the Blosc compressor: +Here is an example using the delta filter with the Blosc compressor:: >>> from numcodecs import Blosc, Delta >>> filters = [Delta(dtype='i4')] >>> compressor = Blosc(cname='zstd', clevel=1, shuffle=Blosc.SHUFFLE) - >>> z = zarr.array(np.arange(100000000, dtype='i4').reshape(10000, 10000), - ... chunks=(1000, 1000), filters=filters, compressor=compressor) - >>> z - + >>> data = np.arange(100000000, dtype='i4').reshape(10000, 10000) + >>> z = zarr.array(data, chunks=(1000, 1000), filters=filters, compressor=compressor) >>> z.info Type : zarr.core.Array Data type : int32 @@ -376,8 +382,7 @@ and :func:`zarr.hierarchy.Group.require_dataset` methods, e.g.:: >>> z = bar_group.create_dataset('quux', shape=(10000, 10000), ... chunks=(1000, 1000), dtype='i4', - ... fill_value=0, compression='gzip', - ... compression_opts=1) + ... compression='gzip', compression_opts=1) >>> z @@ -402,13 +407,192 @@ stored in sub-directories, e.g.:: >>> persistent_group >>> z = persistent_group.create_dataset('foo/bar/baz', shape=(10000, 10000), - ... chunks=(1000, 1000), dtype='i4', - ... fill_value=0) + ... chunks=(1000, 1000), dtype='i4') >>> z For more information on groups see the :mod:`zarr.hierarchy` API docs. +.. _tutorial_indexing: + +Advanced indexing +----------------- + +As of Zarr version 2.2, Zarr arrays support several methods for advanced or "fancy" indexing, +which enable a subset of data items to be extracted or updated in an array without loading the +entire array into memory. Note that although this functionality is similar to some of the +advanced indexing capabilities available on NumPy arrays and on h5py datasets, **the Zarr API for +advanced indexing is different from both NumPy and h5py**, so please read this section carefully. +For a complete description of the indexing API, see the documentation for the +:class:`zarr.core.Array` class. + +Indexing with coordinate arrays +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Items from a Zarr array can be extracted by providing an integer array of coordinates. E.g.:: + + >>> z = zarr.array(np.arange(10)) + >>> z[...] + array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) + >>> z.get_coordinate_selection([1, 4]) + array([1, 4]) + +Coordinate arrays can also be used to update data, e.g.:: + + >>> z.set_coordinate_selection([1, 4], [-1, -2]) + >>> z[...] + array([ 0, -1, 2, 3, -2, 5, 6, 7, 8, 9]) + +For multidimensional arrays, coordinates must be provided for each dimension, e.g.:: + + >>> z = zarr.array(np.arange(15).reshape(3, 5)) + >>> z[...] + array([[ 0, 1, 2, 3, 4], + [ 5, 6, 7, 8, 9], + [10, 11, 12, 13, 14]]) + >>> z.get_coordinate_selection(([0, 2], [1, 3])) + array([ 1, 13]) + >>> z.set_coordinate_selection(([0, 2], [1, 3]), [-1, -2]) + >>> z[...] + array([[ 0, -1, 2, 3, 4], + [ 5, 6, 7, 8, 9], + [10, 11, 12, -2, 14]]) + +For convenience, coordinate indexing is also available via the ``vindex`` property, e.g.:: + + >>> z.vindex[[0, 2], [1, 3]] + array([-1, -2]) + >>> z.vindex[[0, 2], [1, 3]] = [-3, -4] + >>> z[...] + array([[ 0, -3, 2, 3, 4], + [ 5, 6, 7, 8, 9], + [10, 11, 12, -4, 14]]) + +Indexing with a mask array +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Items can also be extracted by providing a Boolean mask array. E.g.:: + + >>> z = zarr.array(np.arange(10)) + >>> z[...] + array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) + >>> sel = np.zeros_like(z, dtype=bool) + >>> sel[1] = True + >>> sel[4] = True + >>> z.get_mask_selection(sel) + array([1, 4]) + >>> z.set_mask_selection(sel, [-1, -2]) + >>> z[...] + array([ 0, -1, 2, 3, -2, 5, 6, 7, 8, 9]) + +Here is a multidimensional example:: + + >>> z = zarr.array(np.arange(15).reshape(3, 5)) + >>> z[...] + array([[ 0, 1, 2, 3, 4], + [ 5, 6, 7, 8, 9], + [10, 11, 12, 13, 14]]) + >>> sel = np.zeros_like(z, dtype=bool) + >>> sel[0, 1] = True + >>> sel[2, 3] = True + >>> z.get_mask_selection(sel) + array([ 1, 13]) + >>> z.set_mask_selection(sel, [-1, -2]) + >>> z[...] + array([[ 0, -1, 2, 3, 4], + [ 5, 6, 7, 8, 9], + [10, 11, 12, -2, 14]]) + +For convenience, mask indexing is also available via the ``vindex`` property, e.g.:: + + >>> z.vindex[sel] + array([-1, -2]) + >>> z.vindex[sel] = [-3, -4] + >>> z[...] + array([[ 0, -3, 2, 3, 4], + [ 5, 6, 7, 8, 9], + [10, 11, 12, -4, 14]]) + +Mask indexing is conceptually the same as coordinate indexing, and is implemented internally via +the same machinery. Both styles of indexing allow selecting arbitrary items from an array, also +known as point selection. + +Orthogonal indexing +~~~~~~~~~~~~~~~~~~~ + +Zarr arrays also support methods for orthogonal indexing, which allows selections to be made +along each dimension of an array independently. For example, this allows selecting a subset of +rows and/or columns from a 2-dimensional array. E.g.:: + + >>> z = zarr.array(np.arange(15).reshape(3, 5)) + >>> z[...] + array([[ 0, 1, 2, 3, 4], + [ 5, 6, 7, 8, 9], + [10, 11, 12, 13, 14]]) + >>> z.get_orthogonal_selection(([0, 2], slice(None))) # select first and third rows + array([[ 0, 1, 2, 3, 4], + [10, 11, 12, 13, 14]]) + >>> z.get_orthogonal_selection((slice(None), [1, 3])) # select second and fourth columns + array([[ 1, 3], + [ 6, 8], + [11, 13]]) + >>> z.get_orthogonal_selection(([0, 2], [1, 3])) # select rows [0, 2] and columns [1, 4] + array([[ 1, 3], + [11, 13]]) + +Data can also be modified, e.g.:: + + >>> z.set_orthogonal_selection(([0, 2], [1, 3]), [[-1, -2], [-3, -4]]) + >>> z[...] + array([[ 0, -1, 2, -2, 4], + [ 5, 6, 7, 8, 9], + [10, -3, 12, -4, 14]]) + +For convenience, the orthogonal indexing functionality is also available via the ``oindex`` +property, e.g.:: + + >>> z = zarr.array(np.arange(15).reshape(3, 5)) + >>> z.oindex[[0, 2], :] # select first and third rows + array([[ 0, 1, 2, 3, 4], + [10, 11, 12, 13, 14]]) + >>> z.oindex[:, [1, 3]] # select second and fourth columns + array([[ 1, 3], + [ 6, 8], + [11, 13]]) + >>> z.oindex[[0, 2], [1, 3]] # select rows [0, 2] and columns [1, 4] + array([[ 1, 3], + [11, 13]]) + >>> z.oindex[[0, 2], [1, 3]] = [[-1, -2], [-3, -4]] + >>> z[...] + array([[ 0, -1, 2, -2, 4], + [ 5, 6, 7, 8, 9], + [10, -3, 12, -4, 14]]) + +Any combination of integer, slice, integer array and/or Boolean array can be used for orthogonal +indexing. + +Indexing fields in structured arrays +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +All selection methods support a ``fields`` parameter which allows retrieving or replacing data +for a specific field in an array with a structured dtype. E.g.:: + + >>> a = np.array([(b'aaa', 1, 4.2), + ... (b'bbb', 2, 8.4), + ... (b'ccc', 3, 12.6)], + ... dtype=[('foo', 'S3'), ('bar', 'i4'), ('baz', 'f8')]) + >>> z = zarr.array(a) + >>> z['foo'] + array([b'aaa', b'bbb', b'ccc'], + dtype='|S3') + >>> z['baz'] + array([ 4.2, 8.4, 12.6]) + >>> z.get_basic_selection(slice(0, 2), fields='bar') + array([1, 2], dtype=int32) + >>> z.get_coordinate_selection([0, 2], fields=['foo', 'baz']) + array([(b'aaa', 4.2), (b'ccc', 12.6)], + dtype=[('foo', 'S3'), ('baz', '>> foo_group = root_group.create_group('foo') >>> z = foo_group.zeros('bar', shape=1000000, chunks=100000) >>> z[:] = 42 - >>> root_group - >>> root_group.info Name : / Type : zarr.hierarchy.Group @@ -437,8 +619,6 @@ Diagnostic information about arrays and groups is available via the ``info`` pro No. groups : 1 Groups : foo - >>> foo_group - >>> foo_group.info Name : /foo Type : zarr.hierarchy.Group @@ -449,8 +629,6 @@ Diagnostic information about arrays and groups is available via the ``info`` pro No. groups : 0 Arrays : bar - >>> z - >>> z.info Name : /foo/bar Type : zarr.core.Array @@ -535,10 +713,31 @@ which compression filters (e.g., byte shuffle) have been applied. Storage alternatives ~~~~~~~~~~~~~~~~~~~~ -Zarr can use any object that implements the ``MutableMapping`` interface as -the store for a group or an array. +Zarr can use any object that implements the ``MutableMapping`` interface as the store for a group +or an array. Some storage classes are provided in the :mod:`zarr.storage` module. For example, +the :class:`zarr.storage.DirectoryStore` class provides a ``MutableMapping`` interface to a +directory on the local file system. This is used under the hood by the +:func:`zarr.creation.open_array` and :func:`zarr.hierarchy.open_group` functions. In other words, +the following code:: + + >>> z = zarr.open_array('example.zarr', mode='w', shape=1000000, dtype='i4') + +...is just short-hand for:: + + >>> store = zarr.DirectoryStore('example.zarr') + >>> z = zarr.zeros(store=store, overwrite=True, shape=1000000, dtype='i4') + +...and the following code:: -Here is an example storing an array directly into a Zip file:: + >>> grp = zarr.open_group('example.zarr', mode='w') + +...is just a short-hand for:: + + >>> store = zarr.DirectoryStore('example.zarr') + >>> grp = zarr.group(store=store, overwrite=True) + +Any other storage class could be used in place of :class:`zarr.storage.DirectoryStore`. For +example, here is an array stored directly into a Zip file:: >>> store = zarr.ZipStore('example.zip', mode='w') >>> root_group = zarr.group(store=store) @@ -567,11 +766,9 @@ Re-open and check that data have been written:: Note that there are some restrictions on how Zip files can be used, because items within a Zip file cannot be updated in place. This means that data in the array should only be written once and write -operations should be aligned with chunk boundaries. - -Note also that the ``close()`` method must be called after writing any data to -the store, otherwise essential records will not be written to the underlying -zip file. +operations should be aligned with chunk boundaries. Note also that the ``close()`` method must be +called after writing any data to the store, otherwise essential records will not be written to +the underlying zip file. The Dask project has implementations of the ``MutableMapping`` interface for distributed storage systems, see the `S3Map @@ -618,7 +815,7 @@ simple heuristics and may be far from optimal. E.g.:: >>> z4 = zarr.zeros((10000, 10000), dtype='i4') >>> z4.chunks - (313, 313) + (313, 625) .. _tutorial_tips_blosc: diff --git a/notebooks/advanced_indexing.ipynb b/notebooks/advanced_indexing.ipynb new file mode 100644 index 0000000000..eba6b5880b --- /dev/null +++ b/notebooks/advanced_indexing.ipynb @@ -0,0 +1,2798 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Advanced indexing" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'2.1.5.dev144'" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import sys\n", + "sys.path.insert(0, '..')\n", + "import zarr\n", + "import numpy as np\n", + "np.random.seed(42)\n", + "import cProfile\n", + "zarr.__version__" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Functionality and API" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Indexing a 1D array with a Boolean (mask) array\n", + "\n", + "Supported via ``get/set_mask_selection()`` and ``.vindex[]``. Also supported via ``get/set_orthogonal_selection()`` and ``.oindex[]``." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "a = np.arange(10)\n", + "za = zarr.array(a, chunks=2)\n", + "ix = [False, True, False, True, False, True, False, True, False, True]" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([1, 3, 5, 7, 9])" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# get items\n", + "za.vindex[ix]" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([1, 3, 5, 7, 9])" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# get items\n", + "za.oindex[ix]" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ 0, 10, 2, 30, 4, 50, 6, 70, 8, 90])" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# set items\n", + "za.vindex[ix] = a[ix] * 10\n", + "za[:]" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ 0, 100, 2, 300, 4, 500, 6, 700, 8, 900])" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# set items\n", + "za.oindex[ix] = a[ix] * 100\n", + "za[:]" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([1, 3, 5, 7, 9])" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# if using .oindex, indexing array can be any array-like, e.g., Zarr array\n", + "zix = zarr.array(ix, chunks=2)\n", + "za = zarr.array(a, chunks=2)\n", + "za.oindex[zix] # will not load all zix into memory" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Indexing a 1D array with a 1D integer (coordinate) array\n", + "\n", + "Supported via ``get/set_coordinate_selection()`` and ``.vindex[]``. Also supported via ``get/set_orthogonal_selection()`` and ``.oindex[]``." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "a = np.arange(10)\n", + "za = zarr.array(a, chunks=2)\n", + "ix = [1, 3, 5, 7, 9]" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([1, 3, 5, 7, 9])" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# get items\n", + "za.vindex[ix]" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([1, 3, 5, 7, 9])" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# get items\n", + "za.oindex[ix]" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ 0, 10, 2, 30, 4, 50, 6, 70, 8, 90])" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# set items\n", + "za.vindex[ix] = a[ix] * 10\n", + "za[:]" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ 0, 100, 2, 300, 4, 500, 6, 700, 8, 900])" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# set items\n", + "za.oindex[ix] = a[ix] * 100\n", + "za[:]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Indexing a 1D array with a multi-dimensional integer (coordinate) array\n", + "\n", + "Supported via ``get/set_coordinate_selection()`` and ``.vindex[]``." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "a = np.arange(10)\n", + "za = zarr.array(a, chunks=2)\n", + "ix = np.array([[1, 3, 5], [2, 4, 6]])" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[1, 3, 5],\n", + " [2, 4, 6]])" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# get items\n", + "za.vindex[ix]" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ 0, 10, 20, 30, 40, 50, 60, 7, 8, 9])" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# set items\n", + "za.vindex[ix] = a[ix] * 10\n", + "za[:]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Slicing a 1D array with step > 1\n", + "\n", + "Slices with step > 1 are supported via ``get/set_basic_selection()``, ``get/set_orthogonal_selection()``, ``__getitem__`` and ``.oindex[]``. Negative steps are not supported." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "a = np.arange(10)\n", + "za = zarr.array(a, chunks=2)" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([1, 3, 5, 7, 9])" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# get items\n", + "za[1::2]" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ 0, 10, 2, 30, 4, 50, 6, 70, 8, 90])" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# set items\n", + "za.oindex[1::2] = a[1::2] * 10\n", + "za[:]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Orthogonal (outer) indexing of multi-dimensional arrays\n", + "\n", + "Orthogonal (a.k.a. outer) indexing is supported with either Boolean or integer arrays, in combination with integers and slices. This functionality is provided via the ``get/set_orthogonal_selection()`` methods. For convenience, this functionality is also available via the ``.oindex[]`` property." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 0, 1, 2],\n", + " [ 3, 4, 5],\n", + " [ 6, 7, 8],\n", + " [ 9, 10, 11],\n", + " [12, 13, 14]])" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "a = np.arange(15).reshape(5, 3)\n", + "za = zarr.array(a, chunks=(3, 2))\n", + "za[:]" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 3, 5],\n", + " [ 9, 11]])" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# orthogonal indexing with Boolean arrays\n", + "ix0 = [False, True, False, True, False]\n", + "ix1 = [True, False, True]\n", + "za.get_orthogonal_selection((ix0, ix1))" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 3, 5],\n", + " [ 9, 11]])" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# alternative API\n", + "za.oindex[ix0, ix1]" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 3, 5],\n", + " [ 9, 11]])" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# orthogonal indexing with integer arrays\n", + "ix0 = [1, 3]\n", + "ix1 = [0, 2]\n", + "za.get_orthogonal_selection((ix0, ix1))" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 3, 5],\n", + " [ 9, 11]])" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# alternative API\n", + "za.oindex[ix0, ix1]" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 3, 4, 5],\n", + " [ 9, 10, 11]])" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# combine with slice\n", + "za.oindex[[1, 3], :]" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 0, 2],\n", + " [ 3, 5],\n", + " [ 6, 8],\n", + " [ 9, 11],\n", + " [12, 14]])" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# combine with slice\n", + "za.oindex[:, [0, 2]]" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 0, 1, 2],\n", + " [42, 4, 42],\n", + " [ 6, 7, 8],\n", + " [42, 10, 42],\n", + " [12, 13, 14]])" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# set items via Boolean selection\n", + "ix0 = [False, True, False, True, False]\n", + "ix1 = [True, False, True]\n", + "selection = ix0, ix1\n", + "value = 42\n", + "za.set_orthogonal_selection(selection, value)\n", + "za[:]" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 0, 1, 2],\n", + " [44, 4, 44],\n", + " [ 6, 7, 8],\n", + " [44, 10, 44],\n", + " [12, 13, 14]])" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# alternative API\n", + "za.oindex[ix0, ix1] = 44\n", + "za[:]" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 0, 1, 2],\n", + " [46, 4, 46],\n", + " [ 6, 7, 8],\n", + " [46, 10, 46],\n", + " [12, 13, 14]])" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# set items via integer selection\n", + "ix0 = [1, 3]\n", + "ix1 = [0, 2]\n", + "selection = ix0, ix1\n", + "value = 46\n", + "za.set_orthogonal_selection(selection, value)\n", + "za[:]" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 0, 1, 2],\n", + " [48, 4, 48],\n", + " [ 6, 7, 8],\n", + " [48, 10, 48],\n", + " [12, 13, 14]])" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# alternative API\n", + "za.oindex[ix0, ix1] = 48\n", + "za[:]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Coordinate indexing of multi-dimensional arrays\n", + "\n", + "Selecting arbitrary points from a multi-dimensional array by indexing with integer (coordinate) arrays is supported. This functionality is provided via the ``get/set_coordinate_selection()`` methods. For convenience, this functionality is also available via the ``.vindex[]`` property." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 0, 1, 2],\n", + " [ 3, 4, 5],\n", + " [ 6, 7, 8],\n", + " [ 9, 10, 11],\n", + " [12, 13, 14]])" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "a = np.arange(15).reshape(5, 3)\n", + "za = zarr.array(a, chunks=(3, 2))\n", + "za[:]" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ 3, 11])" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# get items\n", + "ix0 = [1, 3]\n", + "ix1 = [0, 2]\n", + "za.get_coordinate_selection((ix0, ix1))" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ 3, 11])" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# alternative API\n", + "za.vindex[ix0, ix1]" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 0, 1, 2],\n", + " [42, 4, 5],\n", + " [ 6, 7, 8],\n", + " [ 9, 10, 42],\n", + " [12, 13, 14]])" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# set items\n", + "za.set_coordinate_selection((ix0, ix1), 42)\n", + "za[:]" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 0, 1, 2],\n", + " [44, 4, 5],\n", + " [ 6, 7, 8],\n", + " [ 9, 10, 44],\n", + " [12, 13, 14]])" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# alternative API\n", + "za.vindex[ix0, ix1] = 44\n", + "za[:]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Mask indexing of multi-dimensional arrays\n", + "\n", + "Selecting arbitrary points from a multi-dimensional array by a Boolean array is supported. This functionality is provided via the ``get/set_mask_selection()`` methods. For convenience, this functionality is also available via the ``.vindex[]`` property." + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 0, 1, 2],\n", + " [ 3, 4, 5],\n", + " [ 6, 7, 8],\n", + " [ 9, 10, 11],\n", + " [12, 13, 14]])" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "a = np.arange(15).reshape(5, 3)\n", + "za = zarr.array(a, chunks=(3, 2))\n", + "za[:]" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ 3, 11])" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ix = np.zeros_like(a, dtype=bool)\n", + "ix[1, 0] = True\n", + "ix[3, 2] = True\n", + "za.get_mask_selection(ix)" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ 3, 11])" + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "za.vindex[ix]" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 0, 1, 2],\n", + " [42, 4, 5],\n", + " [ 6, 7, 8],\n", + " [ 9, 10, 42],\n", + " [12, 13, 14]])" + ] + }, + "execution_count": 38, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "za.set_mask_selection(ix, 42)\n", + "za[:]" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 0, 1, 2],\n", + " [44, 4, 5],\n", + " [ 6, 7, 8],\n", + " [ 9, 10, 44],\n", + " [12, 13, 14]])" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "za.vindex[ix] = 44\n", + "za[:]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Selecting fields from arrays with a structured dtype\n", + "\n", + "All ``get/set_selection_...()`` methods support a ``fields`` argument which allows retrieving/replacing data for a specific field or fields. Also h5py-like API is supported where fields can be provided within ``__getitem__``, ``.oindex[]`` and ``.vindex[]``." + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([(b'aaa', 1, 4.2), (b'bbb', 2, 8.4), (b'ccc', 3, 12.6)],\n", + " dtype=[('foo', 'S3'), ('bar', '\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0ma\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'foo'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'baz'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mIndexError\u001b[0m: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices" + ] + } + ], + "source": [ + "a['foo', 'baz']" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([(b'aaa', 4.2), (b'bbb', 8.4), (b'ccc', 12.6)],\n", + " dtype=[('foo', 'S3'), ('baz', '", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mIndexError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mza\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'foo'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'baz'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m~/src/github/alimanfoo/zarr/zarr/core.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, selection)\u001b[0m\n\u001b[1;32m 537\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 538\u001b[0m \u001b[0mfields\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mselection\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpop_fields\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mselection\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 539\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_basic_selection\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mselection\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfields\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mfields\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 540\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 541\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mget_basic_selection\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mselection\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mEllipsis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfields\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/src/github/alimanfoo/zarr/zarr/core.py\u001b[0m in \u001b[0;36mget_basic_selection\u001b[0;34m(self, selection, out, fields)\u001b[0m\n\u001b[1;32m 661\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_get_basic_selection_zd\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mselection\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mselection\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfields\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mfields\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 662\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 663\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_get_basic_selection_nd\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mselection\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mselection\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfields\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mfields\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 664\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 665\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_get_basic_selection_zd\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mselection\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfields\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/src/github/alimanfoo/zarr/zarr/core.py\u001b[0m in \u001b[0;36m_get_basic_selection_nd\u001b[0;34m(self, selection, out, fields)\u001b[0m\n\u001b[1;32m 701\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 702\u001b[0m \u001b[0;31m# setup indexer\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 703\u001b[0;31m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mBasicIndexer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mselection\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 704\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 705\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_get_selection\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfields\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mfields\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/src/github/alimanfoo/zarr/zarr/indexing.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, selection, array)\u001b[0m\n\u001b[1;32m 275\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 276\u001b[0m raise IndexError('unsupported selection item for basic indexing; expected integer '\n\u001b[0;32m--> 277\u001b[0;31m 'or slice, got {!r}'.format(type(dim_sel)))\n\u001b[0m\u001b[1;32m 278\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 279\u001b[0m \u001b[0mdim_indexers\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdim_indexer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mIndexError\u001b[0m: unsupported selection item for basic indexing; expected integer or slice, got " + ] + } + ], + "source": [ + "za[['foo', 'baz']]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1D Benchmarking" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "800000000" + ] + }, + "execution_count": 53, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "c = np.arange(100000000)\n", + "c.nbytes" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 480 ms, sys: 16 ms, total: 496 ms\n", + "Wall time: 141 ms\n" + ] + }, + { + "data": { + "text/html": [ + "
Typezarr.core.Array
Data typeint64
Shape(100000000,)
Chunk shape(97657,)
OrderC
Read-onlyFalse
CompressorBlosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store typebuiltins.dict
No. bytes800000000 (762.9M)
No. bytes stored11854081 (11.3M)
Storage ratio67.5
Chunks initialized1024/1024
" + ], + "text/plain": [ + "Type : zarr.core.Array\n", + "Data type : int64\n", + "Shape : (100000000,)\n", + "Chunk shape : (97657,)\n", + "Order : C\n", + "Read-only : False\n", + "Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)\n", + "Store type : builtins.dict\n", + "No. bytes : 800000000 (762.9M)\n", + "No. bytes stored : 11854081 (11.3M)\n", + "Storage ratio : 67.5\n", + "Chunks initialized : 1024/1024" + ] + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%time zc = zarr.array(c)\n", + "zc.info" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "121 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" + ] + } + ], + "source": [ + "%timeit c.copy()" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "254 ms ± 942 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" + ] + } + ], + "source": [ + "%timeit zc[:]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### bool dense selection" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "9997476" + ] + }, + "execution_count": 57, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# relatively dense selection - 10%\n", + "ix_dense_bool = np.random.binomial(1, 0.1, size=c.shape[0]).astype(bool)\n", + "np.count_nonzero(ix_dense_bool)" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "243 ms ± 5.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" + ] + } + ], + "source": [ + "%timeit c[ix_dense_bool]" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "433 ms ± 6.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" + ] + } + ], + "source": [ + "%timeit zc.oindex[ix_dense_bool]" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "548 ms ± 5.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" + ] + } + ], + "source": [ + "%timeit zc.vindex[ix_dense_bool]" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "metadata": {}, + "outputs": [], + "source": [ + "import tempfile\n", + "import cProfile\n", + "import pstats\n", + "\n", + "def profile(statement, sort='time', restrictions=(7,)):\n", + " with tempfile.NamedTemporaryFile() as f:\n", + " cProfile.run(statement, filename=f.name)\n", + " pstats.Stats(f.name).sort_stats(sort).print_stats(*restrictions)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Wed Nov 8 17:17:48 2017 /tmp/tmpruua2rs_\n", + "\n", + " 98386 function calls in 0.483 seconds\n", + "\n", + " Ordered by: internal time\n", + " List reduced from 83 to 7 due to restriction <7>\n", + "\n", + " ncalls tottime percall cumtime percall filename:lineno(function)\n", + " 1025 0.197 0.000 0.197 0.000 {method 'nonzero' of 'numpy.ndarray' objects}\n", + " 1024 0.149 0.000 0.159 0.000 ../zarr/core.py:1028(_decode_chunk)\n", + " 1024 0.044 0.000 0.231 0.000 ../zarr/core.py:849(_chunk_getitem)\n", + " 1024 0.009 0.000 0.009 0.000 {built-in method numpy.core.multiarray.count_nonzero}\n", + " 1025 0.007 0.000 0.238 0.000 ../zarr/indexing.py:541(__iter__)\n", + " 1024 0.006 0.000 0.207 0.000 /home/aliman/pyenv/zarr_20171023/lib/python3.6/site-packages/numpy/lib/index_tricks.py:26(ix_)\n", + " 2048 0.005 0.000 0.005 0.000 ../zarr/core.py:337()\n", + "\n", + "\n" + ] + } + ], + "source": [ + "profile('zc.oindex[ix_dense_bool]')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Method ``nonzero`` is being called internally within numpy to convert bool to int selections, no way to avoid." + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Wed Nov 8 17:18:06 2017 /tmp/tmp7_bautep\n", + "\n", + " 52382 function calls in 0.592 seconds\n", + "\n", + " Ordered by: internal time\n", + " List reduced from 88 to 7 due to restriction <7>\n", + "\n", + " ncalls tottime percall cumtime percall filename:lineno(function)\n", + " 2 0.219 0.110 0.219 0.110 {method 'nonzero' of 'numpy.ndarray' objects}\n", + " 1024 0.096 0.000 0.101 0.000 ../zarr/core.py:1028(_decode_chunk)\n", + " 2 0.094 0.047 0.094 0.047 ../zarr/indexing.py:630()\n", + " 1024 0.044 0.000 0.167 0.000 ../zarr/core.py:849(_chunk_getitem)\n", + " 1 0.029 0.029 0.029 0.029 {built-in method numpy.core.multiarray.ravel_multi_index}\n", + " 1 0.023 0.023 0.023 0.023 {built-in method numpy.core.multiarray.bincount}\n", + " 1 0.021 0.021 0.181 0.181 ../zarr/indexing.py:603(__init__)\n", + "\n", + "\n" + ] + } + ], + "source": [ + "profile('zc.vindex[ix_dense_bool]')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "``.vindex[]`` is a bit slower, possibly because internally it converts to a coordinate array first." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### int dense selection" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "10000000" + ] + }, + "execution_count": 64, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ix_dense_int = np.random.choice(c.shape[0], size=c.shape[0]//10, replace=True)\n", + "ix_dense_int_sorted = ix_dense_int.copy()\n", + "ix_dense_int_sorted.sort()\n", + "len(ix_dense_int)" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "62.2 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" + ] + } + ], + "source": [ + "%timeit c[ix_dense_int_sorted]" + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "355 ms ± 3.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" + ] + } + ], + "source": [ + "%timeit zc.oindex[ix_dense_int_sorted]" + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "351 ms ± 3.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" + ] + } + ], + "source": [ + "%timeit zc.vindex[ix_dense_int_sorted]" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "128 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" + ] + } + ], + "source": [ + "%timeit c[ix_dense_int]" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1.71 s ± 5.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" + ] + } + ], + "source": [ + "%timeit zc.oindex[ix_dense_int]" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1.68 s ± 3.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" + ] + } + ], + "source": [ + "%timeit zc.vindex[ix_dense_int]" + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Wed Nov 8 17:19:09 2017 /tmp/tmpgmu5btr_\n", + "\n", + " 95338 function calls in 0.424 seconds\n", + "\n", + " Ordered by: internal time\n", + " List reduced from 89 to 7 due to restriction <7>\n", + "\n", + " ncalls tottime percall cumtime percall filename:lineno(function)\n", + " 1 0.141 0.141 0.184 0.184 ../zarr/indexing.py:369(__init__)\n", + " 1024 0.099 0.000 0.106 0.000 ../zarr/core.py:1028(_decode_chunk)\n", + " 1024 0.046 0.000 0.175 0.000 ../zarr/core.py:849(_chunk_getitem)\n", + " 1025 0.027 0.000 0.027 0.000 ../zarr/indexing.py:424(__iter__)\n", + " 1 0.023 0.023 0.023 0.023 {built-in method numpy.core.multiarray.bincount}\n", + " 1 0.010 0.010 0.010 0.010 /home/aliman/pyenv/zarr_20171023/lib/python3.6/site-packages/numpy/lib/function_base.py:1848(diff)\n", + " 1025 0.006 0.000 0.059 0.000 ../zarr/indexing.py:541(__iter__)\n", + "\n", + "\n" + ] + } + ], + "source": [ + "profile('zc.oindex[ix_dense_int_sorted]')" + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Wed Nov 8 17:19:13 2017 /tmp/tmpay1gvnx8\n", + "\n", + " 52362 function calls in 0.398 seconds\n", + "\n", + " Ordered by: internal time\n", + " List reduced from 85 to 7 due to restriction <7>\n", + "\n", + " ncalls tottime percall cumtime percall filename:lineno(function)\n", + " 2 0.107 0.054 0.107 0.054 ../zarr/indexing.py:630()\n", + " 1024 0.091 0.000 0.096 0.000 ../zarr/core.py:1028(_decode_chunk)\n", + " 1024 0.041 0.000 0.160 0.000 ../zarr/core.py:849(_chunk_getitem)\n", + " 1 0.040 0.040 0.213 0.213 ../zarr/indexing.py:603(__init__)\n", + " 1 0.029 0.029 0.029 0.029 {built-in method numpy.core.multiarray.ravel_multi_index}\n", + " 1 0.023 0.023 0.023 0.023 {built-in method numpy.core.multiarray.bincount}\n", + " 2048 0.011 0.000 0.011 0.000 ../zarr/indexing.py:695()\n", + "\n", + "\n" + ] + } + ], + "source": [ + "profile('zc.vindex[ix_dense_int_sorted]')" + ] + }, + { + "cell_type": "code", + "execution_count": 73, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Wed Nov 8 17:19:20 2017 /tmp/tmpngsf6zpp\n", + "\n", + " 120946 function calls in 1.793 seconds\n", + "\n", + " Ordered by: internal time\n", + " List reduced from 92 to 7 due to restriction <7>\n", + "\n", + " ncalls tottime percall cumtime percall filename:lineno(function)\n", + " 1 1.128 1.128 1.128 1.128 {method 'argsort' of 'numpy.ndarray' objects}\n", + " 1024 0.139 0.000 0.285 0.000 ../zarr/core.py:849(_chunk_getitem)\n", + " 1 0.132 0.132 1.422 1.422 ../zarr/indexing.py:369(__init__)\n", + " 1 0.120 0.120 0.120 0.120 {method 'take' of 'numpy.ndarray' objects}\n", + " 1024 0.116 0.000 0.123 0.000 ../zarr/core.py:1028(_decode_chunk)\n", + " 1025 0.034 0.000 0.034 0.000 ../zarr/indexing.py:424(__iter__)\n", + " 1 0.023 0.023 0.023 0.023 {built-in method numpy.core.multiarray.bincount}\n", + "\n", + "\n" + ] + } + ], + "source": [ + "profile('zc.oindex[ix_dense_int]')" + ] + }, + { + "cell_type": "code", + "execution_count": 74, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Wed Nov 8 17:19:22 2017 /tmp/tmpbskhj8de\n", + "\n", + " 50320 function calls in 1.730 seconds\n", + "\n", + " Ordered by: internal time\n", + " List reduced from 86 to 7 due to restriction <7>\n", + "\n", + " ncalls tottime percall cumtime percall filename:lineno(function)\n", + " 1 1.116 1.116 1.116 1.116 {method 'argsort' of 'numpy.ndarray' objects}\n", + " 1024 0.133 0.000 0.275 0.000 ../zarr/core.py:849(_chunk_getitem)\n", + " 2 0.121 0.060 0.121 0.060 ../zarr/indexing.py:654()\n", + " 1024 0.113 0.000 0.119 0.000 ../zarr/core.py:1028(_decode_chunk)\n", + " 2 0.100 0.050 0.100 0.050 ../zarr/indexing.py:630()\n", + " 1 0.030 0.030 0.030 0.030 {built-in method numpy.core.multiarray.ravel_multi_index}\n", + " 1 0.024 0.024 1.427 1.427 ../zarr/indexing.py:603(__init__)\n", + "\n", + "\n" + ] + } + ], + "source": [ + "profile('zc.vindex[ix_dense_int]')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When indices are not sorted, zarr needs to partially sort them so the occur in chunk order, so we only have to visit each chunk once. This sorting dominates the processing time and is unavoidable AFAIK." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### bool sparse selection" + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "9932" + ] + }, + "execution_count": 75, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# relatively sparse selection\n", + "ix_sparse_bool = np.random.binomial(1, 0.0001, size=c.shape[0]).astype(bool)\n", + "np.count_nonzero(ix_sparse_bool)" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "15.7 ms ± 38.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" + ] + } + ], + "source": [ + "%timeit c[ix_sparse_bool]" + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "156 ms ± 2.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" + ] + } + ], + "source": [ + "%timeit zc.oindex[ix_sparse_bool]" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "133 ms ± 2.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" + ] + } + ], + "source": [ + "%timeit zc.vindex[ix_sparse_bool]" + ] + }, + { + "cell_type": "code", + "execution_count": 79, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Wed Nov 8 17:20:09 2017 /tmp/tmpb7nqc9ax\n", + "\n", + " 98386 function calls in 0.191 seconds\n", + "\n", + " Ordered by: internal time\n", + " List reduced from 83 to 7 due to restriction <7>\n", + "\n", + " ncalls tottime percall cumtime percall filename:lineno(function)\n", + " 1024 0.093 0.000 0.098 0.000 ../zarr/core.py:1028(_decode_chunk)\n", + " 1025 0.017 0.000 0.017 0.000 {method 'nonzero' of 'numpy.ndarray' objects}\n", + " 1024 0.007 0.000 0.007 0.000 {built-in method numpy.core.multiarray.count_nonzero}\n", + " 1024 0.007 0.000 0.129 0.000 ../zarr/core.py:849(_chunk_getitem)\n", + " 1025 0.005 0.000 0.052 0.000 ../zarr/indexing.py:541(__iter__)\n", + " 1024 0.005 0.000 0.025 0.000 /home/aliman/pyenv/zarr_20171023/lib/python3.6/site-packages/numpy/lib/index_tricks.py:26(ix_)\n", + " 2048 0.004 0.000 0.004 0.000 ../zarr/core.py:337()\n", + "\n", + "\n" + ] + } + ], + "source": [ + "profile('zc.oindex[ix_sparse_bool]')" + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Wed Nov 8 17:20:09 2017 /tmp/tmphsko8nvh\n", + "\n", + " 52382 function calls in 0.160 seconds\n", + "\n", + " Ordered by: internal time\n", + " List reduced from 88 to 7 due to restriction <7>\n", + "\n", + " ncalls tottime percall cumtime percall filename:lineno(function)\n", + " 1024 0.093 0.000 0.098 0.000 ../zarr/core.py:1028(_decode_chunk)\n", + " 2 0.017 0.008 0.017 0.008 {method 'nonzero' of 'numpy.ndarray' objects}\n", + " 1025 0.008 0.000 0.014 0.000 ../zarr/indexing.py:674(__iter__)\n", + " 1024 0.006 0.000 0.127 0.000 ../zarr/core.py:849(_chunk_getitem)\n", + " 2048 0.004 0.000 0.004 0.000 ../zarr/indexing.py:695()\n", + " 2054 0.003 0.000 0.003 0.000 ../zarr/core.py:337()\n", + " 1024 0.002 0.000 0.005 0.000 /home/aliman/pyenv/zarr_20171023/lib/python3.6/site-packages/numpy/core/arrayprint.py:381(wrapper)\n", + "\n", + "\n" + ] + } + ], + "source": [ + "profile('zc.vindex[ix_sparse_bool]')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### int sparse selection" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "10000" + ] + }, + "execution_count": 81, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ix_sparse_int = np.random.choice(c.shape[0], size=c.shape[0]//10000, replace=True)\n", + "ix_sparse_int_sorted = ix_sparse_int.copy()\n", + "ix_sparse_int_sorted.sort()\n", + "len(ix_sparse_int)" + ] + }, + { + "cell_type": "code", + "execution_count": 82, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "18.9 µs ± 392 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" + ] + } + ], + "source": [ + "%timeit c[ix_sparse_int_sorted]" + ] + }, + { + "cell_type": "code", + "execution_count": 83, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "20.3 µs ± 155 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" + ] + } + ], + "source": [ + "%timeit c[ix_sparse_int]" + ] + }, + { + "cell_type": "code", + "execution_count": 84, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "125 ms ± 296 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" + ] + } + ], + "source": [ + "%timeit zc.oindex[ix_sparse_int_sorted]" + ] + }, + { + "cell_type": "code", + "execution_count": 85, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "109 ms ± 428 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" + ] + } + ], + "source": [ + "%timeit zc.vindex[ix_sparse_int_sorted]" + ] + }, + { + "cell_type": "code", + "execution_count": 86, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "132 ms ± 489 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" + ] + } + ], + "source": [ + "%timeit zc.oindex[ix_sparse_int]" + ] + }, + { + "cell_type": "code", + "execution_count": 87, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "108 ms ± 579 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" + ] + } + ], + "source": [ + "%timeit zc.vindex[ix_sparse_int]" + ] + }, + { + "cell_type": "code", + "execution_count": 88, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Wed Nov 8 17:21:12 2017 /tmp/tmp0b0o2quo\n", + "\n", + " 120946 function calls in 0.196 seconds\n", + "\n", + " Ordered by: internal time\n", + " List reduced from 92 to 7 due to restriction <7>\n", + "\n", + " ncalls tottime percall cumtime percall filename:lineno(function)\n", + " 1024 0.105 0.000 0.111 0.000 ../zarr/core.py:1028(_decode_chunk)\n", + " 2048 0.006 0.000 0.013 0.000 /home/aliman/pyenv/zarr_20171023/lib/python3.6/site-packages/numpy/lib/index_tricks.py:26(ix_)\n", + " 1025 0.006 0.000 0.051 0.000 ../zarr/indexing.py:541(__iter__)\n", + " 1024 0.006 0.000 0.141 0.000 ../zarr/core.py:849(_chunk_getitem)\n", + " 2048 0.005 0.000 0.005 0.000 ../zarr/core.py:337()\n", + " 15373 0.004 0.000 0.010 0.000 {built-in method builtins.isinstance}\n", + " 1025 0.004 0.000 0.005 0.000 ../zarr/indexing.py:424(__iter__)\n", + "\n", + "\n" + ] + } + ], + "source": [ + "profile('zc.oindex[ix_sparse_int]')" + ] + }, + { + "cell_type": "code", + "execution_count": 89, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Wed Nov 8 17:21:19 2017 /tmp/tmpdwju98kn\n", + "\n", + " 50320 function calls in 0.167 seconds\n", + "\n", + " Ordered by: internal time\n", + " List reduced from 86 to 7 due to restriction <7>\n", + "\n", + " ncalls tottime percall cumtime percall filename:lineno(function)\n", + " 1024 0.105 0.000 0.111 0.000 ../zarr/core.py:1028(_decode_chunk)\n", + " 1025 0.009 0.000 0.017 0.000 ../zarr/indexing.py:674(__iter__)\n", + " 1024 0.006 0.000 0.142 0.000 ../zarr/core.py:849(_chunk_getitem)\n", + " 2048 0.005 0.000 0.005 0.000 ../zarr/indexing.py:695()\n", + " 2054 0.004 0.000 0.004 0.000 ../zarr/core.py:337()\n", + " 1 0.003 0.003 0.162 0.162 ../zarr/core.py:591(_get_selection)\n", + " 1027 0.003 0.000 0.003 0.000 {method 'reshape' of 'numpy.ndarray' objects}\n", + "\n", + "\n" + ] + } + ], + "source": [ + "profile('zc.vindex[ix_sparse_int]')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For sparse selections, processing time is dominated by decompression, so we can't do any better." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### sparse bool selection as zarr array" + ] + }, + { + "cell_type": "code", + "execution_count": 90, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
Typezarr.core.Array
Data typebool
Shape(100000000,)
Chunk shape(390625,)
OrderC
Read-onlyFalse
CompressorBlosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store typebuiltins.dict
No. bytes100000000 (95.4M)
No. bytes stored507131 (495.2K)
Storage ratio197.2
Chunks initialized256/256
" + ], + "text/plain": [ + "Type : zarr.core.Array\n", + "Data type : bool\n", + "Shape : (100000000,)\n", + "Chunk shape : (390625,)\n", + "Order : C\n", + "Read-only : False\n", + "Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)\n", + "Store type : builtins.dict\n", + "No. bytes : 100000000 (95.4M)\n", + "No. bytes stored : 507131 (495.2K)\n", + "Storage ratio : 197.2\n", + "Chunks initialized : 256/256" + ] + }, + "execution_count": 90, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "zix_sparse_bool = zarr.array(ix_sparse_bool)\n", + "zix_sparse_bool.info" + ] + }, + { + "cell_type": "code", + "execution_count": 91, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "387 ms ± 5.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" + ] + } + ], + "source": [ + "%timeit zc.oindex[zix_sparse_bool]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### slice with step" + ] + }, + { + "cell_type": "code", + "execution_count": 92, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "80.3 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" + ] + } + ], + "source": [ + "%timeit np.array(c[::2])" + ] + }, + { + "cell_type": "code", + "execution_count": 93, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "168 ms ± 837 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" + ] + } + ], + "source": [ + "%timeit zc[::2]" + ] + }, + { + "cell_type": "code", + "execution_count": 94, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "136 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" + ] + } + ], + "source": [ + "%timeit zc[::10]" + ] + }, + { + "cell_type": "code", + "execution_count": 95, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "104 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" + ] + } + ], + "source": [ + "%timeit zc[::100]" + ] + }, + { + "cell_type": "code", + "execution_count": 96, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "100 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" + ] + } + ], + "source": [ + "%timeit zc[::1000]" + ] + }, + { + "cell_type": "code", + "execution_count": 97, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Wed Nov 8 17:22:44 2017 /tmp/tmpg9dxqcpg\n", + "\n", + " 49193 function calls in 0.211 seconds\n", + "\n", + " Ordered by: internal time\n", + " List reduced from 55 to 7 due to restriction <7>\n", + "\n", + " ncalls tottime percall cumtime percall filename:lineno(function)\n", + " 1024 0.104 0.000 0.110 0.000 ../zarr/core.py:1028(_decode_chunk)\n", + " 1024 0.067 0.000 0.195 0.000 ../zarr/core.py:849(_chunk_getitem)\n", + " 1025 0.005 0.000 0.013 0.000 ../zarr/indexing.py:278(__iter__)\n", + " 2048 0.004 0.000 0.004 0.000 ../zarr/core.py:337()\n", + " 2050 0.003 0.000 0.003 0.000 ../zarr/indexing.py:90(ceildiv)\n", + " 1025 0.003 0.000 0.006 0.000 ../zarr/indexing.py:109(__iter__)\n", + " 1024 0.003 0.000 0.003 0.000 {method 'reshape' of 'numpy.ndarray' objects}\n", + "\n", + "\n" + ] + } + ], + "source": [ + "profile('zc[::2]')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2D Benchmarking" + ] + }, + { + "cell_type": "code", + "execution_count": 99, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(100000000,)" + ] + }, + "execution_count": 99, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "c.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 100, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(100000, 1000)" + ] + }, + "execution_count": 100, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "d = c.reshape(-1, 1000)\n", + "d.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 101, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
Typezarr.core.Array
Data typeint64
Shape(100000, 1000)
Chunk shape(3125, 32)
OrderC
Read-onlyFalse
CompressorBlosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store typebuiltins.dict
No. bytes800000000 (762.9M)
No. bytes stored39228864 (37.4M)
Storage ratio20.4
Chunks initialized1024/1024
" + ], + "text/plain": [ + "Type : zarr.core.Array\n", + "Data type : int64\n", + "Shape : (100000, 1000)\n", + "Chunk shape : (3125, 32)\n", + "Order : C\n", + "Read-only : False\n", + "Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)\n", + "Store type : builtins.dict\n", + "No. bytes : 800000000 (762.9M)\n", + "No. bytes stored : 39228864 (37.4M)\n", + "Storage ratio : 20.4\n", + "Chunks initialized : 1024/1024" + ] + }, + "execution_count": 101, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "zd = zarr.array(d)\n", + "zd.info" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### bool orthogonal selection" + ] + }, + { + "cell_type": "code", + "execution_count": 102, + "metadata": {}, + "outputs": [], + "source": [ + "ix0 = np.random.binomial(1, 0.5, size=d.shape[0]).astype(bool)\n", + "ix1 = np.random.binomial(1, 0.5, size=d.shape[1]).astype(bool)" + ] + }, + { + "cell_type": "code", + "execution_count": 103, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "101 ms ± 577 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" + ] + } + ], + "source": [ + "%timeit d[np.ix_(ix0, ix1)]" + ] + }, + { + "cell_type": "code", + "execution_count": 104, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "373 ms ± 5.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" + ] + } + ], + "source": [ + "%timeit zd.oindex[ix0, ix1]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### int orthogonal selection" + ] + }, + { + "cell_type": "code", + "execution_count": 105, + "metadata": {}, + "outputs": [], + "source": [ + "ix0 = np.random.choice(d.shape[0], size=int(d.shape[0] * .5), replace=True)\n", + "ix1 = np.random.choice(d.shape[1], size=int(d.shape[1] * .5), replace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 106, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "174 ms ± 4.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" + ] + } + ], + "source": [ + "%timeit d[np.ix_(ix0, ix1)]" + ] + }, + { + "cell_type": "code", + "execution_count": 107, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "566 ms ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" + ] + } + ], + "source": [ + "%timeit zd.oindex[ix0, ix1]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### coordinate (point) selection" + ] + }, + { + "cell_type": "code", + "execution_count": 108, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "10000000" + ] + }, + "execution_count": 108, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "n = int(d.size * .1)\n", + "ix0 = np.random.choice(d.shape[0], size=n, replace=True)\n", + "ix1 = np.random.choice(d.shape[1], size=n, replace=True)\n", + "n" + ] + }, + { + "cell_type": "code", + "execution_count": 109, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "243 ms ± 3.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" + ] + } + ], + "source": [ + "%timeit d[ix0, ix1]" + ] + }, + { + "cell_type": "code", + "execution_count": 110, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2.03 s ± 17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" + ] + } + ], + "source": [ + "%timeit zd.vindex[ix0, ix1]" + ] + }, + { + "cell_type": "code", + "execution_count": 111, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Wed Nov 8 17:24:31 2017 /tmp/tmp7c68z70p\n", + "\n", + " 62673 function calls in 2.065 seconds\n", + "\n", + " Ordered by: internal time\n", + " List reduced from 88 to 7 due to restriction <7>\n", + "\n", + " ncalls tottime percall cumtime percall filename:lineno(function)\n", + " 1 1.112 1.112 1.112 1.112 {method 'argsort' of 'numpy.ndarray' objects}\n", + " 3 0.244 0.081 0.244 0.081 ../zarr/indexing.py:654()\n", + " 3 0.193 0.064 0.193 0.064 ../zarr/indexing.py:630()\n", + " 1024 0.170 0.000 0.350 0.000 ../zarr/core.py:849(_chunk_getitem)\n", + " 1024 0.142 0.000 0.151 0.000 ../zarr/core.py:1028(_decode_chunk)\n", + " 1 0.044 0.044 0.044 0.044 {built-in method numpy.core.multiarray.ravel_multi_index}\n", + " 1 0.043 0.043 1.676 1.676 ../zarr/indexing.py:603(__init__)\n", + "\n", + "\n" + ] + } + ], + "source": [ + "profile('zd.vindex[ix0, ix1]')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Points need to be partially sorted so all points in the same chunk are grouped and processed together. This requires ``argsort`` which dominates time." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## h5py comparison\n", + "\n", + "N.B., not really fair because using slower compressor, but for interest..." + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "metadata": {}, + "outputs": [], + "source": [ + "import h5py\n", + "import tempfile" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "metadata": {}, + "outputs": [], + "source": [ + "h5f = h5py.File(tempfile.mktemp(), driver='core', backing_store=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 79, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 79, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "hc = h5f.create_dataset('c', data=c, compression='gzip', compression_opts=1, chunks=zc.chunks, shuffle=True)\n", + "hc" + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 1.16 s, sys: 172 ms, total: 1.33 s\n", + "Wall time: 1.32 s\n" + ] + }, + { + "data": { + "text/plain": [ + "array([ 0, 1, 2, ..., 99999997, 99999998, 99999999])" + ] + }, + "execution_count": 80, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%time hc[:]" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 1.11 s, sys: 0 ns, total: 1.11 s\n", + "Wall time: 1.11 s\n" + ] + }, + { + "data": { + "text/plain": [ + "array([ 1063, 28396, 37229, ..., 99955875, 99979354, 99995791])" + ] + }, + "execution_count": 81, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%time hc[ix_sparse_bool]" + ] + }, + { + "cell_type": "code", + "execution_count": 82, + "metadata": {}, + "outputs": [], + "source": [ + "# # this is pathological, takes minutes \n", + "# %time hc[ix_dense_bool]" + ] + }, + { + "cell_type": "code", + "execution_count": 83, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 38.3 s, sys: 136 ms, total: 38.4 s\n", + "Wall time: 38.1 s\n" + ] + }, + { + "data": { + "text/plain": [ + "array([ 0, 1000, 2000, ..., 99997000, 99998000, 99999000])" + ] + }, + "execution_count": 83, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# this is pretty slow\n", + "%time hc[::1000]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.1" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/requirements.txt b/requirements.txt index 8427764e04..28a7ceece4 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,3 +1,4 @@ +nose numpy fasteners numcodecs diff --git a/windows_conda_dev.txt b/windows_conda_dev.txt new file mode 100644 index 0000000000..85b43d4255 --- /dev/null +++ b/windows_conda_dev.txt @@ -0,0 +1,11 @@ +coverage +coveralls +fasteners +flake8 +monotonic +msgpack-python +nose +numcodecs +numpy +setuptools_scm +twine diff --git a/zarr/core.py b/zarr/core.py index f7e55b1f23..1901a0c1f3 100644 --- a/zarr/core.py +++ b/zarr/core.py @@ -8,17 +8,21 @@ import numpy as np -from zarr.util import is_total_slice, normalize_array_selection, get_chunk_range, \ - human_readable_size, normalize_resize_args, normalize_storage_path, normalize_shape, \ - normalize_chunks, InfoReporter +from zarr.util import (is_total_slice, human_readable_size, normalize_resize_args, + normalize_storage_path, normalize_shape, normalize_chunks, InfoReporter, + check_array_shape) from zarr.storage import array_meta_key, attrs_key, listdir, getsize from zarr.meta import decode_array_metadata, encode_array_metadata from zarr.attrs import Attributes from zarr.errors import PermissionError, err_read_only, err_array_not_found from zarr.compat import reduce from zarr.codecs import AsType, get_codec +from zarr.indexing import (OIndex, OrthogonalIndexer, BasicIndexer, VIndex, CoordinateIndexer, + MaskIndexer, check_fields, pop_fields, ensure_tuple, is_scalar, + is_contiguous_selection, err_too_many_indices, check_no_multi_fields) +# noinspection PyUnresolvedReferences class Array(object): """Instantiate an array from an initialized store. @@ -67,11 +71,21 @@ class Array(object): nchunks_initialized is_view info + vindex + oindex Methods ------- __getitem__ __setitem__ + get_basic_selection + set_basic_selection + get_orthogonal_selection + set_orthogonal_selection + get_mask_selection + set_mask_selection + get_coordinate_selection + set_coordinate_selection resize append view @@ -107,6 +121,10 @@ def __init__(self, store, path=None, read_only=False, chunk_store=None, # initialize info reporter self._info_reporter = InfoReporter(self) + # initialize indexing helpers + self._oindex = OIndex(self) + self._vindex = VIndex(self) + def _load_metadata(self): """(Re)load metadata from store.""" if self._synchronizer is None: @@ -124,7 +142,7 @@ def _load_metadata_nosync(self): err_array_not_found(self._path) else: - # decode and store metadata + # decode and store metadata as instance members meta = decode_array_metadata(meta_bytes) self._meta = meta self._shape = meta['shape'] @@ -156,7 +174,7 @@ def _refresh_metadata_nosync(self): def _flush_metadata_nosync(self): if self._is_view: - raise PermissionError('not permitted for views') + raise PermissionError('operation not permitted for views') if self._compressor: compressor_config = self._compressor.get_config() @@ -315,7 +333,7 @@ def nbytes_stored(self): @property def _cdata_shape(self): if self._shape == (): - return (1,) + return 1, else: return tuple(int(np.ceil(s / c)) for s, c in zip(self._shape, self._chunks)) @@ -340,7 +358,6 @@ def nchunks(self): @property def nchunks_initialized(self): """The number of chunks that have been initialized with some data.""" - # TODO fix bug here, need to only count chunks # key pattern for chunk keys prog = re.compile(r'\.'.join([r'\d+'] * min(1, self.ndim))) @@ -356,6 +373,19 @@ def is_view(self): """A boolean, True if this array is a view on another array.""" return self._is_view + @property + def oindex(self): + """Shortcut for orthogonal (outer) indexing, see :func:`get_orthogonal_selection` and + :func:`set_orthogonal_selection` for documentation and examples.""" + return self._oindex + + @property + def vindex(self): + """Shortcut for vectorized (inner) indexing, see :func:`get_coordinate_selection`, + :func:`set_coordinate_selection`, :func:`get_mask_selection` and + :func:`set_mask_selection` for documentation and examples.""" + return self._vindex + def __eq__(self, other): return ( isinstance(other, Array) and @@ -377,11 +407,17 @@ def __len__(self): if self.shape: return self.shape[0] else: + # 0-dimensional array, same error message as numpy raise TypeError('len() of unsized object') - def __getitem__(self, item): - """Retrieve data for some portion of the array. Most NumPy-style - slicing operations are supported. + def __getitem__(self, selection): + """Retrieve data for an item or region of the array. + + Parameters + ---------- + selection : tuple + An integer index or slice or tuple of int/slice objects specifying the requested + item or region for each dimension of the array. Returns ------- @@ -390,63 +426,225 @@ def __getitem__(self, item): Examples -------- - Setup a 1-dimensional array:: >>> import zarr >>> import numpy as np - >>> z = zarr.array(np.arange(100000000), chunks=1000000, dtype='i4') - >>> z - + >>> z = zarr.array(np.arange(100)) - Take some slices:: + Retrieve a single item:: >>> z[5] 5 + + Retrieve a region via slicing:: + >>> z[:5] - array([0, 1, 2, 3, 4], dtype=int32) + array([0, 1, 2, 3, 4]) >>> z[-5:] - array([99999995, 99999996, 99999997, 99999998, 99999999], dtype=int32) + array([95, 96, 97, 98, 99]) >>> z[5:10] - array([5, 6, 7, 8, 9], dtype=int32) - >>> z[:] - array([ 0, 1, 2, ..., 99999997, 99999998, 99999999], dtype=int32) + array([5, 6, 7, 8, 9]) + >>> z[5:10:2] + array([5, 7, 9]) + >>> z[::2] + array([ 0, 2, 4, ..., 94, 96, 98]) + + Load the entire array into memory:: + + >>> z[...] + array([ 0, 1, 2, ..., 97, 98, 99]) Setup a 2-dimensional array:: + >>> z = zarr.array(np.arange(100).reshape(10, 10)) + + Retrieve an item:: + + >>> z[2, 2] + 22 + + Retrieve a region via slicing:: + + >>> z[1:3, 1:3] + array([[11, 12], + [21, 22]]) + >>> z[1:3, :] + array([[10, 11, 12, 13, 14, 15, 16, 17, 18, 19], + [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]]) + >>> z[:, 1:3] + array([[ 1, 2], + [11, 12], + [21, 22], + [31, 32], + [41, 42], + [51, 52], + [61, 62], + [71, 72], + [81, 82], + [91, 92]]) + >>> z[0:5:2, 0:5:2] + array([[ 0, 2, 4], + [20, 22, 24], + [40, 42, 44]]) + >>> z[::2, ::2] + array([[ 0, 2, 4, 6, 8], + [20, 22, 24, 26, 28], + [40, 42, 44, 46, 48], + [60, 62, 64, 66, 68], + [80, 82, 84, 86, 88]]) + + Load the entire array into memory:: + + >>> z[...] + array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], + [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], + [20, 21, 22, 23, 24, 25, 26, 27, 28, 29], + [30, 31, 32, 33, 34, 35, 36, 37, 38, 39], + [40, 41, 42, 43, 44, 45, 46, 47, 48, 49], + [50, 51, 52, 53, 54, 55, 56, 57, 58, 59], + [60, 61, 62, 63, 64, 65, 66, 67, 68, 69], + [70, 71, 72, 73, 74, 75, 76, 77, 78, 79], + [80, 81, 82, 83, 84, 85, 86, 87, 88, 89], + [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]]) + + For arrays with a structured dtype, specific fields can be retrieved, e.g.:: + + >>> a = np.array([(b'aaa', 1, 4.2), + ... (b'bbb', 2, 8.4), + ... (b'ccc', 3, 12.6)], + ... dtype=[('foo', 'S3'), ('bar', 'i4'), ('baz', 'f8')]) + >>> z = zarr.array(a) + >>> z['foo'] + array([b'aaa', b'bbb', b'ccc'], + dtype='|S3') + + Notes + ----- + Slices with step > 1 are supported, but slices with negative step are not. + + Currently the implementation for __getitem__ is provided by :func:`get_basic_selection`. + For advanced ("fancy") indexing, see the methods listed under See Also. + + See Also + -------- + get_basic_selection, set_basic_selection, get_mask_selection, set_mask_selection, + get_coordinate_selection, set_coordinate_selection, get_orthogonal_selection, + set_orthogonal_selection, vindex, oindex, __setitem__ + + """ + + fields, selection = pop_fields(selection) + return self.get_basic_selection(selection, fields=fields) + + def get_basic_selection(self, selection=Ellipsis, out=None, fields=None): + """Retrieve data for an item or region of the array. + + Parameters + ---------- + selection : tuple + A tuple specifying the requested item or region for each dimension of the array. May + be any combination of int and/or slice for multidimensional arrays. + out : ndarray, optional + If given, load the selected data directly into this array. + fields : str or sequence of str, optional + For arrays with a structured dtype, one or more fields can be specified to extract + data for. + + Returns + ------- + out : ndarray + A NumPy array containing the data for the requested region. + + Examples + -------- + Setup a 1-dimensional array:: + >>> import zarr >>> import numpy as np - >>> z = zarr.array(np.arange(100000000).reshape(10000, 10000), - ... chunks=(1000, 1000), dtype='i4') - >>> z - + >>> z = zarr.array(np.arange(100)) - Take some slices:: + Retrieve a single item:: - >>> z[2, 2] - 20002 - >>> z[:2, :2] - array([[ 0, 1], - [10000, 10001]], dtype=int32) - >>> z[:2] - array([[ 0, 1, 2, ..., 9997, 9998, 9999], - [10000, 10001, 10002, ..., 19997, 19998, 19999]], dtype=int32) - >>> z[:, :2] - array([[ 0, 1], - [ 10000, 10001], - [ 20000, 20001], - ..., - [99970000, 99970001], - [99980000, 99980001], - [99990000, 99990001]], dtype=int32) - >>> z[:] - array([[ 0, 1, 2, ..., 9997, 9998, 9999], - [ 10000, 10001, 10002, ..., 19997, 19998, 19999], - [ 20000, 20001, 20002, ..., 29997, 29998, 29999], - ..., - [99970000, 99970001, 99970002, ..., 99979997, 99979998, 99979999], - [99980000, 99980001, 99980002, ..., 99989997, 99989998, 99989999], - [99990000, 99990001, 99990002, ..., 99999997, 99999998, 99999999]], dtype=int32) + >>> z.get_basic_selection(5) + 5 + + Retrieve a region via slicing:: + + >>> z.get_basic_selection(slice(5)) + array([0, 1, 2, 3, 4]) + >>> z.get_basic_selection(slice(-5, None)) + array([95, 96, 97, 98, 99]) + >>> z.get_basic_selection(slice(5, 10)) + array([5, 6, 7, 8, 9]) + >>> z.get_basic_selection(slice(5, 10, 2)) + array([5, 7, 9]) + >>> z.get_basic_selection(slice(None, None, 2)) + array([ 0, 2, 4, ..., 94, 96, 98]) + + Setup a 2-dimensional array:: + + >>> z = zarr.array(np.arange(100).reshape(10, 10)) + + Retrieve an item:: + + >>> z.get_basic_selection((2, 2)) + 22 + + Retrieve a region via slicing:: + + >>> z.get_basic_selection((slice(1, 3), slice(1, 3))) + array([[11, 12], + [21, 22]]) + >>> z.get_basic_selection((slice(1, 3), slice(None))) + array([[10, 11, 12, 13, 14, 15, 16, 17, 18, 19], + [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]]) + >>> z.get_basic_selection((slice(None), slice(1, 3))) + array([[ 1, 2], + [11, 12], + [21, 22], + [31, 32], + [41, 42], + [51, 52], + [61, 62], + [71, 72], + [81, 82], + [91, 92]]) + >>> z.get_basic_selection((slice(0, 5, 2), slice(0, 5, 2))) + array([[ 0, 2, 4], + [20, 22, 24], + [40, 42, 44]]) + >>> z.get_basic_selection((slice(None, None, 2), slice(None, None, 2))) + array([[ 0, 2, 4, 6, 8], + [20, 22, 24, 26, 28], + [40, 42, 44, 46, 48], + [60, 62, 64, 66, 68], + [80, 82, 84, 86, 88]]) + + For arrays with a structured dtype, specific fields can be retrieved, e.g.:: + + >>> a = np.array([(b'aaa', 1, 4.2), + ... (b'bbb', 2, 8.4), + ... (b'ccc', 3, 12.6)], + ... dtype=[('foo', 'S3'), ('bar', 'i4'), ('baz', 'f8')]) + >>> z = zarr.array(a) + >>> z.get_basic_selection(slice(2), fields='foo') + array([b'aaa', b'bbb'], + dtype='|S3') + + Notes + ----- + Slices with step > 1 are supported, but slices with negative step are not. + + Currently this method provides the implementation for accessing data via the square + bracket notation (__getitem__). See :func:`__getitem__` for examples using the + alternative notation. + + See Also + -------- + set_basic_selection, get_mask_selection, set_mask_selection, + get_coordinate_selection, set_coordinate_selection, get_orthogonal_selection, + set_orthogonal_selection, vindex, oindex, __getitem__, __setitem__ """ @@ -454,18 +652,22 @@ def __getitem__(self, item): if not self._cache_metadata: self._load_metadata() + # check args + check_fields(fields, self._dtype) + # handle zero-dimensional arrays if self._shape == (): - return self._getitem_zd(item) + return self._get_basic_selection_zd(selection=selection, out=out, fields=fields) else: - return self._getitem_nd(item) + return self._get_basic_selection_nd(selection=selection, out=out, fields=fields) - def _getitem_zd(self, item): - # special case __getitem__ for zero-dimensional array + def _get_basic_selection_zd(self, selection, out=None, fields=None): + # special case basic selection for zero-dimensional array - # check item is valid - if item not in ((), Ellipsis): - raise IndexError('too many indices for array') + # check selection is valid + selection = ensure_tuple(selection) + if selection not in ((), (Ellipsis,)): + err_too_many_indices(selection, ()) try: # obtain encoded data for chunk @@ -474,126 +676,484 @@ def _getitem_zd(self, item): except KeyError: # chunk not initialized - out = np.empty((), dtype=self._dtype) + chunk = np.zeros((), dtype=self._dtype) if self._fill_value is not None: - out.fill(self._fill_value) + chunk.fill(self._fill_value) else: - out = self._decode_chunk(cdata) + chunk = self._decode_chunk(cdata) + + # handle fields + if fields: + chunk = chunk[fields] # handle selection of the scalar value via empty tuple - out = out[item] + if out is None: + out = chunk[selection] + else: + out[selection] = chunk[selection] return out - def _getitem_nd(self, item): - # implementation of __getitem__ for array with at least one dimension + def _get_basic_selection_nd(self, selection, out=None, fields=None): + # implementation of basic selection for array with at least one dimension - # normalize selection - selection = normalize_array_selection(item, self._shape) + # setup indexer + indexer = BasicIndexer(selection, self) - # determine output array shape - out_shape = tuple(s.stop - s.start for s in selection - if isinstance(s, slice)) + return self._get_selection(indexer=indexer, out=out, fields=fields) - # setup output array - out = np.empty(out_shape, dtype=self._dtype, order=self._order) + def get_orthogonal_selection(self, selection, out=None, fields=None): + """Retrieve data by making a selection for each dimension of the array. For example, + if an array has 2 dimensions, allows selecting specific rows and/or columns. The + selection for each dimension can be either an integer (indexing a single item), a slice, + an array of integers, or a Boolean array where True values indicate a selection. - # determine indices of chunks overlapping the selection - chunk_range = get_chunk_range(selection, self._chunks) + Parameters + ---------- + selection : tuple + A selection for each dimension of the array. May be any combination of int, slice, + integer array or Boolean array. + out : ndarray, optional + If given, load the selected data directly into this array. + fields : str or sequence of str, optional + For arrays with a structured dtype, one or more fields can be specified to extract + data for. - # iterate over chunks in range - for cidx in itertools.product(*chunk_range): - - # determine chunk offset - offset = [i * c for i, c in zip(cidx, self._chunks)] - - # determine region within output array - out_selection = tuple( - slice(max(0, o - s.start), - min(o + c - s.start, s.stop - s.start)) - for s, o, c, in zip(selection, offset, self._chunks) - if isinstance(s, slice) - ) - - # determine region within chunk - chunk_selection = tuple( - slice(max(0, s.start - o), min(c, s.stop - o)) - if isinstance(s, slice) - else s - o - for s, o, c in zip(selection, offset, self._chunks) - ) - - # obtain the destination array as a view of the output array - if out_selection: - dest = out[out_selection] - else: - dest = out + Returns + ------- + out : ndarray + A NumPy array containing the data for the requested selection. + + Examples + -------- + Setup a 2-dimensional array:: + + >>> import zarr + >>> import numpy as np + >>> z = zarr.array(np.arange(100).reshape(10, 10)) + + Retrieve rows and columns via any combination of int, slice, integer array and/or Boolean + array:: + + >>> z.get_orthogonal_selection(([1, 4], slice(None))) + array([[10, 11, 12, 13, 14, 15, 16, 17, 18, 19], + [40, 41, 42, 43, 44, 45, 46, 47, 48, 49]]) + >>> z.get_orthogonal_selection((slice(None), [1, 4])) + array([[ 1, 4], + [11, 14], + [21, 24], + [31, 34], + [41, 44], + [51, 54], + [61, 64], + [71, 74], + [81, 84], + [91, 94]]) + >>> z.get_orthogonal_selection(([1, 4], [1, 4])) + array([[11, 14], + [41, 44]]) + >>> sel = np.zeros(z.shape[0], dtype=bool) + >>> sel[1] = True + >>> sel[4] = True + >>> z.get_orthogonal_selection((sel, sel)) + array([[11, 14], + [41, 44]]) + + For convenience, the orthogonal selection functionality is also available via the + `oindex` property, e.g.:: + + >>> z.oindex[[1, 4], :] + array([[10, 11, 12, 13, 14, 15, 16, 17, 18, 19], + [40, 41, 42, 43, 44, 45, 46, 47, 48, 49]]) + >>> z.oindex[:, [1, 4]] + array([[ 1, 4], + [11, 14], + [21, 24], + [31, 34], + [41, 44], + [51, 54], + [61, 64], + [71, 74], + [81, 84], + [91, 94]]) + >>> z.oindex[[1, 4], [1, 4]] + array([[11, 14], + [41, 44]]) + >>> sel = np.zeros(z.shape[0], dtype=bool) + >>> sel[1] = True + >>> sel[4] = True + >>> z.oindex[sel, sel] + array([[11, 14], + [41, 44]]) + + Notes + ----- + Orthogonal indexing is also known as outer indexing. + + Slices with step > 1 are supported, but slices with negative step are not. + + See Also + -------- + get_basic_selection, set_basic_selection, get_mask_selection, set_mask_selection, + get_coordinate_selection, set_coordinate_selection, set_orthogonal_selection, vindex, + oindex, __getitem__, __setitem__ + + """ + + # refresh metadata + if not self._cache_metadata: + self._load_metadata() + + # check args + check_fields(fields, self._dtype) + + # setup indexer + indexer = OrthogonalIndexer(selection, self) + + return self._get_selection(indexer=indexer, out=out, fields=fields) + + def get_coordinate_selection(self, selection, out=None, fields=None): + """Retrieve a selection of individual items, by providing the indices (coordinates) for + each selected item. + + Parameters + ---------- + selection : tuple + An integer (coordinate) array for each dimension of the array. + out : ndarray, optional + If given, load the selected data directly into this array. + fields : str or sequence of str, optional + For arrays with a structured dtype, one or more fields can be specified to extract + data for. + + Returns + ------- + out : ndarray + A NumPy array containing the data for the requested selection. + + Examples + -------- + Setup a 2-dimensional array:: + + >>> import zarr + >>> import numpy as np + >>> z = zarr.array(np.arange(100).reshape(10, 10)) + + Retrieve items by specifying their coordinates:: + + >>> z.get_coordinate_selection(([1, 4], [1, 4])) + array([11, 44]) + + For convenience, the coordinate selection functionality is also available via the + `vindex` property, e.g.:: + + >>> z.vindex[[1, 4], [1, 4]] + array([11, 44]) + + Notes + ----- + Coordinate indexing is also known as point selection, and is a form of vectorized or inner + indexing. + + Slices are not supported. Coordinate arrays must be provided for all dimensions of the + array. + + Coordinate arrays may be multidimensional, in which case the output array will also be + multidimensional. Coordinate arrays are broadcast against each other before being + applied. The shape of the output will be the same as the shape of each coordinate array + after broadcasting. + + See Also + -------- + get_basic_selection, set_basic_selection, get_mask_selection, set_mask_selection, + get_orthogonal_selection, set_orthogonal_selection, set_coordinate_selection, vindex, + oindex, __getitem__, __setitem__ + + """ + + # refresh metadata + if not self._cache_metadata: + self._load_metadata() + + # check args + check_fields(fields, self._dtype) + + # setup indexer + indexer = CoordinateIndexer(selection, self) + + # handle output - need to flatten + if out is not None: + out = out.reshape(-1) + + out = self._get_selection(indexer=indexer, out=out, fields=fields) + + # restore shape + out = out.reshape(indexer.sel_shape) + + return out + + def get_mask_selection(self, selection, out=None, fields=None): + """Retrieve a selection of individual items, by providing a Boolean array of the same + shape as the array against which the selection is being made, where True values indicate + a selected item. + + Parameters + ---------- + selection : ndarray, bool + A Boolean array of the same shape as the array against which the selection is being + made. + out : ndarray, optional + If given, load the selected data directly into this array. + fields : str or sequence of str, optional + For arrays with a structured dtype, one or more fields can be specified to extract + data for. + + Returns + ------- + out : ndarray + A NumPy array containing the data for the requested selection. + + Examples + -------- + Setup a 2-dimensional array:: + + >>> import zarr + >>> import numpy as np + >>> z = zarr.array(np.arange(100).reshape(10, 10)) + + Retrieve items by specifying a maks:: + + >>> sel = np.zeros_like(z, dtype=bool) + >>> sel[1, 1] = True + >>> sel[4, 4] = True + >>> z.get_mask_selection(sel) + array([11, 44]) + + For convenience, the mask selection functionality is also available via the + `vindex` property, e.g.:: + + >>> z.vindex[sel] + array([11, 44]) + + Notes + ----- + Mask indexing is a form of vectorized or inner indexing, and is equivalent to coordinate + indexing. Internally the mask array is converted to coordinate arrays by calling + `np.nonzero`. + + See Also + -------- + get_basic_selection, set_basic_selection, set_mask_selection, get_orthogonal_selection, + set_orthogonal_selection, get_coordinate_selection, set_coordinate_selection, vindex, + oindex, __getitem__, __setitem__ + + """ + + # refresh metadata + if not self._cache_metadata: + self._load_metadata() + + # check args + check_fields(fields, self._dtype) + + # setup indexer + indexer = MaskIndexer(selection, self) + + return self._get_selection(indexer=indexer, out=out, fields=fields) + + def _get_selection(self, indexer, out=None, fields=None): + + # We iterate over all chunks which overlap the selection and thus contain data that needs + # to be extracted. Each chunk is processed in turn, extracting the necessary data and + # storing into the correct location in the output array. + + # N.B., it is an important optimisation that we only visit chunks which overlap the + # selection. This minimises the number of iterations in the main for loop. + + # check fields are sensible + out_dtype = check_fields(fields, self._dtype) + + # determine output shape + out_shape = indexer.shape + + # setup output array + if out is None: + out = np.empty(out_shape, dtype=out_dtype, order=self._order) + else: + check_array_shape('out', out, out_shape) + + # iterate over chunks + for chunk_coords, chunk_selection, out_selection in indexer: # load chunk selection into output array - self._chunk_getitem(cidx, chunk_selection, dest) + self._chunk_getitem(chunk_coords, chunk_selection, out, out_selection, + drop_axes=indexer.drop_axes, fields=fields) if out.shape: return out else: return out[()] - def __setitem__(self, item, value): - """Modify data for some portion of the array. + def __setitem__(self, selection, value): + """Modify data for an item or region of the array. + + Parameters + ---------- + selection : tuple + An integer index or slice or tuple of int/slice specifying the requested region for + each dimension of the array. + value : scalar or array-like + Value to be stored into the array. Examples -------- - Setup a 1-dimensional array:: >>> import zarr - >>> z = zarr.zeros(100000000, chunks=1000000, dtype='i4') - >>> z - + >>> z = zarr.zeros(100, dtype=int) Set all array elements to the same scalar value:: - >>> z[:] = 42 - >>> z[:] - array([42, 42, 42, ..., 42, 42, 42], dtype=int32) + >>> z[...] = 42 + >>> z[...] + array([42, 42, 42, ..., 42, 42, 42]) Set a portion of the array:: - >>> z[:100] = np.arange(100) - >>> z[-100:] = np.arange(100)[::-1] - >>> z[:] - array([0, 1, 2, ..., 2, 1, 0], dtype=int32) + >>> z[:10] = np.arange(10) + >>> z[-10:] = np.arange(10)[::-1] + >>> z[...] + array([ 0, 1, 2, ..., 2, 1, 0]) Setup a 2-dimensional array:: - >>> z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4') - >>> z - + >>> z = zarr.zeros((5, 5), dtype=int) Set all array elements to the same scalar value:: - >>> z[:] = 42 - >>> z[:] - array([[42, 42, 42, ..., 42, 42, 42], - [42, 42, 42, ..., 42, 42, 42], - [42, 42, 42, ..., 42, 42, 42], - ..., - [42, 42, 42, ..., 42, 42, 42], - [42, 42, 42, ..., 42, 42, 42], - [42, 42, 42, ..., 42, 42, 42]], dtype=int32) + >>> z[...] = 42 Set a portion of the array:: >>> z[0, :] = np.arange(z.shape[1]) >>> z[:, 0] = np.arange(z.shape[0]) + >>> z[...] + array([[ 0, 1, 2, 3, 4], + [ 1, 42, 42, 42, 42], + [ 2, 42, 42, 42, 42], + [ 3, 42, 42, 42, 42], + [ 4, 42, 42, 42, 42]]) + + For arrays with a structured dtype, specific fields can be modified, e.g.:: + + >>> a = np.array([(b'aaa', 1, 4.2), + ... (b'bbb', 2, 8.4), + ... (b'ccc', 3, 12.6)], + ... dtype=[('foo', 'S3'), ('bar', 'i4'), ('baz', 'f8')]) + >>> z = zarr.array(a) + >>> z['foo'] = b'zzz' + >>> z[...] + array([(b'zzz', 1, 4.2), (b'zzz', 2, 8.4), (b'zzz', 3, 12.6)], + dtype=[('foo', 'S3'), ('bar', ' 1 are supported, but slices with negative step are not. + + Currently the implementation for __setitem__ is provided by :func:`set_basic_selection`, + which means that only integers and slices are supported within the selection. For + advanced ("fancy") indexing, see the methods listed under See Also. + + See Also + -------- + get_basic_selection, set_basic_selection, get_mask_selection, set_mask_selection, + get_coordinate_selection, set_coordinate_selection, get_orthogonal_selection, + set_orthogonal_selection, vindex, oindex, __getitem__ + + """ + + fields, selection = pop_fields(selection) + self.set_basic_selection(selection, value, fields=fields) + + def set_basic_selection(self, selection, value, fields=None): + """Modify data for an item or region of the array. + + Parameters + ---------- + selection : tuple + An integer index or slice or tuple of int/slice specifying the requested region for + each dimension of the array. + value : scalar or array-like + Value to be stored into the array. + fields : str or sequence of str, optional + For arrays with a structured dtype, one or more fields can be specified to set + data for. + + Examples + -------- + Setup a 1-dimensional array:: + + >>> import zarr + >>> import numpy as np + >>> z = zarr.zeros(100, dtype=int) + + Set all array elements to the same scalar value:: + + >>> z.set_basic_selection(..., 42) + >>> z[...] + array([42, 42, 42, ..., 42, 42, 42]) + + Set a portion of the array:: + + >>> z.set_basic_selection(slice(10), np.arange(10)) + >>> z.set_basic_selection(slice(-10, None), np.arange(10)[::-1]) + >>> z[...] + array([ 0, 1, 2, ..., 2, 1, 0]) + + Setup a 2-dimensional array:: + + >>> z = zarr.zeros((5, 5), dtype=int) + + Set all array elements to the same scalar value:: + + >>> z.set_basic_selection(..., 42) + + Set a portion of the array:: + + >>> z.set_basic_selection((0, slice(None)), np.arange(z.shape[1])) + >>> z.set_basic_selection((slice(None), 0), np.arange(z.shape[0])) + >>> z[...] + array([[ 0, 1, 2, 3, 4], + [ 1, 42, 42, 42, 42], + [ 2, 42, 42, 42, 42], + [ 3, 42, 42, 42, 42], + [ 4, 42, 42, 42, 42]]) + + For arrays with a structured dtype, the `fields` parameter can be used to set data for + a specific field, e.g.:: + + >>> a = np.array([(b'aaa', 1, 4.2), + ... (b'bbb', 2, 8.4), + ... (b'ccc', 3, 12.6)], + ... dtype=[('foo', 'S3'), ('bar', 'i4'), ('baz', 'f8')]) + >>> z = zarr.array(a) + >>> z.set_basic_selection(slice(0, 2), b'zzz', fields='foo') >>> z[:] - array([[ 0, 1, 2, ..., 9997, 9998, 9999], - [ 1, 42, 42, ..., 42, 42, 42], - [ 2, 42, 42, ..., 42, 42, 42], - ..., - [9997, 42, 42, ..., 42, 42, 42], - [9998, 42, 42, ..., 42, 42, 42], - [9999, 42, 42, ..., 42, 42, 42]], dtype=int32) + array([(b'zzz', 1, 4.2), (b'zzz', 2, 8.4), (b'ccc', 3, 12.6)], + dtype=[('foo', 'S3'), ('bar', '>> import zarr + >>> import numpy as np + >>> z = zarr.zeros((5, 5), dtype=int) + + Set data for a selection of rows:: + + >>> z.set_orthogonal_selection(([1, 4], slice(None)), 1) + >>> z[...] + array([[0, 0, 0, 0, 0], + [1, 1, 1, 1, 1], + [0, 0, 0, 0, 0], + [0, 0, 0, 0, 0], + [1, 1, 1, 1, 1]]) + + Set data for a selection of columns:: + + >>> z.set_orthogonal_selection((slice(None), [1, 4]), 2) + >>> z[...] + array([[0, 2, 0, 0, 2], + [1, 2, 1, 1, 2], + [0, 2, 0, 0, 2], + [0, 2, 0, 0, 2], + [1, 2, 1, 1, 2]]) + + Set data for a selection of rows and columns:: + + >>> z.set_orthogonal_selection(([1, 4], [1, 4]), 3) + >>> z[...] + array([[0, 2, 0, 0, 2], + [1, 3, 1, 1, 3], + [0, 2, 0, 0, 2], + [0, 2, 0, 0, 2], + [1, 3, 1, 1, 3]]) + + For convenience, this functionality is also available via the `oindex` property. E.g.:: + + >>> z.oindex[[1, 4], [1, 4]] = 4 + >>> z[...] + array([[0, 2, 0, 0, 2], + [1, 4, 1, 1, 4], + [0, 2, 0, 0, 2], + [0, 2, 0, 0, 2], + [1, 4, 1, 1, 4]]) + + Notes + ----- + Orthogonal indexing is also known as outer indexing. - # check item is valid - if item not in ((), Ellipsis): - raise IndexError('too many indices for array') + Slices with step > 1 are supported, but slices with negative step are not. - # setup data to store - arr = np.asarray(value, dtype=self._dtype) + See Also + -------- + get_basic_selection, set_basic_selection, get_mask_selection, set_mask_selection, + get_coordinate_selection, set_coordinate_selection, get_orthogonal_selection, + vindex, oindex, __getitem__, __setitem__ - # check value - if arr.shape != (): - raise ValueError('bad value; expected scalar, found %r' % value) + """ - # obtain key for chunk storage + # guard conditions + if self._read_only: + err_read_only() + + # refresh metadata + if not self._cache_metadata: + self._load_metadata_nosync() + + # setup indexer + indexer = OrthogonalIndexer(selection, self) + + self._set_selection(indexer, value, fields=fields) + + def set_coordinate_selection(self, selection, value, fields=None): + """Modify a selection of individual items, by providing the indices (coordinates) for + each item to be modified. + + Parameters + ---------- + selection : tuple + An integer (coordinate) array for each dimension of the array. + value : scalar or array-like + Value to be stored into the array. + fields : str or sequence of str, optional + For arrays with a structured dtype, one or more fields can be specified to set + data for. + + Examples + -------- + Setup a 2-dimensional array:: + + >>> import zarr + >>> import numpy as np + >>> z = zarr.zeros((5, 5), dtype=int) + + Set data for a selection of items:: + + >>> z.set_coordinate_selection(([1, 4], [1, 4]), 1) + >>> z[...] + array([[0, 0, 0, 0, 0], + [0, 1, 0, 0, 0], + [0, 0, 0, 0, 0], + [0, 0, 0, 0, 0], + [0, 0, 0, 0, 1]]) + + For convenience, this functionality is also available via the `vindex` property. E.g.:: + + >>> z.vindex[[1, 4], [1, 4]] = 2 + >>> z[...] + array([[0, 0, 0, 0, 0], + [0, 2, 0, 0, 0], + [0, 0, 0, 0, 0], + [0, 0, 0, 0, 0], + [0, 0, 0, 0, 2]]) + + Notes + ----- + Coordinate indexing is also known as point selection, and is a form of vectorized or inner + indexing. + + Slices are not supported. Coordinate arrays must be provided for all dimensions of the + array. + + See Also + -------- + get_basic_selection, set_basic_selection, get_mask_selection, set_mask_selection, + get_orthogonal_selection, set_orthogonal_selection, get_coordinate_selection, vindex, + oindex, __getitem__, __setitem__ + + """ + + # guard conditions + if self._read_only: + err_read_only() + + # refresh metadata + if not self._cache_metadata: + self._load_metadata_nosync() + + # setup indexer + indexer = CoordinateIndexer(selection, self) + + # handle value - need to flatten + if not is_scalar(value, self._dtype): + value = np.asanyarray(value) + if hasattr(value, 'shape') and len(value.shape) > 1: + value = value.reshape(-1) + + self._set_selection(indexer, value, fields=fields) + + def set_mask_selection(self, selection, value, fields=None): + """Modify a selection of individual items, by providing a Boolean array of the same + shape as the array against which the selection is being made, where True values indicate + a selected item. + + Parameters + ---------- + selection : ndarray, bool + A Boolean array of the same shape as the array against which the selection is being + made. + value : scalar or array-like + Value to be stored into the array. + fields : str or sequence of str, optional + For arrays with a structured dtype, one or more fields can be specified to set + data for. + + Examples + -------- + Setup a 2-dimensional array:: + + >>> import zarr + >>> import numpy as np + >>> z = zarr.zeros((5, 5), dtype=int) + + Set data for a selection of items:: + + >>> sel = np.zeros_like(z, dtype=bool) + >>> sel[1, 1] = True + >>> sel[4, 4] = True + >>> z.set_mask_selection(sel, 1) + >>> z[...] + array([[0, 0, 0, 0, 0], + [0, 1, 0, 0, 0], + [0, 0, 0, 0, 0], + [0, 0, 0, 0, 0], + [0, 0, 0, 0, 1]]) + + For convenience, this functionality is also available via the `vindex` property. E.g.:: + + >>> z.vindex[sel] = 2 + >>> z[...] + array([[0, 0, 0, 0, 0], + [0, 2, 0, 0, 0], + [0, 0, 0, 0, 0], + [0, 0, 0, 0, 0], + [0, 0, 0, 0, 2]]) + + Notes + ----- + Mask indexing is a form of vectorized or inner indexing, and is equivalent to coordinate + indexing. Internally the mask array is converted to coordinate arrays by calling + `np.nonzero`. + + See Also + -------- + get_basic_selection, set_basic_selection, get_mask_selection, get_orthogonal_selection, + set_orthogonal_selection, get_coordinate_selection, set_coordinate_selection, vindex, + oindex, __getitem__, __setitem__ + + """ + + # guard conditions + if self._read_only: + err_read_only() + + # refresh metadata + if not self._cache_metadata: + self._load_metadata_nosync() + + # setup indexer + indexer = MaskIndexer(selection, self) + + self._set_selection(indexer, value, fields=fields) + + def _set_basic_selection_zd(self, selection, value, fields=None): + # special case __setitem__ for zero-dimensional array + + # check selection is valid + selection = ensure_tuple(selection) + if selection not in ((), (Ellipsis,)): + err_too_many_indices(selection, self._shape) + + # check fields + check_fields(fields, self._dtype) + fields = check_no_multi_fields(fields) + + # obtain key for chunk ckey = self._chunk_key((0,)) + # setup chunk + try: + # obtain compressed data for chunk + cdata = self.chunk_store[ckey] + + except KeyError: + # chunk not initialized + chunk = np.zeros((), dtype=self._dtype) + if self._fill_value is not None: + chunk.fill(self._fill_value) + + else: + # decode chunk + chunk = self._decode_chunk(cdata).copy() + + # set value + if fields: + chunk[fields][selection] = value + else: + chunk[selection] = value + # encode and store - cdata = self._encode_chunk(arr) + cdata = self._encode_chunk(chunk) self.chunk_store[ckey] = cdata - def _setitem_nd(self, item, value): + def _set_basic_selection_nd(self, selection, value, fields=None): # implementation of __setitem__ for array with at least one dimension - # normalize selection - selection = normalize_array_selection(item, self._shape) + # setup indexer + indexer = BasicIndexer(selection, self) - # check value shape - expected_shape = tuple( - s.stop - s.start for s in selection - if isinstance(s, slice) - ) - if np.isscalar(value): - pass - elif expected_shape != value.shape: - raise ValueError('value has wrong shape; expected %s, found %s' - % (str(expected_shape), - str(value.shape))) - - # determine indices of chunks overlapping the selection - chunk_range = get_chunk_range(selection, self._chunks) + self._set_selection(indexer, value, fields=fields) - # iterate over chunks in range - for cidx in itertools.product(*chunk_range): + def _set_selection(self, indexer, value, fields=None): - # determine chunk offset - offset = [i * c for i, c in zip(cidx, self._chunks)] + # We iterate over all chunks which overlap the selection and thus contain data that needs + # to be replaced. Each chunk is processed in turn, extracting the necessary data from the + # value array and storing into the chunk array. - # determine required index range within chunk - chunk_selection = tuple( - slice(max(0, s.start - o), min(c, s.stop - o)) - if isinstance(s, slice) - else s - o - for s, o, c in zip(selection, offset, self._chunks) - ) + # N.B., it is an important optimisation that we only visit chunks which overlap the + # selection. This minimises the nuimber of iterations in the main for loop. - if np.isscalar(value): + # check fields are sensible + check_fields(fields, self._dtype) + fields = check_no_multi_fields(fields) - # put data - self._chunk_setitem(cidx, chunk_selection, value) + # determine indices of chunks overlapping the selection + sel_shape = indexer.shape - else: - # assume value is array-like - - # determine index within value - value_selection = tuple( - slice(max(0, o - s.start), - min(o + c - s.start, s.stop - s.start)) - for s, o, c in zip(selection, offset, self._chunks) - if isinstance(s, slice) - ) + # check value shape + if is_scalar(value, self._dtype): + pass + else: + if not hasattr(value, 'shape'): + value = np.asanyarray(value) + check_array_shape('value', value, sel_shape) - # put data - self._chunk_setitem(cidx, chunk_selection, value[value_selection]) + # iterate over chunks in range + for chunk_coords, chunk_selection, out_selection in indexer: - def _chunk_getitem(self, cidx, item, dest): + # extract data to store + if is_scalar(value, self._dtype): + chunk_value = value + else: + chunk_value = value[out_selection] + # handle missing singleton dimensions + if indexer.drop_axes: + item = [slice(None)] * self.ndim + for a in indexer.drop_axes: + item[a] = np.newaxis + chunk_value = chunk_value[item] + + # put data + self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields) + + def _chunk_getitem(self, chunk_coords, chunk_selection, out, out_selection, drop_axes=None, + fields=None): """Obtain part or whole of a chunk. Parameters ---------- - cidx : tuple of ints + chunk_coords : tuple of ints Indices of the chunk. - item : tuple of slices - Location of region within the chunk. - dest : ndarray - Numpy array to store result in. + chunk_selection : selection + Location of region within the chunk to extract. + out : ndarray + Array to store result in. + out_selection : selection + Location of region within output array to store results in. + drop_axes : tuple of ints + Axes to squeeze out of the chunk. + fields + TODO """ - try: + assert len(chunk_coords) == len(self._cdata_shape) + + # obtain key for chunk + ckey = self._chunk_key(chunk_coords) + try: # obtain compressed data for chunk - ckey = self._chunk_key(cidx) cdata = self.chunk_store[ckey] except KeyError: - # chunk not initialized if self._fill_value is not None: - dest.fill(self._fill_value) + out[out_selection] = self._fill_value else: - if is_total_slice(item, self._chunks) and \ - not self._filters and \ - ((self._order == 'C' and dest.flags.c_contiguous) or - (self._order == 'F' and dest.flags.f_contiguous)): + if (isinstance(out, np.ndarray) and + not fields and + is_contiguous_selection(out_selection) and + is_total_slice(chunk_selection, self._chunks) and + not self._filters): - # optimization: we want the whole chunk, and the destination is - # contiguous, so we can decompress directly from the chunk - # into the destination array + dest = out[out_selection] + write_direct = ( + dest.flags.writeable and ( + (self._order == 'C' and dest.flags.c_contiguous) or + (self._order == 'F' and dest.flags.f_contiguous) + ) + ) - if self._compressor: - self._compressor.decode(cdata, dest) - else: - arr = np.frombuffer(cdata, dtype=self._dtype) - arr = arr.reshape(self._chunks, order=self._order) - np.copyto(dest, arr) + if write_direct: - else: + # optimization: we want the whole chunk, and the destination is + # contiguous, so we can decompress directly from the chunk + # into the destination array - # decode chunk - chunk = self._decode_chunk(cdata) + if self._compressor: + self._compressor.decode(cdata, dest) + else: + chunk = np.frombuffer(cdata, dtype=self._dtype) + chunk = chunk.reshape(self._chunks, order=self._order) + np.copyto(dest, chunk) + return - # set data in output array - # (split into two lines for profiling) - tmp = chunk[item] - if dest.shape: - dest[:] = tmp - else: - dest[()] = tmp + # decode chunk + chunk = self._decode_chunk(cdata) + + # select data from chunk + if fields: + chunk = chunk[fields] + tmp = chunk[chunk_selection] + if drop_axes: + tmp = np.squeeze(tmp, axis=drop_axes) + + # store selected data in output + out[out_selection] = tmp - def _chunk_setitem(self, cidx, item, value): + def _chunk_setitem(self, chunk_coords, chunk_selection, value, fields=None): """Replace part or whole of a chunk. Parameters ---------- - cidx : tuple of ints + chunk_coords : tuple of ints Indices of the chunk. - item : tuple of slices + chunk_selection : tuple of slices Location of region within the chunk. value : scalar or ndarray Value to set. @@ -759,25 +1595,25 @@ def _chunk_setitem(self, cidx, item, value): # synchronization if self._synchronizer is None: - self._chunk_setitem_nosync(cidx, item, value) + self._chunk_setitem_nosync(chunk_coords, chunk_selection, value, fields=fields) else: # synchronize on the chunk - ckey = self._chunk_key(cidx) + ckey = self._chunk_key(chunk_coords) with self._synchronizer[ckey]: - self._chunk_setitem_nosync(cidx, item, value) + self._chunk_setitem_nosync(chunk_coords, chunk_selection, value, fields=fields) - def _chunk_setitem_nosync(self, cidx, item, value): + def _chunk_setitem_nosync(self, chunk_coords, chunk_selection, value, fields=None): # obtain key for chunk storage - ckey = self._chunk_key(cidx) + ckey = self._chunk_key(chunk_coords) - if is_total_slice(item, self._chunks): + if is_total_slice(chunk_selection, self._chunks) and not fields: # totally replace chunk # optimization: we are completely replacing the chunk, so no need # to access the existing chunk data - if np.isscalar(value): + if is_scalar(value, self._dtype): # setup array filled with value chunk = np.empty(self._chunks, dtype=self._dtype, order=self._order) @@ -812,9 +1648,13 @@ def _chunk_setitem_nosync(self, cidx, item, value): except KeyError: # chunk not initialized - chunk = np.empty(self._chunks, dtype=self._dtype, order=self._order) if self._fill_value is not None: + chunk = np.empty(self._chunks, dtype=self._dtype, order=self._order) chunk.fill(self._fill_value) + else: + # N.B., use zeros here so any region beyond the array has consistent and + # compressible data + chunk = np.zeros(self._chunks, dtype=self._dtype, order=self._order) else: @@ -824,7 +1664,12 @@ def _chunk_setitem_nosync(self, cidx, item, value): chunk = chunk.copy(order='K') # modify - chunk[item] = value + if fields: + # N.B., currently multi-field assignment is not supported in numpy, so this only + # works for a single field + chunk[fields][chunk_selection] = value + else: + chunk[chunk_selection] = value # encode chunk cdata = self._encode_chunk(chunk) @@ -832,8 +1677,8 @@ def _chunk_setitem_nosync(self, cidx, item, value): # store self.chunk_store[ckey] = cdata - def _chunk_key(self, cidx): - return self._key_prefix + '.'.join(map(str, cidx)) + def _chunk_key(self, chunk_coords): + return self._key_prefix + '.'.join(map(str, chunk_coords)) def _decode_chunk(self, cdata): @@ -966,8 +1811,8 @@ def bytestr(n): return items def __getstate__(self): - return self._store, self._path, self._read_only, self._chunk_store, self._synchronizer, \ - self._cache_metadata + return (self._store, self._path, self._read_only, self._chunk_store, self._synchronizer, + self._cache_metadata) def __setstate__(self, state): self.__init__(*state) @@ -1083,8 +1928,8 @@ def append(self, data, axis=0): (20000, 1000) >>> z.append(np.vstack([a, a]), axis=1) (20000, 2000) - >>> z - + >>> z.shape + (20000, 2000) """ return self._write_op(self._append_nosync, data, axis=axis) @@ -1092,7 +1937,7 @@ def append(self, data, axis=0): def _append_nosync(self, data, axis=0): # ensure data is array-like - if not hasattr(data, 'shape') or not hasattr(data, 'dtype'): + if not hasattr(data, 'shape'): data = np.asanyarray(data) # ensure shapes are compatible for non-append dimensions @@ -1101,7 +1946,8 @@ def _append_nosync(self, data, axis=0): data_shape_preserved = tuple(s for i, s in enumerate(data.shape) if i != axis) if self_shape_preserved != data_shape_preserved: - raise ValueError('shapes not compatible') + raise ValueError('shape of data to append is not compatible with the array; all ' + 'dimensions must match except for the dimension being appended') # remember old shape old_shape = self._shape @@ -1231,7 +2077,7 @@ def view(self, shape=None, chunks=None, dtype=None, ... v.resize(20000) ... except PermissionError as e: ... print(e) - not permitted for views + operation not permitted for views """ @@ -1268,7 +2114,7 @@ def view(self, shape=None, chunks=None, dtype=None, return a def astype(self, dtype): - """Does on the fly type conversion of the underlying data. + """Returns a view that does on the fly type conversion of the underlying data. Parameters ---------- diff --git a/zarr/creation.py b/zarr/creation.py index 0e3e3750cc..4dfc20c4c6 100644 --- a/zarr/creation.py +++ b/zarr/creation.py @@ -103,10 +103,6 @@ def create(shape, chunks=None, dtype=None, compressor='default', # API compatibility with h5py compressor, fill_value = _handle_kwargs(compressor, fill_value, kwargs) - # ensure fill_value of correct type - if fill_value is not None: - fill_value = np.array(fill_value, dtype=dtype)[()] - # initialize array metadata init_array(store, shape=shape, chunks=chunks, dtype=dtype, compressor=compressor, fill_value=fill_value, order=order, @@ -329,22 +325,21 @@ def array(data, **kwargs): return z -def open_array(store=None, mode='a', shape=None, chunks=None, dtype=None, - compressor='default', fill_value=0, order='C', - synchronizer=None, filters=None, cache_metadata=True, +def open_array(store, mode='a', shape=None, chunks=None, dtype=None, compressor='default', + fill_value=0, order='C', synchronizer=None, filters=None, cache_metadata=True, path=None, **kwargs): - """Open array using mode-like semantics. + """Open an array using file-mode-like semantics. Parameters ---------- store : MutableMapping or string Store or path to directory in file system. - mode : {'r', 'r+', 'a', 'w', 'w-'} + mode : {'r', 'r+', 'a', 'w', 'w-'}, optional Persistence mode: 'r' means read only (must exist); 'r+' means read/write (must exist); 'a' means read/write (create if doesn't exist); 'w' means create (overwrite if exists); 'w-' means create (fail if exists). - shape : int or tuple of ints + shape : int or tuple of ints, optional Array shape. chunks : int or tuple of ints, optional Chunk shape. If not provided, will be guessed from `shape` and `dtype`. @@ -352,7 +347,7 @@ def open_array(store=None, mode='a', shape=None, chunks=None, dtype=None, NumPy dtype. compressor : Codec, optional Primary compressor. - fill_value : object + fill_value : object, optional Default value to use for uninitialized portions of the array. order : {'C', 'F'}, optional Memory layout to be used within each chunk. @@ -366,7 +361,7 @@ def open_array(store=None, mode='a', shape=None, chunks=None, dtype=None, prior to all data access and modification operations (may incur overhead depending on storage and data access pattern). path : string, optional - Array path. + Array path within store. Returns ------- diff --git a/zarr/errors.py b/zarr/errors.py index 82c9306ca7..8829ec3e02 100644 --- a/zarr/errors.py +++ b/zarr/errors.py @@ -50,3 +50,23 @@ def err_fspath_exists_notdir(fspath): def err_read_only(): raise PermissionError('object is read-only') + + +def err_boundscheck(dim_len): + raise IndexError('index out of bounds for dimension with length {}' + .format(dim_len)) + + +def err_negative_step(): + raise IndexError('only slices with step >= 1 are supported') + + +def err_too_many_indices(selection, shape): + raise IndexError('too many indices for array; expected {}, got {}' + .format(len(shape), len(selection))) + + +def err_vindex_invalid_selection(selection): + raise IndexError('unsupported selection type for vectorized indexing; only coordinate ' + 'selection (tuple of integer arrays) and mask selection (single ' + 'Boolean array) are supported; got {!r}'.format(selection)) diff --git a/zarr/hierarchy.py b/zarr/hierarchy.py index 937c53d12c..fb19782168 100644 --- a/zarr/hierarchy.py +++ b/zarr/hierarchy.py @@ -719,6 +719,9 @@ def _require_dataset_nosync(self, name, shape, dtype=None, exact=False, path = self._item_path(name) if contains_array(self._store, path): + + # array already exists at path, validate that it is the right shape and type + synchronizer = kwargs.get('synchronizer', self._synchronizer) cache_metadata = kwargs.get('cache_metadata', True) a = Array(self._store, path=path, read_only=self._read_only, @@ -726,14 +729,17 @@ def _require_dataset_nosync(self, name, shape, dtype=None, exact=False, cache_metadata=cache_metadata) shape = normalize_shape(shape) if shape != a.shape: - raise TypeError('shapes do not match') + raise TypeError('shape do not match existing array; expected {}, got {}' + .format(a.shape, shape)) dtype = np.dtype(dtype) if exact: if dtype != a.dtype: - raise TypeError('dtypes do not match exactly') + raise TypeError('dtypes do not match exactly; expected {}, got {}' + .format(a.dtype, dtype)) else: if not np.can_cast(dtype, a.dtype): - raise TypeError('dtypes cannot be safely cast') + raise TypeError('dtypes ({}, {}) cannot be safely cast' + .format(dtype, a.dtype)) return a else: @@ -855,13 +861,12 @@ def _handle_store_arg(store): return store -def group(store=None, overwrite=False, chunk_store=None, synchronizer=None, - path=None): +def group(store=None, overwrite=False, chunk_store=None, synchronizer=None, path=None): """Create a group. Parameters ---------- - store : MutableMapping or string + store : MutableMapping or string, optional Store or path to directory in file system. overwrite : bool, optional If True, delete any pre-existing data in `store` at `path` before @@ -872,7 +877,7 @@ def group(store=None, overwrite=False, chunk_store=None, synchronizer=None, synchronizer : object, optional Array synchronizer. path : string, optional - Group path. + Group path within store. Returns ------- @@ -910,14 +915,14 @@ def group(store=None, overwrite=False, chunk_store=None, synchronizer=None, synchronizer=synchronizer, path=path) -def open_group(store=None, mode='a', synchronizer=None, path=None): - """Open a group using mode-like semantics. +def open_group(store, mode='a', synchronizer=None, path=None): + """Open a group using file-mode-like semantics. Parameters ---------- store : MutableMapping or string Store or path to directory in file system. - mode : {'r', 'r+', 'a', 'w', 'w-'} + mode : {'r', 'r+', 'a', 'w', 'w-'}, optional Persistence mode: 'r' means read only (must exist); 'r+' means read/write (must exist); 'a' means read/write (create if doesn't exist); 'w' means create (overwrite if exists); 'w-' means create @@ -925,7 +930,7 @@ def open_group(store=None, mode='a', synchronizer=None, path=None): synchronizer : object, optional Array synchronizer. path : string, optional - Group path. + Group path within store. Returns ------- diff --git a/zarr/indexing.py b/zarr/indexing.py new file mode 100644 index 0000000000..656efc201b --- /dev/null +++ b/zarr/indexing.py @@ -0,0 +1,811 @@ +# -*- coding: utf-8 -*- +from __future__ import absolute_import, print_function, division +import numbers +import itertools +import collections + + +import numpy as np + + +from zarr.errors import (err_too_many_indices, err_boundscheck, err_negative_step, + err_vindex_invalid_selection) + + +def is_integer(x): + return isinstance(x, numbers.Integral) + + +def is_integer_array(x, ndim=None): + t = hasattr(x, 'shape') and hasattr(x, 'dtype') and x.dtype.kind in 'ui' + if ndim is not None: + t = t and len(x.shape) == ndim + return t + + +def is_bool_array(x, ndim=None): + t = hasattr(x, 'shape') and hasattr(x, 'dtype') and x.dtype == bool + if ndim is not None: + t = t and len(x.shape) == ndim + return t + + +def is_scalar(value, dtype): + if np.isscalar(value): + return True + if isinstance(value, tuple) and dtype.names and len(value) == len(dtype.names): + return True + return False + + +def normalize_integer_selection(dim_sel, dim_len): + + # normalize type to int + dim_sel = int(dim_sel) + + # handle wraparound + if dim_sel < 0: + dim_sel = dim_len + dim_sel + + # handle out of bounds + if dim_sel >= dim_len or dim_sel < 0: + err_boundscheck(dim_len) + + return dim_sel + + +ChunkDimProjection = collections.namedtuple('ChunkDimProjection', + ('dim_chunk_ix', 'dim_chunk_sel', 'dim_out_sel')) +"""A mapping from chunk to output array for a single dimension. + +Parameters +---------- +dim_chunk_ix + Index of chunk. +dim_chunk_sel + Selection of items from chunk array. +dim_out_sel + Selection of items in target (output) array. + +""" + + +class IntDimIndexer(object): + + def __init__(self, dim_sel, dim_len, dim_chunk_len): + + # normalize + dim_sel = normalize_integer_selection(dim_sel, dim_len) + + # store attributes + self.dim_sel = dim_sel + self.dim_len = dim_len + self.dim_chunk_len = dim_chunk_len + self.nitems = 1 + + def __iter__(self): + dim_chunk_ix = self.dim_sel // self.dim_chunk_len + dim_offset = dim_chunk_ix * self.dim_chunk_len + dim_chunk_sel = self.dim_sel - dim_offset + dim_out_sel = None + yield ChunkDimProjection(dim_chunk_ix, dim_chunk_sel, dim_out_sel) + + +def ceildiv(a, b): + return int(np.ceil(a / b)) + + +class SliceDimIndexer(object): + + def __init__(self, dim_sel, dim_len, dim_chunk_len): + + # normalize + self.start, self.stop, self.step = dim_sel.indices(dim_len) + if self.step < 1: + err_negative_step() + + # store attributes + self.dim_len = dim_len + self.dim_chunk_len = dim_chunk_len + self.nitems = max(0, ceildiv((self.stop - self.start), self.step)) + self.nchunks = ceildiv(self.dim_len, self.dim_chunk_len) + + def __iter__(self): + + # figure out the range of chunks we need to visit + dim_chunk_ix_from = self.start // self.dim_chunk_len + dim_chunk_ix_to = ceildiv(self.stop, self.dim_chunk_len) + + # iterate over chunks in range + for dim_chunk_ix in range(dim_chunk_ix_from, dim_chunk_ix_to): + + # compute offsets for chunk within overall array + dim_offset = dim_chunk_ix * self.dim_chunk_len + dim_limit = min(self.dim_len, (dim_chunk_ix + 1) * self.dim_chunk_len) + + # determine chunk length, accounting for trailing chunk + dim_chunk_len = dim_limit - dim_offset + + if self.start < dim_offset: + # selection starts before current chunk + dim_chunk_sel_start = 0 + remainder = (dim_offset - self.start) % self.step + if remainder: + dim_chunk_sel_start += self.step - remainder + # compute number of previous items, provides offset into output array + dim_out_offset = ceildiv((dim_offset - self.start), self.step) + + else: + # selection starts within current chunk + dim_chunk_sel_start = self.start - dim_offset + dim_out_offset = 0 + + if self.stop > dim_limit: + # selection ends after current chunk + dim_chunk_sel_stop = dim_chunk_len + + else: + # selection ends within current chunk + dim_chunk_sel_stop = self.stop - dim_offset + + dim_chunk_sel = slice(dim_chunk_sel_start, dim_chunk_sel_stop, self.step) + dim_chunk_nitems = ceildiv((dim_chunk_sel_stop - dim_chunk_sel_start), self.step) + dim_out_sel = slice(dim_out_offset, dim_out_offset + dim_chunk_nitems) + + yield ChunkDimProjection(dim_chunk_ix, dim_chunk_sel, dim_out_sel) + + +def check_selection_length(selection, shape): + if len(selection) > len(shape): + err_too_many_indices(selection, shape) + + +def replace_ellipsis(selection, shape): + + selection = ensure_tuple(selection) + + # count number of ellipsis present + n_ellipsis = sum(1 for i in selection if i is Ellipsis) + + if n_ellipsis > 1: + # more than 1 is an error + raise IndexError("an index can only have a single ellipsis ('...')") + + elif n_ellipsis == 1: + # locate the ellipsis, count how many items to left and right + n_items_l = selection.index(Ellipsis) # items to left of ellipsis + n_items_r = len(selection) - (n_items_l + 1) # items to right of ellipsis + n_items = len(selection) - 1 # all non-ellipsis items + + if n_items >= len(shape): + # ellipsis does nothing, just remove it + selection = tuple(i for i in selection if i != Ellipsis) + + else: + # replace ellipsis with as many slices are needed for number of dims + new_item = selection[:n_items_l] + ((slice(None),) * (len(shape) - n_items)) + if n_items_r: + new_item += selection[-n_items_r:] + selection = new_item + + # fill out selection if not completely specified + if len(selection) < len(shape): + selection += (slice(None),) * (len(shape) - len(selection)) + + # check selection not too long + check_selection_length(selection, shape) + + return selection + + +def replace_lists(selection): + return tuple( + np.asarray(dim_sel) if isinstance(dim_sel, list) else dim_sel + for dim_sel in selection + ) + + +def ensure_tuple(v): + if not isinstance(v, tuple): + v = (v,) + return v + + +ChunkProjection = collections.namedtuple('ChunkProjection', + ('chunk_coords', 'chunk_selection', 'out_selection')) +"""A mapping of items from chunk to output array. Can be used to extract items from the chunk +array for loading into an output array. Can also be used to extract items from a value array for +setting/updating in a chunk array. + +Parameters +---------- +chunk_coords + Indices of chunk. +chunk_selection + Selection of items from chunk array. +out_selection + Selection of items in target (output) array. + +""" + + +def is_slice(s): + return isinstance(s, slice) + + +def is_contiguous_slice(s): + return is_slice(s) and (s.step is None or s.step == 1) + + +def is_positive_slice(s): + return is_slice(s) and (s.step is None or s.step >= 1) + + +def is_contiguous_selection(selection): + selection = ensure_tuple(selection) + return all([ + (is_integer_array(s) or is_contiguous_slice(s) or s == Ellipsis) + for s in selection + ]) + + +def is_basic_selection(selection): + selection = ensure_tuple(selection) + return all([is_integer(s) or is_positive_slice(s) for s in selection]) + + +# noinspection PyProtectedMember +class BasicIndexer(object): + + def __init__(self, selection, array): + + # handle ellipsis + selection = replace_ellipsis(selection, array._shape) + + # setup per-dimension indexers + dim_indexers = [] + for dim_sel, dim_len, dim_chunk_len in zip(selection, array._shape, array._chunks): + + if is_integer(dim_sel): + dim_indexer = IntDimIndexer(dim_sel, dim_len, dim_chunk_len) + + elif is_slice(dim_sel): + dim_indexer = SliceDimIndexer(dim_sel, dim_len, dim_chunk_len) + + else: + raise IndexError('unsupported selection item for basic indexing; expected integer ' + 'or slice, got {!r}'.format(type(dim_sel))) + + dim_indexers.append(dim_indexer) + + self.dim_indexers = dim_indexers + self.shape = tuple(s.nitems for s in self.dim_indexers + if not isinstance(s, IntDimIndexer)) + self.drop_axes = None + + def __iter__(self): + for dim_projections in itertools.product(*self.dim_indexers): + + chunk_coords = tuple(p.dim_chunk_ix for p in dim_projections) + chunk_selection = tuple(p.dim_chunk_sel for p in dim_projections) + out_selection = tuple(p.dim_out_sel for p in dim_projections + if p.dim_out_sel is not None) + + yield ChunkProjection(chunk_coords, chunk_selection, out_selection) + + +class BoolArrayDimIndexer(object): + + def __init__(self, dim_sel, dim_len, dim_chunk_len): + + # check number of dimensions + if not is_bool_array(dim_sel, 1): + raise IndexError('Boolean arrays in an orthogonal selection must 1-dimensional only') + + # check shape + if dim_sel.shape[0] != dim_len: + raise IndexError('Boolean array has the wrong length for dimension; ' + 'expected {}, got {}'.format(dim_len, dim_sel.shape[0])) + + # store attributes + self.dim_sel = dim_sel + self.dim_len = dim_len + self.dim_chunk_len = dim_chunk_len + self.nchunks = ceildiv(self.dim_len, self.dim_chunk_len) + + # precompute number of selected items for each chunk + self.chunk_nitems = np.zeros(self.nchunks, dtype='i8') + for dim_chunk_ix in range(self.nchunks): + dim_offset = dim_chunk_ix * self.dim_chunk_len + self.chunk_nitems[dim_chunk_ix] = np.count_nonzero( + self.dim_sel[dim_offset:dim_offset + self.dim_chunk_len] + ) + self.chunk_nitems_cumsum = np.cumsum(self.chunk_nitems) + self.nitems = self.chunk_nitems_cumsum[-1] + self.dim_chunk_ixs = np.nonzero(self.chunk_nitems)[0] + + def __iter__(self): + + # iterate over chunks with at least one item + for dim_chunk_ix in self.dim_chunk_ixs: + + # find region in chunk + dim_offset = dim_chunk_ix * self.dim_chunk_len + dim_chunk_sel = self.dim_sel[dim_offset:dim_offset + self.dim_chunk_len] + + # pad out if final chunk + if dim_chunk_sel.shape[0] < self.dim_chunk_len: + tmp = np.zeros(self.dim_chunk_len, dtype=bool) + tmp[:dim_chunk_sel.shape[0]] = dim_chunk_sel + dim_chunk_sel = tmp + + # find region in output + if dim_chunk_ix == 0: + start = 0 + else: + start = self.chunk_nitems_cumsum[dim_chunk_ix - 1] + stop = self.chunk_nitems_cumsum[dim_chunk_ix] + dim_out_sel = slice(start, stop) + + yield ChunkDimProjection(dim_chunk_ix, dim_chunk_sel, dim_out_sel) + + +class Order: + UNKNOWN = 0 + INCREASING = 1 + DECREASING = 2 + UNORDERED = 3 + + @staticmethod + def check(a): + diff = np.diff(a) + diff_positive = diff >= 0 + n_diff_positive = np.count_nonzero(diff_positive) + all_increasing = n_diff_positive == len(diff_positive) + any_increasing = n_diff_positive > 0 + if all_increasing: + order = Order.INCREASING + elif any_increasing: + order = Order.UNORDERED + else: + order = Order.DECREASING + return order + + +def wraparound_indices(x, dim_len): + loc_neg = x < 0 + if np.any(loc_neg): + x[loc_neg] = x[loc_neg] + dim_len + + +def boundscheck_indices(x, dim_len): + if np.any(x < 0) or np.any(x >= dim_len): + err_boundscheck(dim_len) + + +class IntArrayDimIndexer(object): + """Integer array selection against a single dimension.""" + + def __init__(self, dim_sel, dim_len, dim_chunk_len, wraparound=True, boundscheck=True, + order=Order.UNKNOWN): + + # ensure 1d array + dim_sel = np.asanyarray(dim_sel) + if not is_integer_array(dim_sel, 1): + raise IndexError('integer arrays in an orthogonal selection must be 1-dimensional only') + + # handle wraparound + if wraparound: + wraparound_indices(dim_sel, dim_len) + + # handle out of bounds + if boundscheck: + boundscheck_indices(dim_sel, dim_len) + + # store attributes + self.dim_len = dim_len + self.dim_chunk_len = dim_chunk_len + self.nchunks = ceildiv(self.dim_len, self.dim_chunk_len) + self.nitems = len(dim_sel) + + # determine which chunk is needed for each selection item + # note: for dense integer selections, the division operation here is the bottleneck + dim_sel_chunk = dim_sel // dim_chunk_len + + # determine order of indices + if order == Order.UNKNOWN: + order = Order.check(dim_sel) + self.order = order + + if self.order == Order.INCREASING: + self.dim_sel = dim_sel + self.dim_out_sel = None + elif self.order == Order.DECREASING: + self.dim_sel = dim_sel[::-1] + # TODO should be possible to do this without creating an arange + self.dim_out_sel = np.arange(self.nitems - 1, -1, -1) + else: + # sort indices to group by chunk + self.dim_out_sel = np.argsort(dim_sel_chunk) + self.dim_sel = np.take(dim_sel, self.dim_out_sel) + + # precompute number of selected items for each chunk + self.chunk_nitems = np.bincount(dim_sel_chunk, minlength=self.nchunks) + + # find chunks that we need to visit + self.dim_chunk_ixs = np.nonzero(self.chunk_nitems)[0] + + # compute offsets into the output array + self.chunk_nitems_cumsum = np.cumsum(self.chunk_nitems) + + def __iter__(self): + + for dim_chunk_ix in self.dim_chunk_ixs: + + # find region in output + if dim_chunk_ix == 0: + start = 0 + else: + start = self.chunk_nitems_cumsum[dim_chunk_ix - 1] + stop = self.chunk_nitems_cumsum[dim_chunk_ix] + if self.order == Order.INCREASING: + dim_out_sel = slice(start, stop) + else: + dim_out_sel = self.dim_out_sel[start:stop] + + # find region in chunk + dim_offset = dim_chunk_ix * self.dim_chunk_len + dim_chunk_sel = self.dim_sel[start:stop] - dim_offset + + yield ChunkDimProjection(dim_chunk_ix, dim_chunk_sel, dim_out_sel) + + +def slice_to_range(s, l): + return range(*s.indices(l)) + + +def ix_(selection, shape): + """Convert an orthogonal selection to a numpy advanced (fancy) selection, like numpy.ix_ + but with support for slices and single ints.""" + + # normalisation + selection = replace_ellipsis(selection, shape) + + # replace slice and int as these are not supported by numpy.ix_ + selection = [slice_to_range(dim_sel, dim_len) if isinstance(dim_sel, slice) + else [dim_sel] if is_integer(dim_sel) + else dim_sel + for dim_sel, dim_len in zip(selection, shape)] + + # now get numpy to convert to a coordinate selection + selection = np.ix_(*selection) + + return selection + + +def oindex(a, selection): + """Implementation of orthogonal indexing with slices and ints.""" + selection = replace_ellipsis(selection, a.shape) + drop_axes = tuple([i for i, s in enumerate(selection) if is_integer(s)]) + selection = ix_(selection, a.shape) + result = a[selection] + if drop_axes: + result = result.squeeze(axis=drop_axes) + return result + + +def oindex_set(a, selection, value): + selection = replace_ellipsis(selection, a.shape) + drop_axes = tuple([i for i, s in enumerate(selection) if is_integer(s)]) + selection = ix_(selection, a.shape) + if not np.isscalar(value) and drop_axes: + value = np.asanyarray(value) + value_selection = [slice(None)] * len(a.shape) + for i in drop_axes: + value_selection[i] = np.newaxis + value = value[value_selection] + a[selection] = value + + +# noinspection PyProtectedMember +class OrthogonalIndexer(object): + + def __init__(self, selection, array): + + # handle ellipsis + selection = replace_ellipsis(selection, array._shape) + + # normalize list to array + selection = replace_lists(selection) + + # setup per-dimension indexers + dim_indexers = [] + for dim_sel, dim_len, dim_chunk_len in zip(selection, array._shape, array._chunks): + + if is_integer(dim_sel): + dim_indexer = IntDimIndexer(dim_sel, dim_len, dim_chunk_len) + + elif isinstance(dim_sel, slice): + dim_indexer = SliceDimIndexer(dim_sel, dim_len, dim_chunk_len) + + elif is_integer_array(dim_sel): + dim_indexer = IntArrayDimIndexer(dim_sel, dim_len, dim_chunk_len) + + elif is_bool_array(dim_sel): + dim_indexer = BoolArrayDimIndexer(dim_sel, dim_len, dim_chunk_len) + + else: + raise IndexError('unsupported selection item for orthogonal indexing; expected ' + 'integer, slice, integer array or Boolean array, got {!r}' + .format(type(dim_sel))) + + dim_indexers.append(dim_indexer) + + self.array = array + self.dim_indexers = dim_indexers + self.shape = tuple(s.nitems for s in self.dim_indexers + if not isinstance(s, IntDimIndexer)) + self.is_advanced = not is_basic_selection(selection) + if self.is_advanced: + self.drop_axes = tuple([i for i, dim_indexer in enumerate(self.dim_indexers) + if isinstance(dim_indexer, IntDimIndexer)]) + else: + self.drop_axes = None + + def __iter__(self): + for dim_projections in itertools.product(*self.dim_indexers): + + chunk_coords = tuple(p.dim_chunk_ix for p in dim_projections) + chunk_selection = tuple(p.dim_chunk_sel for p in dim_projections) + out_selection = tuple(p.dim_out_sel for p in dim_projections + if p.dim_out_sel is not None) + + # handle advanced indexing arrays orthogonally + if self.is_advanced: + + # numpy doesn't support orthogonal indexing directly as yet, so need to work + # around via np.ix_. Also np.ix_ does not support a mixture of arrays and slices + # or integers, so need to convert slices and integers into ranges. + chunk_selection = ix_(chunk_selection, self.array._chunks) + + # special case for non-monotonic indices + if not is_basic_selection(out_selection): + out_selection = ix_(out_selection, self.shape) + + yield ChunkProjection(chunk_coords, chunk_selection, out_selection) + + +class OIndex(object): + + def __init__(self, array): + self.array = array + + def __getitem__(self, selection): + fields, selection = pop_fields(selection) + selection = ensure_tuple(selection) + selection = replace_lists(selection) + return self.array.get_orthogonal_selection(selection, fields=fields) + + def __setitem__(self, selection, value): + fields, selection = pop_fields(selection) + selection = ensure_tuple(selection) + selection = replace_lists(selection) + return self.array.set_orthogonal_selection(selection, value, fields=fields) + + +# noinspection PyProtectedMember +def is_coordinate_selection(selection, array): + return ( + (len(selection) == len(array._shape)) and + all([is_integer(dim_sel) or is_integer_array(dim_sel) + for dim_sel in selection]) + ) + + +# noinspection PyProtectedMember +def is_mask_selection(selection, array): + return ( + len(selection) == 1 and + is_bool_array(selection[0]) and + selection[0].shape == array._shape + ) + + +# noinspection PyProtectedMember +class CoordinateIndexer(object): + + def __init__(self, selection, array): + + # some initial normalization + selection = ensure_tuple(selection) + selection = tuple([i] if is_integer(i) else i for i in selection) + selection = replace_lists(selection) + + # validation + if not is_coordinate_selection(selection, array): + raise IndexError('invalid coordinate selection; expected one integer (coordinate) ' + 'array per dimension of the target array, got {!r}'.format(selection)) + + # handle wraparound, boundscheck + for dim_sel, dim_len in zip(selection, array.shape): + + # handle wraparound + wraparound_indices(dim_sel, dim_len) + + # handle out of bounds + boundscheck_indices(dim_sel, dim_len) + + # compute chunk index for each point in the selection + chunks_multi_index = tuple( + dim_sel // dim_chunk_len + for (dim_sel, dim_chunk_len) in zip(selection, array._chunks) + ) + + # broadcast selection - this will raise error if array dimensions don't match + selection = np.broadcast_arrays(*selection) + chunks_multi_index = np.broadcast_arrays(*chunks_multi_index) + + # remember shape of selection, because we will flatten indices for processing + self.sel_shape = selection[0].shape if selection[0].shape else (1,) + + # flatten selection + selection = [dim_sel.reshape(-1) for dim_sel in selection] + chunks_multi_index = [dim_chunks.reshape(-1) for dim_chunks in chunks_multi_index] + + # ravel chunk indices + chunks_raveled_indices = np.ravel_multi_index(chunks_multi_index, + dims=array._cdata_shape) + + # group points by chunk + if np.any(np.diff(chunks_raveled_indices) < 0): + # optimisation, only sort if needed + sel_sort = np.argsort(chunks_raveled_indices) + selection = tuple(dim_sel[sel_sort] for dim_sel in selection) + else: + sel_sort = None + + # store atrributes + self.selection = selection + self.sel_sort = sel_sort + self.shape = selection[0].shape if selection[0].shape else (1,) + self.drop_axes = None + self.array = array + + # precompute number of selected items for each chunk + self.chunk_nitems = np.bincount(chunks_raveled_indices, minlength=array.nchunks) + self.chunk_nitems_cumsum = np.cumsum(self.chunk_nitems) + # locate the chunks we need to process + self.chunk_rixs = np.nonzero(self.chunk_nitems)[0] + + # unravel chunk indices + self.chunk_mixs = np.unravel_index(self.chunk_rixs, dims=array._cdata_shape) + + def __iter__(self): + + # iterate over chunks + for i, chunk_rix in enumerate(self.chunk_rixs): + + chunk_coords = tuple(m[i] for m in self.chunk_mixs) + if chunk_rix == 0: + start = 0 + else: + start = self.chunk_nitems_cumsum[chunk_rix - 1] + stop = self.chunk_nitems_cumsum[chunk_rix] + if self.sel_sort is None: + out_selection = slice(start, stop) + else: + out_selection = self.sel_sort[start:stop] + + chunk_offsets = tuple( + dim_chunk_ix * dim_chunk_len + for dim_chunk_ix, dim_chunk_len in zip(chunk_coords, self.array._chunks) + ) + chunk_selection = tuple( + dim_sel[start:stop] - dim_chunk_offset + for (dim_sel, dim_chunk_offset) in zip(self.selection, chunk_offsets) + ) + + yield ChunkProjection(chunk_coords, chunk_selection, out_selection) + + +# noinspection PyProtectedMember +class MaskIndexer(CoordinateIndexer): + + def __init__(self, selection, array): + + # some initial normalization + selection = ensure_tuple(selection) + selection = replace_lists(selection) + + # validation + if not is_mask_selection(selection, array): + raise IndexError('invalid mask selection; expected one Boolean (mask)' + 'array with the same shape as the target array, got {!r}' + .format(selection)) + + # convert to indices + selection = np.nonzero(selection[0]) + + # delegate the rest to superclass + super(MaskIndexer, self).__init__(selection, array) + + +class VIndex(object): + + def __init__(self, array): + self.array = array + + def __getitem__(self, selection): + fields, selection = pop_fields(selection) + selection = ensure_tuple(selection) + selection = replace_lists(selection) + if is_coordinate_selection(selection, self.array): + return self.array.get_coordinate_selection(selection, fields=fields) + elif is_mask_selection(selection, self.array): + return self.array.get_mask_selection(selection, fields=fields) + else: + err_vindex_invalid_selection(selection) + + def __setitem__(self, selection, value): + fields, selection = pop_fields(selection) + selection = ensure_tuple(selection) + selection = replace_lists(selection) + if is_coordinate_selection(selection, self.array): + self.array.set_coordinate_selection(selection, value, fields=fields) + elif is_mask_selection(selection, self.array): + self.array.set_mask_selection(selection, value, fields=fields) + else: + err_vindex_invalid_selection(selection) + + +def check_fields(fields, dtype): + # early out + if fields is None: + return dtype + # check type + if not isinstance(fields, (str, list, tuple)): + raise IndexError("'fields' argument must be a string or list of strings; found {!r}" + .format(type(fields))) + if fields: + if dtype.names is None: + raise IndexError("invalid 'fields' argument, array does not have any fields") + try: + if isinstance(fields, str): + # single field selection + out_dtype = dtype[fields] + else: + # multiple field selection + out_dtype = np.dtype([(f, dtype[f]) for f in fields]) + except KeyError as e: + raise IndexError("invalid 'fields' argument, field not found: {!r}".format(e)) + else: + return out_dtype + else: + return dtype + + +def check_no_multi_fields(fields): + if isinstance(fields, list): + if len(fields) == 1: + return fields[0] + elif len(fields) > 1: + raise IndexError('multiple fields are not supported for this operation') + return fields + + +def pop_fields(selection): + if isinstance(selection, str): + # single field selection + fields = selection + selection = () + elif not isinstance(selection, tuple): + # single selection item, no fields + fields = None + # leave selection as-is + else: + # multiple items, split fields from selection items + fields = [f for f in selection if isinstance(f, str)] + fields = fields[0] if len(fields) == 1 else fields + selection = tuple(s for s in selection if not isinstance(s, str)) + selection = selection[0] if len(selection) == 1 else selection + return fields, selection diff --git a/zarr/meta.py b/zarr/meta.py index 59fe2d22d5..62852247ec 100644 --- a/zarr/meta.py +++ b/zarr/meta.py @@ -124,15 +124,20 @@ def decode_fill_value(v, dtype): return np.NINF else: return np.array(v, dtype=dtype)[()] - elif dtype.kind == 'S': + elif dtype.kind in 'SV': try: - return base64.standard_b64decode(v) + v = base64.standard_b64decode(v) + v = np.array(v, dtype=dtype)[()] + return v except Exception: # be lenient, allow for other values that may have been used before base64 encoding # and may work as fill values, e.g., the number 0 return v - else: + elif dtype.kind == 'U': + # leave as-is return v + else: + return np.array(v, dtype=dtype)[()] def encode_fill_value(v, dtype): @@ -152,10 +157,12 @@ def encode_fill_value(v, dtype): return int(v) elif dtype.kind == 'b': return bool(v) - elif dtype.kind == 'S': + elif dtype.kind in 'SV': v = base64.standard_b64encode(v) if not PY2: v = str(v, 'ascii') return v + elif dtype.kind == 'U': + return v else: return v diff --git a/zarr/storage.py b/zarr/storage.py index 939e4ef85a..302dc44530 100644 --- a/zarr/storage.py +++ b/zarr/storage.py @@ -19,7 +19,7 @@ from zarr.util import normalize_shape, normalize_chunks, normalize_order, \ - normalize_storage_path, buffer_size + normalize_storage_path, buffer_size, normalize_fill_value from zarr.meta import encode_array_metadata, encode_group_metadata from zarr.compat import PY2, binary_type from numcodecs.registry import codec_registry @@ -285,6 +285,7 @@ def _init_array_metadata(store, shape, chunks=None, dtype=None, compressor='defa dtype = np.dtype(dtype) chunks = normalize_chunks(chunks, shape, dtype.itemsize) order = normalize_order(order) + fill_value = normalize_fill_value(fill_value, dtype) # compressor prep if shape == (): diff --git a/zarr/tests/test_core.py b/zarr/tests/test_core.py index d7957d162b..03811e1ab7 100644 --- a/zarr/tests/test_core.py +++ b/zarr/tests/test_core.py @@ -21,6 +21,7 @@ from numcodecs import Delta, FixedScaleOffset, Zlib, Blosc, BZ2 +# noinspection PyMethodMayBeStatic class TestArray(unittest.TestCase): def test_array_init(self): @@ -81,6 +82,7 @@ def test_nbytes_stored(self): except TypeError: pass + # noinspection PyStatementEffect def test_array_1d(self): a = np.arange(1050) z = self.create_array(shape=a.shape, chunks=100, dtype=a.dtype) @@ -123,6 +125,12 @@ def test_array_1d(self): assert_array_equal(a[:10], z[:10]) assert_array_equal(a[10:20], z[10:20]) assert_array_equal(a[-10:], z[-10:]) + assert_array_equal(a[:10, ...], z[:10, ...]) + assert_array_equal(a[10:20, ...], z[10:20, ...]) + assert_array_equal(a[-10:, ...], z[-10:, ...]) + assert_array_equal(a[..., :10], z[..., :10]) + assert_array_equal(a[..., 10:20], z[..., 10:20]) + assert_array_equal(a[..., -10:], z[..., -10:]) # ...across chunk boundaries... assert_array_equal(a[:110], z[:110]) assert_array_equal(a[190:310], z[190:310]) @@ -135,6 +143,18 @@ def test_array_1d(self): eq(a[42], z[np.int32(42)]) eq(a[42], z[np.uint64(42)]) eq(a[42], z[np.uint32(42)]) + # too many indices + with assert_raises(IndexError): + z[:, :] + with assert_raises(IndexError): + z[0, :] + with assert_raises(IndexError): + z[:, 0] + with assert_raises(IndexError): + z[0, 0] + # only single ellipsis allowed + with assert_raises(IndexError): + z[..., ...] # check partial assignment b = np.arange(1e5, 2e5) @@ -174,6 +194,49 @@ def test_array_1d_set_scalar(self): z[:] = value assert_array_equal(a, z[:]) + def test_array_1d_selections(self): + # light test here, full tests in test_indexing + + # setup + a = np.arange(1050) + z = self.create_array(shape=a.shape, chunks=100, dtype=a.dtype) + z[:] = a + + # get + assert_array_equal(a[50:150], z.get_orthogonal_selection(slice(50, 150))) + assert_array_equal(a[50:150], z.oindex[50: 150]) + ix = [99, 100, 101] + bix = np.zeros_like(a, dtype=bool) + bix[ix] = True + assert_array_equal(a[ix], z.get_orthogonal_selection(ix)) + assert_array_equal(a[ix], z.oindex[ix]) + assert_array_equal(a[ix], z.get_coordinate_selection(ix)) + assert_array_equal(a[ix], z.vindex[ix]) + assert_array_equal(a[bix], z.get_mask_selection(bix)) + assert_array_equal(a[bix], z.oindex[bix]) + assert_array_equal(a[bix], z.vindex[bix]) + + # set + z.set_orthogonal_selection(slice(50, 150), 1) + assert_array_equal(1, z[50:150]) + z.oindex[50:150] = 2 + assert_array_equal(2, z[50:150]) + z.set_orthogonal_selection(ix, 3) + assert_array_equal(3, z.get_coordinate_selection(ix)) + z.oindex[ix] = 4 + assert_array_equal(4, z.oindex[ix]) + z.set_coordinate_selection(ix, 5) + assert_array_equal(5, z.get_coordinate_selection(ix)) + z.vindex[ix] = 6 + assert_array_equal(6, z.vindex[ix]) + z.set_mask_selection(bix, 7) + assert_array_equal(7, z.get_mask_selection(bix)) + z.vindex[bix] = 8 + assert_array_equal(8, z.vindex[bix]) + z.oindex[bix] = 9 + assert_array_equal(9, z.oindex[bix]) + + # noinspection PyStatementEffect def test_array_2d(self): a = np.arange(10000).reshape((1000, 10)) z = self.create_array(shape=a.shape, chunks=(100, 2), dtype=a.dtype) @@ -194,37 +257,84 @@ def test_array_2d(self): eq(a.nbytes, z.nbytes) eq(50, z.nchunks_initialized) - # check slicing + # check array-like assert_array_equal(a, np.array(z)) + + # check slicing + + # total slice assert_array_equal(a, z[:]) assert_array_equal(a, z[...]) # noinspection PyTypeChecker assert_array_equal(a, z[slice(None)]) + + # slice first dimension assert_array_equal(a[:10], z[:10]) assert_array_equal(a[10:20], z[10:20]) assert_array_equal(a[-10:], z[-10:]) + assert_array_equal(a[:10, :], z[:10, :]) + assert_array_equal(a[10:20, :], z[10:20, :]) + assert_array_equal(a[-10:, :], z[-10:, :]) + assert_array_equal(a[:10, ...], z[:10, ...]) + assert_array_equal(a[10:20, ...], z[10:20, ...]) + assert_array_equal(a[-10:, ...], z[-10:, ...]) + assert_array_equal(a[:10, :, ...], z[:10, :, ...]) + assert_array_equal(a[10:20, :, ...], z[10:20, :, ...]) + assert_array_equal(a[-10:, :, ...], z[-10:, :, ...]) + + # slice second dimension assert_array_equal(a[:, :2], z[:, :2]) assert_array_equal(a[:, 2:4], z[:, 2:4]) assert_array_equal(a[:, -2:], z[:, -2:]) + assert_array_equal(a[..., :2], z[..., :2]) + assert_array_equal(a[..., 2:4], z[..., 2:4]) + assert_array_equal(a[..., -2:], z[..., -2:]) + assert_array_equal(a[:, ..., :2], z[:, ..., :2]) + assert_array_equal(a[:, ..., 2:4], z[:, ..., 2:4]) + assert_array_equal(a[:, ..., -2:], z[:, ..., -2:]) + + # slice both dimensions assert_array_equal(a[:10, :2], z[:10, :2]) assert_array_equal(a[10:20, 2:4], z[10:20, 2:4]) assert_array_equal(a[-10:, -2:], z[-10:, -2:]) - # ...across chunk boundaries... + + # slicing across chunk boundaries assert_array_equal(a[:110], z[:110]) assert_array_equal(a[190:310], z[190:310]) assert_array_equal(a[-110:], z[-110:]) + assert_array_equal(a[:110, :], z[:110, :]) + assert_array_equal(a[190:310, :], z[190:310, :]) + assert_array_equal(a[-110:, :], z[-110:, :]) assert_array_equal(a[:, :3], z[:, :3]) assert_array_equal(a[:, 3:7], z[:, 3:7]) assert_array_equal(a[:, -3:], z[:, -3:]) assert_array_equal(a[:110, :3], z[:110, :3]) assert_array_equal(a[190:310, 3:7], z[190:310, 3:7]) assert_array_equal(a[-110:, -3:], z[-110:, -3:]) - # single item + + # single row/col/item assert_array_equal(a[0], z[0]) assert_array_equal(a[-1], z[-1]) + assert_array_equal(a[:, 0], z[:, 0]) + assert_array_equal(a[:, -1], z[:, -1]) eq(a[0, 0], z[0, 0]) eq(a[-1, -1], z[-1, -1]) + # too many indices + with assert_raises(IndexError): + z[:, :, :] + with assert_raises(IndexError): + z[0, :, :] + with assert_raises(IndexError): + z[:, 0, :] + with assert_raises(IndexError): + z[:, :, 0] + with assert_raises(IndexError): + z[0, 0, 0] + # only single ellipsis allowed + with assert_raises(IndexError): + z[..., ...] + # check partial assignment b = np.arange(10000, 20000).reshape((1000, 10)) z[190:310, 3:7] = b[190:310, 3:7] @@ -234,6 +344,18 @@ def test_array_2d(self): assert_array_equal(a[310:], z[310:]) assert_array_equal(a[:, 7:], z[:, 7:]) + def test_array_2d_edge_case(self): + # this fails with filters - chunks extend beyond edge of array, messes with delta filter + # if no fill value? + shape = 1000, 10 + chunks = 300, 30 + dtype = 'i8' + z = self.create_array(shape=shape, dtype=dtype, chunks=chunks) + z[:] = 0 + expect = np.zeros(shape, dtype=dtype) + actual = z[:] + assert_array_equal(expect, actual) + def test_array_2d_partial(self): z = self.create_array(shape=(1000, 10), chunks=(100, 2), dtype='i4', fill_value=0) @@ -478,6 +600,18 @@ def test_read_only(self): z.resize(2000) with assert_raises(PermissionError): z.append(np.arange(1000)) + with assert_raises(PermissionError): + z.set_basic_selection(Ellipsis, 42) + with assert_raises(PermissionError): + z.set_orthogonal_selection([0, 1, 2], 42) + with assert_raises(PermissionError): + z.oindex[[0, 1, 2]] = 42 + with assert_raises(PermissionError): + z.set_coordinate_selection([0, 1, 2], 42) + with assert_raises(PermissionError): + z.vindex[[0, 1, 2]] = 42 + with assert_raises(PermissionError): + z.set_mask_selection(np.ones(z.shape, dtype=bool), 42) def test_pickle(self): @@ -521,6 +655,7 @@ def test_np_ufuncs(self): assert_array_equal(np.take(a, indices, axis=1), np.take(a, zi, axis=1)) + # noinspection PyStatementEffect def test_0len_dim_1d(self): # Test behaviour for 1D array with zero-length dimension. @@ -553,6 +688,7 @@ def test_0len_dim_1d(self): with assert_raises(IndexError): z[0] = 42 + # noinspection PyStatementEffect def test_0len_dim_2d(self): # Test behavioud for 2D array with a zero-length dimension. @@ -589,6 +725,7 @@ def test_0len_dim_2d(self): with assert_raises(IndexError): z[:, 0] = 42 + # noinspection PyStatementEffect def test_array_0d(self): # test behaviour for array with 0 dimensions @@ -645,6 +782,32 @@ def test_nchunks_initialized(self): z[:] = 42 eq(10, z.nchunks_initialized) + def test_structured_array(self): + + # setup some data + a = np.array([(b'aaa', 1, 4.2), + (b'bbb', 2, 8.4), + (b'ccc', 3, 12.6)], + dtype=[('foo', 'S3'), ('bar', 'i4'), ('baz', 'f8')]) + for fill_value in None, b'', (b'zzz', 0, 0.0): + z = self.create_array(shape=a.shape, chunks=2, dtype=a.dtype, fill_value=fill_value) + eq(3, len(z)) + if fill_value is not None: + np_fill_value = np.array(fill_value, dtype=a.dtype)[()] + eq(np_fill_value, z.fill_value) + eq(np_fill_value, z[0]) + eq(np_fill_value, z[-1]) + z[...] = a + eq(a[0], z[0]) + assert_array_equal(a, z[...]) + assert_array_equal(a['foo'], z['foo']) + assert_array_equal(a['bar'], z['bar']) + assert_array_equal(a['baz'], z['baz']) + + with assert_raises(ValueError): + # dodgy fill value + self.create_array(shape=a.shape, chunks=2, dtype=a.dtype, fill_value=42) + class TestArrayWithPath(TestArray): @@ -829,6 +992,10 @@ def test_astype(self): expected = data.astype(astype) assert_array_equal(expected, z2) + def test_structured_array(self): + # don't implement this one, cannot do delta on structured array + pass + # custom store, does not support getsize() class CustomMapping(object): diff --git a/zarr/tests/test_creation.py b/zarr/tests/test_creation.py index bb617fff14..e159019921 100644 --- a/zarr/tests/test_creation.py +++ b/zarr/tests/test_creation.py @@ -19,6 +19,7 @@ from zarr.hierarchy import open_group from zarr.errors import PermissionError from zarr.codecs import Zlib +from zarr.compat import PY2 # something bcolz-like @@ -97,6 +98,12 @@ def test_array(): assert_array_equal(a[:], z[:]) eq(a.dtype, z.dtype) + # with dtype=something else + a = np.arange(100, dtype='i4') + z = array(a, dtype='i8') + assert_array_equal(a[:], z[:]) + eq(np.dtype('i8'), z.dtype) + def test_empty(): z = empty(100, chunks=10) @@ -128,9 +135,37 @@ def test_full(): z = full(100, chunks=10, fill_value=np.nan, dtype='f8') assert np.all(np.isnan(z[:])) - # "NaN" - z = full(100, chunks=10, fill_value='NaN', dtype='U3') - assert np.all(z[:] == 'NaN') + # byte string dtype + v = b'xxx' + z = full(100, chunks=10, fill_value=v, dtype='S3') + eq(v, z[0]) + a = z[...] + eq(z.dtype, a.dtype) + eq(v, a[0]) + assert np.all(a == v) + + # unicode string dtype + v = u'xxx' + z = full(100, chunks=10, fill_value=v, dtype='U3') + eq(v, z[0]) + a = z[...] + eq(z.dtype, a.dtype) + eq(v, a[0]) + assert np.all(a == v) + + # bytes fill value / unicode dtype + v = b'xxx' + if PY2: # pragma: py3 no cover + # allow this on PY2 + z = full(100, chunks=10, fill_value=v, dtype='U3') + a = z[...] + eq(z.dtype, a.dtype) + eq(v, a[0]) + assert np.all(a == v) + else: # pragma: py2 no cover + # be strict on PY3 + with assert_raises(ValueError): + full(100, chunks=10, fill_value=v, dtype='U3') def test_open_array(): @@ -370,7 +405,7 @@ def test_create(): with assert_raises(ValueError): create(100, chunks=10, compressor='zlib') - # compatibility + # h5py compatibility z = create(100, compression='zlib', compression_opts=9) eq('zlib', z.compressor.codec_id) @@ -381,7 +416,11 @@ def test_create(): # errors with assert_raises(ValueError): + # bad compression argument create(100, compression=1) + with assert_raises(ValueError): + # bad fill value + create(100, dtype='i4', fill_value='foo') def test_compression_args(): diff --git a/zarr/tests/test_filters.py b/zarr/tests/test_filters.py index f9c9d04434..101f90d1d3 100644 --- a/zarr/tests/test_filters.py +++ b/zarr/tests/test_filters.py @@ -35,7 +35,6 @@ def test_array_with_delta_filter(): data = np.arange(100, dtype=dtype) for compressor in compressors: - # print(repr(compressor)) a = array(data, chunks=10, compressor=compressor, filters=filters) @@ -66,7 +65,6 @@ def test_array_with_astype_filter(): data = np.arange(shape, dtype=decode_dtype) for compressor in compressors: - # print(repr(compressor)) a = array(data, chunks=chunks, compressor=compressor, filters=filters) @@ -96,7 +94,6 @@ def test_array_with_scaleoffset_filter(): data = np.linspace(1000, 1001, 34, dtype='f8') for compressor in compressors: - # print(repr(compressor)) a = array(data, chunks=5, compressor=compressor, filters=filters) @@ -125,7 +122,6 @@ def test_array_with_quantize_filter(): data = np.linspace(0, 1, 34, dtype=dtype) for compressor in compressors: - # print(repr(compressor)) a = array(data, chunks=5, compressor=compressor, filters=filters) @@ -152,7 +148,6 @@ def test_array_with_packbits_filter(): data = np.random.randint(0, 2, size=100, dtype=bool) for compressor in compressors: - # print(repr(compressor)) a = array(data, chunks=5, compressor=compressor, filters=filters) @@ -179,7 +174,6 @@ def test_array_with_categorize_filter(): filters = [flt] for compressor in compressors: - # print(repr(compressor)) a = array(data, chunks=5, compressor=compressor, filters=filters) @@ -203,7 +197,6 @@ def test_compressor_as_filter(): if compressor is None: # skip continue - # print(repr(compressor)) # setup filters dtype = 'i8' diff --git a/zarr/tests/test_indexing.py b/zarr/tests/test_indexing.py new file mode 100644 index 0000000000..6400d5d62b --- /dev/null +++ b/zarr/tests/test_indexing.py @@ -0,0 +1,1293 @@ +# -*- coding: utf-8 -*- +from __future__ import absolute_import, print_function, division + + +import numpy as np +from numpy.testing import assert_array_equal +from nose.tools import assert_raises, eq_ as eq + + +from zarr.indexing import (normalize_integer_selection, replace_ellipsis, oindex, oindex_set) +import zarr + + +def test_normalize_integer_selection(): + + eq(1, normalize_integer_selection(1, 100)) + eq(99, normalize_integer_selection(-1, 100)) + with assert_raises(IndexError): + normalize_integer_selection(100, 100) + with assert_raises(IndexError): + normalize_integer_selection(1000, 100) + with assert_raises(IndexError): + normalize_integer_selection(-1000, 100) + + +def test_replace_ellipsis(): + + # 1D, single item + eq((0,), replace_ellipsis(0, (100,))) + + # 1D + eq((slice(None),), replace_ellipsis(Ellipsis, (100,))) + eq((slice(None),), replace_ellipsis(slice(None), (100,))) + eq((slice(None, 100),), replace_ellipsis(slice(None, 100), (100,))) + eq((slice(0, None),), replace_ellipsis(slice(0, None), (100,))) + eq((slice(None),), replace_ellipsis((slice(None), Ellipsis), (100,))) + eq((slice(None),), replace_ellipsis((Ellipsis, slice(None)), (100,))) + + # 2D, single item + eq((0, 0), replace_ellipsis((0, 0), (100, 100))) + eq((-1, 1), replace_ellipsis((-1, 1), (100, 100))) + + # 2D, single col/row + eq((0, slice(None)), replace_ellipsis((0, slice(None)), (100, 100))) + eq((0, slice(None)), replace_ellipsis((0,), (100, 100))) + eq((slice(None), 0), replace_ellipsis((slice(None), 0), (100, 100))) + + # 2D slice + eq((slice(None), slice(None)), + replace_ellipsis(Ellipsis, (100, 100))) + eq((slice(None), slice(None)), + replace_ellipsis(slice(None), (100, 100))) + eq((slice(None), slice(None)), + replace_ellipsis((slice(None), slice(None)), (100, 100))) + eq((slice(None), slice(None)), + replace_ellipsis((Ellipsis, slice(None)), (100, 100))) + eq((slice(None), slice(None)), + replace_ellipsis((slice(None), Ellipsis), (100, 100))) + eq((slice(None), slice(None)), + replace_ellipsis((slice(None), Ellipsis, slice(None)), (100, 100))) + eq((slice(None), slice(None)), + replace_ellipsis((Ellipsis, slice(None), slice(None)), (100, 100))) + eq((slice(None), slice(None)), + replace_ellipsis((slice(None), slice(None), Ellipsis), (100, 100))) + + +def test_get_basic_selection_0d(): + + # setup + a = np.array(42) + z = zarr.create(shape=a.shape, dtype=a.dtype, fill_value=None) + z[...] = a + + assert_array_equal(a, z.get_basic_selection(Ellipsis)) + assert_array_equal(a, z[...]) + eq(42, z.get_basic_selection(())) + eq(42, z[()]) + + # test out param + b = np.zeros_like(a) + z.get_basic_selection(Ellipsis, out=b) + assert_array_equal(a, b) + + # test structured array + value = (b'aaa', 1, 4.2) + a = np.array(value, dtype=[('foo', 'S3'), ('bar', 'i4'), ('baz', 'f8')]) + z = zarr.create(shape=a.shape, dtype=a.dtype, fill_value=None) + z[()] = value + assert_array_equal(a, z.get_basic_selection(Ellipsis)) + assert_array_equal(a, z[...]) + eq(a[()], z.get_basic_selection(())) + eq(a[()], z[()]) + eq(b'aaa', z.get_basic_selection((), fields='foo')) + eq(b'aaa', z['foo']) + eq(a[['foo', 'bar']], z.get_basic_selection((), fields=['foo', 'bar'])) + eq(a[['foo', 'bar']], z['foo', 'bar']) + # test out param + b = np.zeros_like(a) + z.get_basic_selection(Ellipsis, out=b) + assert_array_equal(a, b) + c = np.zeros_like(a[['foo', 'bar']]) + z.get_basic_selection(Ellipsis, out=c, fields=['foo', 'bar']) + assert_array_equal(a[['foo', 'bar']], c) + + +basic_selections_1d = [ + # single value + 42, + -1, + # slices + slice(0, 1050), + slice(50, 150), + slice(0, 2000), + slice(-150, -50), + slice(-2000, 2000), + slice(0, 0), # empty result + slice(-1, 0), # empty result + # total selections + slice(None), + Ellipsis, + (), + (Ellipsis, slice(None)), + # slice with step + slice(None), + slice(None, None), + slice(None, None, 1), + slice(None, None, 10), + slice(None, None, 100), + slice(None, None, 1000), + slice(None, None, 10000), + slice(0, 1050), + slice(0, 1050, 1), + slice(0, 1050, 10), + slice(0, 1050, 100), + slice(0, 1050, 1000), + slice(0, 1050, 10000), + slice(1, 31, 3), + slice(1, 31, 30), + slice(1, 31, 300), + slice(81, 121, 3), + slice(81, 121, 30), + slice(81, 121, 300), + slice(50, 150), + slice(50, 150, 1), + slice(50, 150, 10), +] + + +basic_selections_1d_bad = [ + # only positive step supported + slice(None, None, -1), + slice(None, None, -10), + slice(None, None, -100), + slice(None, None, -1000), + slice(None, None, -10000), + slice(1050, -1, -1), + slice(1050, -1, -10), + slice(1050, -1, -100), + slice(1050, -1, -1000), + slice(1050, -1, -10000), + slice(1050, 0, -1), + slice(1050, 0, -10), + slice(1050, 0, -100), + slice(1050, 0, -1000), + slice(1050, 0, -10000), + slice(150, 50, -1), + slice(150, 50, -10), + slice(31, 1, -3), + slice(121, 81, -3), + slice(-1, 0, -1), + # bad stuff + 2.3, + 'foo', + b'xxx', + None, + (0, 0), + (slice(None), slice(None)), +] + + +def _test_get_basic_selection(a, z, selection): + expect = a[selection] + actual = z.get_basic_selection(selection) + assert_array_equal(expect, actual) + actual = z[selection] + assert_array_equal(expect, actual) + + +# noinspection PyStatementEffect +def test_get_basic_selection_1d(): + + # setup + a = np.arange(1050, dtype=int) + z = zarr.create(shape=a.shape, chunks=100, dtype=a.dtype) + z[:] = a + + for selection in basic_selections_1d: + _test_get_basic_selection(a, z, selection) + + bad_selections = basic_selections_1d_bad + [ + [0, 1], # fancy indexing + ] + for selection in bad_selections: + with assert_raises(IndexError): + z.get_basic_selection(selection) + with assert_raises(IndexError): + z[selection] + + +basic_selections_2d = [ + # single row + 42, + -1, + (42, slice(None)), + (-1, slice(None)), + # single col + (slice(None), 4), + (slice(None), -1), + # row slices + slice(None), + slice(0, 1000), + slice(250, 350), + slice(0, 2000), + slice(-350, -250), + slice(0, 0), # empty result + slice(-1, 0), # empty result + slice(-2000, 0), + slice(-2000, 2000), + # 2D slices + (slice(None), slice(1, 5)), + (slice(250, 350), slice(None)), + (slice(250, 350), slice(1, 5)), + (slice(250, 350), slice(-5, -1)), + (slice(250, 350), slice(-50, 50)), + (slice(250, 350, 10), slice(1, 5)), + (slice(250, 350), slice(1, 5, 2)), + (slice(250, 350, 33), slice(1, 5, 3)), + # total selections + (slice(None), slice(None)), + Ellipsis, + (), + (Ellipsis, slice(None)), + (Ellipsis, slice(None), slice(None)), +] + + +basic_selections_2d_bad = [ + # bad stuff + 2.3, + 'foo', + b'xxx', + None, + (2.3, slice(None)), + # only positive step supported + slice(None, None, -1), + (slice(None, None, -1), slice(None)), + (0, 0, 0), + (slice(None), slice(None), slice(None)), +] + + +# noinspection PyStatementEffect +def test_get_basic_selection_2d(): + + # setup + a = np.arange(10000, dtype=int).reshape(1000, 10) + z = zarr.create(shape=a.shape, chunks=(300, 3), dtype=a.dtype) + z[:] = a + + for selection in basic_selections_2d: + _test_get_basic_selection(a, z, selection) + + bad_selections = basic_selections_2d_bad + [ + # integer arrays + [0, 1], + ([0, 1], [0, 1]), + (slice(None), [0, 1]), + ] + for selection in bad_selections: + with assert_raises(IndexError): + z.get_basic_selection(selection) + with assert_raises(IndexError): + z[selection] + + +def test_set_basic_selection_0d(): + + # setup + v = np.array(42) + a = np.zeros_like(v) + z = zarr.zeros_like(v) + assert_array_equal(a, z) + + # tests + z.set_basic_selection(Ellipsis, v) + assert_array_equal(v, z) + z[...] = 0 + assert_array_equal(a, z) + z[...] = v + assert_array_equal(v, z) + + # test structured array + value = (b'aaa', 1, 4.2) + v = np.array(value, dtype=[('foo', 'S3'), ('bar', 'i4'), ('baz', 'f8')]) + a = np.zeros_like(v) + z = zarr.create(shape=a.shape, dtype=a.dtype, fill_value=None) + + # tests + z.set_basic_selection(Ellipsis, v) + assert_array_equal(v, z) + z.set_basic_selection(Ellipsis, a) + assert_array_equal(a, z) + z[...] = v + assert_array_equal(v, z) + z[...] = a + assert_array_equal(a, z) + # with fields + z.set_basic_selection(Ellipsis, v['foo'], fields='foo') + eq(v['foo'], z['foo']) + eq(a['bar'], z['bar']) + eq(a['baz'], z['baz']) + z['bar'] = v['bar'] + eq(v['foo'], z['foo']) + eq(v['bar'], z['bar']) + eq(a['baz'], z['baz']) + # multiple field assignment not supported + with assert_raises(IndexError): + z.set_basic_selection(Ellipsis, v[['foo', 'bar']], fields=['foo', 'bar']) + with assert_raises(IndexError): + z[..., 'foo', 'bar'] = v[['foo', 'bar']] + + +def _test_get_orthogonal_selection(a, z, selection): + expect = oindex(a, selection) + actual = z.get_orthogonal_selection(selection) + assert_array_equal(expect, actual) + actual = z.oindex[selection] + assert_array_equal(expect, actual) + + +# noinspection PyStatementEffect +def test_get_orthogonal_selection_1d_bool(): + + # setup + a = np.arange(1050, dtype=int) + z = zarr.create(shape=a.shape, chunks=100, dtype=a.dtype) + z[:] = a + + np.random.seed(42) + # test with different degrees of sparseness + for p in 0.5, 0.1, 0.01: + ix = np.random.binomial(1, p, size=a.shape[0]).astype(bool) + _test_get_orthogonal_selection(a, z, ix) + + # test errors + with assert_raises(IndexError): + z.oindex[np.zeros(50, dtype=bool)] # too short + with assert_raises(IndexError): + z.oindex[np.zeros(2000, dtype=bool)] # too long + with assert_raises(IndexError): + z.oindex[[[True, False], [False, True]]] # too many dimensions + + +# noinspection PyStatementEffect +def test_get_orthogonal_selection_1d_int(): + + # setup + a = np.arange(1050, dtype=int) + z = zarr.create(shape=a.shape, chunks=100, dtype=a.dtype) + z[:] = a + + np.random.seed(42) + # test with different degrees of sparseness + for p in 2, 0.5, 0.1, 0.01: + # unordered + ix = np.random.choice(a.shape[0], size=int(a.shape[0] * p), replace=True) + _test_get_orthogonal_selection(a, z, ix) + # increasing + ix.sort() + _test_get_orthogonal_selection(a, z, ix) + # decreasing + ix = ix[::-1] + _test_get_orthogonal_selection(a, z, ix) + + selections = basic_selections_1d + [ + # test wraparound + [0, 3, 10, -23, -12, -1], + # explicit test not sorted + [3, 105, 23, 127], + + ] + for selection in selections: + _test_get_orthogonal_selection(a, z, selection) + + bad_selections = basic_selections_1d_bad + [ + [a.shape[0] + 1], # out of bounds + [-(a.shape[0] + 1)], # out of bounds + [[2, 4], [6, 8]], # too many dimensions + ] + for selection in bad_selections: + with assert_raises(IndexError): + z.get_orthogonal_selection(selection) + with assert_raises(IndexError): + z.oindex[selection] + + +def _test_get_orthogonal_selection_2d(a, z, ix0, ix1): + selections = [ + # index both axes with array + (ix0, ix1), + # mixed indexing with array / slice + (ix0, slice(1, 5)), + (ix0, slice(1, 5, 2)), + (slice(250, 350), ix1), + (slice(250, 350, 10), ix1), + # mixed indexing with array / int + (ix0, 4), + (42, ix1), + ] + for selection in selections: + _test_get_orthogonal_selection(a, z, selection) + + +# noinspection PyStatementEffect +def test_get_orthogonal_selection_2d(): + + # setup + a = np.arange(10000, dtype=int).reshape(1000, 10) + z = zarr.create(shape=a.shape, chunks=(300, 3), dtype=a.dtype) + z[:] = a + + np.random.seed(42) + # test with different degrees of sparseness + for p in 0.5, 0.1, 0.01: + + # boolean arrays + ix0 = np.random.binomial(1, p, size=a.shape[0]).astype(bool) + ix1 = np.random.binomial(1, 0.5, size=a.shape[1]).astype(bool) + _test_get_orthogonal_selection_2d(a, z, ix0, ix1) + + # mixed int array / bool array + selections = ( + (ix0, np.nonzero(ix1)[0]), + (np.nonzero(ix0)[0], ix1), + ) + for selection in selections: + _test_get_orthogonal_selection(a, z, selection) + + # integer arrays + ix0 = np.random.choice(a.shape[0], size=int(a.shape[0] * p), replace=True) + ix1 = np.random.choice(a.shape[1], size=int(a.shape[1] * .5), replace=True) + _test_get_orthogonal_selection_2d(a, z, ix0, ix1) + ix0.sort() + ix1.sort() + _test_get_orthogonal_selection_2d(a, z, ix0, ix1) + ix0 = ix0[::-1] + ix1 = ix1[::-1] + _test_get_orthogonal_selection_2d(a, z, ix0, ix1) + + for selection in basic_selections_2d: + _test_get_orthogonal_selection(a, z, selection) + + for selection in basic_selections_2d_bad: + with assert_raises(IndexError): + z.get_orthogonal_selection(selection) + with assert_raises(IndexError): + z.oindex[selection] + + +def _test_get_orthogonal_selection_3d(a, z, ix0, ix1, ix2): + selections = [ + # single value + (84, 42, 4), + (-1, -1, -1), + # index all axes with array + (ix0, ix1, ix2), + # mixed indexing with single array / slices + (ix0, slice(15, 25), slice(1, 5)), + (slice(50, 70), ix1, slice(1, 5)), + (slice(50, 70), slice(15, 25), ix2), + (ix0, slice(15, 25, 5), slice(1, 5, 2)), + (slice(50, 70, 3), ix1, slice(1, 5, 2)), + (slice(50, 70, 3), slice(15, 25, 5), ix2), + # mixed indexing with single array / ints + (ix0, 42, 4), + (84, ix1, 4), + (84, 42, ix2), + # mixed indexing with single array / slice / int + (ix0, slice(15, 25), 4), + (42, ix1, slice(1, 5)), + (slice(50, 70), 42, ix2), + # mixed indexing with two array / slice + (ix0, ix1, slice(1, 5)), + (slice(50, 70), ix1, ix2), + (ix0, slice(15, 25), ix2), + # mixed indexing with two array / integer + (ix0, ix1, 4), + (42, ix1, ix2), + (ix0, 42, ix2), + ] + for selection in selections: + _test_get_orthogonal_selection(a, z, selection) + + +def test_get_orthogonal_selection_3d(): + + # setup + a = np.arange(100000, dtype=int).reshape(200, 50, 10) + z = zarr.create(shape=a.shape, chunks=(60, 20, 3), dtype=a.dtype) + z[:] = a + + np.random.seed(42) + # test with different degrees of sparseness + for p in 0.5, 0.1, 0.01: + + # boolean arrays + ix0 = np.random.binomial(1, p, size=a.shape[0]).astype(bool) + ix1 = np.random.binomial(1, .5, size=a.shape[1]).astype(bool) + ix2 = np.random.binomial(1, .5, size=a.shape[2]).astype(bool) + _test_get_orthogonal_selection_3d(a, z, ix0, ix1, ix2) + + # integer arrays + ix0 = np.random.choice(a.shape[0], size=int(a.shape[0] * p), replace=True) + ix1 = np.random.choice(a.shape[1], size=int(a.shape[1] * .5), replace=True) + ix2 = np.random.choice(a.shape[2], size=int(a.shape[2] * .5), replace=True) + _test_get_orthogonal_selection_3d(a, z, ix0, ix1, ix2) + ix0.sort() + ix1.sort() + ix2.sort() + _test_get_orthogonal_selection_3d(a, z, ix0, ix1, ix2) + ix0 = ix0[::-1] + ix1 = ix1[::-1] + ix2 = ix2[::-1] + _test_get_orthogonal_selection_3d(a, z, ix0, ix1, ix2) + + +def test_orthogonal_indexing_edge_cases(): + + a = np.arange(6).reshape(1, 2, 3) + z = zarr.create(shape=a.shape, chunks=(1, 2, 3), dtype=a.dtype) + z[:] = a + + expect = oindex(a, (0, slice(None), [0, 1, 2])) + actual = z.oindex[0, :, [0, 1, 2]] + assert_array_equal(expect, actual) + + expect = oindex(a, (0, slice(None), [True, True, True])) + actual = z.oindex[0, :, [True, True, True]] + assert_array_equal(expect, actual) + + +def _test_set_orthogonal_selection(v, a, z, selection): + for value in 42, oindex(v, selection), oindex(v, selection).tolist(): + if isinstance(value, list) and value == []: + # skip these cases as cannot preserve all dimensions + continue + # setup expectation + a[:] = 0 + oindex_set(a, selection, value) + # long-form API + z[:] = 0 + z.set_orthogonal_selection(selection, value) + assert_array_equal(a, z[:]) + # short-form API + z[:] = 0 + z.oindex[selection] = value + assert_array_equal(a, z[:]) + + +def test_set_orthogonal_selection_1d(): + + # setup + v = np.arange(1050, dtype=int) + a = np.empty(v.shape, dtype=int) + z = zarr.create(shape=a.shape, chunks=100, dtype=a.dtype) + + # test with different degrees of sparseness + np.random.seed(42) + for p in 0.5, 0.1, 0.01: + + # boolean arrays + ix = np.random.binomial(1, p, size=a.shape[0]).astype(bool) + _test_set_orthogonal_selection(v, a, z, ix) + + # integer arrays + ix = np.random.choice(a.shape[0], size=int(a.shape[0] * p), replace=True) + _test_set_orthogonal_selection(v, a, z, ix) + ix.sort() + _test_set_orthogonal_selection(v, a, z, ix) + ix = ix[::-1] + _test_set_orthogonal_selection(v, a, z, ix) + + # basic selections + for selection in basic_selections_1d: + _test_set_orthogonal_selection(v, a, z, selection) + + +def _test_set_orthogonal_selection_2d(v, a, z, ix0, ix1): + + selections = [ + # index both axes with array + (ix0, ix1), + # mixed indexing with array / slice or int + (ix0, slice(1, 5)), + (slice(250, 350), ix1), + (ix0, 4), + (42, ix1), + ] + for selection in selections: + _test_set_orthogonal_selection(v, a, z, selection) + + +def test_set_orthogonal_selection_2d(): + + # setup + v = np.arange(10000, dtype=int).reshape(1000, 10) + a = np.empty_like(v) + z = zarr.create(shape=a.shape, chunks=(300, 3), dtype=a.dtype) + + np.random.seed(42) + # test with different degrees of sparseness + for p in 0.5, 0.1, 0.01: + + # boolean arrays + ix0 = np.random.binomial(1, p, size=a.shape[0]).astype(bool) + ix1 = np.random.binomial(1, .5, size=a.shape[1]).astype(bool) + _test_set_orthogonal_selection_2d(v, a, z, ix0, ix1) + + # integer arrays + ix0 = np.random.choice(a.shape[0], size=int(a.shape[0] * p), replace=True) + ix1 = np.random.choice(a.shape[1], size=int(a.shape[1] * .5), replace=True) + _test_set_orthogonal_selection_2d(v, a, z, ix0, ix1) + ix0.sort() + ix1.sort() + _test_set_orthogonal_selection_2d(v, a, z, ix0, ix1) + ix0 = ix0[::-1] + ix1 = ix1[::-1] + _test_set_orthogonal_selection_2d(v, a, z, ix0, ix1) + + for selection in basic_selections_2d: + _test_set_orthogonal_selection(v, a, z, selection) + + +def _test_set_orthogonal_selection_3d(v, a, z, ix0, ix1, ix2): + + selections = ( + # single value + (84, 42, 4), + (-1, -1, -1), + # index all axes with bool array + (ix0, ix1, ix2), + # mixed indexing with single bool array / slice or int + (ix0, slice(15, 25), slice(1, 5)), + (slice(50, 70), ix1, slice(1, 5)), + (slice(50, 70), slice(15, 25), ix2), + (ix0, 42, 4), + (84, ix1, 4), + (84, 42, ix2), + (ix0, slice(15, 25), 4), + (slice(50, 70), ix1, 4), + (slice(50, 70), 42, ix2), + # indexing with two arrays / slice + (ix0, ix1, slice(1, 5)), + # indexing with two arrays / integer + (ix0, ix1, 4), + ) + for selection in selections: + _test_set_orthogonal_selection(v, a, z, selection) + + +def test_set_orthogonal_selection_3d(): + + # setup + v = np.arange(100000, dtype=int).reshape(200, 50, 10) + a = np.empty_like(v) + z = zarr.create(shape=a.shape, chunks=(60, 20, 3), dtype=a.dtype) + + np.random.seed(42) + # test with different degrees of sparseness + for p in 0.5, 0.1, 0.01: + + # boolean arrays + ix0 = np.random.binomial(1, p, size=a.shape[0]).astype(bool) + ix1 = np.random.binomial(1, .5, size=a.shape[1]).astype(bool) + ix2 = np.random.binomial(1, .5, size=a.shape[2]).astype(bool) + _test_set_orthogonal_selection_3d(v, a, z, ix0, ix1, ix2) + + # integer arrays + ix0 = np.random.choice(a.shape[0], size=int(a.shape[0] * p), replace=True) + ix1 = np.random.choice(a.shape[1], size=int(a.shape[1] * .5), replace=True) + ix2 = np.random.choice(a.shape[2], size=int(a.shape[2] * .5), replace=True) + _test_set_orthogonal_selection_3d(v, a, z, ix0, ix1, ix2) + + # sorted increasing + ix0.sort() + ix1.sort() + ix2.sort() + _test_set_orthogonal_selection_3d(v, a, z, ix0, ix1, ix2) + + # sorted decreasing + ix0 = ix0[::-1] + ix1 = ix1[::-1] + ix2 = ix2[::-1] + _test_set_orthogonal_selection_3d(v, a, z, ix0, ix1, ix2) + + +def _test_get_coordinate_selection(a, z, selection): + expect = a[selection] + actual = z.get_coordinate_selection(selection) + assert_array_equal(expect, actual) + actual = z.vindex[selection] + assert_array_equal(expect, actual) + + +coordinate_selections_1d_bad = [ + # slice not supported + slice(5, 15), + slice(None), + Ellipsis, + # bad stuff + 2.3, + 'foo', + b'xxx', + None, + (0, 0), + (slice(None), slice(None)), +] + + +# noinspection PyStatementEffect +def test_get_coordinate_selection_1d(): + + # setup + a = np.arange(1050, dtype=int) + z = zarr.create(shape=a.shape, chunks=100, dtype=a.dtype) + z[:] = a + + np.random.seed(42) + # test with different degrees of sparseness + for p in 2, 0.5, 0.1, 0.01: + n = int(a.size * p) + ix = np.random.choice(a.shape[0], size=n, replace=True) + _test_get_coordinate_selection(a, z, ix) + ix.sort() + _test_get_coordinate_selection(a, z, ix) + ix = ix[::-1] + _test_get_coordinate_selection(a, z, ix) + + selections = [ + # test single item + 42, + -1, + # test wraparound + [0, 3, 10, -23, -12, -1], + # test out of order + [3, 105, 23, 127], # not monotonically increasing + # test multi-dimensional selection + np.array([[2, 4], [6, 8]]), + ] + for selection in selections: + _test_get_coordinate_selection(a, z, selection) + + # test errors + bad_selections = coordinate_selections_1d_bad + [ + [a.shape[0] + 1], # out of bounds + [-(a.shape[0] + 1)], # out of bounds + ] + for selection in bad_selections: + with assert_raises(IndexError): + z.get_coordinate_selection(selection) + with assert_raises(IndexError): + z.vindex[selection] + + +def test_get_coordinate_selection_2d(): + + # setup + a = np.arange(10000, dtype=int).reshape(1000, 10) + z = zarr.create(shape=a.shape, chunks=(300, 3), dtype=a.dtype) + z[:] = a + + np.random.seed(42) + # test with different degrees of sparseness + for p in 2, 0.5, 0.1, 0.01: + n = int(a.size * p) + ix0 = np.random.choice(a.shape[0], size=n, replace=True) + ix1 = np.random.choice(a.shape[1], size=n, replace=True) + selections = [ + # single value + (42, 4), + (-1, -1), + # index both axes with array + (ix0, ix1), + # mixed indexing with array / int + (ix0, 4), + (42, ix1), + (42, 4), + ] + for selection in selections: + _test_get_coordinate_selection(a, z, selection) + + # not monotonically increasing (first dim) + ix0 = [3, 3, 4, 2, 5] + ix1 = [1, 3, 5, 7, 9] + _test_get_coordinate_selection(a, z, (ix0, ix1)) + + # not monotonically increasing (second dim) + ix0 = [1, 1, 2, 2, 5] + ix1 = [1, 3, 2, 1, 0] + _test_get_coordinate_selection(a, z, (ix0, ix1)) + + # multi-dimensional selection + ix0 = np.array([[1, 1, 2], + [2, 2, 5]]) + ix1 = np.array([[1, 3, 2], + [1, 0, 0]]) + _test_get_coordinate_selection(a, z, (ix0, ix1)) + + with assert_raises(IndexError): + selection = slice(5, 15), [1, 2, 3] + z.get_coordinate_selection(selection) + with assert_raises(IndexError): + selection = [1, 2, 3], slice(5, 15) + z.get_coordinate_selection(selection) + with assert_raises(IndexError): + selection = Ellipsis, [1, 2, 3] + z.get_coordinate_selection(selection) + with assert_raises(IndexError): + selection = Ellipsis + z.get_coordinate_selection(selection) + + +def _test_set_coordinate_selection(v, a, z, selection): + for value in 42, v[selection], v[selection].tolist(): + # setup expectation + a[:] = 0 + a[selection] = value + # test long-form API + z[:] = 0 + z.set_coordinate_selection(selection, value) + assert_array_equal(a, z[:]) + # test short-form API + z[:] = 0 + z.vindex[selection] = value + assert_array_equal(a, z[:]) + + +def test_set_coordinate_selection_1d(): + + # setup + v = np.arange(1050, dtype=int) + a = np.empty(v.shape, dtype=v.dtype) + z = zarr.create(shape=a.shape, chunks=100, dtype=a.dtype) + + np.random.seed(42) + # test with different degrees of sparseness + for p in 2, 0.5, 0.1, 0.01: + n = int(a.size * p) + ix = np.random.choice(a.shape[0], size=n, replace=True) + _test_set_coordinate_selection(v, a, z, ix) + + # multi-dimensional selection + ix = np.array([[2, 4], [6, 8]]) + _test_set_coordinate_selection(v, a, z, ix) + + for selection in coordinate_selections_1d_bad: + with assert_raises(IndexError): + z.set_coordinate_selection(selection, 42) + with assert_raises(IndexError): + z.vindex[selection] = 42 + + +def test_set_coordinate_selection_2d(): + + # setup + v = np.arange(10000, dtype=int).reshape(1000, 10) + a = np.empty_like(v) + z = zarr.create(shape=a.shape, chunks=(300, 3), dtype=a.dtype) + + np.random.seed(42) + # test with different degrees of sparseness + for p in 2, 0.5, 0.1, 0.01: + n = int(a.size * p) + ix0 = np.random.choice(a.shape[0], size=n, replace=True) + ix1 = np.random.choice(a.shape[1], size=n, replace=True) + + selections = ( + (42, 4), + (-1, -1), + # index both axes with array + (ix0, ix1), + # mixed indexing with array / int + (ix0, 4), + (42, ix1), + ) + for selection in selections: + _test_set_coordinate_selection(v, a, z, selection) + + # multi-dimensional selection + ix0 = np.array([[1, 2, 3], + [4, 5, 6]]) + ix1 = np.array([[1, 3, 2], + [2, 0, 5]]) + _test_set_coordinate_selection(v, a, z, (ix0, ix1)) + + +def _test_get_mask_selection(a, z, selection): + expect = a[selection] + actual = z.get_mask_selection(selection) + assert_array_equal(expect, actual) + actual = z.vindex[selection] + assert_array_equal(expect, actual) + + +mask_selections_1d_bad = [ + # slice not supported + slice(5, 15), + slice(None), + Ellipsis, + # bad stuff + 2.3, + 'foo', + b'xxx', + None, + (0, 0), + (slice(None), slice(None)), +] + + +# noinspection PyStatementEffect +def test_get_mask_selection_1d(): + + # setup + a = np.arange(1050, dtype=int) + z = zarr.create(shape=a.shape, chunks=100, dtype=a.dtype) + z[:] = a + + np.random.seed(42) + # test with different degrees of sparseness + for p in 0.5, 0.1, 0.01: + ix = np.random.binomial(1, p, size=a.shape[0]).astype(bool) + _test_get_mask_selection(a, z, ix) + + # test errors + bad_selections = mask_selections_1d_bad + [ + np.zeros(50, dtype=bool), # too short + np.zeros(2000, dtype=bool), # too long + [[True, False], [False, True]], # too many dimensions + ] + for selection in bad_selections: + with assert_raises(IndexError): + z.get_mask_selection(selection) + with assert_raises(IndexError): + z.vindex[selection] + + +# noinspection PyStatementEffect +def test_get_mask_selection_2d(): + + # setup + a = np.arange(10000, dtype=int).reshape(1000, 10) + z = zarr.create(shape=a.shape, chunks=(300, 3), dtype=a.dtype) + z[:] = a + + np.random.seed(42) + # test with different degrees of sparseness + for p in 0.5, 0.1, 0.01: + ix = np.random.binomial(1, p, size=a.size).astype(bool).reshape(a.shape) + _test_get_mask_selection(a, z, ix) + + # test errors + with assert_raises(IndexError): + z.vindex[np.zeros((1000, 5), dtype=bool)] # too short + with assert_raises(IndexError): + z.vindex[np.zeros((2000, 10), dtype=bool)] # too long + with assert_raises(IndexError): + z.vindex[[True, False]] # wrong no. dimensions + + +def _test_set_mask_selection(v, a, z, selection): + a[:] = 0 + z[:] = 0 + a[selection] = v[selection] + z.set_mask_selection(selection, v[selection]) + assert_array_equal(a, z[:]) + z[:] = 0 + z.vindex[selection] = v[selection] + assert_array_equal(a, z[:]) + + +def test_set_mask_selection_1d(): + + # setup + v = np.arange(1050, dtype=int) + a = np.empty_like(v) + z = zarr.create(shape=a.shape, chunks=100, dtype=a.dtype) + + np.random.seed(42) + # test with different degrees of sparseness + for p in 0.5, 0.1, 0.01: + ix = np.random.binomial(1, p, size=a.shape[0]).astype(bool) + _test_set_mask_selection(v, a, z, ix) + + for selection in mask_selections_1d_bad: + with assert_raises(IndexError): + z.set_mask_selection(selection, 42) + with assert_raises(IndexError): + z.vindex[selection] = 42 + + +def test_set_mask_selection_2d(): + + # setup + v = np.arange(10000, dtype=int).reshape(1000, 10) + a = np.empty_like(v) + z = zarr.create(shape=a.shape, chunks=(300, 3), dtype=a.dtype) + + np.random.seed(42) + # test with different degrees of sparseness + for p in 0.5, 0.1, 0.01: + ix = np.random.binomial(1, p, size=a.size).astype(bool).reshape(a.shape) + _test_set_mask_selection(v, a, z, ix) + + +def test_get_selection_out(): + + # basic selections + a = np.arange(1050) + z = zarr.create(shape=1050, chunks=100, dtype=a.dtype) + z[:] = a + selections = [ + slice(50, 150), + slice(0, 1050), + slice(1, 2), + ] + for selection in selections: + expect = a[selection] + out = zarr.create(shape=expect.shape, chunks=10, dtype=expect.dtype, fill_value=0) + z.get_basic_selection(selection, out=out) + assert_array_equal(expect, out[:]) + + with assert_raises(TypeError): + z.get_basic_selection(Ellipsis, out=[]) + + # orthogonal selections + a = np.arange(10000, dtype=int).reshape(1000, 10) + z = zarr.create(shape=a.shape, chunks=(300, 3), dtype=a.dtype) + z[:] = a + np.random.seed(42) + # test with different degrees of sparseness + for p in 0.5, 0.1, 0.01: + ix0 = np.random.binomial(1, p, size=a.shape[0]).astype(bool) + ix1 = np.random.binomial(1, .5, size=a.shape[1]).astype(bool) + selections = [ + # index both axes with array + (ix0, ix1), + # mixed indexing with array / slice + (ix0, slice(1, 5)), + (slice(250, 350), ix1), + # mixed indexing with array / int + (ix0, 4), + (42, ix1), + # mixed int array / bool array + (ix0, np.nonzero(ix1)[0]), + (np.nonzero(ix0)[0], ix1), + ] + for selection in selections: + expect = oindex(a, selection) + # out = zarr.create(shape=expect.shape, chunks=10, dtype=expect.dtype, + # fill_value=0) + out = np.zeros(expect.shape, dtype=expect.dtype) + z.get_orthogonal_selection(selection, out=out) + assert_array_equal(expect, out[:]) + + # coordinate selections + a = np.arange(10000, dtype=int).reshape(1000, 10) + z = zarr.create(shape=a.shape, chunks=(300, 3), dtype=a.dtype) + z[:] = a + np.random.seed(42) + # test with different degrees of sparseness + for p in 0.5, 0.1, 0.01: + n = int(a.size * p) + ix0 = np.random.choice(a.shape[0], size=n, replace=True) + ix1 = np.random.choice(a.shape[1], size=n, replace=True) + selections = [ + # index both axes with array + (ix0, ix1), + # mixed indexing with array / int + (ix0, 4), + (42, ix1), + ] + for selection in selections: + expect = a[selection] + out = np.zeros(expect.shape, dtype=expect.dtype) + z.get_coordinate_selection(selection, out=out) + assert_array_equal(expect, out[:]) + + +def test_get_selections_with_fields(): + + a = [('aaa', 1, 4.2), + ('bbb', 2, 8.4), + ('ccc', 3, 12.6)] + a = np.array(a, dtype=[('foo', 'S3'), ('bar', 'i4'), ('baz', 'f8')]) + z = zarr.create(shape=a.shape, chunks=2, dtype=a.dtype, fill_value=None) + z[:] = a + + fields_fixture = [ + 'foo', + ['foo'], + ['foo', 'bar'], + ['foo', 'baz'], + ['bar', 'baz'], + ['foo', 'bar', 'baz'], + ['bar', 'foo'], + ['baz', 'bar', 'foo'], + ] + + for fields in fields_fixture: + + # total selection + expect = a[fields] + actual = z.get_basic_selection(Ellipsis, fields=fields) + assert_array_equal(expect, actual) + # alternative API + if isinstance(fields, str): + actual = z[fields] + assert_array_equal(expect, actual) + elif len(fields) == 2: + actual = z[fields[0], fields[1]] + assert_array_equal(expect, actual) + if isinstance(fields, str): + actual = z[..., fields] + assert_array_equal(expect, actual) + elif len(fields) == 2: + actual = z[..., fields[0], fields[1]] + assert_array_equal(expect, actual) + + # basic selection with slice + expect = a[fields][0:2] + actual = z.get_basic_selection(slice(0, 2), fields=fields) + assert_array_equal(expect, actual) + # alternative API + if isinstance(fields, str): + actual = z[0:2, fields] + assert_array_equal(expect, actual) + elif len(fields) == 2: + actual = z[0:2, fields[0], fields[1]] + assert_array_equal(expect, actual) + + # basic selection with single item + expect = a[fields][1] + actual = z.get_basic_selection(1, fields=fields) + assert_array_equal(expect, actual) + # alternative API + if isinstance(fields, str): + actual = z[1, fields] + assert_array_equal(expect, actual) + elif len(fields) == 2: + actual = z[1, fields[0], fields[1]] + assert_array_equal(expect, actual) + + # orthogonal selection + ix = [0, 2] + expect = a[fields][ix] + actual = z.get_orthogonal_selection(ix, fields=fields) + assert_array_equal(expect, actual) + # alternative API + if isinstance(fields, str): + actual = z.oindex[ix, fields] + assert_array_equal(expect, actual) + elif len(fields) == 2: + actual = z.oindex[ix, fields[0], fields[1]] + assert_array_equal(expect, actual) + + # coordinate selection + ix = [0, 2] + expect = a[fields][ix] + actual = z.get_coordinate_selection(ix, fields=fields) + assert_array_equal(expect, actual) + # alternative API + if isinstance(fields, str): + actual = z.vindex[ix, fields] + assert_array_equal(expect, actual) + elif len(fields) == 2: + actual = z.vindex[ix, fields[0], fields[1]] + assert_array_equal(expect, actual) + + # mask selection + ix = [True, False, True] + expect = a[fields][ix] + actual = z.get_mask_selection(ix, fields=fields) + assert_array_equal(expect, actual) + # alternative API + if isinstance(fields, str): + actual = z.vindex[ix, fields] + assert_array_equal(expect, actual) + elif len(fields) == 2: + actual = z.vindex[ix, fields[0], fields[1]] + assert_array_equal(expect, actual) + + # missing/bad fields + with assert_raises(IndexError): + z.get_basic_selection(Ellipsis, fields=['notafield']) + with assert_raises(IndexError): + z.get_basic_selection(Ellipsis, fields=slice(None)) + + +def test_set_selections_with_fields(): + + v = [('aaa', 1, 4.2), + ('bbb', 2, 8.4), + ('ccc', 3, 12.6)] + v = np.array(v, dtype=[('foo', 'S3'), ('bar', 'i4'), ('baz', 'f8')]) + a = np.empty_like(v) + z = zarr.empty_like(v, chunks=2) + + fields_fixture = [ + 'foo', + [], + ['foo'], + ['foo', 'bar'], + ['foo', 'baz'], + ['bar', 'baz'], + ['foo', 'bar', 'baz'], + ['bar', 'foo'], + ['baz', 'bar', 'foo'], + ] + + for fields in fields_fixture: + + # currently multi-field assignment is not supported in numpy, so we won't support it either + if isinstance(fields, list) and len(fields) > 1: + with assert_raises(IndexError): + z.set_basic_selection(Ellipsis, v, fields=fields) + with assert_raises(IndexError): + z.set_orthogonal_selection([0, 2], v, fields=fields) + with assert_raises(IndexError): + z.set_coordinate_selection([0, 2], v, fields=fields) + with assert_raises(IndexError): + z.set_mask_selection([True, False, True], v, fields=fields) + + else: + + if isinstance(fields, list) and len(fields) == 1: + # work around numpy does not support multi-field assignment even if there is only + # one field + key = fields[0] + elif isinstance(fields, list) and len(fields) == 0: + # work around numpy ambiguity about what is a field selection + key = Ellipsis + else: + key = fields + + # setup expectation + a[:] = ('', 0, 0) + z[:] = ('', 0, 0) + assert_array_equal(a, z[:]) + a[key] = v[key] + # total selection + z.set_basic_selection(Ellipsis, v[key], fields=fields) + assert_array_equal(a, z[:]) + + # basic selection with slice + a[:] = ('', 0, 0) + z[:] = ('', 0, 0) + a[key][0:2] = v[key][0:2] + z.set_basic_selection(slice(0, 2), v[key][0:2], fields=fields) + assert_array_equal(a, z[:]) + + # orthogonal selection + a[:] = ('', 0, 0) + z[:] = ('', 0, 0) + ix = [0, 2] + a[key][ix] = v[key][ix] + z.set_orthogonal_selection(ix, v[key][ix], fields=fields) + assert_array_equal(a, z[:]) + + # coordinate selection + a[:] = ('', 0, 0) + z[:] = ('', 0, 0) + ix = [0, 2] + a[key][ix] = v[key][ix] + z.set_coordinate_selection(ix, v[key][ix], fields=fields) + assert_array_equal(a, z[:]) + + # mask selection + a[:] = ('', 0, 0) + z[:] = ('', 0, 0) + ix = [True, False, True] + a[key][ix] = v[key][ix] + z.set_mask_selection(ix, v[key][ix], fields=fields) + assert_array_equal(a, z[:]) diff --git a/zarr/tests/test_meta.py b/zarr/tests/test_meta.py index d1f1814cf2..3269760b1d 100644 --- a/zarr/tests/test_meta.py +++ b/zarr/tests/test_meta.py @@ -74,7 +74,7 @@ def test_encode_decode_array_2(): chunks=(10, 10), dtype=np.dtype([('a', 'i4'), ('b', 'S10')]), compressor=compressor.get_config(), - fill_value=42, + fill_value=b'', order='F', filters=[df.get_config()] ) @@ -89,7 +89,7 @@ def test_encode_decode_array_2(): "blocksize": 0 }, "dtype": [["a", "= length or item < 0: - raise IndexError('index out of bounds: %s' % item) - - return item - - elif isinstance(item, slice): - - # handle slice with step - if item.step is not None and item.step != 1: - raise NotImplementedError('slice with step not implemented') - - # handle slice with None bound - start = 0 if item.start is None else item.start - stop = length if item.stop is None else item.stop - - # handle wraparound - if start < 0: - start = length + start - if stop < 0: - stop = length + stop - - # handle zero-length axis - if start == stop == length == 0: - return slice(0, 0) - - # handle out of bounds - if start < 0 or stop < 0: - raise IndexError('index out of bounds: %s, %s' % (start, stop)) - if start >= length: - raise IndexError('index out of bounds: %s, %s' % (start, stop)) - if stop > length: - stop = length - if stop < start: - raise IndexError('index out of bounds: %s, %s' % (start, stop)) - - return slice(start, stop) - - else: - raise TypeError('expected integer or slice, found: %r' % item) - - -# noinspection PyTypeChecker -def normalize_array_selection(item, shape): - """Convenience function to normalize a selection within an array with - the given `shape`.""" - - # normalize item - if isinstance(item, numbers.Integral): - item = (int(item),) - elif isinstance(item, slice): - item = (item,) - elif item == Ellipsis: - item = (slice(None),) - - # handle tuple of indices/slices - if isinstance(item, tuple): - - # determine start and stop indices for all axes - selection = tuple(normalize_axis_selection(i, l) - for i, l in zip(item, shape)) - - # fill out selection if not completely specified - if len(selection) < len(shape): - selection += tuple(slice(0, l) for l in shape[len(selection):]) - - return selection - - else: - raise TypeError('expected indices or slice, found: %r' % item) - - -def get_chunk_range(selection, chunks): - """Convenience function to get a range over all chunk indices, - for iterating over chunks.""" - chunk_range = [range(s.start//l, int(np.ceil(s.stop/l))) - if isinstance(s, slice) - else range(s//l, (s//l)+1) - for s, l in zip(selection, chunks)] - return chunk_range - - def normalize_resize_args(old_shape, *args): # normalize new shape argument @@ -270,6 +177,38 @@ def normalize_order(order): return order +def normalize_fill_value(fill_value, dtype): + + if fill_value is None: + # no fill value + pass + + elif fill_value == 0 and dtype.kind == 'V': + # special case because 0 used as default, but cannot be used for structured arrays + fill_value = b'' + + elif dtype.kind == 'U': + # special case unicode because of encoding issues on Windows if passed through numpy + # https://github.com/alimanfoo/zarr/pull/172#issuecomment-343782713 + + if PY2 and isinstance(fill_value, binary_type): # pragma: py3 no cover + # this is OK on PY2, can be written as JSON + pass + + elif not isinstance(fill_value, text_type): + raise ValueError('fill_value {!r} is not valid for dtype {}; must be a unicode string' + .format(fill_value, dtype)) + + else: + try: + fill_value = np.array(fill_value, dtype=dtype)[()] + except Exception as e: + # re-raise with our own error message to be helpful + raise ValueError('fill_value {!r} is not valid for dtype {}; nested exception: {}' + .format(fill_value, dtype, e)) + return fill_value + + def normalize_storage_path(path): # handle bytes @@ -365,3 +304,12 @@ def __repr__(self): def _repr_html_(self): items = self.obj.info_items() return info_html_report(items) + + +def check_array_shape(param, array, shape): + if not hasattr(array, 'shape'): + raise TypeError('parameter {!r}: expected an array-like object, got {!r}' + .format(param, type(array))) + if array.shape != shape: + raise ValueError('parameter {!r}: expected array with shape {!r}, got {!r}' + .format(param, shape, array.shape))