Dimension scales #167

alimanfoo · 2017-10-25T22:00:34Z

Implement the h5py dimension scales API?

weatherfrog · 2018-02-12T10:54:52Z

+1
Is this still on the wishlist? Would a PR be appreciated?

jhamman · 2018-02-12T23:52:08Z

We would also make use of something like this feature in Xarray.

alimanfoo · 2018-02-21T10:40:27Z

Thanks both for the feedback. If this would be useful to you then I'd be very happy to discuss adding the feature. It would be good to talk through some of the implementation details first before working up a PR, there's a couple of technical points that are not immediately obvious to me. Regarding the underlying implementation, i.e., how the links between dimension scales and arrays are stored, I had imagined we would do something analogous to how this is implemented in HDF5 using attributes, as described here: https://support.hdfgroup.org/HDF5/doc/HL/H5DS_Spec.pdf I think this means the following. Dimensions of an array can be labelled, and this can be done without attaching dimension scales. In HDF5 this is done by setting an attribute DIMENSION_LABELLIST=<sequence of strings> on the array. In h5py the API ( http://docs.h5py.org/en/latest/high/dims.html) seems fine, e.g., to set a label on the first dimension of array 'a': a.dims[0].label = 'foo' Creating a dimension scale (i.e., converting an existing array to a dimension scale) is done in HDF5 by setting an attribute CLASS=DIMENSION_SCALE on the array. Optionally, the dimension scale may also be given a label, in which case an attribute NAME=<label> is also set. The spec also describes the optional use of a SUB_CLASS attribute, but I think we could ignore that for now. Personally I find the h5py API for doing this to be confusing. It has to be done via the Dataset.dims property, however it doesn't matter which dataset, because attaching a scale is a separate step. I think it would be more natural to have a method like "Array.set_scale(label)" which does the conversion on the array owning the method. I.e., if 'ds1' is an array to be converted to a dimension scale: ds1.set_scale('foo') We could always support the full h5py API too for compatibility, e.g., if 'a' is an arbitrary array: a.dims.create_scale(ds1, 'foo') Attaching a dimension scale to an array involves two parts in HDF5. On the array, an attribute DIMENSION_LIST=<sequence of sequence of ref> is set. The value is a sequence of length equal to the number of dimensions on the array. Each element of the sequence is a sequence of references to dimension scales. This allows a dimension to have more than one dimension scale attached. On the dimension scale array, an attribute REFERENCE_LIST=<sequence of (ref, index)> is set. Each element in this sequence contains a reference to an array to which the dimension scale has been attached, and the index of the attached dimension within that array. The h5py API for attaching a dimension scale seems fine to me. E.g., to attach scale 'ds1' to the first dimension of a dataset 'a': a.dims[0].attach_scale(ds1) Detaching a scale implies updating the DIMENSION_LIST attribute on the array, and also updating the REFERENCE_LIST attribute on the dimension scale array. Most of the above seems OK to me, in terms of both following the HDF5 implementation and the h5py API, caveat a slightly confusing h5py API for creating a scale, although I have no experience of using this myself and I'd be interested what others think. I have two main questions about the implementation. First, creating the links between an array and a dimension scale requires some form of reference. In HDF5 every object has an object reference which is a unique identifier, in addition to it's path within the HDF5 file. In Zarr we don't have identifiers for array or group objects, we just use paths within the hierarchy. So in Zarr, instead of object references, do we just use the absolute path in the hierarchy? E.g., if a dimension scale array is at path /scales/ds1 and a 1D array is at path /data/a, when attaching a scale to we store e.g., DIMENSION_LIST=[['/scales/ds1']] on the array at /data/a, and REFERENCE_LIST=[['/data/a', 0]] on /scales/ds1? This would seem the obvious answer, but I have one small concern, which is that for some storage backends there is no fixed notion of the root of a hierarchy. I.e., you can create a hierarchy, then re-open at some point within the hierarchy, which will treat that point as the root. E.g.: root = zarr.open_group('/data/root') grp = root.create_group('foo/bar/') new_root = zarr.open_group('data/root/foo/bar') ...is perfectly valid. Obviously absolute paths will be different when interacting with root versus new_root. Options for coping with this seem to be either (1) just don't support this use case for dimension scales, i.e., use absolute paths as references and live with this being broken if people do strange things like re-open a group at the wrong hierarchy level, or (2) use some kind of relative path as references. Zarr has no notion of relative paths at the moment, but this could be done. The second question I have is what (if anything) do we do about maintaining the referential integrity of dimension scale forward and backward links? E.g., if you delete a dimension scale array, do you also try to update the DIMENSION_LIST attribute on all arrays to which the scale is attached? Or, e.g., if you delete an array with an attached scale, do you also try to update the REFERENCE_LIST attribute on the scale array? This is suggested in the HDF5 dimension scale spec, however I think this could get messy in the implementation for a couple of reasons. First, it would require adding logic into the deletion implementations to check the attributes on any deleted array prior to deletion, and then try to update any dimension scale links, which would complicate the code and also add some overhead especially for remote backends. Also, if this is being done against a remote backend, there is no guarantee that all operations will succeed, i.e., it would be entirely possible to end up with some hanging links. Also, if someone deletes a group then a recursive search would have to be made of all descendant arrays first and links removed. Here it seems to me like there are at least two options. (1) Don't try to maintain referential integrity, i.e., if someone deletes an array or dimension scale array, let any dimension scale links dangle. (2) Make a best effort to update all dimension scale links whenever anything is deleted. (3) Give the user a choice between (1) and (2) via some API switch. Any thoughts?

…

On Monday, February 12, 2018, Joe Hamman ***@***.***> wrote: We would also make use of something like this feature in Xarray. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#167 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QoEKCM3eDrv0QfgWe2js28p3rHpOks5tUM6ogaJpZM4QGuZA> .

-- If I do not respond to an email within a few days and you need a response, please feel free to resend your email and/or contact me by other means. Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Skype: londonbonsaipurple Email: [email protected] Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

jhamman · 2018-02-23T06:34:22Z

@alimanfoo - thinking about this more, let me clarify how we would likely use this in xarray.

We mostly would need a DIMENSION_LIST attribute on each array (we make one of these ourselves for now, see ref 1).
If we could query the dataset/group for a list of dimensions and their sizes, that would be awesome.
The dimension scales themselves, in the common data model, are less important and we would probably continue to use normal variables to store coordinate labels (see ref 2).

References:

alimanfoo · 2018-02-23T09:21:35Z

Thanks Joe, need to mull this over a bit.

…

-- If I do not respond to an email within a few days, please feel free to resend your email and/or contact me by other means. Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Skype: londonbonsaipurple Email: [email protected] Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

rsignell-usgs · 2018-07-14T22:38:10Z

Was talking with @shoyer about this at scipy2018.

Currently xarray is putting the _ARRAY_DIMENSIONS info to allow interpretation as NetCDF into the .zattrs for the variable, like:

{
    "_ARRAY_DIMENSIONS": [
        "time",
        "node"
    ],
    "coordinates": "y x",
    "location": "node",
    "long_name": "water surface elevation above geoid",
    "mesh": "adcirc_mesh",
    "standard_name": "sea_surface_height_above_geoid",
    "units": "m"
}

but perhaps could find a home in .zarray, an example of which currently looks like:

{
    "chunks": [
        10,
        141973
    ],
    "compressor": {
        "blocksize": 0,
        "clevel": 5,
        "cname": "lz4",
        "id": "blosc",
        "shuffle": 1
    },
    "dtype": "<f8",
    "fill_value": -99999.0,
    "filters": null,
    "order": "C",
    "shape": [
        720,
        9228245
    ],
    "zarr_format": 2
}

as a new field dimensions or something, and be part of the Zarr functionality/documentation?

I imagine there will be other packages soon trying to use Zarr for NetCDF, and it would be nice to have a convention.

Whatever is decided, @shoyer suggests we call these the "NetZDF" conventions! 😸

shoyer · 2018-07-15T22:32:57Z

See #276 for a proposed Zarr spec v3, incorporating optional dimension names and the "netzdf" format.

jhamman · 2023-12-07T20:24:59Z

closing as this is now part of the v3 spec and is in the v3 dev branch.

alimanfoo added the enhancement New features or improvements label Nov 21, 2017

shoyer mentioned this issue Jul 15, 2018

DOC: zarr spec v3: adds optional dimensions and the "netZDF" format #276

Closed

jhamman closed this as completed Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dimension scales #167

Dimension scales #167

alimanfoo commented Oct 25, 2017

weatherfrog commented Feb 12, 2018

jhamman commented Feb 12, 2018

alimanfoo commented Feb 21, 2018 via email

jhamman commented Feb 23, 2018

alimanfoo commented Feb 23, 2018 via email

rsignell-usgs commented Jul 14, 2018

shoyer commented Jul 15, 2018

jhamman commented Dec 7, 2023

Dimension scales #167

Dimension scales #167

Comments

alimanfoo commented Oct 25, 2017

weatherfrog commented Feb 12, 2018

jhamman commented Feb 12, 2018

alimanfoo commented Feb 21, 2018 via email

jhamman commented Feb 23, 2018

alimanfoo commented Feb 23, 2018 via email

rsignell-usgs commented Jul 14, 2018

shoyer commented Jul 15, 2018

jhamman commented Dec 7, 2023