-
-
Notifications
You must be signed in to change notification settings - Fork 329
Dimension scales #167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
+1 |
We would also make use of something like this feature in Xarray. |
Thanks both for the feedback. If this would be useful to you then I'd be
very happy to discuss adding the feature. It would be good to talk through
some of the implementation details first before working up a PR, there's a
couple of technical points that are not immediately obvious to me.
Regarding the underlying implementation, i.e., how the links between
dimension scales and arrays are stored, I had imagined we would do
something analogous to how this is implemented in HDF5 using attributes, as
described here: https://support.hdfgroup.org/HDF5/doc/HL/H5DS_Spec.pdf
I think this means the following.
Dimensions of an array can be labelled, and this can be done without
attaching dimension scales. In HDF5 this is done by setting an attribute
DIMENSION_LABELLIST=<sequence of strings> on the array. In h5py the API (
http://docs.h5py.org/en/latest/high/dims.html) seems fine, e.g., to set a
label on the first dimension of array 'a':
a.dims[0].label = 'foo'
Creating a dimension scale (i.e., converting an existing array to a
dimension scale) is done in HDF5 by setting an attribute
CLASS=DIMENSION_SCALE on the array. Optionally, the dimension scale may
also be given a label, in which case an attribute NAME=<label> is also set.
The spec also describes the optional use of a SUB_CLASS attribute, but I
think we could ignore that for now.
Personally I find the h5py API for doing this to be confusing. It has to be
done via the Dataset.dims property, however it doesn't matter which
dataset, because attaching a scale is a separate step. I think it would be
more natural to have a method like "Array.set_scale(label)" which does the
conversion on the array owning the method. I.e., if 'ds1' is an array to be
converted to a dimension scale:
ds1.set_scale('foo')
We could always support the full h5py API too for compatibility, e.g., if
'a' is an arbitrary array:
a.dims.create_scale(ds1, 'foo')
Attaching a dimension scale to an array involves two parts in HDF5.
On the array, an attribute DIMENSION_LIST=<sequence of sequence of ref> is
set. The value is a sequence of length equal to the number of dimensions on
the array. Each element of the sequence is a sequence of references to
dimension scales. This allows a dimension to have more than one dimension
scale attached.
On the dimension scale array, an attribute REFERENCE_LIST=<sequence of
(ref, index)> is set. Each element in this sequence contains a reference to
an array to which the dimension scale has been attached, and the index of
the attached dimension within that array.
The h5py API for attaching a dimension scale seems fine to me. E.g., to
attach scale 'ds1' to the first dimension of a dataset 'a':
a.dims[0].attach_scale(ds1)
Detaching a scale implies updating the DIMENSION_LIST attribute on the
array, and also updating the REFERENCE_LIST attribute on the dimension
scale array.
Most of the above seems OK to me, in terms of both following the HDF5
implementation and the h5py API, caveat a slightly confusing h5py API for
creating a scale, although I have no experience of using this myself and
I'd be interested what others think.
I have two main questions about the implementation.
First, creating the links between an array and a dimension scale requires
some form of reference. In HDF5 every object has an object reference which
is a unique identifier, in addition to it's path within the HDF5 file. In
Zarr we don't have identifiers for array or group objects, we just use
paths within the hierarchy. So in Zarr, instead of object references, do we
just use the absolute path in the hierarchy? E.g., if a dimension scale
array is at path /scales/ds1 and a 1D array is at path /data/a, when
attaching a scale to we store e.g., DIMENSION_LIST=[['/scales/ds1']] on the
array at /data/a, and REFERENCE_LIST=[['/data/a', 0]] on /scales/ds1?
This would seem the obvious answer, but I have one small concern, which is
that for some storage backends there is no fixed notion of the root of a
hierarchy. I.e., you can create a hierarchy, then re-open at some point
within the hierarchy, which will treat that point as the root. E.g.:
root = zarr.open_group('/data/root')
grp = root.create_group('foo/bar/')
new_root = zarr.open_group('data/root/foo/bar')
...is perfectly valid. Obviously absolute paths will be different when
interacting with root versus new_root. Options for coping with this seem to
be either (1) just don't support this use case for dimension scales, i.e.,
use absolute paths as references and live with this being broken if people
do strange things like re-open a group at the wrong hierarchy level, or (2)
use some kind of relative path as references. Zarr has no notion of
relative paths at the moment, but this could be done.
The second question I have is what (if anything) do we do about maintaining
the referential integrity of dimension scale forward and backward links?
E.g., if you delete a dimension scale array, do you also try to update the
DIMENSION_LIST attribute on all arrays to which the scale is attached? Or,
e.g., if you delete an array with an attached scale, do you also try to
update the REFERENCE_LIST attribute on the scale array? This is suggested
in the HDF5 dimension scale spec, however I think this could get messy in
the implementation for a couple of reasons. First, it would require adding
logic into the deletion implementations to check the attributes on any
deleted array prior to deletion, and then try to update any dimension scale
links, which would complicate the code and also add some overhead
especially for remote backends. Also, if this is being done against a
remote backend, there is no guarantee that all operations will succeed,
i.e., it would be entirely possible to end up with some hanging links.
Also, if someone deletes a group then a recursive search would have to be
made of all descendant arrays first and links removed.
Here it seems to me like there are at least two options. (1) Don't try to
maintain referential integrity, i.e., if someone deletes an array or
dimension scale array, let any dimension scale links dangle. (2) Make a
best effort to update all dimension scale links whenever anything is
deleted. (3) Give the user a choice between (1) and (2) via some API switch.
Any thoughts?
…On Monday, February 12, 2018, Joe Hamman ***@***.***> wrote:
We would also make use of something like this feature in Xarray.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#167 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QoEKCM3eDrv0QfgWe2js28p3rHpOks5tUM6ogaJpZM4QGuZA>
.
--
If I do not respond to an email within a few days and you need a response,
please feel free to resend your email and/or contact me by other means.
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596 or +44 (0)7866 541624
Skype: londonbonsaipurple
Email: [email protected]
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
@alimanfoo - thinking about this more, let me clarify how we would likely use this in xarray.
References: |
Thanks Joe, need to mull this over a bit.
…--
If I do not respond to an email within a few days, please feel free to
resend your email and/or contact me by other means.
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596 or +44 (0)7866 541624
Skype: londonbonsaipurple
Email: [email protected]
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
Was talking with @shoyer about this at scipy2018. Currently
but perhaps could find a home in
as a new field I imagine there will be other packages soon trying to use Whatever is decided, @shoyer suggests we call these the "NetZDF" conventions! 😸 |
See #276 for a proposed Zarr spec v3, incorporating optional dimension names and the "netzdf" format. |
closing as this is now part of the |
Implement the h5py dimension scales API?
The text was updated successfully, but these errors were encountered: