Skip to content

Refactor xarray.conventions into VariableCoder #1752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 14, 2017
Merged

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Dec 1, 2017

Building off of discussion in #1087, I would like to propose refactoring xarray.conventions to use an interface based on VariableCoder objects with encode() and decode() methods.

The idea is make it easier to write new backends, by making decoding variables according to CF conventions as simple as calling decode() on each coder in a list of coders, with encoding defined by calling the same list of encoders in the opposite order.

As a proof of concept, here I implement only a single Coder. In addition to making use of xarray's existing lazy indexing behavior, I have written it so that dask arrays are decoded using dask (which would solve #1372)

Eventually, we should port all the coders in xarray.conventions to this new format. This is probably best saved for future PRs -- help would be appreciated!

  • Tests added / passed
  • Passes git diff upstream/master **/*py | flake8 --diff

I would like to propose refactoring xarray.conventions to use an interface
based on VariableCoder objects with encode() and decode() methods.

The idea is make it easier to write new backends, by making decoding variables
according to CF conventions as simple as calling decode() on each coder in a
list of coders, with encoding defined by calling the same list of encoders in
the opposite order.

As a proof of concept, here I implement only a single Coder. In addition to
making use of xarray's existing lazy indexing behavior, I have written it so
that dask arrays are decoded using dask.

Eventually, we should port all the coders in xarray.conventions to this new
format. This is probably best saved for future PRs -- help would be appreciated!
@shoyer shoyer requested a review from jhamman December 4, 2017 02:17
Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. This is going to really cleanup the conventions module nicely. It may also allow for a more modular decoding schema somewhat detached from specific backends. I'm specifically thinking about a set of Coders that are more/less sensitive to CF conventions.

if '_FillValue' in attrs:
raw_fill_value = pop_to(attrs, encoding, '_FillValue', name=name)
encoded_fill_values = [
fv for fv in np.ravel(raw_fill_value) if not pd.isnull(fv)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it even possible to specify a sequence of fill values in a netcdf file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently? This came from a pull request, though I refactored the logic here slightly.

@shoyer
Copy link
Member Author

shoyer commented Dec 13, 2017

I'm going to merge this shortly if there are no objections.

@alexamici
Copy link
Collaborator

alexamici commented Dec 23, 2017

@shoyer I'm writing a small helper that looks very much like a decoding filter, but I'm not sure it fits the VariableCoder model.

The intent is to use CF attributes to identify coordinate variables, for example {'units': 'degrees_north'} identify a 'latitude' coordinate, and apply a set of transformations to harmonise datasets from different providers: I rename the coordinates to a known name, for example 'lat' for latitude, and convert to a known unit, for example to 'Pa' for pressure.

To my understanding some of the transformations need to be at a DataArray / Dataset level, not Variable one, because I need to manipulate the names of the coordinate variables and keep the name of the dimensions in sync, for example I'm currently using DataArray.rename a lot, but I also have an example of using DataArray.swap_dims. A simplified example is:

def harmonise_latitude(dataset):
    # (xr.Dataset) -> xr.Dataset
    for coord in dataset.coords.values():
        if coord.attrs.get('units') == 'degrees_north':
            # update the other CF attributes that reference coord.name, for example 'coordinates'.
            return dataset.rename({coord.name: 'lat'})
    return dataset

Would you be open to add support for Dataset -> Dataset decoding filters similar to the Variable -> Variable?

@shoyer
Copy link
Member Author

shoyer commented Dec 23, 2017

The intent with decoders is handle the invertible transformations between a raw file on disk and xarray's data model. So we will definitely need some transformations that act on a collection of variables (e.g., to decode coordinates vs data variables), but not in the form of a Dataset yet. These decoders/encoders will probably take/return three arguments: variables (an OrderedDict of xarray.Variable objects with both coordinates and data), dims (dict of dimension names to sizes) and (global) attrs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants