-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Refactor xarray.conventions into VariableCoder #1752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I would like to propose refactoring xarray.conventions to use an interface based on VariableCoder objects with encode() and decode() methods. The idea is make it easier to write new backends, by making decoding variables according to CF conventions as simple as calling decode() on each coder in a list of coders, with encoding defined by calling the same list of encoders in the opposite order. As a proof of concept, here I implement only a single Coder. In addition to making use of xarray's existing lazy indexing behavior, I have written it so that dask arrays are decoded using dask. Eventually, we should port all the coders in xarray.conventions to this new format. This is probably best saved for future PRs -- help would be appreciated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. This is going to really cleanup the conventions module nicely. It may also allow for a more modular decoding schema somewhat detached from specific backends. I'm specifically thinking about a set of Coders that are more/less sensitive to CF conventions.
if '_FillValue' in attrs: | ||
raw_fill_value = pop_to(attrs, encoding, '_FillValue', name=name) | ||
encoded_fill_values = [ | ||
fv for fv in np.ravel(raw_fill_value) if not pd.isnull(fv)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it even possible to specify a sequence of fill values in a netcdf file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently? This came from a pull request, though I refactored the logic here slightly.
I'm going to merge this shortly if there are no objections. |
@shoyer I'm writing a small helper that looks very much like a decoding filter, but I'm not sure it fits the VariableCoder model. The intent is to use CF attributes to identify coordinate variables, for example {'units': 'degrees_north'} identify a 'latitude' coordinate, and apply a set of transformations to harmonise datasets from different providers: I rename the coordinates to a known name, for example 'lat' for latitude, and convert to a known unit, for example to 'Pa' for pressure. To my understanding some of the transformations need to be at a DataArray / Dataset level, not Variable one, because I need to manipulate the names of the coordinate variables and keep the name of the dimensions in sync, for example I'm currently using DataArray.rename a lot, but I also have an example of using DataArray.swap_dims. A simplified example is: def harmonise_latitude(dataset):
# (xr.Dataset) -> xr.Dataset
for coord in dataset.coords.values():
if coord.attrs.get('units') == 'degrees_north':
# update the other CF attributes that reference coord.name, for example 'coordinates'.
return dataset.rename({coord.name: 'lat'})
return dataset Would you be open to add support for Dataset -> Dataset decoding filters similar to the Variable -> Variable? |
The intent with decoders is handle the invertible transformations between a raw file on disk and xarray's data model. So we will definitely need some transformations that act on a collection of variables (e.g., to decode coordinates vs data variables), but not in the form of a Dataset yet. These decoders/encoders will probably take/return three arguments: |
Building off of discussion in #1087, I would like to propose refactoring xarray.conventions to use an interface based on VariableCoder objects with encode() and decode() methods.
The idea is make it easier to write new backends, by making decoding variables according to CF conventions as simple as calling decode() on each coder in a list of coders, with encoding defined by calling the same list of encoders in the opposite order.
As a proof of concept, here I implement only a single Coder. In addition to making use of xarray's existing lazy indexing behavior, I have written it so that dask arrays are decoded using dask (which would solve #1372)
Eventually, we should port all the coders in xarray.conventions to this new format. This is probably best saved for future PRs -- help would be appreciated!
git diff upstream/master **/*py | flake8 --diff