Skip to content

We need some way to identify non-index coordinates #197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shoyer opened this issue Aug 1, 2014 · 3 comments
Closed

We need some way to identify non-index coordinates #197

shoyer opened this issue Aug 1, 2014 · 3 comments
Milestone

Comments

@shoyer
Copy link
Member

shoyer commented Aug 1, 2014

I am currently working with station data. In order to keep around latitude and longitude (I use station_id as the coordinate variable), I need to resort to some ridiculous contortions:

residuals = results['y'] - observations['y']
residuals.dataset.update(results.select_vars('longitude', 'latitude'))

There has got to be an easier way to handle this.


I don't want to revert to some primitive guessing strategy (e.g, looking at attrs['coordinates']) to figure out which extra variables can be safely kept after mathematical operations.

Another approach would be to try to preserve everything in the dataset linked to an DataArray when doing math. But I don't really like this option, either, because it would lead to serious propagation of "linked dataset variables", which are rather surprising and can have unexpected performance consequences (though at least they appear in repr as of #128).


This leaves me to a final alternative: restructuring xray's internals to provide first-class support for coordinates that are not indexes. For example, this would mean promoting ds.coordinates to an actual dictionary stored on a dataset, and allowing it to hold objects that aren't an xray.Coordinate.

Making this change transparent to users would likely require changing the Dataset signature to something like Dataset(variables, coords, attrs). We might (yet again) want to rename Coordinate, to something like IndexVar, to emphasis the notion of "index" and "non-index" coordinates. And we could get rid of the terrible "linked dataset variable".

Once we have non-index coordinates, we need a policy for what to do when adding with two DataArrays for which they differ. I think my preferred approach is to not enforce that they be found on both arrays, but to raise an exception if there are any conflicting values -- unless they are scalar valued, in which case the dropped or turned into a tuple or given different names. (Otherwise there would be cases where you couldn't calculate x[1] - x[0].)

We might even able to keep around multi-dimension coordinates this way (e.g., 2D lat/lon arrays for projected data).... I'll need to think about that one some more.

@shoyer shoyer added the API label Aug 1, 2014
@shoyer
Copy link
Member Author

shoyer commented Aug 3, 2014

Some further thinking suggests that we can allow for multi-dimensional coordinates, but the only sane way to handle conflicting non-index coordinates is to drop them. Even Iris, with its strict interpretation of CF conventions, takes this approach.

Raising an exception for conflicting non-scalar variables would make multi-dimensional coordinates impractical.

@shoyer shoyer modified the milestones: 0.3, 1.0 Aug 3, 2014
shoyer added a commit to shoyer/xarray that referenced this issue Aug 14, 2014
shoyer added a commit to shoyer/xarray that referenced this issue Aug 14, 2014
@shoyer
Copy link
Member Author

shoyer commented Aug 18, 2014

Here's my current thinking on implementation:

  • Dataset._coords_keys is a set that keeps track of the names of variables that are coordinates. This would let us implement Dataset.coords as a dict-like object with very little overhead for lookups.
  • DataArray.dataset should go away from the public AP (use to_dataset() instead); more importantly, DataArray objects should only know about their coordinates, not any other aspects of the underlying dataset (it's a leaky abstraction).
  • Dataset.__iter__ only only iterate over non-coordinates; but __contains__ and __getitem__ should be unchanged.
  • Of course, we should automatically save and load coordinates according to CF conventions.

Also, we need some new methods to make this workable (modeled off of pandas's set_index and reset_index):

  • Dataset.set_coords(keys, inplace=False) turns the variables in names into coordinates
  • Dataset.reset_coords(keys=None, drop=False, inplace=False) removes all coordinates and turns them back into variables (unless drop=True).
  • DataArray.reset_coords() would be very similar to the dataset method; it would return a Dataset unless unless drop=True (in which case it would return another DataArray).
  • set_index(keys, inplace=False) should be both a DataArray and Dataset method.

An important aspect is that using non-index coordinates (a power user feature) should be optional, just like how using and understanding Dataset objects should be optional.

@shoyer
Copy link
Member Author

shoyer commented Sep 10, 2014

Still a few items to check off (see the open associated issues) but I don't think they are blockers to v0.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant