Skip to content

multiple files - variable X not equal across datasets #443

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
razcore-rad opened this issue Jun 26, 2015 · 9 comments
Closed

multiple files - variable X not equal across datasets #443

razcore-rad opened this issue Jun 26, 2015 · 9 comments

Comments

@razcore-rad
Copy link

The other day I was playing with xray.open_mfdataset and I noticed you can get this error, when opening multiple files at the same time. I think there is a pretty easy solution to this:

import glob as g
from toolz.curried import curry, map, pipe
import xray

def get_ds(glob):
    def _get_ds(file_path):
        dim = 'mean_height_agl'
        dim_new = 'agl'
        with xray.open_dataset(file_path) as _ds:
            _ds.load()
            return (_ds.assign_coords(**{dim_new:
                                         (dim, range(_ds.coords[dim].size))})
                    .swap_dims({dim: dim_new}))

    return pipe(g.glob(glob),
                sorted,
                map(_get_ds),
                curry(xray.concat)(dim='time'))

Of course, this is for a particular variable I was having trouble with, but the idea is to swap dimensions, that is create a dummy dimension with the same length as the troublesome variable and then swap the two. This can be done for any number of troublesome variables. I don't know how feasible this is though. Just thought to share my idea...

@shoyer
Copy link
Member

shoyer commented Jun 26, 2015

Could you print two of the incompatible datasets? I'm not sure if there is a general pattern here (or not).

@razcore-rad
Copy link
Author

So I don't know if this is what you're asking for (I only have one dataset with this problem) but here's how it looks like:

<xray.Dataset>
Dimensions:            (agl: 50, lat: 39, lon: 59, time: 192)
Coordinates:
  * lon                (lon) float64 -29.0 -28.0 -27.0 -26.0 -25.0 -24.0 ...
  * lat                (lat) float64 32.0 33.0 34.0 35.0 36.0 37.0 38.0 39.0 ...
  * agl                (agl) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
  * time               (time) datetime64[ns] 2011-05-21T13:00:00 ...
    mean_height_agl    (time, agl) float64 28.28 97.21 191.1 310.7 460.9 ...
Data variables:
    so2_concentration  (time, agl, lat, lon) float64 3.199e-13 3.199e-13 ...
    ash_wetdep         (time, lat, lon) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    ash_concentration  (time, agl, lat, lon) float64 9.583e-16 9.581e-16 ...
    ash_mass_loading   (time, lat, lon) float64 1.091e-11 1.091e-11 ...
    so2_mass_loading   (time, lat, lon) float64 2.602e-09 2.602e-09 ...
    ash_drydep         (time, lat, lon) float64 4.086e-10 4.084e-10 4.08e-10 ...
Attributes:

This is read in with get_ds from above. I wouldn't be able to read it normally with xray.open_mfdataset because it would give me 'Variable mean_height_agl not equal across datasets'. But 'mean_height_agl' is indeed a coordinate per individual file, so I had to create the dummy 'agl' coordinate and convert 'mean_height_agl' to a variable basically. This way I can still treat the data as being part of one file.

@shoyer
Copy link
Member

shoyer commented Jun 26, 2015

Do you get the error message if you specify the full path to this file in open_mfdataset?

@razcore-rad
Copy link
Author

I get this with open_mfdataset:

Traceback (most recent call last):
  File "box.py", line 59, in <module>
    concat_dim='time'))
  File "/ichec/home/users/razvan/.local/lib/python3.4/site-packages/xray/backends/api.py", line 205, in open_mfdataset
    combined = auto_combine(datasets, concat_dim=concat_dim)
  File "/ichec/home/users/razvan/.local/lib/python3.4/site-packages/xray/core/alignment.py", line 352, in auto_combine
    concatenated = [_auto_concat(ds, dim=concat_dim) for ds in grouped]
  File "/ichec/home/users/razvan/.local/lib/python3.4/site-packages/xray/core/alignment.py", line 352, in <listcomp>
    concatenated = [_auto_concat(ds, dim=concat_dim) for ds in grouped]
  File "/ichec/home/users/razvan/.local/lib/python3.4/site-packages/xray/core/alignment.py", line 303, in _auto_concat
    return concat(datasets, dim=dim)
  File "/ichec/home/users/razvan/.local/lib/python3.4/site-packages/xray/core/alignment.py", line 278, in concat
    return cls._concat(objs, dim, indexers, mode, concat_over, compat)
  File "/ichec/home/users/razvan/.local/lib/python3.4/site-packages/xray/core/dataset.py", line 1712, in _concat
    'variable %r not %s across datasets' % (k, verb))
ValueError: variable 'mean_height_agl' not equal across datasets

@shoyer
Copy link
Member

shoyer commented Jun 26, 2015

Marking this as a bug, I'll see if I can reproduce this with a similar dataset.

@shoyer shoyer added the bug label Jun 26, 2015
@razcore-rad
Copy link
Author

Well, I'm not sure if it's a bug, I would say it's more like a missing feature... in my case, each netCDF file has a different mean_height_agl coordinate, that is, they have the same length (it's 1D), but different values in each file. I can understand why it can't concatenate, but I would argue that a better way to handle this is to create a dummy coordinate (as I did) and replace the troublesome coordinate with that dummy coordinate.

@shoyer
Copy link
Member

shoyer commented Jun 26, 2015

Do you concatenate these files along one of the existing axes or a new axis? This might require new API but should probably be supported.

Could you print two of these netCDF files that you want to automatically combine with open_mfdataset? I know they have the same structure but it's useful to see how/if the values differ.

@razcore-rad
Copy link
Author

I try concatenating on an existing axis, the 'time' axis. I uploaded a couple of files here. It's just easier and you can try experimenting.

@shoyer shoyer removed the bug label Jun 26, 2015
@shoyer
Copy link
Member

shoyer commented Jun 27, 2015

OK, I understand now. One of these files looks like:

<xray.Dataset>
Dimensions:            (lat: 39, lon: 59, mean_height_agl: 50, time: 1)
Coordinates:
  * time               (time) datetime64[ns] 2011-05-21T13:00:00
  * lon                (lon) float64 -29.0 -28.0 -27.0 -26.0 -25.0 -24.0 ...
  * lat                (lat) float64 32.0 33.0 34.0 35.0 36.0 37.0 38.0 39.0 ...
  * mean_height_agl    (mean_height_agl) float64 28.28 97.21 191.1 310.7 ...
Data variables:
    ash_concentration  (mean_height_agl, lat, lon) float64 9.583e-16 ...
    ash_mass_loading   (lat, lon) float64 1.091e-11 1.091e-11 1.091e-11 ...
    ash_drydep         (lat, lon) float64 4.086e-10 4.084e-10 4.08e-10 ...
    ash_wetdep         (lat, lon) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    so2_concentration  (mean_height_agl, lat, lon) float64 3.199e-13 ...
    so2_mass_loading   (lat, lon) float64 2.602e-09 2.602e-09 2.602e-09 ...

The problem is that the mean_height_agl coordinate changes between each file.

Another interesting aspect of this file, which relates to how I as hoping to fix #438, is that it includes a time coordinate with length 1, but none of the other dataset variables use that coordinate.

This suggests to me that we need some sort of hook that can allow you to transform a single dataset before they are joined with open_mfdataset. Perhaps a preprocess argument? Then you could write, e.g.,:

def fix_my_data(ds):
    return (ds.assign_coords(
                agl=('mean_height_agl', range(ds.dims['mean_height_agl'])))
            .swap_dims({'mean_height_agl': 'agl'})
            .squeeze('time'))

ds = xray.open_mfdataset('*.nc', preprocess=fix_my_data)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants