Skip to content

xray.open_mfdataset concatenates also variables without time dimension #438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
j08lue opened this issue Jun 18, 2015 · 13 comments
Closed

xray.open_mfdataset concatenates also variables without time dimension #438

j08lue opened this issue Jun 18, 2015 · 13 comments
Labels
Milestone

Comments

@j08lue
Copy link
Contributor

j08lue commented Jun 18, 2015

When opening a multi-file dataset with xray.open_mfdataset, also some variables are concatenated that do not have a time dimension.

My netCDF files contain a lot of those "static" variables (e.g. grid spacing etc.). netCDF4.MFDataset used to handle those as expected (i.e. did not concatenate them).

Is the different behaviour of xray.open_mfdataset intentional or due to a bug?

Note: I am using decode_times=False.

Example

    with xray.open_dataset(files[0], decode_times=False) as single:
        print single['dz']
<xray.DataArray 'dz' (z_t: 60)>
array([  1000.        ,   1000.        ,   1000.        ,   1000.        ,
         1000.        ,   1000.        ,   1000.        ,   1000.        ,
         1000.        ,   1000.        ,   1000.        ,   1000.        ,
         1000.        ,   1000.        ,   1000.        ,   1000.        ,
         1019.68078613,   1056.44836426,   1105.99511719,   1167.80700684,
         1242.41333008,   1330.96777344,   1435.14099121,   1557.12585449,
         1699.67956543,   1866.21240234,   2060.90234375,   2288.85205078,
         2556.24707031,   2870.57495117,   3240.8371582 ,   3677.77246094,
         4194.03076172,   4804.22363281,   5524.75439453,   6373.19189453,
         7366.94482422,   8520.89257812,   9843.65820312,  11332.46582031,
        12967.19921875,  14705.34375   ,  16480.70898438,  18209.13476562,
        19802.234375  ,  21185.95703125,  22316.50976562,  23186.49414062,
        23819.44921875,  24257.21679688,  24546.77929688,  24731.01367188,
        24844.328125  ,  24911.97460938,  24951.29101562,  24973.59375   ,
        24985.9609375 ,  24992.67382812,  24996.24414062,  24998.109375  ])
Coordinates:
  * z_t      (z_t) float32 500.0 1500.0 2500.0 3500.0 4500.0 5500.0 6500.0 ...
Attributes:
    long_name: thickness of layer k
    units: centimeters
    with xray.open_mfdataset(files, decode_times=False) as multiple:
        print multiple['dz']
<xray.DataArray 'dz' (time: 12, z_t: 60)>
dask.array<concatenate-1156, shape=(12, 60), chunks=((1, 1, 1, ..., 1, 1), (60,)), dtype=float64>
Coordinates:
  * z_t      (z_t) float32 500.0 1500.0 2500.0 3500.0 4500.0 5500.0 6500.0 ...
  * time     (time) float64 3.653e+04 3.656e+04 3.659e+04 3.662e+04 ...
Attributes:
    long_name: thickness of layer k
    units: centimeters
@shoyer
Copy link
Member

shoyer commented Jun 18, 2015

Hmm, I'll have to think about this one. We use some heuristics to figure out what to concatenation but they aren't perfect. Could you print what the full datasets look like, not just this variable?

@j08lue
Copy link
Contributor Author

j08lue commented Jun 19, 2015

Here is a print-out of the full dataset for POP ocean model output (see that gist in nbviewer).

I can see that the heuristics exclude variables from concatenation that are associated with dimensions of other variables. But why not just exclude all that do not have a time dimension?

@j08lue
Copy link
Contributor Author

j08lue commented Jun 19, 2015

netCDF4-python uses a dimension specified by the user or an unlimited dimension it finds in the dataset. Here is the corresponding code section.

@shoyer
Copy link
Member

shoyer commented Jun 19, 2015

But why not just exclude all that do not have a time dimension?

Yeah, this is probably a good idea.

concat also covers combining datasets along a new dimension (i.e., if time was not a dimension of any of the individual datasets), but that's not the case here.

@shoyer
Copy link
Member

shoyer commented Jul 15, 2015

With #473, you will be able to achieve your desired result by adjusting the data_vars argument. data_vars='minimal will probably do the trick.

@guziy
Copy link
Contributor

guziy commented Sep 19, 2017

Hi @shoyer:

where is this data_vars='minimal' set? or maybe I am using a wrong version of xarray? Here is what I get

In [10]: ds = xarray.open_mfdataset("/snow3/huziy/Daymet_daily_derivatives/daymet_spatial_agg_tmin_10x10/daymet_v3_tmin_*.nc4", data_vars="minimum")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-ba844206f74a> in <module>()
----> 1 ds = xarray.open_mfdataset("/snow3/huziy/Daymet_daily_derivatives/daymet_spatial_agg_tmin_10x10/daymet_v3_tmin_*.nc4", data_vars="minimum")

/snow3/huziy/Python/python_builds/anaconda3/envs/py3.6-a3/lib/python3.6/site-packages/xarray/backends/api.py in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, lock, **kwargs)
    503         paths = sorted(glob(paths))
    504     else:
--> 505         paths = [str(p) if isinstance(p, path_type) else p for p in paths]
    506
    507     if not paths:

/snow3/huziy/Python/python_builds/anaconda3/envs/py3.6-a3/lib/python3.6/site-packages/xarray/backends/api.py in <listcomp>(.0)
    503         paths = sorted(glob(paths))
    504     else:
--> 505         paths = [str(p) if isinstance(p, path_type) else p for p in paths]
    506
    507     if not paths:

TypeError: open_dataset() got an unexpected keyword argument 'data_vars'

In [11]: ds = xarray.open_mfdataset("/snow3/huziy/Daymet_daily_derivatives/daymet_spatial_agg_tmin_10x10/daymet_v3_tmin_*.nc4")

In [12]: ds.data_vars = "minimum"
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-12-9026b067946d> in <module>()
----> 1 ds.data_vars = "minimum"

/snow3/huziy/Python/python_builds/anaconda3/envs/py3.6-a3/lib/python3.6/site-packages/xarray/core/common.py in __setattr__(self, name, value)
    180                 raise AttributeError(
    181                     "cannot set attribute %r on a %r object. Use __setitem__ "
--> 182                     "style assignment (e.g., `ds['name'] = ...`) instead to "
    183                     "assign variables." % (name, type(self).__name__))
    184         object.__setattr__(self, name, value)

AttributeError: can't set attribute

In [13]: ds.data_vars
Out[13]:
Data variables:
    tmin                     (time, y, x) float64 nan nan nan nan nan nan ...
    lon                      (time, y, x) float64 156.5 156.5 156.6 156.6 ...
    lat                      (time, y, x) float64 58.55 58.64 58.72 58.81 ...
    lambert_conformal_conic  (time) int64 -32767 -32767 -32767 -32767 -32767 ...

In [16]: xarray.__version__
Out[16]: '0.9.6'

Cheers

@spencerahill
Copy link
Contributor

I think you've accidentally used "minimum" instead of "minimal"

@guziy
Copy link
Contributor

guziy commented Sep 19, 2017

Thanks @spencerahill :
You are right, but the error message won't change, since the data_vars keyword is not known, and I am not able to change the dataset's attribute data_vars. I think I have to use concat explicitly...

Cheers

@jhamman
Copy link
Member

jhamman commented Sep 19, 2017

use of data_vars was deprecated in #473.

@guziy
Copy link
Contributor

guziy commented Sep 19, 2017

This seems to be working, and no deprecation warning... (But probably I have to sort paths...)

In [8]: ds = xarray.concat([xarray.open_dataset(p, chunks={"time": 100}) for p in paths], data_vars="minimal", dim="time")

In [9]: ds
Out[9]:
<xarray.Dataset>
Dimensions:                  (time: 13505, x: 782, y: 808)
Coordinates:
  * x                        (x) float64 -4.556e+06 -4.546e+06 -4.536e+06 ...
  * y                        (y) float64 4.98e+06 4.97e+06 4.96e+06 4.95e+06 ...
  * time                     (time) datetime64[ns] 1993-01-01T12:00:09.140797440 ...
Data variables:
    lon                      (y, x) float64 156.5 156.5 156.6 156.6 156.7 ...
    lat                      (y, x) float64 58.55 58.64 58.72 58.81 58.9 ...
    lambert_conformal_conic  int16 -32767
    tmin                     (time, y, x) float64 nan nan nan nan nan nan ...

In [10]: lon = ds["lon"]

In [11]: lon.ndim
Out[11]: 2

Cheers

@shoyer
Copy link
Member

shoyer commented Sep 19, 2017

Indeed, data_vars is a (somewhat confusingly named) argument to xarray.concat, and isn't deprecated. open_mfdataset could pass the argument on through to concat, but it doesn't do that yet.

@jhamman
Copy link
Member

jhamman commented Sep 19, 2017

I stand corrected. Misread the old issue. ☕️ 😪

@guziy
Copy link
Contributor

guziy commented Sep 19, 2017

Thanks @shoyer:
Is there a how to contribute guide? I basically look for how to write a test and run test to check if my adding this option did not break anything...

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants