Skip to content

open_mfdataset cannot open multiple netcdf files written by NASA/GEOS MAPL v1.0.0 that contain data on a cubed-sphere grid #3286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
yantosca opened this issue Sep 6, 2019 · 9 comments

Comments

@yantosca
Copy link

yantosca commented Sep 6, 2019

MCVE Code Sample

First download these files:

Then run this code:

import xarray as xr
filelist = ['GCHP.SpeciesConc.20160716_1200z.nc4', 'GCHP.AerosolMass.20160716_1200z.nc4']
ds = xr.open_mfdataset(filelist)
print(ds)

Expected Output

This should load data from both files into a single xarray Dataset object and print its contents.

Problem Description

Instead, this error occurs;

  File "./run_1mo_benchmark.py", line 479, in <module>
    ds = xr.open_mfdataset(filelist])
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/backends/api.py", line 719, in open_mfdataset
    ids=ids)
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/core/combine.py", line 553, in _auto_combine
    data_vars=data_vars, coords=coords)
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/core/combine.py", line 475, in _combine_nd
    compat=compat)
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/core/combine.py", line 493, in _auto_combine_all_along_first_dim
    data_vars, coords)
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/core/combine.py", line 514, in _auto_combine_1d
    merged = merge(concatenated, compat=compat)
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/core/merge.py", line 532, in merge
    variables, coord_names, dims = merge_core(dict_like_objects, compat, join)
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/core/merge.py", line 451, in merge_core
    variables = merge_variables(expanded, priority_vars, compat=compat)
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/core/merge.py", line 170, in merge_variables
    merged[name] = unique_variable(name, var_list, compat)
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/core/merge.py", line 90, in unique_variable
    % (name, out, var))
xarray.core.merge.MergeError: conflicting values for variable 'anchor' on objects to be combined:
first value: <xarray.Variable (nf: 6, ncontact: 4)>
dask.array<shape=(6, 4, 4), dtype=int32, chunksize=(6, 4, 4)>
Attributes:
    long_name:  anchor point
second value: <xarray.Variable (nf: 6, ncontact: 4)>
dask.array<shape=(6, 4, 4), dtype=int32, chunksize=(6, 4, 4)>
Attributes:
    long_name:  anchor point

It seems to get hung up on trying to merge the "anchor" variable. As a workaround, if I drop the "anchor" variable from both datasets and then use xr.open_mfdataset, then the merge works properly.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-957.12.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.2 libnetcdf: 4.6.1

xarray: 0.12.1
pandas: 0.25.1
numpy: 1.16.4
scipy: 1.3.1
netCDF4: 1.4.2
pydap: None
h5netcdf: 0.6.2
h5py: 2.8.0
Nio: None
zarr: None
cftime: 1.0.3.4
nc_time_axis: None
PseudonetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.2.1
dask: 2.3.0
distributed: 2.3.2
matplotlib: 3.1.1
cartopy: 0.17.0
seaborn: 0.9.0
setuptools: 41.2.0
pip: 19.2.2
conda: None
pytest: 4.2.0
IPython: 7.7.0
sphinx: 2.1.2

@dcherian
Copy link
Contributor

dcherian commented Sep 6, 2019

Is anchor equal in the two files or does it need to be concatenated?

@yantosca
Copy link
Author

yantosca commented Sep 6, 2019

They appear to be the same:

import xarray as xr
ds1 = xr.open_dataset('GCHP.SpeciesConc.20160716_1200z.nc4')
ds2 = xr.open_dataset('GCHP.AerosolMass.20160716_1200z.nc4')
print(ds1['anchor'].values - ds2['anchor'].values)

Which gives:

[[[0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]]

 [[0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]]

 [[0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]]

 [[0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]]

 [[0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]]

 [[0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]]]

I am not 100% sure what this "anchor" variable represents. It was added when NASA updated MAPL to v1.0.0. It seems to be something to do with the cubed-sphere coordinates. But for the purposes of plotting and analyzing the data we don't need it.

If it helps...I also get this error if I just try to subtract the DataArrays from each other directly, instead of the numpy ndarray values:

import xarray as xr
ds1 = xr.open_dataset('GCHP.SpeciesConc.20160716_1200z.nc4')
ds2 = xr.open_dataset('GCHP.AerosolMass.20160716_1200z.nc4')
dr = ds1['anchor'] - ds2['anchor']
print(dr)

Which gives;

Traceback (most recent call last):
  File "./run_1mo_benchmark.py", line 481, in <module>
    dr = ds1['anchor'] - ds2['anchor']
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo   /lib/python3.6/site-packages/xarray/core/dataarray.py", line 2009, in func
    if not reflexive
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/core/variable.py", line 1767, in func
    self_data, other_data, dims = _broadcast_compat_data(self, other)
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/core/variable.py", line 2043, in _broadcast_compat_data
    new_self, new_other = _broadcast_compat_variables(self, other)
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/core/variable.py", line 2018, in _broadcast_compat_variables
    dims = tuple(_unified_dims(variables))
  File "/net/seasasfs02/srv/export/seasasfs02/share_root/ryantosca/python/geo/miniconda/envs/geo/lib/python3.6/site-packages/xarray/core/variable.py", line 2001, in _unified_dims
    'dimensions: %r' % list(var_dims))
ValueError: broadcasting cannot handle duplicate dimensions: ['nf', 'ncontact', 'ncontact']

If two arrays are the same like this, is there a way to tell manually open_mfdataset not to broadcast them but to use the same values?

@dcherian
Copy link
Contributor

dcherian commented Sep 6, 2019

It looks like anchor has a repeated dimension name. This is not well supported in xarray. See #1378. If you don't need it, then I think it's best to drop it.

@yantosca
Copy link
Author

yantosca commented Sep 6, 2019

Thanks again. I will implement a workaround to drop it (probably a wrapper function that calls open_mfdataset). Good to know.

@jhamman
Copy link
Member

jhamman commented Sep 6, 2019

You should be able to do what with the drop_variables kwarg to open_mfdataset. If that doesn't work, the preprocess kwarg is the Swiss Army Knife here that lets you pass in a custom function that will be applied before the datasets are combined.

@dcherian
Copy link
Contributor

dcherian commented Sep 6, 2019

You can use the drop_variables kwarg. This is passed down to open_dataset. For more general manipulation, you can use the preprocess argument.

@dcherian
Copy link
Contributor

dcherian commented Sep 6, 2019

@jhamman Hahaha.

@yantosca
Copy link
Author

yantosca commented Sep 6, 2019

Thanks, I'll check it out. Wasn't aware of drop_variables.

@crusaderky
Copy link
Contributor

Closing as a duplicate of #1378

yantosca added a commit to geoschem/gcpy that referenced this issue Sep 6, 2019
According to xarray issues:
   pydata/xarray#3286
   pydata/xarray#1378

The open_mfdataset function has problems in creating a merged
dataset from multiple files in which variables have repeated
dimension names.  The easiest thing to do in this case is to
prevent such variables from being read in.

We now have added the drop_variables keyword to avoid reading
in the "anchor" variable in all calls to open_dataset and
open_mfdataset in both benchmark.py and core.py.  This variable is
only present in GCHP-created netCDF files using MAPL v1.0.0, which
is in GCHP 12.5.0 and later.

This commit should resolve GCPy issue #26:
   #26

Signed-off-by: Bob Yantosca <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants