Skip to content

Dataset combine_by_coords unexpected behavior #8828

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
5 tasks done
schunkes opened this issue Mar 13, 2024 · 2 comments · Fixed by #9070
Closed
5 tasks done

Dataset combine_by_coords unexpected behavior #8828

schunkes opened this issue Mar 13, 2024 · 2 comments · Fixed by #9070
Labels
bug topic-combine combine/concat/merge

Comments

@schunkes
Copy link

What happened?

I am trying to combine two datasets with partially overlapping data variables and exactly identical coordinates, using the compat="override" option.
In the resulting dataset, the data variable, which appears in both datasets contains the values, from the dataset, placed second in the list of datasets to combine.
The result is the same, if I switch the order of the two datasets in the call to combine_by_coords

What did you expect to happen?

I would expect, that for the data variables, which appear in both datasets, the values from the first dataset in the list, passed to combine_by_coords are used.

Minimal Complete Verifiable Example

import numpy as np
import pandas as pd
import xarray as xr

temperature = np.random.randint(1,255,size=(9,10,10))
precipitation = np.random.randint(1,255,size=(9,10,10))
precipitation_alt = np.random.randint(255,1000,size=(9,10,10))
lon = np.linspace(10,100,10)
lat = np.linspace(10,100,10)
time_0 = pd.date_range("2014-09-06", periods=9, freq='2D')

ds_0 = xr.Dataset(
   data_vars=dict(
       temperature=(('time', 'y', 'x'), temperature),
       precipitation=(('time', 'y', 'x'), precipitation)),
   coords=dict(
       x=lon,
       y=lat,
       time=time_0)
)
ds_1 = xr.Dataset(
    data_vars=dict(
        temperature=(('time', 'y', 'x'), temperature),
        precipitation=(('time', 'y', 'x'), precipitation_alt),
        precipitation2=(('time', 'y', 'x'), precipitation)),
    coords=dict(
        x=lon,
        y=lat,
        time=time_0)

)

res1 = xr.combine_by_coords([ds_0, ds_1], compat="override")
res2 = xr.combine_by_coords([ds_1, ds_0], compat="override")

# In the first case the resulting dataset should contain the same values in the variables precipitation and precipitation2, but it does not.

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

NSTALLED VERSIONS ------------------ commit: None python: 3.8.18 (default, Sep 11 2023, 13:40:15) [GCC 11.2.0] python-bits: 64 OS: Linux OS-release: 5.15.0-91-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.3-development

xarray: 2023.1.0
pandas: 2.0.3
numpy: 1.24.4
scipy: 1.10.1
netCDF4: 1.6.5
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.6.3
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.5
dask: None
distributed: None
matplotlib: 3.7.2
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: 0.21.1
sparse: None
flox: None
numpy_groupies: None
setuptools: 68.2.2
pip: 23.3.1
conda: None
pytest: None
mypy: None
IPython: 8.12.3
sphinx: None

@schunkes schunkes added bug needs triage Issue that has not been reviewed by xarray team member labels Mar 13, 2024
@TomNicholas TomNicholas added topic-combine combine/concat/merge and removed needs triage Issue that has not been reviewed by xarray team member labels Mar 13, 2024
@kmuehlbauer
Copy link
Contributor

The root cause of your issue is that the datasets are re-sorted before being combined:

# Group by data vars
sorted_datasets = sorted(data_objects, key=vars_as_keys)
grouped_by_vars = itertools.groupby(sorted_datasets, key=vars_as_keys)

The first line re-sorts the data_objects to [ds_1, ds_0] because vars_as_keys will give something like this:

(('precipitation', 'temperature'), ('precipitation', 'precipitation2', 'temperature'))

and the sorted representation is:

(('precipitation', 'precipitation2', 'temperature'), ('precipitation', 'temperature'))

So this will end up in the order [ds_1, ds_0] before entering the underlying combine-processing.

Not sure how this could be resolved. Given the declaration in the docstring this is very surprising behaviour.

@kmuehlbauer
Copy link
Contributor

It seems like the additional sorting is not needed at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug topic-combine combine/concat/merge
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants