-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
combine_by_coords can succed when it shouldn't #4824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
merge leads to interlaced coords In [2]: xr.merge([da0, da1])
Out[2]:
<xarray.Dataset>
Dimensions: (lat: 10, t: 2)
Coordinates:
* lat (lat) float64 0.0 1e-06 1.0 1.0 2.0 2.0 3.0 3.0 4.0 4.0
* t (t) int64 1 2
Data variables:
a (t, lat) float64 0.0 nan 1.0 nan 2.0 nan ... 2.0 nan 3.0 nan 4.0 concat leads to - well - concatenated coords xr.concat([da0, da1], dim="lat")
Out[5]:
<xarray.Dataset>
Dimensions: (lat: 10, t: 2)
Coordinates:
* t (t) int64 1 2
* lat (lat) float64 0.0 1.0 2.0 3.0 4.0 1e-06 1.0 2.0 3.0 4.0
Data variables:
a (t, lat) float64 0.0 1.0 2.0 3.0 4.0 nan ... 0.0 1.0 2.0 3.0 4.0 Yes I think I'd be interested to fix this... One question on the scope of
Or as ds_a = xr.Dataset({"data": (("y", "x"), [[1, 2]]), "x": [0, 1], "y": [0]})
ds_b = xr.Dataset({"data": (("y", "x"), [[3, 4, 5]]), "x": [2, 3, 4], "y": [0]})
ds_c = xr.Dataset({"data": (("y", "x"), [[11, 12, 14]]), "x": [0, 1, 2], "y": [1]})
ds_d = xr.Dataset({"data": (("y", "x"), [[14, 15]]), "x": [3, 4], "y": [1]})
xr.combine_by_coords([ds_a, ds_b, ds_c, ds_d]).data Should that work or not? Current behaviour Currently it yields the following: <xarray.DataArray 'data' (y: 2, x: 5)>
array([[ 1, 2, 3, 4, 5],
[11, 12, 14, 14, 15]])
Coordinates:
* x (x) int64 0 1 2 3 4
* y (y) int64 0 1 However, if |
Thanks for these examples @mathause , these are useful.
Not sure if this is what you meant, but to be clear:
As far I can see here then
This is a good question, but I'm 99% sure I didn't intend for either combine function to be able to handle this case. The problem with this case is that it's not order-invariant: you could concat Try comparing it to the behaviour of |
tl;dr, I currently think there are two issues:
|
Thanks for the thorough answer! Thus, we need to figure out if there is a "bug" in Disallowing "ragged hypercubes" certainly makes the problem much easier. Then we can say: "if two coords have the same start they need to be equal" (I think). |
I don't actually know - this behaviour of
Yes exactly. Well, more specifically, "if they have the same start they need to be equal in length otherwise its a ragged hypercube, and if they have the same start, equal in length, but different values in the middle then it's a valid hypercube but an inconsistent dataset". It is currently assumed that the user passes a valid hypercube with consistent data, see this comment in
Though I realise now that I don't think this assumption is made explicit in the docs anywhere, instead it just talks about coords being monotonic. If people pass "dodgy" hypercubes then this could currently fail in multiple ways (including silently), but the reason we didn't just actually check that the coords were completely equal throughout was because then you have to load all the actual values from the files, which could incur a significant performance cost. Adding a check of just the last value of each coord would help considerably (it should solve #4077), but unless we check every value then there will always be a way to silently produce a nonsense result by feeding it inconsistent data. We might consider some kind of flag for whether or not these checks should be done, which defaults to on, and users can turn off if they trust their data but want more speed. |
this is the result of You can avoid this by setting |
Thanks @dcherian. So passing <xarray.Dataset>
Dimensions: (lat: 5, t: 2)
Coordinates:
* lat (lat) int64 0 1 2 3 4
* t (t) int64 1 2
Data variables:
a (t, lat) int64 0 1 2 3 4 0 1 2 3 4 which seems fine - that solves issue (1) doesn't it? Your other example though @mathause, of allowing certain ragged arrays to pass through, presumably we still need a new check of some kind to disallow that? |
But then it also overrides when it should concatenate... import numpy as np
import xarray as xr
data = np.arange(5).reshape(1, 5)
x = np.arange(5)
x_name = "lat"
da0 = xr.DataArray(data, dims=("t", x_name), coords={"t": [1], x_name: x}).to_dataset(name="a")
x = x + 5
da1 = xr.DataArray(data, dims=("t", x_name), coords={"t": [2], x_name: x}).to_dataset(name="a")
ds = xr.combine_by_coords((da0, da1), join="override") out <xarray.Dataset>
Dimensions: (lat: 5, t: 2)
Coordinates:
* lat (lat) int64 0 1 2 3 4
* t (t) int64 1 2
Data variables:
a (t, lat) int64 0 1 2 3 4 0 1 2 3 4 Again if we set <xarray.Dataset>
Dimensions: (t: 1, y: 10)
Coordinates:
* t (t) int64 1
* y (y) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
a (t, y) int64 0 1 2 3 4 0 1 2 3 4
|
Is there a reasonable way of preventing the ambiguity in |
I honestly don't really understand |
For I don't think it makes sense at all for |
Thanks @dcherian that helps. It does seem silly to be like "I'm going to use the coordinates to decide how everything should be concatenated, but I'm also happy to change some of those coordinate values whilst I'm combining". Does that mean we should just not to allow |
I think so. I suspect the original problem can only be solved by allowing approximate alignment within a specified tolerance. |
I add the example from #4753 (comment) here. I am not 100% if that is the same problem or not: def combine_by_coords_problem(name, join="override"):
ds0 = xr.Dataset(coords={"x1": [1, 2, 3], name: [10, 20, 30]})
ds1 = xr.Dataset(coords={"x1": [4, 5, 6], name: [40, 50, 60]})
return xr.combine_by_coords([ds0, ds1], join=join)
combine_by_coords_problem("x0") # concatenates 1, 2, 3, 4, 5, 6
combine_by_coords_problem("x2") # concatenates 10, 20, 30, 40, 50, 60 with
|
What happened:
combine_by_coords
can succeed when it should not - depending on the name of the dimensions (which determines the order of operations incombine_by_coords
).What you expected to happen:
Minimal Complete Verifiable Example:
returns:
Thus lat is interlaced - it don't think
combine_by_coords
should do this. If you setand run the example again, it returns:
Anything else we need to know?:
combine_by_coords
concatenates over all dimensions where the coords are different - thereforecompat="override"
doesn't actually do anything? Or does it?xarray/xarray/core/combine.py
Line 69 in ba42c08
cc @dcherian @TomNicholas
Environment:
Output of xr.show_versions()
The text was updated successfully, but these errors were encountered: