-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Fixing GH1434 xr.concat loses coordinate dtype information with recarrays in 0.9 #1438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rrays that are numpy.recarrays The following code WILL run with the fix but will crash when running current xarray: import numpy as np import xarray as xr p_data = np.array([('John', 180), ('Stacy', 150), ('Dick',200)], dtype=[('name', '|S256'), ('height', int)]) p_data_1 = np.array([('John', 180), ('Stacy', 150), ('Dick',200)], dtype=[('name', '|S256'), ('height', int)]) weights_0 = xr.DataArray([80,56,120], dims=['participant'], coords={'participant':p_data}) weights_1 = xr.DataArray([81,52,115], dims=['participant'], coords={'participant':p_data_1}) print weights_1-weights_0
…rrays that are numpy.recarrays. Used try/except to catch exception that may arise when running isnull on a structured array. Added appropriate test to test_dataarray to check for correct handling of dimension that are structured arrays The following code WILL run with the fix but will crash when running current xarray: import numpy as np import xarray as xr p_data = np.array([('John', 180), ('Stacy', 150), ('Dick',200)], dtype=[('name', '|S256'), ('height', int)]) p_data_1 = np.array([('John', 180), ('Stacy', 150), ('Dick',200)], dtype=[('name', '|S256'), ('height', int)]) weights_0 = xr.DataArray([80,56,120], dims=['participant'], coords={'participant':p_data}) weights_1 = xr.DataArray([81,52,115], dims=['participant'], coords={'participant':p_data_1}) print weights_1-weights_0
…snull on structured array that happens to be one of the dims of DataArray
# Conflicts: # xarray/core/ops.py # xarray/tests/test_dataarray.py
1 . dtype of the dimension that we concatenate along, was lost 2. dtype of the other dimensions (ones that are not concatenated but aligned) was lost Fixing 1. required small refactoring of the "concat" classmethod in the variables.py - by default we check if the dimensions can be concatenated using numpy.concat function and if this method fails we resort to the current method of concatenation using append fcn from pandas As far as 2. it turns out the dtype of the dimensions that are not concatenated was lost in the "align" function in alignment.py - see my comment with # GH1434 tag. the solution was to store dtypes of the original dimensions and reapply those to the dimensions DataArrays after "align" function call
xarray/core/alignment.py
Outdated
# GH1434 | ||
# dtype is lost after obj.indexes[dim] | ||
# problem originates in Indexes.__getitem__ | ||
# in coordinates.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why you're adding these comments here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed those
xarray/core/combine.py
Outdated
datasets = align(*datasets, join='outer', copy=False, exclude=[dim]) | ||
|
||
# GH1434 | ||
# restoring original dtype of the datasets' dimensions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix should either be done in align
(actually inside core.alignment.reindex_variables
) or not at all. We cannot preserve all dtypes in alignment operations, e.g., there is no missing value for structured dtypes. If you care about this, you should use join='inner'
or join='exact'
(will appears in the next release, from #1330). It would be nice to surface these options up into xarray.concat
, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer. It was indeed the place where loss of dtype happened
xarray/core/variable.py
Outdated
# GH1434 | ||
# Fixes bug: "xr.concat loses coordinate dtype | ||
# information with recarrays in 0.9" | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than using exceptions for control flow, can we divide this into two cases based upon whether any (all?) of the variables have dtype=object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
xarray/core/variable.py
Outdated
variables = list(variables) | ||
first_var = variables[0] | ||
if len(variable_type_set) > 1: | ||
raise TypeError('Trying to concatenate variables of ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These extra checks would break some things that currently works with np.concatenate
(e.g., concatenate([float, int]) -> float
). So I would remove them, and replace this with the simpler np.concatenate([v.data for v in variables])
.
xarray/core/variable.py
Outdated
variable_type = list(variable_type_set)[0] | ||
if not variable_type == np.ndarray: | ||
raise TypeError('Can only concatenate variables whose ' | ||
'_data member is ndarray') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already guaranteed by xarray's data model, so this should either be an assert
or you should skip it entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
xarray/core/variable.py
Outdated
|
||
@classmethod | ||
def concat_pandas(cls, variables, positions=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you keep these methods already, they should be private (preface with _
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
xarray/tests/__init__.py
Outdated
@@ -101,7 +101,7 @@ | |||
try: | |||
_SKIP_FLAKY = not pytest.config.getoption("--run-flaky") | |||
_SKIP_NETWORK_TESTS = not pytest.config.getoption("--run-network-tests") | |||
except ValueError: | |||
except (ValueError, AttributeError) as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did AttributeError
come up for you?
Even if so, we don't need as e
unless we're doing something with the error
doc/whats-new.rst
Outdated
- Attributes were being retained by default for some resampling | ||
operations when they should not. With the ``keep_attrs=False`` option, they | ||
will no longer be retained by default. This may be backwards-incompatible | ||
with some scripts, but the attributes may be kept by adding the | ||
``keep_attrs=True`` option. By | ||
`Jeremy McGibbon <https://github.com/mcgibbon>`_. | ||
|
||
- Fixed bug in arithmetic operations on DataArray objects whose dimensions | ||
are numpy structured arrays or recarrays :issue:`861`, :issue:`837`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a different fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
… array) in reindex_variables (alignment.py) and implemented appropriate dtype-casting in case dtype information was lost for composite dtypes. Cleaned up the code
# ensures that dtype of numpy structured arrays is preserved | ||
if len(args) and is_composite_dtype(var.dtype) \ | ||
and idx_var.dtype != var.dtype: | ||
idx_var.data = idx_var.data.astype(var.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand why this change is necessary. Can you give an example of what this would fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a simple demo that shows the problem. After concatenation dtype of the "participant" variable will be lost but it should be preserved, especially that participant is not the dimension we are concatenating along
p1 = np.array([('A', 180), ('B', 150), ('C', 200)],
dtype=[('name', '|S256'), ('height', int)])
p2 = np.array([('D', 170), ('E', 250), ('F', 150)],
dtype=[('name', '|S256'), ('height', int)])
data = np.arange(50, 80, 1, dtype=np.float)
dims = ['measurement', 'participant']
da1 = DataArray(
data.reshape(10, 3),
coords={
'measurement': np.arange(10),
'participant': p1,
},
dims=dims
)
da2 = DataArray(
data.reshape(10, 3),
coords={
'measurement': np.arange(10),
'participant': p2,
},
dims=dims
)
combined_2 = concat([da1, da2], dim='measurement')
assert combined_2.participant.dtype == da1.participant.dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer : let me know if this makes sense .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this does indeed make sense. To isolate he issue from your example, let's see what we get from aligning instead:
>>> da1_aligned, da2_aligned = xarray.align(da1, da2, join='outer')
>>> da1
<xarray.DataArray (measurement: 10, participant: 3)>
array([[ 50., 51., 52.],
[ 53., 54., 55.],
[ 56., 57., 58.],
[ 59., 60., 61.],
[ 62., 63., 64.],
[ 65., 66., 67.],
[ 68., 69., 70.],
[ 71., 72., 73.],
[ 74., 75., 76.],
[ 77., 78., 79.]])
Coordinates:
* measurement (measurement) int64 0 1 2 3 4 5 6 7 8 9
* participant (participant) [('name', 'S256'), ('height', '<i8')] ('A', 180) ...
>>> da1_aligned
<xarray.DataArray (measurement: 10, participant: 6)>
array([[ 50., 51., 52., nan, nan, nan],
[ 53., 54., 55., nan, nan, nan],
[ 56., 57., 58., nan, nan, nan],
[ 59., 60., 61., nan, nan, nan],
[ 62., 63., 64., nan, nan, nan],
[ 65., 66., 67., nan, nan, nan],
[ 68., 69., 70., nan, nan, nan],
[ 71., 72., 73., nan, nan, nan],
[ 74., 75., 76., nan, nan, nan],
[ 77., 78., 79., nan, nan, nan]])
Coordinates:
* participant (participant) object ('A', 180) ('B', 150) ('C', 200) ...
* measurement (measurement) int64 0 1 2 3 4 5 6 7 8 9
Notice that the dtype of participant
is now `object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fix this, we need a dtype safe version of pandas.Index.__or__
, which is used for joining together indexes: https://github.com/maciekswat/xarray/blob/7cec75c6252c34aee46f3e7bb8c170c169429f37/xarray/core/alignment.py#L20
The cleanest way I see to handle this is to make the logic inside align
always use some sort of xarray wrapper around pandas.Index
similar to (or building on) PandasIndexAdapter
that propagates dtype properly in these operations. I don't want to add dtype specific logic in align
/reindex_variables
that fixes things up after the fact, because the result is much less maintainable.
|
||
data = indexes[0].append(indexes[1:]) | ||
|
||
if positions is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block is the same for both _concat_pandas
and _concat_numpy
-- please move it up into the caller instead.
xarray/core/variable.py
Outdated
|
||
indexes = [v._data for v in variables] | ||
|
||
if not indexes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this special branch for numpy. Literally return np.concatenate([v.data for v in variables])
would do for this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
xarray/core/variable.py
Outdated
@@ -1212,34 +1212,79 @@ def __setitem__(self, key, value): | |||
raise TypeError('%s values cannot be modified' % type(self).__name__) | |||
|
|||
@classmethod | |||
def _concat_numpy(cls, variables, positions=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this has enough to need to be a separate function, but if so:
- It shouldn't be a
classmethod
-- we don't need to overwrite it. - Docstring should follow the numpydoc style, which we use throughout the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can clean it up a little
xarray/core/variable.py
Outdated
# GH1434 | ||
# Fixes bug: "xr.concat loses coordinate dtype | ||
# information with recarrays in 0.9" | ||
if np.object in map(lambda var: var.dtype, variables): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use a comprehension here instead, e.g., any(var.dtype == np.object for var in variables)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
I couldn't find a |
I don't know why this was closed by #1545 -- this is an unrelated issue. |
the issue seems to have been resolved in the meantime, so I'm closing. Thanks for the effort, @maciekswat |
git diff upstream/master | flake8 --diff
whats-new.rst
for all changes andapi.rst
for new API