Skip to content

Fixing GH1434 xr.concat loses coordinate dtype information with recarrays in 0.9 #1438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from
Closed
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -624,13 +624,18 @@ Enhancements
Bug fixes
~~~~~~~~~

- Fixed losing of dimansions dtype during concat operation. By
`Maciek Swat <https://github.com/maciekswat>`_.

- Attributes were being retained by default for some resampling
operations when they should not. With the ``keep_attrs=False`` option, they
will no longer be retained by default. This may be backwards-incompatible
with some scripts, but the attributes may be kept by adding the
``keep_attrs=True`` option. By
`Jeremy McGibbon <https://github.com/mcgibbon>`_.

- Fixed bug in arithmetic operations on DataArray objects whose dimensions
are numpy structured arrays or recarrays :issue:`861`, :issue:`837`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a different fix?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

- Concatenating xarray objects along an axis with a MultiIndex or PeriodIndex
preserves the nature of the index (:issue:`875`). By
`Stephan Hoyer <https://github.com/shoyer>`_.
Expand Down
4 changes: 4 additions & 0 deletions xarray/core/alignment.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,10 @@ def align(*objects, **kwargs):
for dim in obj.dims:
if dim not in exclude:
try:
# GH1434
# dtype is lost after obj.indexes[dim]
# problem originates in Indexes.__getitem__
# in coordinates.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why you're adding these comments here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed those

index = obj.indexes[dim]
except KeyError:
unlabeled_dim_sizes[dim].add(obj.sizes[dim])
Expand Down
21 changes: 21 additions & 0 deletions xarray/core/combine.py
Original file line number Diff line number Diff line change
Expand Up @@ -207,8 +207,29 @@ def _dataset_concat(datasets, dim, data_vars, coords, compat, positions):

dim, coord = _calc_concat_dim_coord(dim)
datasets = [as_dataset(ds) for ds in datasets]

# GH1434
# constructing a dictionary that will be used to preserve dtype
# of the original dataset dimensions
dtype_dict = {}
for ds in datasets:
for dim_tuple in ds.dims.items():
dim_name = dim_tuple[0]
if dim_name != dim:
dtype_dict[dim_name] = ds[dim_name].dtype

# align loses original dtype of the datasets' dim variables
datasets = align(*datasets, join='outer', copy=False, exclude=[dim])

# GH1434
# restoring original dtype of the datasets' dimensions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix should either be done in align (actually inside core.alignment.reindex_variables) or not at all. We cannot preserve all dtypes in alignment operations, e.g., there is no missing value for structured dtypes. If you care about this, you should use join='inner' or join='exact' (will appears in the next release, from #1330). It would be nice to surface these options up into xarray.concat, though.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer. It was indeed the place where loss of dtype happened

for ds in datasets:
for dim_name, dim_dtype in dtype_dict.items():
try:
ds[dim_name] = ds[dim_name].astype(dtype_dict[dim_name])
except KeyError:
pass

concat_over = _calc_concat_over(datasets, dim, data_vars, coords)

def insert_result_variable(k, v):
Expand Down
85 changes: 72 additions & 13 deletions xarray/core/variable.py
Original file line number Diff line number Diff line change
Expand Up @@ -1212,35 +1212,94 @@ def __setitem__(self, key, value):
raise TypeError('%s values cannot be modified' % type(self).__name__)

@classmethod
def concat(cls, variables, dim='concat_dim', positions=None,
shortcut=False):
"""Specialized version of Variable.concat for IndexVariable objects.

This exists because we want to avoid converting Index objects to NumPy
arrays, if possible.
def concat_numpy(cls, variables, positions=None):
"""
if not isinstance(dim, basestring):
dim, = dim.dims
Concatenates variables. Works for variables whose dtype is
different from numpy.object. If variables' dtype is numpy.object
it throws TypeError and "concat" function will use
concat_pandas function
:param variables: list of variables to concatenate
:return: Concatenated variables
"""
variable_type_set = set(map(lambda v: type(v.data), variables))

variables = list(variables)
first_var = variables[0]
if len(variable_type_set) > 1:
raise TypeError('Trying to concatenate variables of '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These extra checks would break some things that currently works with np.concatenate (e.g., concatenate([float, int]) -> float). So I would remove them, and replace this with the simpler np.concatenate([v.data for v in variables]).

'different types')

if any(not isinstance(v, cls) for v in variables):
raise TypeError('IndexVariable.concat requires that all input '
'variables be IndexVariable objects')
variable_type = list(variable_type_set)[0]
if not variable_type == np.ndarray:
raise TypeError('Can only concatenate variables whose '
'_data member is ndarray')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already guaranteed by xarray's data model, so this should either be an assert or you should skip it entirely.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK


if variables[0].dtype == np.object:
raise TypeError('We use concat_numpy for objects whose '
'dtypes are different than numpy.object ')

indexes = [v._data for v in variables]

if not indexes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this special branch for numpy. Literally return np.concatenate([v.data for v in variables]) would do for this function.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

data = []
else:
data = np.concatenate((indexes))

if positions is not None:
indices = nputils.inverse_permutation(
np.concatenate(positions))
data = data.take(indices)

return data

@classmethod
def concat_pandas(cls, variables, positions=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you keep these methods already, they should be private (preface with _)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"""
Concatenates variables. This is generic function that
handles all cases for which concat_numpy does not work
:param variables: list of variables to concatenate
:return: Concatenated variables
"""
indexes = [v._data.array for v in variables]

if not indexes:
data = []
else:

data = indexes[0].append(indexes[1:])

if positions is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is the same for both _concat_pandas and _concat_numpy -- please move it up into the caller instead.

indices = nputils.inverse_permutation(
np.concatenate(positions))
data = data.take(indices)

return data

@classmethod
def concat(cls, variables, dim='concat_dim', positions=None,
shortcut=False):
"""Specialized version of Variable.concat for
IndexVariable objects.
This exists because we want to avoid converting Index objects to NumPy
arrays, if possible.
"""
if not isinstance(dim, basestring):
dim, = dim.dims

variables = list(variables)

first_var = variables[0]

if any(not isinstance(v, cls) for v in variables):
raise TypeError('IndexVariable.concat requires that all input '
'variables be IndexVariable objects')

# GH1434
# Fixes bug: "xr.concat loses coordinate dtype
# information with recarrays in 0.9"
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using exceptions for control flow, can we divide this into two cases based upon whether any (all?) of the variables have dtype=object?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

data = cls.concat_numpy(variables, positions)
except TypeError:
data = cls.concat_pandas(variables, positions)

attrs = OrderedDict(first_var.attrs)
if not shortcut:
for var in variables:
Expand Down
2 changes: 1 addition & 1 deletion xarray/tests/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@
try:
_SKIP_FLAKY = not pytest.config.getoption("--run-flaky")
_SKIP_NETWORK_TESTS = not pytest.config.getoption("--run-network-tests")
except ValueError:
except (ValueError, AttributeError) as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did AttributeError come up for you?

Even if so, we don't need as e unless we're doing something with the error

# Can't get config from pytest, e.g., because xarray is installed instead
# of being run from a development version (and hence conftests.py is not
# available). Don't run flaky tests.
Expand Down
45 changes: 45 additions & 0 deletions xarray/tests/test_combine.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,51 @@ def rectify_dim_order(dataset):
expected['dim1'] = dim
self.assertDatasetIdentical(expected, concat(datasets, dim))

def test_concat_dtype_preservation(self):
"""
This test checks whether concatennation of two DataArrays
along the axis whose dimension is numpy structured array
preserves dtype of the numpy structured array
"""

p1 = np.array([('A', 180), ('B', 150), ('C', 200)],
dtype=[('name', '|S256'), ('height', int)])
p2 = np.array([('D', 170), ('E', 250), ('F', 150)],
dtype=[('name', '|S256'), ('height', int)])

data = np.arange(50, 80, 1, dtype=np.float)

dims = ['measurement', 'participant']

da1 = DataArray(
data.reshape(10, 3),
coords={
'measurement': np.arange(10),
'participant': p1,
},
dims=dims
)

da2 = DataArray(
data.reshape(10, 3),
coords={
'measurement': np.arange(10),
'participant': p2,
},
dims=dims
)

combined_1 = concat([da1, da2], dim='participant')

assert combined_1.participant.dtype == da1.participant.dtype
assert combined_1.measurement.dtype == da1.measurement.dtype

combined_2 = concat([da1, da2], dim='measurement')

print (combined_2.participant.dtype)
assert combined_2.participant.dtype == da1.participant.dtype
assert combined_2.measurement.dtype == da1.measurement.dtype

def test_concat_data_vars(self):
data = Dataset({'foo': ('x', np.random.randn(10))})
objs = [data.isel(x=slice(5)), data.isel(x=slice(5, None))]
Expand Down