Skip to content

Fixing GH1434 xr.concat loses coordinate dtype information with recarrays in 0.9 #1438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from

Conversation

maciekswat
Copy link
Contributor

@maciekswat maciekswat commented Jun 1, 2017

…rrays that are numpy.recarrays

The following code WILL run with the fix but will crash when running current xarray:

import numpy as np
import xarray as xr

p_data = np.array([('John', 180), ('Stacy', 150), ('Dick',200)], dtype=[('name', '|S256'), ('height', int)])
p_data_1 = np.array([('John', 180), ('Stacy', 150), ('Dick',200)], dtype=[('name', '|S256'), ('height', int)])

weights_0 = xr.DataArray([80,56,120], dims=['participant'], coords={'participant':p_data})
weights_1 = xr.DataArray([81,52,115], dims=['participant'], coords={'participant':p_data_1})

print weights_1-weights_0
…rrays that are numpy.recarrays. Used try/except to catch exception that may arise when running isnull on a structured array. Added appropriate test to test_dataarray to check for correct handling of dimension that are structured arrays

The following code WILL run with the fix but will crash when running current xarray:

import numpy as np
import xarray as xr

p_data = np.array([('John', 180), ('Stacy', 150), ('Dick',200)], dtype=[('name', '|S256'), ('height', int)])
p_data_1 = np.array([('John', 180), ('Stacy', 150), ('Dick',200)], dtype=[('name', '|S256'), ('height', int)])

weights_0 = xr.DataArray([80,56,120], dims=['participant'], coords={'participant':p_data})
weights_1 = xr.DataArray([81,52,115], dims=['participant'], coords={'participant':p_data_1})

print weights_1-weights_0
…snull on structured array that happens to be one of the dims of DataArray
# Conflicts:
#	xarray/core/ops.py
#	xarray/tests/test_dataarray.py
1 . dtype of the dimension that we concatenate along, was lost
2. dtype of the other dimensions (ones that are not concatenated but aligned) was lost

Fixing 1. required small refactoring of the "concat" classmethod in the variables.py -  by default we check if the dimensions can be concatenated using numpy.concat function and if this method fails we resort to the current method of concatenation using append fcn from pandas

As far as 2. it turns out the dtype of the dimensions that are not concatenated was lost in the "align" function in alignment.py - see my comment with # GH1434 tag. the solution was to store dtypes of the original dimensions and reapply those to the dimensions DataArrays after "align" function call
# GH1434
# dtype is lost after obj.indexes[dim]
# problem originates in Indexes.__getitem__
# in coordinates.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why you're adding these comments here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed those

datasets = align(*datasets, join='outer', copy=False, exclude=[dim])

# GH1434
# restoring original dtype of the datasets' dimensions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix should either be done in align (actually inside core.alignment.reindex_variables) or not at all. We cannot preserve all dtypes in alignment operations, e.g., there is no missing value for structured dtypes. If you care about this, you should use join='inner' or join='exact' (will appears in the next release, from #1330). It would be nice to surface these options up into xarray.concat, though.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer. It was indeed the place where loss of dtype happened

# GH1434
# Fixes bug: "xr.concat loses coordinate dtype
# information with recarrays in 0.9"
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using exceptions for control flow, can we divide this into two cases based upon whether any (all?) of the variables have dtype=object?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

variables = list(variables)
first_var = variables[0]
if len(variable_type_set) > 1:
raise TypeError('Trying to concatenate variables of '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These extra checks would break some things that currently works with np.concatenate (e.g., concatenate([float, int]) -> float). So I would remove them, and replace this with the simpler np.concatenate([v.data for v in variables]).

variable_type = list(variable_type_set)[0]
if not variable_type == np.ndarray:
raise TypeError('Can only concatenate variables whose '
'_data member is ndarray')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already guaranteed by xarray's data model, so this should either be an assert or you should skip it entirely.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK


@classmethod
def concat_pandas(cls, variables, positions=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you keep these methods already, they should be private (preface with _)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -101,7 +101,7 @@
try:
_SKIP_FLAKY = not pytest.config.getoption("--run-flaky")
_SKIP_NETWORK_TESTS = not pytest.config.getoption("--run-network-tests")
except ValueError:
except (ValueError, AttributeError) as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did AttributeError come up for you?

Even if so, we don't need as e unless we're doing something with the error

- Attributes were being retained by default for some resampling
operations when they should not. With the ``keep_attrs=False`` option, they
will no longer be retained by default. This may be backwards-incompatible
with some scripts, but the attributes may be kept by adding the
``keep_attrs=True`` option. By
`Jeremy McGibbon <https://github.com/mcgibbon>`_.

- Fixed bug in arithmetic operations on DataArray objects whose dimensions
are numpy structured arrays or recarrays :issue:`861`, :issue:`837`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a different fix?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

… array) in reindex_variables (alignment.py) and implemented appropriate dtype-casting in case dtype information was lost for composite dtypes. Cleaned up the code
# ensures that dtype of numpy structured arrays is preserved
if len(args) and is_composite_dtype(var.dtype) \
and idx_var.dtype != var.dtype:
idx_var.data = idx_var.data.astype(var.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand why this change is necessary. Can you give an example of what this would fix?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a simple demo that shows the problem. After concatenation dtype of the "participant" variable will be lost but it should be preserved, especially that participant is not the dimension we are concatenating along

        p1 = np.array([('A', 180), ('B', 150), ('C', 200)],
                      dtype=[('name', '|S256'), ('height', int)])
        p2 = np.array([('D', 170), ('E', 250), ('F', 150)],
                      dtype=[('name', '|S256'), ('height', int)])

        data = np.arange(50, 80, 1, dtype=np.float)

        dims = ['measurement', 'participant']

        da1 = DataArray(
            data.reshape(10, 3),
            coords={
                'measurement': np.arange(10),
                'participant': p1,
            },
            dims=dims
        )

        da2 = DataArray(
            data.reshape(10, 3),
            coords={
                'measurement': np.arange(10),
                'participant': p2,
            },
            dims=dims
        )


        combined_2 = concat([da1, da2], dim='measurement')


        assert combined_2.participant.dtype == da1.participant.dtype

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer : let me know if this makes sense .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this does indeed make sense. To isolate he issue from your example, let's see what we get from aligning instead:

>>> da1_aligned, da2_aligned = xarray.align(da1, da2, join='outer')
>>> da1
<xarray.DataArray (measurement: 10, participant: 3)>
array([[ 50.,  51.,  52.],
       [ 53.,  54.,  55.],
       [ 56.,  57.,  58.],
       [ 59.,  60.,  61.],
       [ 62.,  63.,  64.],
       [ 65.,  66.,  67.],
       [ 68.,  69.,  70.],
       [ 71.,  72.,  73.],
       [ 74.,  75.,  76.],
       [ 77.,  78.,  79.]])
Coordinates:
  * measurement  (measurement) int64 0 1 2 3 4 5 6 7 8 9
  * participant  (participant) [('name', 'S256'), ('height', '<i8')] ('A', 180) ...

>>> da1_aligned
<xarray.DataArray (measurement: 10, participant: 6)>
array([[ 50.,  51.,  52.,  nan,  nan,  nan],
       [ 53.,  54.,  55.,  nan,  nan,  nan],
       [ 56.,  57.,  58.,  nan,  nan,  nan],
       [ 59.,  60.,  61.,  nan,  nan,  nan],
       [ 62.,  63.,  64.,  nan,  nan,  nan],
       [ 65.,  66.,  67.,  nan,  nan,  nan],
       [ 68.,  69.,  70.,  nan,  nan,  nan],
       [ 71.,  72.,  73.,  nan,  nan,  nan],
       [ 74.,  75.,  76.,  nan,  nan,  nan],
       [ 77.,  78.,  79.,  nan,  nan,  nan]])
Coordinates:
  * participant  (participant) object ('A', 180) ('B', 150) ('C', 200) ...
  * measurement  (measurement) int64 0 1 2 3 4 5 6 7 8 9

Notice that the dtype of participant is now `object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix this, we need a dtype safe version of pandas.Index.__or__, which is used for joining together indexes: https://github.com/maciekswat/xarray/blob/7cec75c6252c34aee46f3e7bb8c170c169429f37/xarray/core/alignment.py#L20

The cleanest way I see to handle this is to make the logic inside align always use some sort of xarray wrapper around pandas.Index similar to (or building on) PandasIndexAdapter that propagates dtype properly in these operations. I don't want to add dtype specific logic in align/reindex_variables that fixes things up after the fact, because the result is much less maintainable.


data = indexes[0].append(indexes[1:])

if positions is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is the same for both _concat_pandas and _concat_numpy -- please move it up into the caller instead.


indexes = [v._data for v in variables]

if not indexes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this special branch for numpy. Literally return np.concatenate([v.data for v in variables]) would do for this function.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -1212,34 +1212,79 @@ def __setitem__(self, key, value):
raise TypeError('%s values cannot be modified' % type(self).__name__)

@classmethod
def _concat_numpy(cls, variables, positions=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this has enough to need to be a separate function, but if so:

  1. It shouldn't be a classmethod -- we don't need to overwrite it.
  2. Docstring should follow the numpydoc style, which we use throughout the project.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can clean it up a little

# GH1434
# Fixes bug: "xr.concat loses coordinate dtype
# information with recarrays in 0.9"
if np.object in map(lambda var: var.dtype, variables):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use a comprehension here instead, e.g., any(var.dtype == np.object for var in variables)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@stickler-ci
Copy link
Contributor

I couldn't find a .stickler.yml file in this repository. I can make one for you, or you can create one by following the documentation.

@shoyer
Copy link
Member

shoyer commented May 13, 2018

I don't know why this was closed by #1545 -- this is an unrelated issue.

@keewis
Copy link
Collaborator

keewis commented Jan 23, 2023

the issue seems to have been resolved in the meantime, so I'm closing. Thanks for the effort, @maciekswat

@keewis keewis closed this Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

xr.concat loses coordinate dtype information with recarrays in 0.9
6 participants