Fixing GH1434 xr.concat loses coordinate dtype information with recarrays in 0.9 #1438

maciekswat · 2017-06-01T20:06:22Z

Closes xr.concat loses coordinate dtype information with recarrays in 0.9 #1434
Tests added / passed
Passes git diff upstream/master | flake8 --diff
Fully documented, including whats-new.rst for all changes and api.rst for new API

…rrays that are numpy.recarrays The following code WILL run with the fix but will crash when running current xarray: import numpy as np import xarray as xr p_data = np.array([('John', 180), ('Stacy', 150), ('Dick',200)], dtype=[('name', '|S256'), ('height', int)]) p_data_1 = np.array([('John', 180), ('Stacy', 150), ('Dick',200)], dtype=[('name', '|S256'), ('height', int)]) weights_0 = xr.DataArray([80,56,120], dims=['participant'], coords={'participant':p_data}) weights_1 = xr.DataArray([81,52,115], dims=['participant'], coords={'participant':p_data_1}) print weights_1-weights_0

…rrays that are numpy.recarrays. Used try/except to catch exception that may arise when running isnull on a structured array. Added appropriate test to test_dataarray to check for correct handling of dimension that are structured arrays The following code WILL run with the fix but will crash when running current xarray: import numpy as np import xarray as xr p_data = np.array([('John', 180), ('Stacy', 150), ('Dick',200)], dtype=[('name', '|S256'), ('height', int)]) p_data_1 = np.array([('John', 180), ('Stacy', 150), ('Dick',200)], dtype=[('name', '|S256'), ('height', int)]) weights_0 = xr.DataArray([80,56,120], dims=['participant'], coords={'participant':p_data}) weights_1 = xr.DataArray([81,52,115], dims=['participant'], coords={'participant':p_data_1}) print weights_1-weights_0

…snull on structured array that happens to be one of the dims of DataArray

# Conflicts: # xarray/core/ops.py # xarray/tests/test_dataarray.py

1 . dtype of the dimension that we concatenate along, was lost 2. dtype of the other dimensions (ones that are not concatenated but aligned) was lost Fixing 1. required small refactoring of the "concat" classmethod in the variables.py - by default we check if the dimensions can be concatenated using numpy.concat function and if this method fails we resort to the current method of concatenation using append fcn from pandas As far as 2. it turns out the dtype of the dimensions that are not concatenated was lost in the "align" function in alignment.py - see my comment with # GH1434 tag. the solution was to store dtypes of the original dimensions and reapply those to the dimensions DataArrays after "align" function call

shoyer · 2017-06-01T20:36:06Z

xarray/core/alignment.py

+                    # GH1434
+                    # dtype is lost after obj.indexes[dim]
+                    # problem originates in  Indexes.__getitem__
+                    # in coordinates.py


I'm not sure why you're adding these comments here?

Removed those

shoyer · 2017-06-01T20:40:43Z

xarray/core/combine.py

    datasets = align(*datasets, join='outer', copy=False, exclude=[dim])

+    # GH1434
+    # restoring original dtype of the datasets' dimensions


This fix should either be done in align (actually inside core.alignment.reindex_variables) or not at all. We cannot preserve all dtypes in alignment operations, e.g., there is no missing value for structured dtypes. If you care about this, you should use join='inner' or join='exact' (will appears in the next release, from #1330). It would be nice to surface these options up into xarray.concat, though.

Thanks for the pointer. It was indeed the place where loss of dtype happened

shoyer · 2017-06-01T20:44:55Z

xarray/core/variable.py

+        # GH1434
+        # Fixes bug: "xr.concat loses coordinate dtype
+        # information with recarrays in 0.9"
+        try:


Rather than using exceptions for control flow, can we divide this into two cases based upon whether any (all?) of the variables have dtype=object?

shoyer · 2017-06-01T20:48:35Z

xarray/core/variable.py

-        variables = list(variables)
-        first_var = variables[0]
+        if len(variable_type_set) > 1:
+            raise TypeError('Trying to concatenate variables of '


These extra checks would break some things that currently works with np.concatenate (e.g., concatenate([float, int]) -> float). So I would remove them, and replace this with the simpler np.concatenate([v.data for v in variables]).

shoyer · 2017-06-01T20:49:55Z

xarray/core/variable.py

+        variable_type = list(variable_type_set)[0]
+        if not variable_type == np.ndarray:
+            raise TypeError('Can only concatenate variables whose '
+                            '_data member is ndarray')


This is already guaranteed by xarray's data model, so this should either be an assert or you should skip it entirely.

shoyer · 2017-06-01T20:50:12Z

xarray/core/variable.py


+    @classmethod
+    def concat_pandas(cls, variables, positions=None):


if you keep these methods already, they should be private (preface with _)

shoyer · 2017-06-01T20:50:54Z

xarray/tests/__init__.py

@@ -101,7 +101,7 @@
 try:
    _SKIP_FLAKY = not pytest.config.getoption("--run-flaky")
    _SKIP_NETWORK_TESTS = not pytest.config.getoption("--run-network-tests")
-except ValueError:
+except (ValueError, AttributeError) as e:


did AttributeError come up for you?

Even if so, we don't need as e unless we're doing something with the error

shoyer · 2017-06-01T20:51:37Z

doc/whats-new.rst

 - Attributes were being retained by default for some resampling
  operations when they should not. With the ``keep_attrs=False`` option, they
  will no longer be retained by default. This may be backwards-incompatible
  with some scripts, but the attributes may be kept by adding the
  ``keep_attrs=True`` option. By
  `Jeremy McGibbon <https://github.com/mcgibbon>`_.

+- Fixed bug in arithmetic operations on DataArray objects whose dimensions
+  are numpy structured arrays or recarrays :issue:`861`, :issue:`837`.


This looks like a different fix?

… array) in reindex_variables (alignment.py) and implemented appropriate dtype-casting in case dtype information was lost for composite dtypes. Cleaned up the code

shoyer · 2017-06-02T20:24:05Z

xarray/core/alignment.py

+        # ensures that dtype of numpy structured arrays is preserved
+        if len(args) and is_composite_dtype(var.dtype) \
+                and idx_var.dtype != var.dtype:
+            idx_var.data = idx_var.data.astype(var.dtype)


I still don't understand why this change is necessary. Can you give an example of what this would fix?

This is a simple demo that shows the problem. After concatenation dtype of the "participant" variable will be lost but it should be preserved, especially that participant is not the dimension we are concatenating along

p1 = np.array([('A', 180), ('B', 150), ('C', 200)], dtype=[('name', '|S256'), ('height', int)]) p2 = np.array([('D', 170), ('E', 250), ('F', 150)], dtype=[('name', '|S256'), ('height', int)]) data = np.arange(50, 80, 1, dtype=np.float) dims = ['measurement', 'participant'] da1 = DataArray( data.reshape(10, 3), coords={ 'measurement': np.arange(10), 'participant': p1, }, dims=dims ) da2 = DataArray( data.reshape(10, 3), coords={ 'measurement': np.arange(10), 'participant': p2, }, dims=dims ) combined_2 = concat([da1, da2], dim='measurement') assert combined_2.participant.dtype == da1.participant.dtype

@shoyer : let me know if this makes sense .

OK, this does indeed make sense. To isolate he issue from your example, let's see what we get from aligning instead:

>>> da1_aligned, da2_aligned = xarray.align(da1, da2, join='outer') >>> da1 <xarray.DataArray (measurement: 10, participant: 3)> array([[ 50., 51., 52.], [ 53., 54., 55.], [ 56., 57., 58.], [ 59., 60., 61.], [ 62., 63., 64.], [ 65., 66., 67.], [ 68., 69., 70.], [ 71., 72., 73.], [ 74., 75., 76.], [ 77., 78., 79.]]) Coordinates: * measurement (measurement) int64 0 1 2 3 4 5 6 7 8 9 * participant (participant) [('name', 'S256'), ('height', '<i8')] ('A', 180) ... >>> da1_aligned <xarray.DataArray (measurement: 10, participant: 6)> array([[ 50., 51., 52., nan, nan, nan], [ 53., 54., 55., nan, nan, nan], [ 56., 57., 58., nan, nan, nan], [ 59., 60., 61., nan, nan, nan], [ 62., 63., 64., nan, nan, nan], [ 65., 66., 67., nan, nan, nan], [ 68., 69., 70., nan, nan, nan], [ 71., 72., 73., nan, nan, nan], [ 74., 75., 76., nan, nan, nan], [ 77., 78., 79., nan, nan, nan]]) Coordinates: * participant (participant) object ('A', 180) ('B', 150) ('C', 200) ... * measurement (measurement) int64 0 1 2 3 4 5 6 7 8 9

Notice that the dtype of participant is now `object.

To fix this, we need a dtype safe version of pandas.Index.__or__, which is used for joining together indexes: https://github.com/maciekswat/xarray/blob/7cec75c6252c34aee46f3e7bb8c170c169429f37/xarray/core/alignment.py#L20

The cleanest way I see to handle this is to make the logic inside align always use some sort of xarray wrapper around pandas.Index similar to (or building on) PandasIndexAdapter that propagates dtype properly in these operations. I don't want to add dtype specific logic in align/reindex_variables that fixes things up after the fact, because the result is much less maintainable.

shoyer · 2017-06-02T20:25:11Z

xarray/core/variable.py

+
+            data = indexes[0].append(indexes[1:])
+
+            if positions is not None:


This block is the same for both _concat_pandas and _concat_numpy -- please move it up into the caller instead.

shoyer · 2017-06-02T20:26:49Z

xarray/core/variable.py

+
+        indexes = [v._data for v in variables]
+
+        if not indexes:


We don't need this special branch for numpy. Literally return np.concatenate([v.data for v in variables]) would do for this function.

shoyer · 2017-06-02T20:28:10Z

xarray/core/variable.py

@@ -1212,34 +1212,79 @@ def __setitem__(self, key, value):
        raise TypeError('%s values cannot be modified' % type(self).__name__)

    @classmethod
+    def _concat_numpy(cls, variables, positions=None):


I'm not sure this has enough to need to be a separate function, but if so:

It shouldn't be a classmethod -- we don't need to overwrite it.

Docstring should follow the numpydoc style, which we use throughout the project.

I can clean it up a little

shoyer · 2017-06-02T20:28:50Z

xarray/core/variable.py

+        # GH1434
+        # Fixes bug: "xr.concat loses coordinate dtype
+        # information with recarrays in 0.9"
+        if np.object in map(lambda var: var.dtype, variables):


Can you use a comprehension here instead, e.g., any(var.dtype == np.object for var in variables)?

stickler-ci · 2018-05-13T22:26:38Z

I couldn't find a .stickler.yml file in this repository. I can make one for you, or you can create one by following the documentation.

shoyer · 2018-05-13T22:26:56Z

I don't know why this was closed by #1545 -- this is an unrelated issue.

keewis · 2023-01-23T10:25:28Z

the issue seems to have been resolved in the meantime, so I'm closing. Thanks for the effort, @maciekswat

maciekswat added 11 commits April 27, 2016 16:29

fixed formatting according to PEP8 recommendation

ee3a23e

removed commented out code. simplified test_struct_array_dims

16a7c37

added a comment explaining the need to catch TypeError when calling i…

96af68c

…snull on structured array that happens to be one of the dims of DataArray

Merge branch 'master' of https://github.com/pydata/xarray

b8beca4

Added issue number to the bugfix code

c0b97b1

updated bugfix entry in "what's new" document

693d019

Merge remote-tracking branch 'upstream/master'

a5d6ade

# Conflicts: # xarray/core/ops.py # xarray/tests/test_dataarray.py

Made code flake8 compliant. Added entry in whats-new.rst

72f5f24

shoyer reviewed Jun 1, 2017

View reviewed changes

maciekswat added 2 commits June 2, 2017 15:44

Fixed pydata#1434. added checks for composite dtype (numpy structured…

a245586

… array) in reindex_variables (alignment.py) and implemented appropriate dtype-casting in case dtype information was lost for composite dtypes. Cleaned up the code

made code flake-8 compliant. Cleaned up whats-new.rst

7cec75c

shoyer reviewed Jun 2, 2017

View reviewed changes

maciekswat added 2 commits June 2, 2017 17:03

Fixed pydata#1434 - minor code cleanup

66f20e3

minor code cleanup

6040c07

jhamman added bug needs work labels Jul 13, 2017

jhamman mentioned this pull request Sep 1, 2017

Concat retain encoding #1545

Merged

4 tasks

jhamman closed this in #1545 Sep 5, 2017

ctw mentioned this pull request May 3, 2018

xr.concat loses coordinate dtype information with recarrays in 0.9 #1434

Closed

shoyer reopened this May 13, 2018

keewis closed this Jan 23, 2023


		@classmethod
		def concat_pandas(cls, variables, positions=None):


		data = indexes[0].append(indexes[1:])

		if positions is not None:

Uh oh!

Fixing GH1434 xr.concat loses coordinate dtype information with recarrays in 0.9 #1438

Fixing GH1434 xr.concat loses coordinate dtype information with recarrays in 0.9 #1438

Uh oh!

Conversation

maciekswat commented Jun 1, 2017 • edited by jhamman Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stickler-ci commented May 13, 2018

Uh oh!

shoyer commented May 13, 2018

Uh oh!

keewis commented Jan 23, 2023

Uh oh!

Uh oh!

maciekswat commented Jun 1, 2017 •

edited by jhamman

Loading