WIP: Automatic label alignment for mathematical operations #184

shoyer · 2014-07-15T01:57:18Z

This still need a bit of cleanup (note the failing test), but there is an interesting design decision that came up: How should we handle alignment for in-place operations when the operation would result in missing values that cannot be represented by the existing data type?

For example, what should x be after the following?

x = DataArray([1, 2], coordinates=[['a', 'b']], dimensions=['foo'])
y = DataArray([3], coordinates=[['b']], dimensions=['foo'])
x += y

If we do automatic alignment like pandas, in-place operations should not change the coordinates of the object to which the operation is being applied. Thus, y should be equivalent to:

y_prime = DataArray([np.nan, 3], coordinates=[['a', 'b']], dimensions=['foo'])

Here arises the problem: x has dtype=int, so it cannot represent NaN. If I run this example using the current version of this patch, I end up with:

In [5]: x
Out[5]:
<xray.DataArray (foo: 2)>
array([-9223372036854775808,                    5])
Coordinates:
    foo: Index([u'a', u'b'], dtype='object')
Attributes:
    Empty

There are several options here:

Don't actually do in-place operations on the underlying ndarrays: x += y should translate under the hood to x = x + y, which sidesteps the issue, because x + y results in a new floating point array. This is what pandas does.
Do the operation in-place on the ndarray like numpy -- it's the user's problem if they try to add np.nan in-place to an integer.

Do the operation in-place, but raise a warning or error if the right hand side expression ends up including any missing values. Interestingly, this is what numpy does, but only for 0-dimensional arrays:

In [3]: x = np.array(0)

In [4]: x += np.nan
/Users/shoyer/miniconda/envs/tcc-climatology/bin/ipython:1: RuntimeWarning: invalid value encountered in add
  #!/Users/shoyer/miniconda/envs/tcc-climatology/python.app/Contents/MacOS/python

Option 1 has negative performance implications for all in-place array operations (they would be no faster than the non-in-place versions), and might also complicate the hypothetical future feature of datasets linked on disk (but we might also just disallow in-place operations for such arrays).

Option 2 is one principled choice, but the outcome with missing values would be pretty surprising (note that in this scenario, both x and y were integer arrays).

I like option 3 (with the warning), but unfortunately it has most of the negative performance implications of option 1, because we could need to make a copy of y to check for missing values. This could be partially alleviated by using something like bottleneck.anynan instead, and by the fact that we would only need to do this check if the in-place operation is adding a float to an int.

Any thoughts?

shoyer · 2014-08-21T05:44:30Z

closing this for now until after I implement #197 (since that will require a refactor)

shoyer added 2 commits July 10, 2014 18:07

more tests for DataArray constructor

cf16aaa

WIP: auto-align math

8724e86

shoyer mentioned this pull request Jul 15, 2014

Checklist for v0.2 release #183

Closed

16 tasks

shoyer added copy from pandas and removed API labels Jul 17, 2014

shoyer mentioned this pull request Jul 17, 2014

Automatic label alignment #186

Closed

3 tasks

shoyer mentioned this pull request Aug 9, 2014

ENH: Allow inplace arithmetic operations pandas-dev/pandas#5104

Closed

shoyer closed this Aug 21, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Automatic label alignment for mathematical operations #184

WIP: Automatic label alignment for mathematical operations #184

shoyer commented Jul 15, 2014

shoyer commented Aug 21, 2014

WIP: Automatic label alignment for mathematical operations #184

WIP: Automatic label alignment for mathematical operations #184

Conversation

shoyer commented Jul 15, 2014

shoyer commented Aug 21, 2014