Skip to content

WIP: Automatic label alignment for mathematical operations #184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Jul 15, 2014

This still need a bit of cleanup (note the failing test), but there is an interesting design decision that came up: How should we handle alignment for in-place operations when the operation would result in missing values that cannot be represented by the existing data type?

For example, what should x be after the following?

x = DataArray([1, 2], coordinates=[['a', 'b']], dimensions=['foo'])
y = DataArray([3], coordinates=[['b']], dimensions=['foo'])
x += y

If we do automatic alignment like pandas, in-place operations should not change the coordinates of the object to which the operation is being applied. Thus, y should be equivalent to:

y_prime = DataArray([np.nan, 3], coordinates=[['a', 'b']], dimensions=['foo'])

Here arises the problem: x has dtype=int, so it cannot represent NaN. If I run this example using the current version of this patch, I end up with:

In [5]: x
Out[5]:
<xray.DataArray (foo: 2)>
array([-9223372036854775808,                    5])
Coordinates:
    foo: Index([u'a', u'b'], dtype='object')
Attributes:
    Empty

There are several options here:

  1. Don't actually do in-place operations on the underlying ndarrays: x += y should translate under the hood to x = x + y, which sidesteps the issue, because x + y results in a new floating point array. This is what pandas does.

  2. Do the operation in-place on the ndarray like numpy -- it's the user's problem if they try to add np.nan in-place to an integer.

  3. Do the operation in-place, but raise a warning or error if the right hand side expression ends up including any missing values. Interestingly, this is what numpy does, but only for 0-dimensional arrays:

    In [3]: x = np.array(0)
    
    In [4]: x += np.nan
    /Users/shoyer/miniconda/envs/tcc-climatology/bin/ipython:1: RuntimeWarning: invalid value encountered in add
      #!/Users/shoyer/miniconda/envs/tcc-climatology/python.app/Contents/MacOS/python
    

Option 1 has negative performance implications for all in-place array operations (they would be no faster than the non-in-place versions), and might also complicate the hypothetical future feature of datasets linked on disk (but we might also just disallow in-place operations for such arrays).

Option 2 is one principled choice, but the outcome with missing values would be pretty surprising (note that in this scenario, both x and y were integer arrays).

I like option 3 (with the warning), but unfortunately it has most of the negative performance implications of option 1, because we could need to make a copy of y to check for missing values. This could be partially alleviated by using something like bottleneck.anynan instead, and by the fact that we would only need to do this check if the in-place operation is adding a float to an int.

Any thoughts?

@shoyer
Copy link
Member Author

shoyer commented Aug 21, 2014

closing this for now until after I implement #197 (since that will require a refactor)

@shoyer shoyer closed this Aug 21, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant