WIP: Automatic label alignment for mathematical operations #184
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This still need a bit of cleanup (note the failing test), but there is an interesting design decision that came up: How should we handle alignment for in-place operations when the operation would result in missing values that cannot be represented by the existing data type?
For example, what should
x
be after the following?If we do automatic alignment like pandas, in-place operations should not change the coordinates of the object to which the operation is being applied. Thus,
y
should be equivalent to:Here arises the problem:
x
hasdtype=int
, so it cannot representNaN
. If I run this example using the current version of this patch, I end up with:There are several options here:
Don't actually do in-place operations on the underlying ndarrays:
x += y
should translate under the hood tox = x + y
, which sidesteps the issue, becausex + y
results in a new floating point array. This is what pandas does.Do the operation in-place on the ndarray like numpy -- it's the user's problem if they try to add
np.nan
in-place to an integer.Do the operation in-place, but raise a warning or error if the right hand side expression ends up including any missing values. Interestingly, this is what numpy does, but only for 0-dimensional arrays:
Option 1 has negative performance implications for all in-place array operations (they would be no faster than the non-in-place versions), and might also complicate the hypothetical future feature of datasets linked on disk (but we might also just disallow in-place operations for such arrays).
Option 2 is one principled choice, but the outcome with missing values would be pretty surprising (note that in this scenario, both
x
andy
were integer arrays).I like option 3 (with the warning), but unfortunately it has most of the negative performance implications of option 1, because we could need to make a copy of
y
to check for missing values. This could be partially alleviated by using something likebottleneck.anynan
instead, and by the fact that we would only need to do this check if the in-place operation is adding a float to an int.Any thoughts?