-
Notifications
You must be signed in to change notification settings - Fork 367
Fix comparison operators to treat NULL as False #1029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 5 commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
6da920c
Fix _column_op
harupy 7e51add
Handle not-equal operator
harupy 0b9229e
Fix
harupy e27d6aa
Fix doctests
harupy da80958
Remove f-string
harupy e1b93fb
Fix for __or__ and __and__
harupy 473a1a8
Fix
harupy 93e6505
Fix comments
harupy 880c4dd
Fix comments
harupy 000d3fd
Fix indent
harupy File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -57,6 +57,14 @@ def wrapper(self, *args): | |
| args = [arg._scol if isinstance(arg, IndexOpsMixin) else arg for arg in args] | ||
| scol = f(self._scol, *args) | ||
|
|
||
| # If f is a logistic operator, fill NULL with False | ||
| log_ops = ['eq', 'ne', 'lt', 'le', 'ge', 'gt'] | ||
| is_log_op = any(f == getattr(spark.Column, '__{}__'.format(log_op)) | ||
| for log_op in log_ops) | ||
| if is_log_op: | ||
| filler = f == spark.Column.__ne__ | ||
| scol = F.when(scol.isNull(), filler).otherwise(scol) | ||
|
|
||
| return self._with_new_scol(scol) | ||
| else: | ||
| # Different DataFrame anchors | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems like we should apply the updates for this branch as well? >>> ks.options.compute.ops_on_diff_frames = True
>>> s1 = pd.Series([True, False, True], index=list("ABC"), name="x")
>>> s2 = pd.Series([True, True, False], index=list("ABD"), name="x")
>>> s1
A True
B False
C True
Name: x, dtype: bool
>>> s2
A True
B True
D False
Name: x, dtype: bool
>>> s1 | s2
A True
B True
C True
D False
Name: x, dtype: bool
>>> (ks.from_pandas(s1) | ks.from_pandas(s2)).sort_index()
A True
B True
C True
D None
Name: x, dtype: object
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ueshin Thank you for catching this. I'm working on this. |
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon @ueshin (cc @itholic @charlesdong1991 )
Not sure if this is the right implementation ...
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not pretty sure, but maybe is
scol = F.when(scol.isNull(), False).otherwise(True)correct?Because when type of
scolis notboolean, it may raisesAnalysisExceptiondue to data type mismatch betweenFalseandscolUh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, and i think maybe it should be better to make test of this if you available 😄ah it's already tested :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@itholic Thanks! I haven't fixed the tests yet because I want to make sure first if this is the right implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the problem is not that simple. E.g.,:
and
If we simply convert
NonetoFalse, the!=case is not the same as pandas'.We might need to add some logic one-by-one.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin @charlesdong1991
Looks like this is how pandas should behave.
https://github.com/pandas-dev/pandas/blob/master/pandas/tests/series/test_operators.py#L447-L452
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, then it behaves this way on purpose in pandas. thanks for looking at source code to clear things up @harupy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss
&and|and address if needed in a separate issue/PR.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
&and|should be addressed in a separate PR.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like pandas tries to follow the boolean operations rules in Python (
orandand) - https://docs.python.org/3/library/stdtypes.html#boolean-operations-and-or-notThen, ^ could make sense.
However, I am still kind of lost about "np.nan, filled with False" because
bool(np.nan)is coerced toTrue, e.g.)