-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: rank with +-inf, #6945 #17903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: rank with +-inf, #6945 #17903
Conversation
tests! |
Codecov Report
@@ Coverage Diff @@
## master #17903 +/- ##
==========================================
+ Coverage 91.46% 91.58% +0.11%
==========================================
Files 157 155 -2
Lines 51439 51255 -184
==========================================
- Hits 47051 46941 -110
+ Misses 4388 4314 -74
Continue to review full report at Codecov.
|
pandas/tests/series/test_rank.py
Outdated
@@ -196,9 +196,6 @@ def test_rank_signature(self): | |||
pytest.raises(ValueError, s.rank, 'average') | |||
|
|||
def test_rank_inf(self): | |||
pytest.skip('DataFrame.rank does not currently rank ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some nans in here as well. is this enough of a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can add nans there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is going to need a battery of tests for types accepting nan
iow float32/float64
also test non nan types like uint64 (yes they cannot have nan but may not be tested fully); since u r adding paths that exclude these need to beef up testing a bit
@@ -44,6 +44,10 @@ def rank_1d_{{dtype}}(object in_arr, ties_method='average', ascending=True, | |||
|
|||
{{if dtype == 'object'}} | |||
ndarray sorted_data, values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can put the sorted_nanmask for all dtypes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes , you are right. I should add for all dtypes.
@@ -54,7 +58,7 @@ def rank_1d_{{dtype}}(object in_arr, ties_method='average', ascending=True, | |||
{{if dtype == 'uint64'}} | |||
{{ctype}} val | |||
{{else}} | |||
{{ctype}} val, nan_value | |||
{{ctype}} val, nan_value, isnan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isnan is a very confusing name here as this is an actual value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should I just call it masked ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually if you fix as below this is ok
but isnan should be same type as sorted_mask and mask iow uint_8
# need to distinguish between pos/neg nan and real nan when keep_na is true | ||
{{if dtype != 'uint64'}} | ||
sorted_namask = mask.take(_as) | ||
sorted_namask = sorted_namask.astype(np.bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a very odd thing to do, are you just computing the mask?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I am sorting the mask to match the sorted_data. When I loop through the sorted_data when computing ranks, I need to find the corresponding indicator for nan value. The current version is comparing the each element against nan_value, which is causing the problem
if (val is nan_value) and keep_na:
ranks[argsorted[i]] = nan
Because nan_value is either pos or neg infinite value.
if ascending ^ (na_option == 'top'):
nan_value = {{pos_nan_value}}
else:
nan_value = {{neg_nan_value}}
As a result when the underlying value is truely +/- infinite, the rank will be just nan for both. That's why ranking with inf values are failing when keep_na is true. Can you elaborate more on why this is odd ? Do you mean the sorted_namask.astype(np.bool)
is redundant ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this sorted_mask to be consistent
you should need the astype at all; mask is already a uint_8 (which in cython is bool)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a note for 0.22 in bug fixes / conversion
# need to distinguish between pos/neg nan and real nan when keep_na is true | ||
{{if dtype != 'uint64'}} | ||
sorted_namask = mask.take(_as) | ||
sorted_namask = sorted_namask.astype(np.bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this sorted_mask to be consistent
you should need the astype at all; mask is already a uint_8 (which in cython is bool)
@@ -54,7 +58,7 @@ def rank_1d_{{dtype}}(object in_arr, ties_method='average', ascending=True, | |||
{{if dtype == 'uint64'}} | |||
{{ctype}} val | |||
{{else}} | |||
{{ctype}} val, nan_value | |||
{{ctype}} val, nan_value, isnan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually if you fix as below this is ok
but isnan should be same type as sorted_mask and mask iow uint_8
pandas/tests/series/test_rank.py
Outdated
@@ -196,9 +196,6 @@ def test_rank_signature(self): | |||
pytest.raises(ValueError, s.rank, 'average') | |||
|
|||
def test_rank_inf(self): | |||
pytest.skip('DataFrame.rank does not currently rank ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is going to need a battery of tests for types accepting nan
iow float32/float64
also test non nan types like uint64 (yes they cannot have nan but may not be tested fully); since u r adding paths that exclude these need to beef up testing a bit
{{endif}} | ||
|
||
n = len(values) | ||
ranks = np.empty(n, dtype='f8') | ||
|
||
{{if dtype == 'object'}} | ||
try: | ||
_as = values.argsort() | ||
# lexsort on object array will raise TypeError for numpy version | ||
# earlier than 1.11.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you adding all of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to keep nan values at the every end or at the every beginning. Otherwise sum_ranks and dups will be messed up when real +/- infs are intertwined with replaced infs (originally Nans )
if tiebreak == TIEBREAK_AVERAGE:
for j in range(i - dups + 1, i + 1):
ranks[argsorted[j]] = sum_ranks / dups
It is hard to let j skip the index corresponding to nan value, and to subtract extra counts from sum_ranks and dups at the same time. So I try to separate nans when I calculate _as.
For float and int dtype, it is easy to achieve this by a double sort.
_as = np.lexsort(keys=(mask, values))
But numpy does not support lexsort on object array before version 1.11.0.
numpy/numpy#6312
Then I tried argsort with an order argument
_dt = [('values', 'object'), ('mask', 'bool')]
_values = np.asarray(list(zip(mask, values)), dtype=_dt)
_as = np.argsort(_values, kind='mergesort', order=('values', 'mask'))
Somehow, it is not working properly on Infinity, NegInfinity object. As a result, I sort it by mask first. The downside is I need to specify the condition on whether to reverse the sign of mask so as to control whether put Nans at the beginning or the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok can you make this a try/except/else to isolate the thing which is actually raising
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Infinity and NegInfinity might work if you add missing methods (might be a comparison operator eg dunder lt)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean nesting the try/except clause?
{{if dtype == 'object'}}
try:
# lexsort on object array will raise TypeError for numpy version
# earlier than 1.11.0
_as = np.lexsort(keys=(mask, values))
except TypeError:
try:
_dt = [('values', 'O'), ('mask', '?')]
_values = np.asarray(list(zip(values, mask)), dtype=_dt)
_as = np.argsort(_values, kind='mergesort', order=('values', 'mask'))
except TypeError:
if not retry:
raise
valid_locs = (~mask).nonzero()[0]
ranks.put(valid_locs, rank_1d_object(values.take(valid_locs), 0,
ties_method=ties_method,
ascending=ascending))
np.putmask(ranks, mask, np.nan)
return ranks
{{else}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no I mean try/except/else. you should put minimal amounts in the try portion; else is evaluation if no exception is raised. nested try/excepts are way harder to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow, it is not working properly on Infinity, NegInfinity object. As a result, I sort it by mask first. The downside is I need to specify the condition on whether to reverse the sign of mask so as to control whether put Nans at the beginning or the end.
looks like you fixed Infinity, doesn't this original strategy work (and is across versions)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I still don't know how to do it. (I get it , what try/except/else does)
When I am trying,
_as = np.lexsort(keys=(mask, values))
I will be expecting two kinds of errors when sorting an object array. One is due to lexsort is buggy in certain numpy versions, which can be fixed by argsort with order argument. Another is due to incomparable types in the array like comparing a str with an int. But they all raise same TypeError. I don't know how to isolate them.
Moreover, if I use argsort with order argument to handle the version issue, it still might fail due to incomparable types within the array and raise another TypeError
What is the Pythonic way to fix it other they nested try/except ?
pandas/tests/series/test_rank.py
Outdated
[np.iinfo(np.uint8).min, 1, 2, 100, np.iinfo(np.uint8).max], | ||
# -inf is eqivatlent to na value for int64 dtype becasue iNaT is | ||
# the same as np.iinfo(np.int64).min. The expected rank for iNaT is | ||
# nan if keep_na is true. Pending GH issue #16674 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to make this the correct result and just xfail it pending that issue
pandas/tests/series/test_rank.py
Outdated
1e10, np.iinfo(np.int64).max], | ||
[NegInfinity(), '1', 'A', 'BA', 'Ba', 'C', Infinity()] | ||
) | ||
dtypes = ('float64', 'float32', 'uint8', 'int64', 'object') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parametrize this test
{{endif}} | ||
|
||
n = len(values) | ||
ranks = np.empty(n, dtype='f8') | ||
|
||
{{if dtype == 'object'}} | ||
try: | ||
_as = values.argsort() | ||
# lexsort on object array will raise TypeError for numpy version | ||
# earlier than 1.11.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no I mean try/except/else. you should put minimal amounts in the try portion; else is evaluation if no exception is raised. nested try/excepts are way harder to read.
{{endif}} | ||
|
||
n = len(values) | ||
ranks = np.empty(n, dtype='f8') | ||
|
||
{{if dtype == 'object'}} | ||
try: | ||
_as = values.argsort() | ||
# lexsort on object array will raise TypeError for numpy version | ||
# earlier than 1.11.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow, it is not working properly on Infinity, NegInfinity object. As a result, I sort it by mask first. The downside is I need to specify the condition on whether to reverse the sign of mask so as to control whether put Nans at the beginning or the end.
looks like you fixed Infinity, doesn't this original strategy work (and is across versions)?
pandas/_libs/algos.pyx
Outdated
__eq__ = lambda self, other: self is other | ||
__ne__ = lambda self, other: self is not other | ||
__gt__ = lambda self, other: self is not other | ||
__le__ = lambda self, other: isinstance(other, Infinity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some tests for Infinity/NegInfitiy (some in test_algos already). esp how they react with NaN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can add some tests.
Is it necessary to take the NaN into consideration, since Infinity/NegInfitiy is used to replace all NaN values ? There are no NaNs when comparing elements against each other.
np.putmask(values, mask, nan_value)
(nan_value is eiterh Infinity or NegInfitiy)
Hello @peterpanmj! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on December 04, 2017 at 15:42 Hours UTC |
pandas/tests/series/test_rank.py
Outdated
values = np.array(contents, dtype=dtype) | ||
nan_indices = np.random.choice(range(len(values)), 3) | ||
exp_order = np.array(range(len(values)), dtype='float64') + 1.0 | ||
if dtype in dtype_na_map: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put a comment why you are doing this.
pandas/tests/test_algos.py
Outdated
assert libalgos.NegInfinity() == libalgos.NegInfinity() | ||
assert not libalgos.NegInfinity() != libalgos.NegInfinity() | ||
|
||
assert Inf > np.nan and Inf >= np.nan and Inf != np.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put the nan comparison in a separate test / section (and comment that you are testing vs nan).
These should be consistent, where nothing matches nan.
In [1]: np.inf > np.nan
Out[1]: False
In [2]: np.inf < np.nan
Out[2]: False
In [3]: np.inf == np.nan
Out[3]: False
In [4]: np.inf != np.nan
Out[4]: True
_dt = [('values', 'O'), ('mask', '?')] | ||
_values = np.asarray(list(zip(values, mask)), dtype=_dt) | ||
_as = np.argsort(_values, kind='mergesort', order=('values', 'mask')) | ||
except TypeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
liberally put some comments on what is going on here.
what is retry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. small changes. can you add note in the bug fix numeric section. if you think its worth a sub-section ok with that as well (as this is a numerically impactful change and should be highlited a bit).
@@ -90,56 +86,69 @@ def rank_1d_{{dtype}}(object in_arr, ties_method='average', ascending=True, | |||
mask = values == iNaT | |||
{{endif}} | |||
|
|||
# doulbe sort first by mask and then by values to ensure nan values are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
@@ -27,7 +27,7 @@ dtypes = [('object', 'object', 'Infinity()', 'NegInfinity()'), | |||
{{if dtype == 'object'}} | |||
|
|||
|
|||
def rank_1d_{{dtype}}(object in_arr, bint retry=1, ties_method='average', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very odd that we had this arg here in the first place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. small changes. can you add note in the bug fix numeric section. if you think its worth a sub-section ok with that as well (as this is a numerically impactful change and should be highlited a bit).
Btw, where should I put the note ? Which file ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v0.22.0.txt, add a small sub-section in other enhancements with a mini-example.
rebase as well as the missing value support was moved around a bit (e.g. changes to infinity w/o nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge in master, you are picking up other changes as well.
doc/source/whatsnew/v0.22.0.txt
Outdated
- Better support for ``Dataframe.style.to_excel()`` output with the ``xlsxwriter`` engine. (:issue:`16149`) | ||
- | ||
- | ||
``Series.rank`` ``DataFrame.rank`` now can handle ``inf`` values and missing values properly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a ref above this, you can just say
.rank
will now handle inf
values when NaN
are present (or somethign like this)
doc/source/whatsnew/v0.22.0.txt
Outdated
|
||
.. code-block:: ipython | ||
|
||
In [17]: pd.Series([-np.inf, 0, 1, np.inf]).rank() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add in a NaN in your example
doc/source/whatsnew/v0.22.0.txt
Outdated
3 4.0 | ||
dtype: float64 | ||
|
||
Furthermore, in previous versions, missing values were not distinguished from infinit values in the calculation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
infinit -> infinity
doc/source/whatsnew/v0.22.0.txt
Outdated
.. code-block:: ipython | ||
|
||
In [3]: pd.Series([None, None, None, 'A', 'B']).rank(na_option='top') | ||
Out[3]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't need an example (for the object
dtype)
doc/source/whatsnew/v0.22.0.txt
Outdated
|
||
.. code-block:: ipython | ||
|
||
In [15]: pd.Series([np.nan, np.nan, -np.inf, -np.inf]).rank(na_option='top') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would think about combining this with the above example (ok if you can't / better like this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good! some doc changes. can you also verify that we have asv's (perf) for this case? if not can you add (and show the results); I expect it to be slightly worse, but that's ok, just want to quantify it.
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -76,6 +76,67 @@ Other Enhancements | |||
- Improved wording of ``ValueError`` raised in :func:`to_datetime` when ``unit=`` is passed with a non-convertible value (:issue:`14350`) | |||
- :func:`Series.fillna` now accepts a Series or a dict as a ``value`` for a categorical dtype (:issue:`17033`) | |||
- :func:`pandas.read_clipboard` updated to use qtpy, falling back to PyQt5 and then PyQt4, adding compatibility with Python3 and multiple python-qt bindings (:issue:`17722`) | |||
- :func:`Series.rank` and :func:`DataFrame.rank` now can handle ``inf`` values properly when ``NaN`` are present (:issue:`6945`) | |||
|
|||
handle ``inf`` values properly when ``NaN`` are present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this above Other Enhancements, add a ref as well, e.g. something like
.. _whatsnew_0220.enhancements.rank_inf:
doc/source/whatsnew/v0.22.0.txt
Outdated
handle ``inf`` values properly when ``NaN`` are present | ||
""""""""""""""""""""""""""""""""""""""""""""""""""""""" | ||
|
||
In previous versions, ``inf`` elements were assigned ``NaN`` as their ranks. Now ranks are calculated properly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number here
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -76,6 +76,67 @@ Other Enhancements | |||
- Improved wording of ``ValueError`` raised in :func:`to_datetime` when ``unit=`` is passed with a non-convertible value (:issue:`14350`) | |||
- :func:`Series.fillna` now accepts a Series or a dict as a ``value`` for a categorical dtype (:issue:`17033`) | |||
- :func:`pandas.read_clipboard` updated to use qtpy, falling back to PyQt5 and then PyQt4, adding compatibility with Python3 and multiple python-qt bindings (:issue:`17722`) | |||
- :func:`Series.rank` and :func:`DataFrame.rank` now can handle ``inf`` values properly when ``NaN`` are present (:issue:`6945`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not needed (as the new section covers)
|
||
Current Behavior | ||
|
||
.. ipython:: python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you use the ipython directive, it executes the code, so you just need to
s.rank()
doc/source/whatsnew/v0.22.0.txt
Outdated
In previous versions, ``inf`` elements were assigned ``NaN`` as their ranks. Now ranks are calculated properly. | ||
|
||
Previous Behavior: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before previous behavior, use an ipython directive to define and how s
.. ipython:: python
s = pd.Series([.....])
s
doc/source/whatsnew/v0.22.0.txt
Outdated
|
||
.. ipython:: python | ||
|
||
In [2]: import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
impot not needed
doc/source/whatsnew/v0.22.0.txt
Outdated
|
||
.. code-block:: ipython | ||
|
||
In [6]: pd.Series([-np.inf, 0, 1, np.nan, np.inf]).rank() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s.rank()
(as its now defined above), but you actually should show the execution adn result of this code (e.g. just copy-past from an ipython session using 0.21.0)
doc/source/whatsnew/v0.22.0.txt
Outdated
|
||
Furthermore, previously if you rank ``inf`` or ``-inf`` values together with ``NaN`` values, the calculation won't distinguish ``NaN`` from infinity when using 'top' or 'bottom' argument. | ||
|
||
Previously Behavior: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, define and show s
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -227,7 +288,7 @@ Reshaping | |||
Numeric | |||
^^^^^^^ | |||
|
|||
- | |||
- Bug in :func:`Series.rank` and :func:`DataFrame.rank` could not properly rank infinity values when ``NaN`` values are present (:issue:`6945`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can remove this as well (covered by above)
thanks @peterpanmj ! nice patch & very responsive! |
The bug is caused by replacing nan with positive infinite or negative infinite values. As a result, when keep_na is true, both real nan and positive/negative infinite will be assigned nan as the rank. The solution is to keep track of the real nan value when generate the nan mask.
git diff upstream/master -u -- "*.py" | flake8 --diff