-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Revert Cythonized Kendall implementation and improve test case to prevent regressions #43403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
zrait
commented
Sep 4, 2021
•
edited
Loading
edited
- closes BUG: Cythonized kendall correlation implementation is incorrect #43401
- tests added / passed
- Ensure all linting tests pass, see here for how to run them
- whatsnew entry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of reverting would be better to actually fix it
this is an edge case
I don't actually agree that this is an edge case. Kendall correlation is intended to be used with ordinal data and a textbook application would be something like computing correlation between a survey response with a limited domain 1-10 and some other variable--a case in which you're almost certain to have many duplicates. |
If you look at the Looking at the original diff here and the performance improvements cited on it, I actually think a significant amount of the benefit is that ignoring ties allows you to take some major shortcuts. I'll note that I'm not an experienced Cython optimizer so if someone can make this handle ties while retaining the same margin of performance improvements I'd be very impressed and agree that'd be better. My intuition is that no matter what, a correct implementation will still have appreciably worse performance than the current incorrect one. |
I'm fine with reverting this. |
This is in released pandas so would also need a release note if we do this. |
I added a release note referencing this is as fixing a regression. To add a bit more to my above remarks, I think that in general if a change made solely to improve performance introduces incorrect behavior, unless the fix is trivial, the correct response is to first revert it and then reopen the original goal of improving performance as a separate issue. Incorrect code is worse than slow code, and the baseline for a new attempt at improving performance should be the original implementation (in this case the scipy path that was used prior to the Cython implementation), not the wrong-but-faster code. |
of course, pandas is a very large project and if something is not tested (which is the case here), regressions are certainly possible. that's why we need external testing. so thank you for that. all that said, happy to have an implementation that is performant and is correct. |
this is failing pre-commit: https://github.com/pandas-dev/pandas/pull/43403/checks?check_run_id=3526049796 |
This reverts commit 57ccd2a. The Kendall implementation failed to take into account ties and was inconsistent with scipy's method
Oops, should be fixed now--accidentally removed an extra newline in |
thanks @zrait |
@meeseeksdev backport 1.3.x |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulation you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove If these instruction are inaccurate, feel free to suggest an improvement. |
@zrait if you wouldn't mind following the instructions above for the backport. |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulation you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove If these instruction are inaccurate, feel free to suggest an improvement. |
…n and improve test case to prevent regressions
…ove test case to prevent regressions (#43431)