Skip to content

Revert Cythonized Kendall implementation and improve test case to prevent regressions #43403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 6, 2021

Conversation

zrait
Copy link
Contributor

@zrait zrait commented Sep 4, 2021

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of reverting would be better to actually fix it
this is an edge case

@zrait
Copy link
Contributor Author

zrait commented Sep 4, 2021

I don't actually agree that this is an edge case. Kendall correlation is intended to be used with ordinal data and a textbook application would be something like computing correlation between a survey response with a limited domain 1-10 and some other variable--a case in which you're almost certain to have many duplicates.

@zrait
Copy link
Contributor Author

zrait commented Sep 4, 2021

If you look at the scipy Kendall code it is pretty clever and well vectorized and also uses a Cythonized Fenwick tree to compute the discordant pairs.

Looking at the original diff here and the performance improvements cited on it, I actually think a significant amount of the benefit is that ignoring ties allows you to take some major shortcuts. I'll note that I'm not an experienced Cython optimizer so if someone can make this handle ties while retaining the same margin of performance improvements I'd be very impressed and agree that'd be better. My intuition is that no matter what, a correct implementation will still have appreciably worse performance than the current incorrect one.

@lithomas1
Copy link
Member

I'm fine with reverting this.

@simonjayhawkins simonjayhawkins added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Regression Functionality that used to work in a prior pandas version labels Sep 5, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3.3 milestone Sep 5, 2021
@simonjayhawkins
Copy link
Member

I'm fine with reverting this.

This is in released pandas so would also need a release note if we do this.

@zrait
Copy link
Contributor Author

zrait commented Sep 5, 2021

I'm fine with reverting this.

This is in released pandas so would also need a release note if we do this.

I added a release note referencing this is as fixing a regression.

To add a bit more to my above remarks, I think that in general if a change made solely to improve performance introduces incorrect behavior, unless the fix is trivial, the correct response is to first revert it and then reopen the original goal of improving performance as a separate issue. Incorrect code is worse than slow code, and the baseline for a new attempt at improving performance should be the original implementation (in this case the scipy path that was used prior to the Cython implementation), not the wrong-but-faster code.

@zrait zrait requested a review from jreback September 5, 2021 18:55
@jreback
Copy link
Contributor

jreback commented Sep 6, 2021

To add a bit more to my above remarks, I think that in general if a change made solely to improve performance introduces incorrect behavior, unless the fix is trivial, the correct response is to first revert it and then reopen the original goal of improving performance as a separate issue. Incorrect code is worse than slow code, and the baseline for a new attempt at improving performance should be the original implementation (in this case the scipy path that was used prior to the Cython implementation), not the wrong-but-faster code.

of course, pandas is a very large project and if something is not tested (which is the case here), regressions are certainly possible. that's why we need external testing. so thank you for that. all that said, happy to have an implementation that is performant and is correct.

@jreback
Copy link
Contributor

jreback commented Sep 6, 2021

This reverts commit 57ccd2a.

The Kendall implementation failed to take into account ties
and was inconsistent with scipy's method
@zrait
Copy link
Contributor Author

zrait commented Sep 6, 2021

this is failing pre-commit: https://github.com/pandas-dev/pandas/pull/43403/checks?check_run_id=3526049796

Oops, should be fixed now--accidentally removed an extra newline in pandas/_libs/algos.pyx

@jreback jreback merged commit daaf286 into pandas-dev:master Sep 6, 2021
@jreback
Copy link
Contributor

jreback commented Sep 6, 2021

thanks @zrait

@jreback
Copy link
Contributor

jreback commented Sep 6, 2021

@meeseeksdev backport 1.3.x

@lumberbot-app
Copy link

lumberbot-app bot commented Sep 6, 2021

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
$ git checkout 1.3.x
$ git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
$ git cherry-pick -m1 daaf2868e0d8841b2d4cdaa4ca766f41fe8dc6c6
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
$ git commit -am 'Backport PR #43403: Revert Cythonized Kendall implementation and improve test case to prevent regressions'
  1. Push to a named branch :
git push YOURFORK 1.3.x:auto-backport-of-pr-43403-on-1.3.x
  1. Create a PR against branch 1.3.x, I would have named this PR:

"Backport PR #43403 on branch 1.3.x (Revert Cythonized Kendall implementation and improve test case to prevent regressions)"

And apply the correct labels and milestones.

Congratulation you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove Still Needs Manual Backport label once the PR gets merged.

If these instruction are inaccurate, feel free to suggest an improvement.

@jreback
Copy link
Contributor

jreback commented Sep 6, 2021

@zrait if you wouldn't mind following the instructions above for the backport.

@lumberbot-app
Copy link

lumberbot-app bot commented Sep 6, 2021

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
$ git checkout 1.3.x
$ git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
$ git cherry-pick -m1 daaf2868e0d8841b2d4cdaa4ca766f41fe8dc6c6
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
$ git commit -am 'Backport PR #43403: Revert Cythonized Kendall implementation and improve test case to prevent regressions'
  1. Push to a named branch :
git push YOURFORK 1.3.x:auto-backport-of-pr-43403-on-1.3.x
  1. Create a PR against branch 1.3.x, I would have named this PR:

"Backport PR #43403 on branch 1.3.x (Revert Cythonized Kendall implementation and improve test case to prevent regressions)"

And apply the correct labels and milestones.

Congratulation you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove Still Needs Manual Backport label once the PR gets merged.

If these instruction are inaccurate, feel free to suggest an improvement.

zrait added a commit to zrait/pandas that referenced this pull request Sep 6, 2021
…n and improve test case to prevent regressions
@zrait
Copy link
Contributor Author

zrait commented Sep 6, 2021

@zrait if you wouldn't mind following the instructions above for the backport.

#43431

lithomas1 pushed a commit that referenced this pull request Sep 6, 2021
feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Cythonized kendall correlation implementation is incorrect
4 participants