Revert Cythonized Kendall implementation and improve test case to prevent regressions #43403

zrait · 2021-09-04T18:31:51Z

closes BUG: Cythonized kendall correlation implementation is incorrect #43401
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

jreback

instead of reverting would be better to actually fix it
this is an edge case

zrait · 2021-09-04T18:44:42Z

I don't actually agree that this is an edge case. Kendall correlation is intended to be used with ordinal data and a textbook application would be something like computing correlation between a survey response with a limited domain 1-10 and some other variable--a case in which you're almost certain to have many duplicates.

zrait · 2021-09-04T18:49:57Z

If you look at the scipy Kendall code it is pretty clever and well vectorized and also uses a Cythonized Fenwick tree to compute the discordant pairs.

Looking at the original diff here and the performance improvements cited on it, I actually think a significant amount of the benefit is that ignoring ties allows you to take some major shortcuts. I'll note that I'm not an experienced Cython optimizer so if someone can make this handle ties while retaining the same margin of performance improvements I'd be very impressed and agree that'd be better. My intuition is that no matter what, a correct implementation will still have appreciably worse performance than the current incorrect one.

lithomas1 · 2021-09-04T19:47:17Z

I'm fine with reverting this.

simonjayhawkins · 2021-09-05T07:32:54Z

I'm fine with reverting this.

This is in released pandas so would also need a release note if we do this.

zrait · 2021-09-05T16:45:54Z

I'm fine with reverting this.

This is in released pandas so would also need a release note if we do this.

I added a release note referencing this is as fixing a regression.

To add a bit more to my above remarks, I think that in general if a change made solely to improve performance introduces incorrect behavior, unless the fix is trivial, the correct response is to first revert it and then reopen the original goal of improving performance as a separate issue. Incorrect code is worse than slow code, and the baseline for a new attempt at improving performance should be the original implementation (in this case the scipy path that was used prior to the Cython implementation), not the wrong-but-faster code.

jreback · 2021-09-06T15:10:34Z

To add a bit more to my above remarks, I think that in general if a change made solely to improve performance introduces incorrect behavior, unless the fix is trivial, the correct response is to first revert it and then reopen the original goal of improving performance as a separate issue. Incorrect code is worse than slow code, and the baseline for a new attempt at improving performance should be the original implementation (in this case the scipy path that was used prior to the Cython implementation), not the wrong-but-faster code.

of course, pandas is a very large project and if something is not tested (which is the case here), regressions are certainly possible. that's why we need external testing. so thank you for that. all that said, happy to have an implementation that is performant and is correct.

jreback · 2021-09-06T15:21:37Z

this is failing pre-commit: https://github.com/pandas-dev/pandas/pull/43403/checks?check_run_id=3526049796

This reverts commit 57ccd2a. The Kendall implementation failed to take into account ties and was inconsistent with scipy's method

zrait · 2021-09-06T16:29:07Z

this is failing pre-commit: https://github.com/pandas-dev/pandas/pull/43403/checks?check_run_id=3526049796

Oops, should be fixed now--accidentally removed an extra newline in pandas/_libs/algos.pyx

jreback · 2021-09-06T18:35:51Z

thanks @zrait

jreback · 2021-09-06T18:35:59Z

@meeseeksdev backport 1.3.x

lumberbot-app · 2021-09-06T18:36:12Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

$ git checkout 1.3.x
$ git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

$ git cherry-pick -m1 daaf2868e0d8841b2d4cdaa4ca766f41fe8dc6c6

You will likely have some merge/cherry-pick conflict here, fix them and commit:

$ git commit -am 'Backport PR #43403: Revert Cythonized Kendall implementation and improve test case to prevent regressions'

Push to a named branch :

git push YOURFORK 1.3.x:auto-backport-of-pr-43403-on-1.3.x

Create a PR against branch 1.3.x, I would have named this PR:

"Backport PR #43403 on branch 1.3.x (Revert Cythonized Kendall implementation and improve test case to prevent regressions)"

And apply the correct labels and milestones.

Congratulation you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove Still Needs Manual Backport label once the PR gets merged.

If these instruction are inaccurate, feel free to suggest an improvement.

jreback · 2021-09-06T18:36:39Z

@zrait if you wouldn't mind following the instructions above for the backport.

lumberbot-app · 2021-09-06T18:36:40Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

$ git checkout 1.3.x
$ git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

$ git cherry-pick -m1 daaf2868e0d8841b2d4cdaa4ca766f41fe8dc6c6

You will likely have some merge/cherry-pick conflict here, fix them and commit:

$ git commit -am 'Backport PR #43403: Revert Cythonized Kendall implementation and improve test case to prevent regressions'

Push to a named branch :

git push YOURFORK 1.3.x:auto-backport-of-pr-43403-on-1.3.x

Create a PR against branch 1.3.x, I would have named this PR:

"Backport PR #43403 on branch 1.3.x (Revert Cythonized Kendall implementation and improve test case to prevent regressions)"

And apply the correct labels and milestones.

Congratulation you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove Still Needs Manual Backport label once the PR gets merged.

If these instruction are inaccurate, feel free to suggest an improvement.

…n and improve test case to prevent regressions

zrait · 2021-09-06T19:47:03Z

@zrait if you wouldn't mind following the instructions above for the backport.

#43431

…ove test case to prevent regressions (#43431)

…vent regressions (pandas-dev#43403)

jreback requested changes Sep 4, 2021

View reviewed changes

simonjayhawkins added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Regression Functionality that used to work in a prior pandas version labels Sep 5, 2021

simonjayhawkins added this to the 1.3.3 milestone Sep 5, 2021

Add duplicated values to correlation test data

51fae8f

zrait force-pushed the master branch from df2d86f to f228a0a Compare September 5, 2021 16:38

zrait requested a review from jreback September 5, 2021 18:55

zrait added 2 commits September 6, 2021 12:27

Revert "PERF: cythonize kendall correlation (pandas-dev#39132)"

d027c56

This reverts commit 57ccd2a. The Kendall implementation failed to take into account ties and was inconsistent with scipy's method

Add release note

25331d2

zrait force-pushed the master branch from f228a0a to 25331d2 Compare September 6, 2021 16:28

jreback approved these changes Sep 6, 2021

View reviewed changes

jreback merged commit daaf286 into pandas-dev:master Sep 6, 2021

lumberbot-app bot added the Still Needs Manual Backport label Sep 6, 2021

zrait added a commit to zrait/pandas that referenced this pull request Sep 6, 2021

Backport PR pandas-dev#43403: Revert Cythonized Kendall implementatio…

7edb04c

…n and improve test case to prevent regressions

lithomas1 pushed a commit that referenced this pull request Sep 6, 2021

Backport PR #43403: Revert Cythonized Kendall implementation and impr…

b7082dd

…ove test case to prevent regressions (#43431)

lithomas1 removed the Still Needs Manual Backport label Sep 6, 2021

feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021

Revert Cythonized Kendall implementation and improve test case to pre…

50f384b

…vent regressions (pandas-dev#43403)

simonjayhawkins mentioned this pull request Oct 29, 2021

TYP: type annotations for nancorr #44227

Merged

Uh oh!

Revert Cythonized Kendall implementation and improve test case to prevent regressions #43403

Revert Cythonized Kendall implementation and improve test case to prevent regressions #43403

Uh oh!

Conversation

zrait commented Sep 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

zrait commented Sep 4, 2021

Uh oh!

zrait commented Sep 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lithomas1 commented Sep 4, 2021

Uh oh!

simonjayhawkins commented Sep 5, 2021

Uh oh!

zrait commented Sep 5, 2021

Uh oh!

jreback commented Sep 6, 2021

Uh oh!

jreback commented Sep 6, 2021

Uh oh!

zrait commented Sep 6, 2021

Uh oh!

jreback commented Sep 6, 2021

Uh oh!

jreback commented Sep 6, 2021

Uh oh!

lumberbot-app bot commented Sep 6, 2021

Uh oh!

jreback commented Sep 6, 2021

Uh oh!

lumberbot-app bot commented Sep 6, 2021

Uh oh!

zrait commented Sep 6, 2021

Uh oh!

Uh oh!

zrait commented Sep 4, 2021 •

edited

Loading

zrait commented Sep 4, 2021 •

edited

Loading