Skip to content

PERF: efficient argmax/argmin for SparseArray #47779

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jul 27, 2022
Merged

Conversation

GYHHAHA
Copy link
Contributor

@GYHHAHA GYHHAHA commented Jul 18, 2022

Thanks for the review. Currently, only argmax/argmin are implemented since argsort has many annoying corner cases and I have to spent more time on the correctness and take the follow-up in a separate PR. Simple benchmark:

>>> val = np.random.rand(1000000)
>>> mask = val < 0.99
>>> val[mask] = np.nan
>>> arr = SparseArray(val)
>>> %timeit arr.argmax()
8.6 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <- master
44.3 µs ± 934 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) <- this pr

@pep8speaks
Copy link

pep8speaks commented Jul 18, 2022

Hello @GYHHAHA! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-07-27 16:40:06 UTC

Copy link
Member

@mzeitlin11 mzeitlin11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pr @GYHHAHA! Some comments below, general question is why can't we just dispatch to a SparseArray.argmin/argmax which directly uses np.argmin/max instead of having to modify the general ExtensionArray path (for example, how SparseArray handles min/max. This could also avoid materializing the data.

@mzeitlin11 mzeitlin11 added Performance Memory or execution speed performance Sparse Sparse Data Type ExtensionArray Extending pandas with custom dtypes or arrays. labels Jul 18, 2022
@GYHHAHA
Copy link
Contributor Author

GYHHAHA commented Jul 18, 2022

@mzeitlin11 I think you are right. Originally I want to reuse some general codes. It seems to be unnecessary. I will handle this in SparseArray.

@GYHHAHA
Copy link
Contributor Author

GYHHAHA commented Jul 18, 2022

I find the current implementation for _first_fill_value_loc is wrong. A simple bad case:

>>> arr = SparseArray([np.nan, 1, 0, 0, np.nan, 2], fill_value=1)
>>> arr._first_fill_value_loc()
5 # should be 1

I will commit a fix after merging this one.

@mzeitlin11
Copy link
Member

I find the current implementation for _first_fill_value_loc is wrong. A simple bad case:

>>> arr = SparseArray([np.nan, 1, 0, 0, np.nan, 2], fill_value=1)
>>> arr._first_fill_value_loc()
5 # should be 1

I will commit a fix after merging this one.

I'm not entirely sure what that function is meant to do, but the name might also just be misleading. It looks to only be used in unique as some kind of mechanism for figuring out where to insert the fill value (probably to keep sorted order in the result?).

@GYHHAHA
Copy link
Contributor Author

GYHHAHA commented Jul 18, 2022

The reason why unique need to find the loc is to keep the insertion place right, which aligns the normal array.unique() result. @mzeitlin11

@mzeitlin11
Copy link
Member

@GYHHAHA what's the status here? This generally lgtm - but does the _first_fill_value_loc issues you discovered cause any regressions in cases where _first_fill_value_loc is wrong? Also just confirming benchmarks still look good with new impl?

@GYHHAHA
Copy link
Contributor Author

GYHHAHA commented Jul 20, 2022

How about I submit the fix PR for unique and _first_fill_value_loc (and add tests) now? After the fix merged, then we can deal with the current one. @mzeitlin11

@mzeitlin11
Copy link
Member

How about I submit the fix PR for unique and _first_fill_value_loc (and add tests) now? After the fix merged, then we can deal with the current one. @mzeitlin11

Sounds great, thanks!

@GYHHAHA GYHHAHA requested a review from mzeitlin11 July 25, 2022 03:54
Copy link
Member

@mzeitlin11 mzeitlin11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @GYHHAHA! LGTM pending one small comment. Can you please also show latest benchmarks for the cases you showed in the pr description?

@mroeschke mroeschke added this to the 1.5 milestone Jul 26, 2022
@GYHHAHA
Copy link
Contributor Author

GYHHAHA commented Jul 26, 2022

@mzeitlin11 New benchmark is added in the description.

@GYHHAHA GYHHAHA changed the title ENH/PERF: efficient argmax/argmin for SparseArray PERF: efficient argmax/argmin for SparseArray Jul 27, 2022
@mroeschke mroeschke merged commit 8d7a379 into pandas-dev:main Jul 27, 2022
@mroeschke
Copy link
Member

Thanks @GYHHAHA

@GYHHAHA GYHHAHA deleted the patch-3 branch July 27, 2022 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Performance Memory or execution speed performance Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: implement efficient sorting methods for SparseArray (argsort/argmin/argmax)
4 participants