PERF: efficient argmax/argmin for SparseArray #47779

GYHHAHA · 2022-07-18T18:04:22Z

partially closes ENH: implement efficient sorting methods for SparseArray (argsort/argmin/argmax) #34197
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Thanks for the review. Currently, only argmax/argmin are implemented since argsort has many annoying corner cases and I have to spent more time on the correctness and take the follow-up in a separate PR. Simple benchmark:

>>> val = np.random.rand(1000000)
>>> mask = val < 0.99
>>> val[mask] = np.nan
>>> arr = SparseArray(val)
>>> %timeit arr.argmax()
8.6 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <- master
44.3 µs ± 934 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) <- this pr

pep8speaks · 2022-07-18T18:04:26Z

Hello @GYHHAHA! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-07-27 16:40:06 UTC

mzeitlin11

Thanks for the pr @GYHHAHA! Some comments below, general question is why can't we just dispatch to a SparseArray.argmin/argmax which directly uses np.argmin/max instead of having to modify the general ExtensionArray path (for example, how SparseArray handles min/max. This could also avoid materializing the data.

pandas/core/arrays/base.py

pandas/core/arrays/sparse/array.py

pandas/tests/arrays/sparse/test_reductions.py

GYHHAHA · 2022-07-18T20:17:45Z

@mzeitlin11 I think you are right. Originally I want to reuse some general codes. It seems to be unnecessary. I will handle this in SparseArray.

GYHHAHA · 2022-07-18T22:16:56Z

I find the current implementation for _first_fill_value_loc is wrong. A simple bad case:

>>> arr = SparseArray([np.nan, 1, 0, 0, np.nan, 2], fill_value=1)
>>> arr._first_fill_value_loc()
5 # should be 1

I will commit a fix after merging this one.

mzeitlin11 · 2022-07-18T23:19:51Z

I find the current implementation for _first_fill_value_loc is wrong. A simple bad case:
>>> arr = SparseArray([np.nan, 1, 0, 0, np.nan, 2], fill_value=1)
>>> arr._first_fill_value_loc()
5 # should be 1
I will commit a fix after merging this one.

I'm not entirely sure what that function is meant to do, but the name might also just be misleading. It looks to only be used in unique as some kind of mechanism for figuring out where to insert the fill value (probably to keep sorted order in the result?).

pandas/core/arrays/sparse/array.py

GYHHAHA · 2022-07-18T23:52:05Z

The reason why unique need to find the loc is to keep the insertion place right, which aligns the normal array.unique() result. @mzeitlin11

mzeitlin11 · 2022-07-20T16:38:26Z

@GYHHAHA what's the status here? This generally lgtm - but does the _first_fill_value_loc issues you discovered cause any regressions in cases where _first_fill_value_loc is wrong? Also just confirming benchmarks still look good with new impl?

GYHHAHA · 2022-07-20T20:29:40Z

How about I submit the fix PR for unique and _first_fill_value_loc (and add tests) now? After the fix merged, then we can deal with the current one. @mzeitlin11

mzeitlin11 · 2022-07-20T21:00:31Z

How about I submit the fix PR for unique and _first_fill_value_loc (and add tests) now? After the fix merged, then we can deal with the current one. @mzeitlin11

Sounds great, thanks!

mzeitlin11

Thanks @GYHHAHA! LGTM pending one small comment. Can you please also show latest benchmarks for the cases you showed in the pr description?

pandas/core/arrays/sparse/array.py

GYHHAHA · 2022-07-26T22:35:08Z

@mzeitlin11 New benchmark is added in the description.

doc/source/whatsnew/v1.5.0.rst

mroeschke · 2022-07-27T16:40:27Z

Thanks @GYHHAHA

GYHHAHA added 5 commits July 18, 2022 12:38

Update test_reductions.py

bb3d23a

Update v1.5.0.rst

f7f00fc

Update array.py

fdf7e96

Update base.py

301c5d9

Update sorting.py

e36f605

GYHHAHA added 3 commits July 18, 2022 13:06

fix format

11421bd

Update array.py

196a4c3

Update base.py

89e24e6

mzeitlin11 reviewed Jul 18, 2022

View reviewed changes

pandas/core/arrays/base.py Outdated Show resolved Hide resolved

pandas/core/arrays/base.py Outdated Show resolved Hide resolved

pandas/core/arrays/sparse/array.py Outdated Show resolved Hide resolved

pandas/tests/arrays/sparse/test_reductions.py Show resolved Hide resolved

mzeitlin11 added Performance Memory or execution speed performance Sparse Sparse Data Type ExtensionArray Extending pandas with custom dtypes or arrays. labels Jul 18, 2022

GYHHAHA added 6 commits July 18, 2022 15:32

Update sorting.py

2861307

Update base.py

784ecb7

Update array.py

96bd368

Update test_reductions.py

dfccadf

fix format

393f358

fix import

9f43ec6

mzeitlin11 reviewed Jul 18, 2022

View reviewed changes

pandas/core/arrays/sparse/array.py Show resolved Hide resolved

GYHHAHA mentioned this pull request Jul 21, 2022

BUG: fix SparseArray.unique IndexError and _first_fill_value_loc algo #47810

Merged

4 tasks

Update test_reductions.py

f1fb365

GYHHAHA requested a review from mzeitlin11 July 25, 2022 03:54

mzeitlin11 approved these changes Jul 26, 2022

View reviewed changes

pandas/core/arrays/sparse/array.py Outdated Show resolved Hide resolved

pandas/core/arrays/sparse/array.py Outdated Show resolved Hide resolved

Update array.py

0057def

mroeschke added this to the 1.5 milestone Jul 26, 2022

mroeschke reviewed Jul 26, 2022

View reviewed changes

doc/source/whatsnew/v1.5.0.rst Outdated Show resolved Hide resolved

GYHHAHA changed the title ~~ENH/PERF: efficient argmax/argmin for SparseArray~~ PERF: efficient argmax/argmin for SparseArray Jul 27, 2022

move to perf

b80ea9e

mroeschke reviewed Jul 27, 2022

View reviewed changes

doc/source/whatsnew/v1.5.0.rst Outdated Show resolved Hide resolved

Update doc/source/whatsnew/v1.5.0.rst

8ee4987

mroeschke approved these changes Jul 27, 2022

View reviewed changes

mroeschke merged commit 8d7a379 into pandas-dev:main Jul 27, 2022

GYHHAHA deleted the patch-3 branch July 27, 2022 17:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: efficient argmax/argmin for SparseArray #47779

PERF: efficient argmax/argmin for SparseArray #47779

GYHHAHA commented Jul 18, 2022 •

edited

Loading

pep8speaks commented Jul 18, 2022 •

edited

Loading

mzeitlin11 left a comment

GYHHAHA commented Jul 18, 2022 •

edited

Loading

GYHHAHA commented Jul 18, 2022

mzeitlin11 commented Jul 18, 2022

GYHHAHA commented Jul 18, 2022

mzeitlin11 commented Jul 20, 2022

GYHHAHA commented Jul 20, 2022

mzeitlin11 commented Jul 20, 2022

mzeitlin11 left a comment

GYHHAHA commented Jul 26, 2022

mroeschke commented Jul 27, 2022

PERF: efficient argmax/argmin for SparseArray #47779

PERF: efficient argmax/argmin for SparseArray #47779

Conversation

GYHHAHA commented Jul 18, 2022 • edited Loading

pep8speaks commented Jul 18, 2022 • edited Loading

Comment last updated at 2022-07-27 16:40:06 UTC

mzeitlin11 left a comment

Choose a reason for hiding this comment

GYHHAHA commented Jul 18, 2022 • edited Loading

GYHHAHA commented Jul 18, 2022

mzeitlin11 commented Jul 18, 2022

GYHHAHA commented Jul 18, 2022

mzeitlin11 commented Jul 20, 2022

GYHHAHA commented Jul 20, 2022

mzeitlin11 commented Jul 20, 2022

mzeitlin11 left a comment

Choose a reason for hiding this comment

GYHHAHA commented Jul 26, 2022

mroeschke commented Jul 27, 2022

GYHHAHA commented Jul 18, 2022 •

edited

Loading

pep8speaks commented Jul 18, 2022 •

edited

Loading

GYHHAHA commented Jul 18, 2022 •

edited

Loading