Skip to content

Performance regression in replace.ReplaceDict.time_replace_series #33920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue May 1, 2020 · 8 comments · Fixed by #35229
Closed

Performance regression in replace.ReplaceDict.time_replace_series #33920

TomAugspurger opened this issue May 1, 2020 · 8 comments · Fixed by #35229
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version replace replace method
Milestone

Comments

@TomAugspurger
Copy link
Contributor

Setup

import pandas as pd
import numpy as np

inplace = False

N = 10 ** 5
start_value = 10 ** 5
to_rep = dict(enumerate(np.arange(N) + start_value))
s = pd.Series(np.random.randint(N, size=10 ** 3))

%timeit s.replace(to_rep, inplace=inplace)
# 1.0.2
2.48 s ± 53.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# master
5.77 s ± 117 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

https://pandas.pydata.org/speed/pandas/index.html#replace.ReplaceDict.time_replace_series?p-inplace=False&commits=acb525a79fd3496a57b93fcfdb86be3de28a1815-aa8e869d76878f07dff065f947c99b5663342087 points to

aa8e869d76 BLD: recursive inclusion of DLLs in package data (#33246)
610568c146 REF: remove replace_list kludge (#33445)
df68cceeb3 REF: .dot tests (#33214)
8d299e4ce1 REF: call pandas_dtype up-front in Index.__new__ (#33407)
e7cbe6dc1a BUG: df.iloc[:, :1] with EA column (#32959)
31ea45f8a3 BUG: Add test to ensure, that bug will not occur again. #33058 (#33072)
f334fcc681 Properly handle missing attributes in query/eval strings (#32408)
12ffb23a72 DOC: Change doc template to fix SA04 errors in docstrings #28792 (#32972)
7f276c8ba1 DOC: Fixed examples in `pandas/core/groupby/` (#33230)
982b4aadef CI: fix lint issue (#33461)
6bc8a49f25 DOC: Fix heading capitalization in doc/source/whatsnew - part2 (#32550) (#33403)
47de449681 API/TST: Call __finalize__ in more places (#33379)
916d1f3786 DOC: Fix EX01 in DataFrame.drop_duplicates (#33283)
4fc8c2515e DOC: Fix heading capitalization in doc/source/whatsnew - part3 (#32550) (#33429)
ef9b9387c8 BUG: Fix bug when concatenating Index of strings (#33436)
40fd73ab8f Update citation webpage (#33311)
3cca07c8a5 BUG: Fix replacing in `string` series with NA (pandas-dev#32621) (#32890)
a2cdd50427 INT: provide helpers for accessing the values of DataFrame columns (#33252)
efa85af76d PERF: improve IntegerArray fast constructor (#33359)
716689a9bf REGR: Fix construction of PeriodIndex from strings (#33304)
4a74463d02 Updated headers for files in doc/source/whatsnew (#33376)
542ef40bb4 Updated headers for files in doc/source/whatsnew (#33378)
12b0d4523a TYP: F used in decorators to _typing (#33456)
d72116b0e7 BUG/PLT: Multiple plots with different cmap, colorbars legends use first cmap (#33392)
2a68c12509 BUG: `weights` is not working for multiple columns in df.plot.hist (#33440)
b5f6e59cb4 BUG: Fix ValueError when grouping by read-only category (#33410) (#33446)
0c69615ad5 DEP: Bump min version of dateutil to 2.7.3 (#33363)
be86b6583f CLN: remove unnecessary Series._convert calls (#32949)
4334482c34 BUG/API: prohibit dtype-changing IntervalArray.__setitem__ (#32782)
6e3537dbab CLN: Static types in `pandas/_lib/lib.pyx` (#33329)
54f9b03cf6 BUG/REF: unstack with EA dtypes (#33356)
496c982b4e REF: reshape.concat operate on arrays, not SingleBlockManagers (#33125)
d72dc24e62 CLN: avoid accessing private functions (#33427)
a142ad7bb2 BUG: DataFrame.diff(axis=1) with mixed (or EA) dtypes (#32995)
185a654e3d BUG: scalar indexing on 2D DTA/TDA/PA (#33342)
5f2cdf8e1d REF: call _block_shape from EABlock.make_block (#33308)
5d0faa8ca5 BUG: Series.__getitem__ with MultiIndex and leading integer level (#33404)
fe42954e08 BUG: Timedelta == ndarray[td64] (#33441)
991f784a72 STY: Using __all__; removed "noqa" comment (#33143)
fbae09e11a CLN/TYP: redundant casts and unused ignores (#33453)
78d8af1fe3 Bump cython for asv environment (#33454)

perhaps one of based on the commit message. Not sure.

@TomAugspurger TomAugspurger added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version replace replace method labels May 1, 2020
@TomAugspurger TomAugspurger added this to the 1.1 milestone May 1, 2020
@TomAugspurger
Copy link
Contributor Author

Reverting #33445 doesn't restore the perf, so it likely wasn't that.

We do now spend twice as much time in _compare_or_regex_search, which points to #32890. Can you take a look @chrispe92?

@chrispe
Copy link
Contributor

chrispe commented Jul 3, 2020

Hi @TomAugspurger, yes I'll try to have a look at it this weekend.

@chrispe
Copy link
Contributor

chrispe commented Jul 4, 2020

I was able to reproduce about the same time percentage difference between the two versions (i.e. the one before and after #32890). It seems like the main overhead is coming from creating the mask array.

mask = np.reshape(~(isna(a)), a.shape)

Notice: My analysis showed that the overhead is not coming from the np.reshape, but from the call of pd.isna.

Background: The mask array is used to filter out pd.NA values in order to avoid using them in a comparison (which would raise an error). I'm not entirely sure how to reduce the complexity of that. Will have to give it some thought, but feel free to make any suggestions!

@chrispe
Copy link
Contributor

chrispe commented Jul 5, 2020

If we were to replace the op function with something like the following (in order to handle the NA values), then it would make it even slower, so that's likely not a possible solution.

if not regex:
    op = np.vectorize(
        lambda x: operator.eq(x, b)
        if isna(x) is False
        else False
    )
else:
    op = np.vectorize(
        lambda x: bool(re.search(b, x))
        if isinstance(x, str) and isinstance(b, str)
        else False
    )

@TomAugspurger
Copy link
Contributor Author

@chrispe92 thanks. How often is _compare_or_regex_search called? Once per scalar element in to_replace right? I wonder if we could call isna() outside of _compare_or_regex_search and pass it in. Something like (untested)

diff --git a/pandas/core/internals/managers.py b/pandas/core/internals/managers.py
index c82670106d..3f0b3c9e8e 100644
--- a/pandas/core/internals/managers.py
+++ b/pandas/core/internals/managers.py
@@ -596,7 +596,7 @@ class BlockManager(PandasObject):
         # figure out our mask apriori to avoid repeated replacements
         values = self.as_array()
 
-        def comp(s, regex=False):
+        def comp(s, regex=False, mask=None):
             """
             Generate a bool array by perform an equality check, or perform
             an element-wise regular expression matching
@@ -605,9 +605,10 @@ class BlockManager(PandasObject):
                 return isna(values)
 
             s = com.maybe_box_datetimelike(s)
-            return _compare_or_regex_search(values, s, regex)
+            return _compare_or_regex_search(values, s, regex, mask)
 
-        masks = [comp(s, regex) for s in src_list]
+        mask = ~isna(values)
+        masks = [comp(s, regex, mask) for s in src_list]
 
         result_blocks = []
         src_len = len(src_list) - 1
@@ -1895,7 +1896,7 @@ def _merge_blocks(
 
 
 def _compare_or_regex_search(
-    a: ArrayLike, b: Scalar, regex: bool = False
+    a: ArrayLike, b: Scalar, regex: bool = False, mask=None
 ) -> Union[ArrayLike, bool]:
     """
     Compare two array_like inputs of the same shape or two scalar values
@@ -1941,7 +1942,7 @@ def _compare_or_regex_search(
         )
 
     # GH#32621 use mask to avoid comparing to NAs
-    if isinstance(a, np.ndarray) and not isinstance(b, np.ndarray):
+    if mask is None and isinstance(a, np.ndarray) and not isinstance(b, np.ndarray):
         mask = np.reshape(~(isna(a)), a.shape)
     if isinstance(a, np.ndarray):
         a = a[mask]

@chrispe
Copy link
Contributor

chrispe commented Jul 8, 2020

Good suggestion @TomAugspurger! I'll give this a go and will keep you posted.

@jreback jreback modified the milestones: 1.1, 1.1.1 Jul 10, 2020
@chrispe
Copy link
Contributor

chrispe commented Jul 11, 2020

I applied the suggested changes and it seems to significantly improve the performance. Here are the stats:

No Mask   (before #32890 - reference):  3.79s (±139.1 ms) [+0.0%]
With Mask (after #32890):               6.60s (±184.5 ms) [+73.9%]
With Mask (after suggested changes):    4.49s (±198.3 ms) [+18.4%]

The tests also seem to pass fine. If you agree I will continue with a PR to include your suggested solution @TomAugspurger.

@jreback
Copy link
Contributor

jreback commented Jul 11, 2020

yep would taken PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version replace replace method
Projects
None yet
3 participants