Performance regression in replace.ReplaceDict.time_replace_series #33920

TomAugspurger · 2020-05-01T15:11:41Z

Setup

import pandas as pd
import numpy as np

inplace = False

N = 10 ** 5
start_value = 10 ** 5
to_rep = dict(enumerate(np.arange(N) + start_value))
s = pd.Series(np.random.randint(N, size=10 ** 3))

%timeit s.replace(to_rep, inplace=inplace)

# 1.0.2
2.48 s ± 53.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# master
5.77 s ± 117 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

https://pandas.pydata.org/speed/pandas/index.html#replace.ReplaceDict.time_replace_series?p-inplace=False&commits=acb525a79fd3496a57b93fcfdb86be3de28a1815-aa8e869d76878f07dff065f947c99b5663342087 points to

aa8e869d76 BLD: recursive inclusion of DLLs in package data (#33246)
610568c146 REF: remove replace_list kludge (#33445)
df68cceeb3 REF: .dot tests (#33214)
8d299e4ce1 REF: call pandas_dtype up-front in Index.__new__ (#33407)
e7cbe6dc1a BUG: df.iloc[:, :1] with EA column (#32959)
31ea45f8a3 BUG: Add test to ensure, that bug will not occur again. #33058 (#33072)
f334fcc681 Properly handle missing attributes in query/eval strings (#32408)
12ffb23a72 DOC: Change doc template to fix SA04 errors in docstrings #28792 (#32972)
7f276c8ba1 DOC: Fixed examples in `pandas/core/groupby/` (#33230)
982b4aadef CI: fix lint issue (#33461)
6bc8a49f25 DOC: Fix heading capitalization in doc/source/whatsnew - part2 (#32550) (#33403)
47de449681 API/TST: Call __finalize__ in more places (#33379)
916d1f3786 DOC: Fix EX01 in DataFrame.drop_duplicates (#33283)
4fc8c2515e DOC: Fix heading capitalization in doc/source/whatsnew - part3 (#32550) (#33429)
ef9b9387c8 BUG: Fix bug when concatenating Index of strings (#33436)
40fd73ab8f Update citation webpage (#33311)
3cca07c8a5 BUG: Fix replacing in `string` series with NA (pandas-dev#32621) (#32890)
a2cdd50427 INT: provide helpers for accessing the values of DataFrame columns (#33252)
efa85af76d PERF: improve IntegerArray fast constructor (#33359)
716689a9bf REGR: Fix construction of PeriodIndex from strings (#33304)
4a74463d02 Updated headers for files in doc/source/whatsnew (#33376)
542ef40bb4 Updated headers for files in doc/source/whatsnew (#33378)
12b0d4523a TYP: F used in decorators to _typing (#33456)
d72116b0e7 BUG/PLT: Multiple plots with different cmap, colorbars legends use first cmap (#33392)
2a68c12509 BUG: `weights` is not working for multiple columns in df.plot.hist (#33440)
b5f6e59cb4 BUG: Fix ValueError when grouping by read-only category (#33410) (#33446)
0c69615ad5 DEP: Bump min version of dateutil to 2.7.3 (#33363)
be86b6583f CLN: remove unnecessary Series._convert calls (#32949)
4334482c34 BUG/API: prohibit dtype-changing IntervalArray.__setitem__ (#32782)
6e3537dbab CLN: Static types in `pandas/_lib/lib.pyx` (#33329)
54f9b03cf6 BUG/REF: unstack with EA dtypes (#33356)
496c982b4e REF: reshape.concat operate on arrays, not SingleBlockManagers (#33125)
d72dc24e62 CLN: avoid accessing private functions (#33427)
a142ad7bb2 BUG: DataFrame.diff(axis=1) with mixed (or EA) dtypes (#32995)
185a654e3d BUG: scalar indexing on 2D DTA/TDA/PA (#33342)
5f2cdf8e1d REF: call _block_shape from EABlock.make_block (#33308)
5d0faa8ca5 BUG: Series.__getitem__ with MultiIndex and leading integer level (#33404)
fe42954e08 BUG: Timedelta == ndarray[td64] (#33441)
991f784a72 STY: Using __all__; removed "noqa" comment (#33143)
fbae09e11a CLN/TYP: redundant casts and unused ignores (#33453)
78d8af1fe3 Bump cython for asv environment (#33454)

perhaps one of based on the commit message. Not sure.

610568c REF: remove replace_list kludge (REF: remove replace_list kludge #33445)
3cca07c BUG: Fix replacing in string series with NA (BUG: Replace in string series with NA #32621) (BUG: Fix replacing in string series with NA (pandas-dev#32621) #32890)

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-07-02T21:11:10Z

Reverting #33445 doesn't restore the perf, so it likely wasn't that.

We do now spend twice as much time in _compare_or_regex_search, which points to #32890. Can you take a look @chrispe92?

chrispe · 2020-07-03T06:18:35Z

Hi @TomAugspurger, yes I'll try to have a look at it this weekend.

chrispe · 2020-07-04T13:22:58Z

I was able to reproduce about the same time percentage difference between the two versions (i.e. the one before and after #32890). It seems like the main overhead is coming from creating the mask array.

pandas/pandas/core/internals/managers.py

Line 1933 in 7e25af8

mask = np.reshape(~(isna(a)), a.shape)

Notice: My analysis showed that the overhead is not coming from the np.reshape, but from the call of pd.isna.

Background: The mask array is used to filter out pd.NA values in order to avoid using them in a comparison (which would raise an error). I'm not entirely sure how to reduce the complexity of that. Will have to give it some thought, but feel free to make any suggestions!

chrispe · 2020-07-05T07:46:04Z

If we were to replace the op function with something like the following (in order to handle the NA values), then it would make it even slower, so that's likely not a possible solution.

if not regex:
    op = np.vectorize(
        lambda x: operator.eq(x, b)
        if isna(x) is False
        else False
    )
else:
    op = np.vectorize(
        lambda x: bool(re.search(b, x))
        if isinstance(x, str) and isinstance(b, str)
        else False
    )

TomAugspurger · 2020-07-07T20:26:50Z

@chrispe92 thanks. How often is _compare_or_regex_search called? Once per scalar element in to_replace right? I wonder if we could call isna() outside of _compare_or_regex_search and pass it in. Something like (untested)

diff --git a/pandas/core/internals/managers.py b/pandas/core/internals/managers.py
index c82670106d..3f0b3c9e8e 100644
--- a/pandas/core/internals/managers.py
+++ b/pandas/core/internals/managers.py
@@ -596,7 +596,7 @@ class BlockManager(PandasObject):
         # figure out our mask apriori to avoid repeated replacements
         values = self.as_array()
 
-        def comp(s, regex=False):
+        def comp(s, regex=False, mask=None):
             """
             Generate a bool array by perform an equality check, or perform
             an element-wise regular expression matching
@@ -605,9 +605,10 @@ class BlockManager(PandasObject):
                 return isna(values)
 
             s = com.maybe_box_datetimelike(s)
-            return _compare_or_regex_search(values, s, regex)
+            return _compare_or_regex_search(values, s, regex, mask)
 
-        masks = [comp(s, regex) for s in src_list]
+        mask = ~isna(values)
+        masks = [comp(s, regex, mask) for s in src_list]
 
         result_blocks = []
         src_len = len(src_list) - 1
@@ -1895,7 +1896,7 @@ def _merge_blocks(
 
 
 def _compare_or_regex_search(
-    a: ArrayLike, b: Scalar, regex: bool = False
+    a: ArrayLike, b: Scalar, regex: bool = False, mask=None
 ) -> Union[ArrayLike, bool]:
     """
     Compare two array_like inputs of the same shape or two scalar values
@@ -1941,7 +1942,7 @@ def _compare_or_regex_search(
         )
 
     # GH#32621 use mask to avoid comparing to NAs
-    if isinstance(a, np.ndarray) and not isinstance(b, np.ndarray):
+    if mask is None and isinstance(a, np.ndarray) and not isinstance(b, np.ndarray):
         mask = np.reshape(~(isna(a)), a.shape)
     if isinstance(a, np.ndarray):
         a = a[mask]

chrispe · 2020-07-08T16:54:53Z

Good suggestion @TomAugspurger! I'll give this a go and will keep you posted.

chrispe · 2020-07-11T11:35:18Z

I applied the suggested changes and it seems to significantly improve the performance. Here are the stats:

No Mask   (before #32890 - reference):  3.79s (±139.1 ms) [+0.0%]
With Mask (after #32890):               6.60s (±184.5 ms) [+73.9%]
With Mask (after suggested changes):    4.49s (±198.3 ms) [+18.4%]

The tests also seem to pass fine. If you agree I will continue with a PR to include your suggested solution @TomAugspurger.

jreback · 2020-07-11T12:22:11Z

yep would taken PR

TomAugspurger added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version replace replace method labels May 1, 2020

TomAugspurger added this to the 1.1 milestone May 1, 2020

jreback modified the milestones: 1.1, 1.1.1 Jul 10, 2020

chrispe mentioned this issue Jul 11, 2020

Place the calculation of mask prior to the calls of comp in replace_list to improve performance #35229

Merged

5 tasks

jreback modified the milestones: 1.1.1, 1.1 Jul 13, 2020

jreback closed this as completed in #35229 Jul 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression in replace.ReplaceDict.time_replace_series #33920

Performance regression in replace.ReplaceDict.time_replace_series #33920

TomAugspurger commented May 1, 2020

TomAugspurger commented Jul 2, 2020

chrispe commented Jul 3, 2020 •

edited

Loading

chrispe commented Jul 4, 2020 •

edited

Loading

chrispe commented Jul 5, 2020 •

edited

Loading

TomAugspurger commented Jul 7, 2020

chrispe commented Jul 8, 2020

chrispe commented Jul 11, 2020 •

edited

Loading

jreback commented Jul 11, 2020

Performance regression in replace.ReplaceDict.time_replace_series #33920

Performance regression in replace.ReplaceDict.time_replace_series #33920

Comments

TomAugspurger commented May 1, 2020

TomAugspurger commented Jul 2, 2020

chrispe commented Jul 3, 2020 • edited Loading

chrispe commented Jul 4, 2020 • edited Loading

chrispe commented Jul 5, 2020 • edited Loading

TomAugspurger commented Jul 7, 2020

chrispe commented Jul 8, 2020

chrispe commented Jul 11, 2020 • edited Loading

jreback commented Jul 11, 2020

chrispe commented Jul 3, 2020 •

edited

Loading

chrispe commented Jul 4, 2020 •

edited

Loading

chrispe commented Jul 5, 2020 •

edited

Loading

chrispe commented Jul 11, 2020 •

edited

Loading