-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Performance regression in replace.ReplaceDict.time_replace_series #33920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Reverting #33445 doesn't restore the perf, so it likely wasn't that. We do now spend twice as much time in |
Hi @TomAugspurger, yes I'll try to have a look at it this weekend. |
I was able to reproduce about the same time percentage difference between the two versions (i.e. the one before and after #32890). It seems like the main overhead is coming from creating the mask array. pandas/pandas/core/internals/managers.py Line 1933 in 7e25af8
Notice: My analysis showed that the overhead is not coming from the Background: The mask array is used to filter out |
If we were to replace the op function with something like the following (in order to handle the NA values), then it would make it even slower, so that's likely not a possible solution. if not regex:
op = np.vectorize(
lambda x: operator.eq(x, b)
if isna(x) is False
else False
)
else:
op = np.vectorize(
lambda x: bool(re.search(b, x))
if isinstance(x, str) and isinstance(b, str)
else False
) |
@chrispe92 thanks. How often is diff --git a/pandas/core/internals/managers.py b/pandas/core/internals/managers.py
index c82670106d..3f0b3c9e8e 100644
--- a/pandas/core/internals/managers.py
+++ b/pandas/core/internals/managers.py
@@ -596,7 +596,7 @@ class BlockManager(PandasObject):
# figure out our mask apriori to avoid repeated replacements
values = self.as_array()
- def comp(s, regex=False):
+ def comp(s, regex=False, mask=None):
"""
Generate a bool array by perform an equality check, or perform
an element-wise regular expression matching
@@ -605,9 +605,10 @@ class BlockManager(PandasObject):
return isna(values)
s = com.maybe_box_datetimelike(s)
- return _compare_or_regex_search(values, s, regex)
+ return _compare_or_regex_search(values, s, regex, mask)
- masks = [comp(s, regex) for s in src_list]
+ mask = ~isna(values)
+ masks = [comp(s, regex, mask) for s in src_list]
result_blocks = []
src_len = len(src_list) - 1
@@ -1895,7 +1896,7 @@ def _merge_blocks(
def _compare_or_regex_search(
- a: ArrayLike, b: Scalar, regex: bool = False
+ a: ArrayLike, b: Scalar, regex: bool = False, mask=None
) -> Union[ArrayLike, bool]:
"""
Compare two array_like inputs of the same shape or two scalar values
@@ -1941,7 +1942,7 @@ def _compare_or_regex_search(
)
# GH#32621 use mask to avoid comparing to NAs
- if isinstance(a, np.ndarray) and not isinstance(b, np.ndarray):
+ if mask is None and isinstance(a, np.ndarray) and not isinstance(b, np.ndarray):
mask = np.reshape(~(isna(a)), a.shape)
if isinstance(a, np.ndarray):
a = a[mask] |
Good suggestion @TomAugspurger! I'll give this a go and will keep you posted. |
I applied the suggested changes and it seems to significantly improve the performance. Here are the stats:
The tests also seem to pass fine. If you agree I will continue with a PR to include your suggested solution @TomAugspurger. |
yep would taken PR |
Setup
https://pandas.pydata.org/speed/pandas/index.html#replace.ReplaceDict.time_replace_series?p-inplace=False&commits=acb525a79fd3496a57b93fcfdb86be3de28a1815-aa8e869d76878f07dff065f947c99b5663342087 points to
perhaps one of based on the commit message. Not sure.
string
series with NA (BUG: Replace instring
series with NA #32621) (BUG: Fix replacing instring
series with NA (pandas-dev#32621) #32890)The text was updated successfully, but these errors were encountered: