-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: operate on arrays instead of Series in DataFrame/DataFrame ops #33561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
1102f0d
cd0218c
44ec4f4
9ff61c4
c4b7823
b8de50e
103db6c
d16ced8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -262,15 +262,11 @@ def dispatch_to_series(left, right, func, axis=None): | |
------- | ||
DataFrame | ||
""" | ||
# Note: we use iloc to access columns for compat with cases | ||
# with non-unique columns. | ||
import pandas.core.computation.expressions as expressions | ||
# Get the appropriate array-op to apply to each column/block's values. | ||
array_op = get_array_op(func) | ||
|
||
right = lib.item_from_zerodim(right) | ||
if lib.is_scalar(right) or np.ndim(right) == 0: | ||
|
||
# Get the appropriate array-op to apply to each block's values. | ||
array_op = get_array_op(func) | ||
bm = left._mgr.apply(array_op, right=right) | ||
return type(left)(bm) | ||
|
||
|
@@ -281,7 +277,6 @@ def dispatch_to_series(left, right, func, axis=None): | |
# fails in cases with empty columns reached via | ||
# _frame_arith_method_with_reindex | ||
|
||
array_op = get_array_op(func) | ||
bm = left._mgr.operate_blockwise(right._mgr, array_op) | ||
return type(left)(bm) | ||
|
||
|
@@ -295,27 +290,24 @@ def dispatch_to_series(left, right, func, axis=None): | |
# Note: we do not do this unconditionally as it may be lossy or | ||
# expensive for EA dtypes. | ||
right = np.asarray(right) | ||
|
||
def column_op(a, b): | ||
return {i: func(a.iloc[:, i], b[i]) for i in range(len(a.columns))} | ||
|
||
else: | ||
right = right._values | ||
|
||
def column_op(a, b): | ||
return {i: func(a.iloc[:, i], b.iloc[i]) for i in range(len(a.columns))} | ||
arrays = [array_op(l, r) for l, r in zip(left._iter_column_arrays(), right)] | ||
|
||
elif isinstance(right, ABCSeries): | ||
assert right.index.equals(left.index) # Handle other cases later | ||
right = right._values | ||
|
||
def column_op(a, b): | ||
return {i: func(a.iloc[:, i], b) for i in range(len(a.columns))} | ||
arrays = [array_op(l, right) for l in left._iter_column_arrays()] | ||
|
||
else: | ||
# Remaining cases have less-obvious dispatch rules | ||
raise NotImplementedError(right) | ||
|
||
new_data = expressions.evaluate(column_op, left, right) | ||
return new_data | ||
return type(left)._from_arrays( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think you can just return There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I want to explicitly use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sounds good |
||
arrays, left.columns, left.index, verify_integrity=False | ||
) | ||
|
||
|
||
# ----------------------------------------------------------------------------- | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesnt need to be in this PR, but since you're attention is already focused on this line: this can be incorrect if
b
is e.g. Categorical, sinceb.iloc[i]
will have the scalar behavior instead of the Categorical (usually raising) behavior. (This is also a place where NaT causes issues, and the idea of only-one-NA causes me nightmares)The main idea I've had here is to use
b.iloc[[i]]
instead ofb.iloc[i]
, haven't gotten around to implementing it. Does this change make it easier to address the issue?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when using something like
b.iloc[[i]]
, you would need to handle broadcasting here as well (since the array ops expect either same length array or scalar). So since that is a pre-existing issue, I would leave that for a dedicated PR.Is there an issue for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think there's an issue for this. If I had found one, i would have added the Numeric label