-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
[ArrowStringArray] PERF: implement ArrowStringArray._str_split #41085
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/arrays/string_arrow.py
Outdated
result[~is_valid] = self.dtype.na_value | ||
valid = result[is_valid] | ||
# we need to loop through to avoid numpy indexing assignment errors when | ||
# the result is not a ragged array and interpreted as a 2 dimensional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you show a small example of the problem? Quickly testing I see no 2D array:
In [49]: np.array(pa.array([[1, 2], [3, 4]]))
Out[49]: array([array([1, 2]), array([3, 4])], dtype=object)
pandas/core/arrays/string_arrow.py
Outdated
else: | ||
result = np.array(result) | ||
for i, val in enumerate(result): | ||
result[i] = val.tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I would maybe leave out this conversion to lists for now (it's opt-in, so there can be some change in behaviour where needed; and since the main reason to use this arrow dtype is performance, it could make sense to keep the arrays).
Or does that break many tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no tests break with expand=True and only one test (that we can change) for expand=False.
but that is only part of the perf issue. have added benchmark but will circle back to this shortly (pushing changes to origin trigger a notification, but not ready for review yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The additional time in DataFrame construction from an array of arrays is more than the time taken to convert to lists.
convert to lists
ncalls tottime percall cumtime percall filename:lineno(function)
100000 0.051 0.000 0.051 0.000 {method 'tolist' of 'numpy.ndarray' objects}
23 0.018 0.001 0.018 0.001 {built-in method numpy.array}
10 0.015 0.001 0.015 0.001 {pyarrow.lib.array}
2 0.012 0.006 0.012 0.006 {method 'call' of 'pyarrow._compute.Function' objects}
1 0.012 0.012 0.015 0.015 accessor.py:290(<listcomp>)
1 0.011 0.011 0.072 0.072 {pandas._libs.lib.map_infer_mask}
1 0.010 0.010 0.027 0.027 accessor.py:286(<listcomp>)
100000 0.010 0.000 0.061 0.000 string_arrow.py:923(<lambda>)
100000 0.009 0.000 0.016 0.000 accessor.py:280(cons_row)
100001 0.009 0.000 0.012 0.000 accessor.py:289(<genexpr>)
10 0.009 0.001 0.009 0.001 {built-in method pandas._libs.lib.ensure_string_array}
100032 0.007 0.000 0.007 0.000 {pandas._libs.lib.is_list_like}
200095/200071 0.007 0.000 0.007 0.000 {built-in method builtins.len}
not converting to lists
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.070 0.070 0.070 0.070 construction.py:793(<listcomp>)
21 0.021 0.001 0.021 0.001 {built-in method numpy.array}
1 0.015 0.015 0.019 0.019 accessor.py:290(<listcomp>)
10 0.015 0.002 0.015 0.002 {pyarrow.lib.array}
1 0.012 0.012 0.012 0.012 {method 'call' of 'pyarrow._compute.Function' objects}
1 0.010 0.010 0.025 0.025 accessor.py:286(<listcomp>)
10 0.009 0.001 0.009 0.001 {built-in method pandas._libs.lib.ensure_string_array}
100001 0.009 0.000 0.013 0.000 accessor.py:289(<genexpr>)
100000 0.008 0.000 0.015 0.000 accessor.py:280(cons_row)
Hello @simonjayhawkins! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-05-19 13:41:18 UTC |
will close for now to help clear the queue. no reason to merge a PR that is slower than object fallback (other than to exercise the pyarrow functions) |
marked as draft as performance issues with conversion from pyarrow result to numpy object array of lists
str.rsplit
can probably also be added to this PR (the tests are parameterised, one failing for StringArray.. maybe precursor needed)also issue with empty string
failing tests are xfailed.. maybe fix later.