Skip to content

[ArrowStringArray] PERF: implement ArrowStringArray._str_split #41085

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from

Conversation

simonjayhawkins
Copy link
Member

marked as draft as performance issues with conversion from pyarrow result to numpy object array of lists

str.rsplit can probably also be added to this PR (the tests are parameterised, one failing for StringArray.. maybe precursor needed)

also issue with empty string

failing tests are xfailed.. maybe fix later.

@simonjayhawkins simonjayhawkins added the Strings String extension data type and string data label Apr 21, 2021
result[~is_valid] = self.dtype.na_value
valid = result[is_valid]
# we need to loop through to avoid numpy indexing assignment errors when
# the result is not a ragged array and interpreted as a 2 dimensional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show a small example of the problem? Quickly testing I see no 2D array:

In [49]: np.array(pa.array([[1, 2], [3, 4]]))
Out[49]: array([array([1, 2]), array([3, 4])], dtype=object)

else:
result = np.array(result)
for i, val in enumerate(result):
result[i] = val.tolist()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I would maybe leave out this conversion to lists for now (it's opt-in, so there can be some change in behaviour where needed; and since the main reason to use this arrow dtype is performance, it could make sense to keep the arrays).
Or does that break many tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no tests break with expand=True and only one test (that we can change) for expand=False.

but that is only part of the perf issue. have added benchmark but will circle back to this shortly (pushing changes to origin trigger a notification, but not ready for review yet)

Copy link
Member Author

@simonjayhawkins simonjayhawkins May 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The additional time in DataFrame construction from an array of arrays is more than the time taken to convert to lists.

convert to lists

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   100000    0.051    0.000    0.051    0.000 {method 'tolist' of 'numpy.ndarray' objects}
       23    0.018    0.001    0.018    0.001 {built-in method numpy.array}
       10    0.015    0.001    0.015    0.001 {pyarrow.lib.array}
        2    0.012    0.006    0.012    0.006 {method 'call' of 'pyarrow._compute.Function' objects}
        1    0.012    0.012    0.015    0.015 accessor.py:290(<listcomp>)
        1    0.011    0.011    0.072    0.072 {pandas._libs.lib.map_infer_mask}
        1    0.010    0.010    0.027    0.027 accessor.py:286(<listcomp>)
   100000    0.010    0.000    0.061    0.000 string_arrow.py:923(<lambda>)
   100000    0.009    0.000    0.016    0.000 accessor.py:280(cons_row)
   100001    0.009    0.000    0.012    0.000 accessor.py:289(<genexpr>)
       10    0.009    0.001    0.009    0.001 {built-in method pandas._libs.lib.ensure_string_array}
   100032    0.007    0.000    0.007    0.000 {pandas._libs.lib.is_list_like}
200095/200071    0.007    0.000    0.007    0.000 {built-in method builtins.len}

not converting to lists

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.070    0.070    0.070    0.070 construction.py:793(<listcomp>)
       21    0.021    0.001    0.021    0.001 {built-in method numpy.array}
        1    0.015    0.015    0.019    0.019 accessor.py:290(<listcomp>)
       10    0.015    0.002    0.015    0.002 {pyarrow.lib.array}
        1    0.012    0.012    0.012    0.012 {method 'call' of 'pyarrow._compute.Function' objects}
        1    0.010    0.010    0.025    0.025 accessor.py:286(<listcomp>)
       10    0.009    0.001    0.009    0.001 {built-in method pandas._libs.lib.ensure_string_array}
   100001    0.009    0.000    0.013    0.000 accessor.py:289(<genexpr>)
   100000    0.008    0.000    0.015    0.000 accessor.py:280(cons_row)

@pep8speaks
Copy link

pep8speaks commented Apr 29, 2021

Hello @simonjayhawkins! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-05-19 13:41:18 UTC

@simonjayhawkins simonjayhawkins changed the title [ArrowStringArray] implement ArrowStringArray._str_split [ArrowStringArray] PERF: implement ArrowStringArray._str_split May 4, 2021
@simonjayhawkins
Copy link
Member Author

will close for now to help clear the queue. no reason to merge a PR that is slower than object fallback (other than to exercise the pyarrow functions)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants