[ArrowStringArray] PERF: implement ArrowStringArray._str_split #41085

simonjayhawkins · 2021-04-21T19:33:20Z

marked as draft as performance issues with conversion from pyarrow result to numpy object array of lists

str.rsplit can probably also be added to this PR (the tests are parameterised, one failing for StringArray.. maybe precursor needed)

also issue with empty string

failing tests are xfailed.. maybe fix later.

jorisvandenbossche · 2021-04-22T06:50:18Z

pandas/core/arrays/string_arrow.py

+            result[~is_valid] = self.dtype.na_value
+            valid = result[is_valid]
+            # we need to loop through to avoid numpy indexing assignment errors when
+            # the result is not a ragged array and interpreted as a 2 dimensional


Can you show a small example of the problem? Quickly testing I see no 2D array:

In [49]: np.array(pa.array([[1, 2], [3, 4]])) Out[49]: array([array([1, 2]), array([3, 4])], dtype=object)

jorisvandenbossche · 2021-04-22T06:52:10Z

pandas/core/arrays/string_arrow.py

+        else:
+            result = np.array(result)
+            for i, val in enumerate(result):
+                result[i] = val.tolist()


In general, I would maybe leave out this conversion to lists for now (it's opt-in, so there can be some change in behaviour where needed; and since the main reason to use this arrow dtype is performance, it could make sense to keep the arrays).
Or does that break many tests?

no tests break with expand=True and only one test (that we can change) for expand=False.

but that is only part of the perf issue. have added benchmark but will circle back to this shortly (pushing changes to origin trigger a notification, but not ready for review yet)

The additional time in DataFrame construction from an array of arrays is more than the time taken to convert to lists.

convert to lists

ncalls tottime percall cumtime percall filename:lineno(function) 100000 0.051 0.000 0.051 0.000 {method 'tolist' of 'numpy.ndarray' objects} 23 0.018 0.001 0.018 0.001 {built-in method numpy.array} 10 0.015 0.001 0.015 0.001 {pyarrow.lib.array} 2 0.012 0.006 0.012 0.006 {method 'call' of 'pyarrow._compute.Function' objects} 1 0.012 0.012 0.015 0.015 accessor.py:290(<listcomp>) 1 0.011 0.011 0.072 0.072 {pandas._libs.lib.map_infer_mask} 1 0.010 0.010 0.027 0.027 accessor.py:286(<listcomp>) 100000 0.010 0.000 0.061 0.000 string_arrow.py:923(<lambda>) 100000 0.009 0.000 0.016 0.000 accessor.py:280(cons_row) 100001 0.009 0.000 0.012 0.000 accessor.py:289(<genexpr>) 10 0.009 0.001 0.009 0.001 {built-in method pandas._libs.lib.ensure_string_array} 100032 0.007 0.000 0.007 0.000 {pandas._libs.lib.is_list_like} 200095/200071 0.007 0.000 0.007 0.000 {built-in method builtins.len}

not converting to lists

ncalls tottime percall cumtime percall filename:lineno(function) 1 0.070 0.070 0.070 0.070 construction.py:793(<listcomp>) 21 0.021 0.001 0.021 0.001 {built-in method numpy.array} 1 0.015 0.015 0.019 0.019 accessor.py:290(<listcomp>) 10 0.015 0.002 0.015 0.002 {pyarrow.lib.array} 1 0.012 0.012 0.012 0.012 {method 'call' of 'pyarrow._compute.Function' objects} 1 0.010 0.010 0.025 0.025 accessor.py:286(<listcomp>) 10 0.009 0.001 0.009 0.001 {built-in method pandas._libs.lib.ensure_string_array} 100001 0.009 0.000 0.013 0.000 accessor.py:289(<genexpr>) 100000 0.008 0.000 0.015 0.000 accessor.py:280(cons_row)

pep8speaks · 2021-04-29T12:48:47Z

Hello @simonjayhawkins! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-05-19 13:41:18 UTC

simonjayhawkins · 2021-05-24T19:40:09Z

will close for now to help clear the queue. no reason to merge a PR that is slower than object fallback (other than to exercise the pyarrow functions)

[ArrowStringArray] implement ArrowStringArray._str_split

34df9e5

simonjayhawkins added the Strings String extension data type and string data label Apr 21, 2021

jorisvandenbossche reviewed Apr 22, 2021

View reviewed changes

simonjayhawkins added 5 commits April 29, 2021 08:11

Merge remote-tracking branch 'upstream/master' into _str_split

7ad0269

move fixture to conftest.py

427eff7

mixed object to seperate test

09ad85e

add benchmark

39dd30a

wip

c9511d9

simonjayhawkins mentioned this pull request May 1, 2021

[ArrowStringArray] API: StringDtype parameterized by storage (python or pyarrow) #39908

Merged

4 tasks

simonjayhawkins changed the title ~~[ArrowStringArray] implement ArrowStringArray._str_split~~ [ArrowStringArray] PERF: implement ArrowStringArray._str_split May 4, 2021

This was referenced May 7, 2021

[ArrowStringArray] REF: str.extract dispatch to array #41372

Closed

[ArrowStringArray] TST: parametrize str.split tests #41392

Merged

simonjayhawkins added 14 commits May 12, 2021 12:51

Merge remote-tracking branch 'upstream/master' into _str_split

8678af7

post merge fix-up

5c2ab24

remove fixture

12407fb

remove xfail (need to fix failing test on blank string before merge)

24d2395

Merge remote-tracking branch 'upstream/master' into _str_split

f43c61c

seperate benchmark for pattern

3d9297d

use pa_version_under3p0 instead of hasattr

a574ccb

Merge remote-tracking branch 'upstream/master' into _str_split

c7ba99c

add test case

9fc0144

use ObjectStringArrayMixin._str_map

8855100

use lib.map_infer_mask

ad3480f

update benchmark

70677c4

always convert to lists

af58055

Merge remote-tracking branch 'upstream/master' into _str_split

2abc508

simonjayhawkins closed this May 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ArrowStringArray] PERF: implement ArrowStringArray._str_split #41085

[ArrowStringArray] PERF: implement ArrowStringArray._str_split #41085

simonjayhawkins commented Apr 21, 2021

jorisvandenbossche Apr 22, 2021

jorisvandenbossche Apr 22, 2021

simonjayhawkins Apr 29, 2021

simonjayhawkins May 19, 2021 •

edited

Loading

pep8speaks commented Apr 29, 2021 •

edited

Loading

simonjayhawkins commented May 24, 2021

[ArrowStringArray] PERF: implement ArrowStringArray._str_split #41085

[ArrowStringArray] PERF: implement ArrowStringArray._str_split #41085

Conversation

simonjayhawkins commented Apr 21, 2021

jorisvandenbossche Apr 22, 2021

Choose a reason for hiding this comment

jorisvandenbossche Apr 22, 2021

Choose a reason for hiding this comment

simonjayhawkins Apr 29, 2021

Choose a reason for hiding this comment

simonjayhawkins May 19, 2021 • edited Loading

Choose a reason for hiding this comment

pep8speaks commented Apr 29, 2021 • edited Loading

Comment last updated at 2021-05-19 13:41:18 UTC

simonjayhawkins commented May 24, 2021

simonjayhawkins May 19, 2021 •

edited

Loading

pep8speaks commented Apr 29, 2021 •

edited

Loading