[ArrowStringArray] TST: parametrize str.split tests #41392

simonjayhawkins · 2021-05-09T10:13:04Z

the test (and benchmark) changes broken off from #41085 as a precursor to #41085 and #41372 (which currently makes changes to the str.split path, although I may break that PR up also)

jorisvandenbossche · 2021-05-10T09:48:37Z

asv_bench/benchmarks/strings.py

@@ -230,17 +230,24 @@ def time_contains(self, dtype, regex):

 class Split:

-    params = [True, False]
-    param_names = ["expand"]
+    params = (["str", "string", "arrow_string"], [None, "-", "--"], [True, False])


I would leave out the pat from this benchmark, to reduce the combinatory explosion of cases. Which pattern is being used shouldn't influence the performance of expanding or not. So I would benchmark them separately.

Indeed. This is from broken off #41085 where sep determines the path taken, either pyarrow.compute.utf8_split_whitespace, pyarrow.compute.split_pattern or object fallback.

so can remove from this PR for now.

…mark

simonjayhawkins · 2021-05-10T10:56:22Z

these are the current timings for master (will hopefully get improvements for object fallback by breaking off changes in #41372 and also by using pyarrow kernels #41085)

 25.00%] ··· strings.Split.time_rsplit                                                                                                                   ok
[ 25.00%] ··· ============== ========== ==========
              --                     expand       
              -------------- ---------------------
                  dtype         True      False   
              ============== ========== ==========
                   str        82.2±0ms   33.5±0ms 
                  string      66.9±0ms   29.9±0ms 
               arrow_string   88.0±0ms   33.7±0ms 
              ============== ========== ==========

[ 50.00%] ··· strings.Split.time_split                                                                                                                    ok
[ 50.00%] ··· ============== ========== ==========
              --                     expand       
              -------------- ---------------------
                  dtype         True      False   
              ============== ========== ==========
                   str        107±0ms    56.8±0ms 
                  string      92.4±0ms   54.3±0ms 
               arrow_string   111±0ms    58.0±0ms 
              ============== ========== ==========

jorisvandenbossche · 2021-05-10T11:59:02Z

Thanks Simon!

simonjayhawkins added 2 commits May 9, 2021 10:34

[ArrowStringArray] TST: paramerterise str.splt tests

2e55728

fixup test_rsplit_to_dataframe_expand

dbddda1

simonjayhawkins added Testing pandas testing functions or related to the test suite Strings String extension data type and string data labels May 9, 2021

simonjayhawkins added this to the 1.3 milestone May 9, 2021

simonjayhawkins changed the title ~~[ArrowStringArray] TST: paramerterise str.split tests~~ [ArrowStringArray] TST: parametrize str.split tests May 9, 2021

jorisvandenbossche reviewed May 10, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into split_tests-bench…

e3f2903

…mark

remove pat from benchmark for now

005d881

jorisvandenbossche approved these changes May 10, 2021

View reviewed changes

jorisvandenbossche merged commit 59df6a8 into pandas-dev:master May 10, 2021

simonjayhawkins deleted the split_tests-benchmark branch May 10, 2021 12:14

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

[ArrowStringArray] TST: parametrize str.split tests (pandas-dev#41392)

9696e83

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ArrowStringArray] TST: parametrize str.split tests #41392

[ArrowStringArray] TST: parametrize str.split tests #41392

simonjayhawkins commented May 9, 2021

jorisvandenbossche May 10, 2021

simonjayhawkins May 10, 2021 •

edited

Loading

simonjayhawkins commented May 10, 2021

jorisvandenbossche commented May 10, 2021

[ArrowStringArray] TST: parametrize str.split tests #41392

[ArrowStringArray] TST: parametrize str.split tests #41392

Conversation

simonjayhawkins commented May 9, 2021

jorisvandenbossche May 10, 2021

Choose a reason for hiding this comment

simonjayhawkins May 10, 2021 • edited Loading

Choose a reason for hiding this comment

simonjayhawkins commented May 10, 2021

jorisvandenbossche commented May 10, 2021

simonjayhawkins May 10, 2021 •

edited

Loading