Skip to content

[ArrowStringArray] TST: parametrize str.split tests #41392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

simonjayhawkins
Copy link
Member

the test (and benchmark) changes broken off from #41085 as a precursor to #41085 and #41372 (which currently makes changes to the str.split path, although I may break that PR up also)

@simonjayhawkins simonjayhawkins added Testing pandas testing functions or related to the test suite Strings String extension data type and string data labels May 9, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3 milestone May 9, 2021
@simonjayhawkins simonjayhawkins changed the title [ArrowStringArray] TST: paramerterise str.split tests [ArrowStringArray] TST: parametrize str.split tests May 9, 2021
@@ -230,17 +230,24 @@ def time_contains(self, dtype, regex):

class Split:

params = [True, False]
param_names = ["expand"]
params = (["str", "string", "arrow_string"], [None, "-", "--"], [True, False])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave out the pat from this benchmark, to reduce the combinatory explosion of cases. Which pattern is being used shouldn't influence the performance of expanding or not. So I would benchmark them separately.

Copy link
Member Author

@simonjayhawkins simonjayhawkins May 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. This is from broken off #41085 where sep determines the path taken, either pyarrow.compute.utf8_split_whitespace, pyarrow.compute.split_pattern or object fallback.

so can remove from this PR for now.

@simonjayhawkins
Copy link
Member Author

these are the current timings for master (will hopefully get improvements for object fallback by breaking off changes in #41372 and also by using pyarrow kernels #41085)

 25.00%] ··· strings.Split.time_rsplit                                                                                                                   ok
[ 25.00%] ··· ============== ========== ==========
              --                     expand       
              -------------- ---------------------
                  dtype         True      False   
              ============== ========== ==========
                   str        82.2±0ms   33.5±0ms 
                  string      66.9±0ms   29.9±0ms 
               arrow_string   88.0±0ms   33.7±0ms 
              ============== ========== ==========

[ 50.00%] ··· strings.Split.time_split                                                                                                                    ok
[ 50.00%] ··· ============== ========== ==========
              --                     expand       
              -------------- ---------------------
                  dtype         True      False   
              ============== ========== ==========
                   str        107±0ms    56.8±0ms 
                  string      92.4±0ms   54.3±0ms 
               arrow_string   111±0ms    58.0±0ms 
              ============== ========== ==========

@jorisvandenbossche jorisvandenbossche merged commit 59df6a8 into pandas-dev:master May 10, 2021
@jorisvandenbossche
Copy link
Member

Thanks Simon!

@simonjayhawkins simonjayhawkins deleted the split_tests-benchmark branch May 10, 2021 12:14
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Strings String extension data type and string data Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants