Skip to content

CLN: Unify signatures in _libs.groupby #34372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 18, 2020

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented May 25, 2020

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

xref #33630 (comment)

The main goal is to allow _get_cythonized_result to be used with the cython implementation of var. In order to do this, it was best to modify the signatures in _libs.groupby of

  • group_any_all;
  • _group_var; and
  • group_quantile

so that the order of the arguments matches other cython functions (e.g. those used in ops.BaseGrouper.aggregate or transform).

Note this PR has the fix for #33955 that is also in #33988 - that PR should be merged first.

Tests using %timeit on dataframes of various shapes (code for generating is at the bottom):

image

I looked into the last row. 26.7% of the time is spent in just iterating over the columns, 12.8% in the cython call, and 35.7% in the call to _wrap_aggregated_output.

While working on this, it seemed best to implement ddof != 1 in cython for var. The following code takes 26.5ms on master and 3.88ms in this PR.

ncol, order = 1, 6
df = pd.DataFrame(np.random.uniform(low=0.0, high=10.0, size=(10**order, ncol)))
df['key'] = np.random.randint(0, 50, (10**order, 1))
g = df.groupby('key')
t = %timeit -o g.std(ddof=4)

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A noble cause here...thanks for tackling! Reading through your OP I think all good thoughts, but I'd actually advise scaling this back a bit before trying to tackle more. Maybe just limit things now to definition / declaration changes leaving the implementations alone and we can move from there

@WillAyd WillAyd added the Refactor Internal refactoring of code label May 27, 2020
@rhshadrach rhshadrach force-pushed the _get_cython_result branch from f2bb7a9 to 98f1c0d Compare June 2, 2020 20:50
@rhshadrach
Copy link
Member Author

@WillAyd I got everything working until I tried to use it for var, which was the original motivation. For var, tests with some columns being nullable integers seemed not possible to make work. So instead of changing the cython implementation of quantile et al to use a 2d input/output, I instead made _get_cythonized_result more flexible to be able to work with cython implementations that are meant for 2d data. More details in the OP.

@rhshadrach rhshadrach marked this pull request as ready for review June 5, 2020 01:38
@rhshadrach rhshadrach changed the title WIP: CLN: Unify signatures in _libs.groupby CLN: Unify signatures in _libs.groupby Jun 5, 2020
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run the asv benchmarks for groupby? We use those to benchmark performance generally

@rhshadrach
Copy link
Member Author

rhshadrach commented Jun 7, 2020

@WillAyd Here are the significant asv changes using "-b ^groupby"

       before           after         ratio
     [c71bfc36]       [98f1c0d2]
     <at_duplicate_labels>       <_get_cython_result>
+      7.89±0.4ms       9.38±0.5ms     1.19  groupby.Nth.time_series_nth('float32')
+         413±2ms         472±20ms     1.14  groupby.Apply.time_copy_overhead_single_col
+       101±0.7μs          112±4μs     1.12  groupby.GroupByMethods.time_dtype_as_field('datetime', 'count', 'direct')
+     6.70±0.03ms       7.47±0.6ms     1.12  groupby.Transform.time_transform_multi_key1
+      98.9±0.5μs          109±4μs     1.10  groupby.GroupByMethods.time_dtype_as_group('object', 'count', 'direct')
-        268±40μs          233±2μs     0.87  groupby.GroupByMethods.time_dtype_as_group('object', 'ffill', 'direct')
-         622±3μs         523±10μs     0.84  groupby.GroupByMethods.time_dtype_as_field('float', 'sem', 'transformation')
-         627±4μs         522±10μs     0.83  groupby.GroupByMethods.time_dtype_as_field('float', 'sem', 'direct')
-         685±4μs          540±8μs     0.79  groupby.GroupByMethods.time_dtype_as_group('float', 'sem', 'transformation')
-         684±4μs          533±8μs     0.78  groupby.GroupByMethods.time_dtype_as_group('float', 'sem', 'direct')
-        678±20μs          516±1μs     0.76  groupby.GroupByMethods.time_dtype_as_field('int', 'sem', 'direct')
-        696±40μs          526±7μs     0.76  groupby.GroupByMethods.time_dtype_as_field('int', 'sem', 'transformation')
-         708±4μs          528±2μs     0.75  groupby.GroupByMethods.time_dtype_as_group('int', 'sem', 'transformation')
-         710±8μs        525±0.5μs     0.74  groupby.GroupByMethods.time_dtype_as_group('int', 'sem', 'direct')
-        348±60μs          256±3μs     0.74  groupby.GroupByMethods.time_dtype_as_group('object', 'last', 'direct')
-       306±0.8μs          194±8μs     0.63  groupby.GroupByMethods.time_dtype_as_field('float', 'std', 'transformation')
-         305±2μs          192±3μs     0.63  groupby.GroupByMethods.time_dtype_as_field('float', 'std', 'direct')
-         366±7μs          200±4μs     0.55  groupby.GroupByMethods.time_dtype_as_group('float', 'std', 'direct')
-         362±4μs          197±1μs     0.54  groupby.GroupByMethods.time_dtype_as_group('float', 'std', 'transformation')
-         362±5μs          191±1μs     0.53  groupby.GroupByMethods.time_dtype_as_field('int', 'std', 'direct')
-         367±8μs          191±1μs     0.52  groupby.GroupByMethods.time_dtype_as_field('int', 'std', 'transformation')
-         387±6μs          197±2μs     0.51  groupby.GroupByMethods.time_dtype_as_group('int', 'std', 'direct')
-         390±8μs        197±0.7μs     0.51  groupby.GroupByMethods.time_dtype_as_group('int', 'std', 'transformation')

Edit: New to using asv - I ran it using the command

asv continuous -f 1.1 upstream/master HEAD -b ^groupby

but it pulled in the branch name "at_duplicate_labels"? I don't think this is of any consequence, this branch is identical to master.

@@ -1731,7 +1731,11 @@ def _wrap_aggregated_output(
DataFrame
"""
indexed_output = {key.position: val for key, val in output.items()}
columns = Index(key.label for key in output)
if self.axis == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can just be self._obj_with_exclusions._get_axis_name(self.axis)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, that is much better. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spoke to soon. This method returns either "index" or "columns", not the name that we desire here.

Copy link
Member Author

@rhshadrach rhshadrach Jun 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ha - found it! self._obj_with_exclusions._get_axis(1-self.axis).name

if self.axis == 0:
name = self._obj_with_exclusions.columns.name
else:
name = self._obj_with_exclusions.index.name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this code useful anywhere else? e.g. this would be fine as a property of the class L1734-37

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some regex searches, not currently that I can tell. Seems suitable for a property though, so might still make sense to add? Also note that with @WillAyd's help I was able to get it down to a single line, no branching. Let me know if it should be added.

@rhshadrach
Copy link
Member Author

@jreback Friendly ping. In #34372 (comment) you asked if the extraction of name might be used anywhere else. Not that I can tell from searching, but it's possible I've missed something. I think that means hold off making it a property? Also note that the code to extract it has now been simplified to:

name = self._obj_with_exclusions._get_axis(1 - self.axis).name

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. few comments. are there any perf implications here?

if aggregate:
result_sz = ngroups
else:
result_sz = len(values)

result = np.zeros(result_sz, dtype=cython_dtype)
func = partial(base_func, result, labels)
if needs_2d:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use at_least2d or just reshape here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do this here

Copy link
Member Author

@rhshadrach rhshadrach Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're asking to replace

if needs_2d:
    result = np.zeros((result_sz, 1), dtype=cython_dtype)
else:
    result = np.zeros(result_sz, dtype=cython_dtype)

with

result = np.zeros(result_sz, dtype=cython_dtype)
if needs_2d:
    result = result.reshape((-1, 1))

I think the reshape version is less performant when needs_2d is True, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe at_least2d is applicable, it will turn a 1d into a single row (1xn) whereas we need a column (nx1).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result = np.zeros(result_sz, dtype=cython_dtype)
if needs_2d:
    result = result.reshape((-1, 1))

yes this would be an improvement

@jreback jreback added this to the 1.1 milestone Jun 14, 2020
@jreback
Copy link
Contributor

jreback commented Jun 14, 2020

Note this PR has the fix for #33955 that is also in #33988 - that PR should be merged first.

is this comment still relevant?

@rhshadrach
Copy link
Member Author

@jreback - changes made, still waiting on checks. I assumed you wanted to keep the prefix "needs" to that the flag is now "needs_at_least2d".

are there any perf implications here?

asv is here: #34372 (comment)
The OP also has timings using %timeit for frames of various shapes: #34372 (comment)

I think the last line of the timings in the OP is the only one of concern, where on a short and wide frame var is taking 17x longer. I also profiled this using line_profiler. From the OP:

I looked into the last row. 26.7% of the time is spent in just iterating over the columns, 12.8% in the cython call, and 35.7% in the call to _wrap_aggregated_output.

Note this PR has the fix for #33955 that is also in #33988 - that PR should be merged first.

is this comment still relevant?

Unclear - that PR has the fix that is also in here, but it's never hit by any tests there. This PR on the other hand has hits the change via tests involving std. Just don't want to step on any toes, I only saw that PR as I was submitting this one.

@jreback
Copy link
Contributor

jreback commented Jun 14, 2020

ok @rhshadrach happy to merge this, pls open an issue for the perf issue (on short / wide) can address in a followup.

if aggregate:
result_sz = ngroups
else:
result_sz = len(values)

result = np.zeros(result_sz, dtype=cython_dtype)
func = partial(base_func, result, labels)
if needs_at_least2d:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use at_least2d here (the numpy function) this is what i meant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - I see. I don't believe so; we'd like to convert [1, 2] to [[1], [2]]. atleast_2d will do output [[1, 2]] instead.

@TomAugspurger
Copy link
Contributor

@rhshadrach can you merge master? Hopefully the CI issues have all been resolved.

@rhshadrach
Copy link
Member Author

@TomAugspurger thanks!
@jreback all green now.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still have the remaining comment

if aggregate:
result_sz = ngroups
else:
result_sz = len(values)

result = np.zeros(result_sz, dtype=cython_dtype)
func = partial(base_func, result, labels)
if needs_2d:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do this here

@rhshadrach
Copy link
Member Author

@jreback Thanks for the comments, changes made and checks pass.

@jreback jreback merged commit c9144ca into pandas-dev:master Jun 18, 2020
@jreback
Copy link
Contributor

jreback commented Jun 18, 2020

thanks @rhshadrach nice cleanup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants