CLN: Unify signatures in _libs.groupby #34372

rhshadrach · 2020-05-25T22:11:25Z

The main goal is to allow _get_cythonized_result to be used with the cython implementation of var. In order to do this, it was best to modify the signatures in _libs.groupby of

group_any_all;
_group_var; and
group_quantile

so that the order of the arguments matches other cython functions (e.g. those used in ops.BaseGrouper.aggregate or transform).

Note this PR has the fix for #33955 that is also in #33988 - that PR should be merged first.

Tests using %timeit on dataframes of various shapes (code for generating is at the bottom):

I looked into the last row. 26.7% of the time is spent in just iterating over the columns, 12.8% in the cython call, and 35.7% in the call to _wrap_aggregated_output.

While working on this, it seemed best to implement ddof != 1 in cython for var. The following code takes 26.5ms on master and 3.88ms in this PR.

ncol, order = 1, 6
df = pd.DataFrame(np.random.uniform(low=0.0, high=10.0, size=(10**order, ncol)))
df['key'] = np.random.randint(0, 50, (10**order, 1))
g = df.groupby('key')
t = %timeit -o g.std(ddof=4)

WillAyd

A noble cause here...thanks for tackling! Reading through your OP I think all good thoughts, but I'd actually advise scaling this back a bit before trying to tackle more. Maybe just limit things now to definition / declaration changes leaving the implementations alone and we can move from there

pandas/core/groupby/groupby.py

pandas/_libs/groupby.pyx

rhshadrach · 2020-06-05T01:37:13Z

@WillAyd I got everything working until I tried to use it for var, which was the original motivation. For var, tests with some columns being nullable integers seemed not possible to make work. So instead of changing the cython implementation of quantile et al to use a 2d input/output, I instead made _get_cythonized_result more flexible to be able to work with cython implementations that are meant for 2d data. More details in the OP.

WillAyd

Can you run the asv benchmarks for groupby? We use those to benchmark performance generally

pandas/core/groupby/groupby.py

rhshadrach · 2020-06-07T13:18:56Z

@WillAyd Here are the significant asv changes using "-b ^groupby"

       before           after         ratio
     [c71bfc36]       [98f1c0d2]
     <at_duplicate_labels>       <_get_cython_result>
+      7.89±0.4ms       9.38±0.5ms     1.19  groupby.Nth.time_series_nth('float32')
+         413±2ms         472±20ms     1.14  groupby.Apply.time_copy_overhead_single_col
+       101±0.7μs          112±4μs     1.12  groupby.GroupByMethods.time_dtype_as_field('datetime', 'count', 'direct')
+     6.70±0.03ms       7.47±0.6ms     1.12  groupby.Transform.time_transform_multi_key1
+      98.9±0.5μs          109±4μs     1.10  groupby.GroupByMethods.time_dtype_as_group('object', 'count', 'direct')
-        268±40μs          233±2μs     0.87  groupby.GroupByMethods.time_dtype_as_group('object', 'ffill', 'direct')
-         622±3μs         523±10μs     0.84  groupby.GroupByMethods.time_dtype_as_field('float', 'sem', 'transformation')
-         627±4μs         522±10μs     0.83  groupby.GroupByMethods.time_dtype_as_field('float', 'sem', 'direct')
-         685±4μs          540±8μs     0.79  groupby.GroupByMethods.time_dtype_as_group('float', 'sem', 'transformation')
-         684±4μs          533±8μs     0.78  groupby.GroupByMethods.time_dtype_as_group('float', 'sem', 'direct')
-        678±20μs          516±1μs     0.76  groupby.GroupByMethods.time_dtype_as_field('int', 'sem', 'direct')
-        696±40μs          526±7μs     0.76  groupby.GroupByMethods.time_dtype_as_field('int', 'sem', 'transformation')
-         708±4μs          528±2μs     0.75  groupby.GroupByMethods.time_dtype_as_group('int', 'sem', 'transformation')
-         710±8μs        525±0.5μs     0.74  groupby.GroupByMethods.time_dtype_as_group('int', 'sem', 'direct')
-        348±60μs          256±3μs     0.74  groupby.GroupByMethods.time_dtype_as_group('object', 'last', 'direct')
-       306±0.8μs          194±8μs     0.63  groupby.GroupByMethods.time_dtype_as_field('float', 'std', 'transformation')
-         305±2μs          192±3μs     0.63  groupby.GroupByMethods.time_dtype_as_field('float', 'std', 'direct')
-         366±7μs          200±4μs     0.55  groupby.GroupByMethods.time_dtype_as_group('float', 'std', 'direct')
-         362±4μs          197±1μs     0.54  groupby.GroupByMethods.time_dtype_as_group('float', 'std', 'transformation')
-         362±5μs          191±1μs     0.53  groupby.GroupByMethods.time_dtype_as_field('int', 'std', 'direct')
-         367±8μs          191±1μs     0.52  groupby.GroupByMethods.time_dtype_as_field('int', 'std', 'transformation')
-         387±6μs          197±2μs     0.51  groupby.GroupByMethods.time_dtype_as_group('int', 'std', 'direct')
-         390±8μs        197±0.7μs     0.51  groupby.GroupByMethods.time_dtype_as_group('int', 'std', 'transformation')

Edit: New to using asv - I ran it using the command

asv continuous -f 1.1 upstream/master HEAD -b ^groupby

but it pulled in the branch name "at_duplicate_labels"? I don't think this is of any consequence, this branch is identical to master.

WillAyd · 2020-06-07T16:24:31Z

pandas/core/groupby/generic.py

@@ -1731,7 +1731,11 @@ def _wrap_aggregated_output(
        DataFrame
        """
        indexed_output = {key.position: val for key, val in output.items()}
-        columns = Index(key.label for key in output)
+        if self.axis == 0:


I think this can just be self._obj_with_exclusions._get_axis_name(self.axis)

Awesome, that is much better. Thanks!

I spoke to soon. This method returns either "index" or "columns", not the name that we desire here.

Ah ha - found it! self._obj_with_exclusions._get_axis(1-self.axis).name

pandas/core/groupby/generic.py

pandas/core/groupby/groupby.py

pandas/_libs/groupby.pyx

jreback · 2020-06-08T23:19:06Z

pandas/core/groupby/generic.py

+        if self.axis == 0:
+            name = self._obj_with_exclusions.columns.name
+        else:
+            name = self._obj_with_exclusions.index.name


is this code useful anywhere else? e.g. this would be fine as a property of the class L1734-37

Did some regex searches, not currently that I can tell. Seems suitable for a property though, so might still make sense to add? Also note that with @WillAyd's help I was able to get it down to a single line, no branching. Let me know if it should be added.

pandas/core/groupby/generic.py

rhshadrach · 2020-06-13T19:32:16Z

@jreback Friendly ping. In #34372 (comment) you asked if the extraction of name might be used anywhere else. Not that I can tell from searching, but it's possible I've missed something. I think that means hold off making it a property? Also note that the code to extract it has now been simplified to:

name = self._obj_with_exclusions._get_axis(1 - self.axis).name

jreback

looks good. few comments. are there any perf implications here?

jreback · 2020-06-14T14:48:58Z

pandas/core/groupby/groupby.py

            if aggregate:
                result_sz = ngroups
            else:
                result_sz = len(values)

-            result = np.zeros(result_sz, dtype=cython_dtype)
-            func = partial(base_func, result, labels)
+            if needs_2d:


can you use at_least2d or just reshape here?

can you do this here

I think you're asking to replace

if needs_2d: result = np.zeros((result_sz, 1), dtype=cython_dtype) else: result = np.zeros(result_sz, dtype=cython_dtype)

with

result = np.zeros(result_sz, dtype=cython_dtype) if needs_2d: result = result.reshape((-1, 1))

I think the reshape version is less performant when needs_2d is True, no?

I don't believe at_least2d is applicable, it will turn a 1d into a single row (1xn) whereas we need a column (nx1).

result = np.zeros(result_sz, dtype=cython_dtype) if needs_2d: result = result.reshape((-1, 1))

yes this would be an improvement

pandas/core/groupby/groupby.py

jreback · 2020-06-14T16:42:00Z

Note this PR has the fix for #33955 that is also in #33988 - that PR should be merged first.

is this comment still relevant?

rhshadrach · 2020-06-14T17:12:14Z

@jreback - changes made, still waiting on checks. I assumed you wanted to keep the prefix "needs" to that the flag is now "needs_at_least2d".

are there any perf implications here?

asv is here: #34372 (comment)
The OP also has timings using %timeit for frames of various shapes: #34372 (comment)

I think the last line of the timings in the OP is the only one of concern, where on a short and wide frame var is taking 17x longer. I also profiled this using line_profiler. From the OP:

I looked into the last row. 26.7% of the time is spent in just iterating over the columns, 12.8% in the cython call, and 35.7% in the call to _wrap_aggregated_output.

Note this PR has the fix for #33955 that is also in #33988 - that PR should be merged first.

is this comment still relevant?

Unclear - that PR has the fix that is also in here, but it's never hit by any tests there. This PR on the other hand has hits the change via tests involving std. Just don't want to step on any toes, I only saw that PR as I was submitting this one.

jreback · 2020-06-14T17:15:21Z

ok @rhshadrach happy to merge this, pls open an issue for the perf issue (on short / wide) can address in a followup.

jreback · 2020-06-15T01:17:59Z

pandas/core/groupby/groupby.py

            if aggregate:
                result_sz = ngroups
            else:
                result_sz = len(values)

-            result = np.zeros(result_sz, dtype=cython_dtype)
-            func = partial(base_func, result, labels)
+            if needs_at_least2d:


can you use at_least2d here (the numpy function) this is what i meant.

Ah - I see. I don't believe so; we'd like to convert [1, 2] to [[1], [2]]. atleast_2d will do output [[1, 2]] instead.

TomAugspurger · 2020-06-17T18:31:58Z

@rhshadrach can you merge master? Hopefully the CI issues have all been resolved.

rhshadrach · 2020-06-17T21:10:47Z

@TomAugspurger thanks!
@jreback all green now.

jreback

still have the remaining comment

jreback · 2020-06-17T22:26:00Z

pandas/core/groupby/groupby.py

            if aggregate:
                result_sz = ngroups
            else:
                result_sz = len(values)

-            result = np.zeros(result_sz, dtype=cython_dtype)
-            func = partial(base_func, result, labels)
+            if needs_2d:


can you do this here

rhshadrach · 2020-06-18T21:05:15Z

@jreback Thanks for the comments, changes made and checks pass.

jreback · 2020-06-18T22:59:14Z

thanks @rhshadrach nice cleanup

CLN: Unify signatures in _libs.groupby

53ae9d6

jreback added the Groupby label May 26, 2020

WillAyd reviewed May 27, 2020

View reviewed changes

pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved

pandas/_libs/groupby.pyx Outdated Show resolved Hide resolved

WillAyd added the Refactor Internal refactoring of code label May 27, 2020

rhshadrach added 2 commits June 2, 2020 16:48

Complete rework

7985efb

Merge remote-tracking branch 'upstream/master' into _get_cython_result

98f1c0d

rhshadrach force-pushed the _get_cython_result branch from f2bb7a9 to 98f1c0d Compare June 2, 2020 20:50

rhshadrach marked this pull request as ready for review June 5, 2020 01:38

rhshadrach mentioned this pull request Jun 5, 2020

Issue33955 Properly checking whether post_processing is callable or not in Class GroupBy #33988

Closed

rhshadrach changed the title ~~WIP: CLN: Unify signatures in _libs.groupby~~ CLN: Unify signatures in _libs.groupby Jun 5, 2020

WillAyd reviewed Jun 5, 2020

View reviewed changes

pandas/core/groupby/groupby.py Show resolved Hide resolved

pandas/core/groupby/groupby.py Show resolved Hide resolved

WillAyd reviewed Jun 7, 2020

View reviewed changes

jreback requested changes Jun 8, 2020

View reviewed changes

Simplified name logic in _wrap_aggregated_output

5e21c72

rhshadrach mentioned this pull request Jun 13, 2020

BUG: DataFrameGroupBy.quantile raises for non-numeric dtypes rather than dropping columns #34756

Merged

5 tasks

jreback requested changes Jun 14, 2020

View reviewed changes

Renamed needs_2d -> needs_at_least2d

4d62493

jreback added this to the 1.1 milestone Jun 14, 2020

rhshadrach mentioned this pull request Jun 14, 2020

PERF: DataFrameGroupBy var and std #34771

Closed

jreback approved these changes Jun 15, 2020

View reviewed changes

jreback requested changes Jun 15, 2020

View reviewed changes

Revert renaming of needs_2d -> needs_at_least2d

f1c868f

Merge remote-tracking branch 'upstream/master' into _get_cython_result

4d2d332

jreback requested changes Jun 17, 2020

View reviewed changes

Requested change

33bf96a

jreback approved these changes Jun 18, 2020

View reviewed changes

jreback merged commit c9144ca into pandas-dev:master Jun 18, 2020

TomAugspurger mentioned this pull request Jun 29, 2020

Performance regression in stat_ops.FrameMultiIndexOps.time_op #35050

Closed

TomAugspurger mentioned this pull request Jul 7, 2020

BUG: Not properly checking if post_processing function is callable or not in _get_cythonized_result() function in class GroupBy #33955

Closed

3 tasks

rhshadrach deleted the _get_cython_result branch July 11, 2020 16:01

simonjayhawkins mentioned this pull request Oct 27, 2020

BUG: groupby with std aggregation of pandas integer dtype throws exception: 'IntegerArray' object has no attribute 'reshape' #37415

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLN: Unify signatures in _libs.groupby #34372

CLN: Unify signatures in _libs.groupby #34372

rhshadrach commented May 25, 2020 •

edited

Loading

WillAyd left a comment

rhshadrach commented Jun 5, 2020

WillAyd left a comment

rhshadrach commented Jun 7, 2020 •

edited

Loading

WillAyd Jun 7, 2020

rhshadrach Jun 7, 2020

rhshadrach Jun 10, 2020

rhshadrach Jun 10, 2020 •

edited

Loading

jreback Jun 8, 2020

rhshadrach Jun 10, 2020

rhshadrach commented Jun 13, 2020

jreback left a comment

jreback Jun 14, 2020

jreback Jun 17, 2020

rhshadrach Jun 18, 2020 •

edited

Loading

rhshadrach Jun 18, 2020

jreback Jun 18, 2020

jreback commented Jun 14, 2020

rhshadrach commented Jun 14, 2020

jreback commented Jun 14, 2020

jreback Jun 15, 2020

rhshadrach Jun 15, 2020

TomAugspurger commented Jun 17, 2020

rhshadrach commented Jun 17, 2020

jreback left a comment

jreback Jun 17, 2020

rhshadrach commented Jun 18, 2020

jreback commented Jun 18, 2020

CLN: Unify signatures in _libs.groupby #34372

CLN: Unify signatures in _libs.groupby #34372

Conversation

rhshadrach commented May 25, 2020 • edited Loading

WillAyd left a comment

Choose a reason for hiding this comment

rhshadrach commented Jun 5, 2020

WillAyd left a comment

Choose a reason for hiding this comment

rhshadrach commented Jun 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach Jun 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach commented Jun 13, 2020

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach Jun 18, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jun 14, 2020

rhshadrach commented Jun 14, 2020

jreback commented Jun 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Jun 17, 2020

rhshadrach commented Jun 17, 2020

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach commented Jun 18, 2020

jreback commented Jun 18, 2020

rhshadrach commented May 25, 2020 •

edited

Loading

rhshadrach commented Jun 7, 2020 •

edited

Loading

rhshadrach Jun 10, 2020 •

edited

Loading

rhshadrach Jun 18, 2020 •

edited

Loading