-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
CLN: Unify signatures in _libs.groupby #34372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A noble cause here...thanks for tackling! Reading through your OP I think all good thoughts, but I'd actually advise scaling this back a bit before trying to tackle more. Maybe just limit things now to definition / declaration changes leaving the implementations alone and we can move from there
f2bb7a9
to
98f1c0d
Compare
@WillAyd I got everything working until I tried to use it for var, which was the original motivation. For var, tests with some columns being nullable integers seemed not possible to make work. So instead of changing the cython implementation of quantile et al to use a 2d input/output, I instead made _get_cythonized_result more flexible to be able to work with cython implementations that are meant for 2d data. More details in the OP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you run the asv benchmarks for groupby? We use those to benchmark performance generally
@WillAyd Here are the significant asv changes using "-b ^groupby"
Edit: New to using asv - I ran it using the command
but it pulled in the branch name "at_duplicate_labels"? I don't think this is of any consequence, this branch is identical to master. |
pandas/core/groupby/generic.py
Outdated
@@ -1731,7 +1731,11 @@ def _wrap_aggregated_output( | |||
DataFrame | |||
""" | |||
indexed_output = {key.position: val for key, val in output.items()} | |||
columns = Index(key.label for key in output) | |||
if self.axis == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can just be self._obj_with_exclusions._get_axis_name(self.axis)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, that is much better. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spoke to soon. This method returns either "index" or "columns", not the name that we desire here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ha - found it! self._obj_with_exclusions._get_axis(1-self.axis).name
pandas/core/groupby/generic.py
Outdated
if self.axis == 0: | ||
name = self._obj_with_exclusions.columns.name | ||
else: | ||
name = self._obj_with_exclusions.index.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this code useful anywhere else? e.g. this would be fine as a property of the class L1734-37
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some regex searches, not currently that I can tell. Seems suitable for a property though, so might still make sense to add? Also note that with @WillAyd's help I was able to get it down to a single line, no branching. Let me know if it should be added.
@jreback Friendly ping. In #34372 (comment) you asked if the extraction of name might be used anywhere else. Not that I can tell from searching, but it's possible I've missed something. I think that means hold off making it a property? Also note that the code to extract it has now been simplified to:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. few comments. are there any perf implications here?
if aggregate: | ||
result_sz = ngroups | ||
else: | ||
result_sz = len(values) | ||
|
||
result = np.zeros(result_sz, dtype=cython_dtype) | ||
func = partial(base_func, result, labels) | ||
if needs_2d: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use at_least2d or just reshape here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you do this here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're asking to replace
if needs_2d:
result = np.zeros((result_sz, 1), dtype=cython_dtype)
else:
result = np.zeros(result_sz, dtype=cython_dtype)
with
result = np.zeros(result_sz, dtype=cython_dtype)
if needs_2d:
result = result.reshape((-1, 1))
I think the reshape version is less performant when needs_2d
is True
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe at_least2d is applicable, it will turn a 1d into a single row (1xn) whereas we need a column (nx1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result = np.zeros(result_sz, dtype=cython_dtype)
if needs_2d:
result = result.reshape((-1, 1))
yes this would be an improvement
@jreback - changes made, still waiting on checks. I assumed you wanted to keep the prefix "needs" to that the flag is now "needs_at_least2d".
asv is here: #34372 (comment) I think the last line of the timings in the OP is the only one of concern, where on a short and wide frame var is taking 17x longer. I also profiled this using line_profiler. From the OP:
Unclear - that PR has the fix that is also in here, but it's never hit by any tests there. This PR on the other hand has hits the change via tests involving std. Just don't want to step on any toes, I only saw that PR as I was submitting this one. |
ok @rhshadrach happy to merge this, pls open an issue for the perf issue (on short / wide) can address in a followup. |
pandas/core/groupby/groupby.py
Outdated
if aggregate: | ||
result_sz = ngroups | ||
else: | ||
result_sz = len(values) | ||
|
||
result = np.zeros(result_sz, dtype=cython_dtype) | ||
func = partial(base_func, result, labels) | ||
if needs_at_least2d: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use at_least2d here (the numpy function) this is what i meant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - I see. I don't believe so; we'd like to convert [1, 2] to [[1], [2]]. atleast_2d will do output [[1, 2]] instead.
@rhshadrach can you merge master? Hopefully the CI issues have all been resolved. |
@TomAugspurger thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still have the remaining comment
if aggregate: | ||
result_sz = ngroups | ||
else: | ||
result_sz = len(values) | ||
|
||
result = np.zeros(result_sz, dtype=cython_dtype) | ||
func = partial(base_func, result, labels) | ||
if needs_2d: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you do this here
@jreback Thanks for the comments, changes made and checks pass. |
thanks @rhshadrach nice cleanup |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
xref #33630 (comment)
The main goal is to allow _get_cythonized_result to be used with the cython implementation of var. In order to do this, it was best to modify the signatures in _libs.groupby of
so that the order of the arguments matches other cython functions (e.g. those used in ops.BaseGrouper.aggregate or transform).
Note this PR has the fix for #33955 that is also in #33988 - that PR should be merged first.
Tests using %timeit on dataframes of various shapes (code for generating is at the bottom):
I looked into the last row. 26.7% of the time is spent in just iterating over the columns, 12.8% in the cython call, and 35.7% in the call to
_wrap_aggregated_output
.While working on this, it seemed best to implement
ddof != 1
in cython for var. The following code takes 26.5ms on master and 3.88ms in this PR.