Skip to content

REF: Separate groupby, rolling, and window agg/apply list/dict-like #53986

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 9, 2023

Conversation

rhshadrach
Copy link
Member

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Should enable work on #53839.

@rhshadrach rhshadrach added Refactor Internal refactoring of code Apply Apply, Aggregate, Transform, Map labels Jul 3, 2023
@rhshadrach rhshadrach added this to the 2.1 milestone Jul 3, 2023
@rhshadrach rhshadrach requested a review from topper-123 July 4, 2023 13:26
# For SeriesGroupBy this matches _obj_with_exclusions
selected_obj = obj._selected_obj
else:
assert False
selected_obj = obj._obj_with_exclusions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

selected_obj is always the same as obj
here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct; I cleaned this up.

Copy link
Contributor

@topper-123 topper-123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, but a few comments.

I've always wondered if it is possible to combine compute_(dict|list)_like by converting dictlike into listlikes, like we already do in Series.replace, but that can be for a future discussion.

else:
selected_obj = obj._selected_obj
selection = obj._selection
selected_obj = obj
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obj here will no longer be a groupby object, but mypy still thinks this can be groupby object. Can you fix the type situation also?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an assert here for mypy


class ResamplerWindowApply(Apply):
class ResamplerWindowApply(GroupByApply):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think Resampler is more similar to Groupby than to BaseWindows, because it inherits from BaseGroupBy. Can you explain why it shouldn't be treated in GroupByApply and rename this to ResamplerWindowApply?

Also, this means GroupByApply.obj can be Resampler | BaseWindow, which may not be precise/what we want?

Can we have a class BaseGroupbyApply(Apply) as a parent of both GroupByApply and ResamplerWindowApply and keep the common functionality there in order to have correct type hints?

Copy link
Member Author

@rhshadrach rhshadrach Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and rename this to ResamplerWindowApply?

It is named ResamplerWindowApply, so I'm not sure what you're asking.

The main reason to separate this off now is that GroupBy requires the context (to use as_index=True) whereas this does not.

In general, I'm thinking of removing the inheritance structure in this file, keeping a single class (something like ApplyOp or maybe UdfOp) that stores the state we need to pass around and making everything else methods on this class. Currently, the use of inheritance makes it hard to understand the code (both for humans and IDEs).

What do you think of moving toward this structure?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is named ResamplerWindowApply, so I'm not sure what you're asking.

Sorry, I meant just WindowApply, but you answered my question (i.e. the difference is as_index=True), thanks.

These new methods are very similar to the methods on GroupbyApply, the difference being only the numba kwargs and the context manager. Can you look again if it's possible to avoid the repetition?

What do you think of moving toward this structure?

Fro a surface view if looks to me like it would be simplest to have the current Apply class only be used by Series and DataFrame and let the groupby/window cases be handled in a separate structure. That would simplify the type hints a lot, for example, and maybe the code also. Is that the plan you're describing, or do you have something else in mind?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you look again if it's possible to avoid the repetition?

An alternative is to introduce a condition argument to temp_setattr so that optionally setting an attribute is a bit easier. I've taken this approach in the latest revision.

Is that the plan you're describing, or do you have something else in mind?

I was more thinking of still having groupby / window / resample in core.apply, just not having the inheritance structure. I attempted this today and it's a bit of a mixed bag. The code may be more straightforward to follow, but there is also more branching. I think your improvements that will take hold in 3.0 and some general cleanups will improve this code alone; I'm going to hold off on pursuing this more for now.

@@ -465,6 +452,7 @@ def agg_or_apply_dict_like(
assert op_name in ["agg", "apply"]

obj = self.obj
assert isinstance(obj, (ABCSeries, ABCDataFrame))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this method to NDFrameApply?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: if your idea is to make Apply.obj only be Series | DataFrame then there's no reason to ove because I guess you will want to merge Apply with NDFrameApply later?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved this to NDFrameApply

@@ -343,6 +343,7 @@ def agg_or_apply_list_like(
self, op_name: Literal["agg", "apply"]
) -> DataFrame | Series:
obj = self.obj
assert isinstance(obj, (ABCSeries, ABCDataFrame))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this ethod to NDFrameApply?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


class ResamplerWindowApply(Apply):
class ResamplerWindowApply(GroupByApply):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is named ResamplerWindowApply, so I'm not sure what you're asking.

Sorry, I meant just WindowApply, but you answered my question (i.e. the difference is as_index=True), thanks.

These new methods are very similar to the methods on GroupbyApply, the difference being only the numba kwargs and the context manager. Can you look again if it's possible to avoid the repetition?

What do you think of moving toward this structure?

Fro a surface view if looks to me like it would be simplest to have the current Apply class only be used by Series and DataFrame and let the groupby/window cases be handled in a separate structure. That would simplify the type hints a lot, for example, and maybe the code also. Is that the plan you're describing, or do you have something else in mind?

@@ -1309,6 +1353,50 @@ def apply(self):
def transform(self):
raise NotImplementedError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is superfluous now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly. We currently have this implemented on the base class Apply. It should not be called for e.g. GroupBy. Without this, if it accidentally gets called it would generate a perhaps confusing error; this is more straightforward to understand. Also - users should never encounter this; only developers.

That said, this is definitely a code smell. I think it may be good to move transform into core.apply.

@topper-123
Copy link
Contributor

Just two suggestion about explaining the return value types for compute_list_like and compute_list_like, because return value types can't be expressed in the type system and IMO it's not always super clear how we can know the return value types in result_data from the input data.

Else looks good to me.

@topper-123
Copy link
Contributor

There's a conflict that needs to be resolved, else looks good.

…rate_groupby_apply

# Conflicts:
#	pandas/core/apply.py
@topper-123 topper-123 merged commit c126eeb into pandas-dev:main Jul 9, 2023
@topper-123
Copy link
Contributor

Thanks, @rhshadrach.

@rhshadrach rhshadrach deleted the separate_groupby_apply branch July 9, 2023 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants