-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: inconsistent DataFrame.agg behavoir when passing as kwargs numeric_only=True #49534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: inconsistent DataFrame.agg behavoir when passing as kwargs numeric_only=True #49534
Conversation
@rhshadrach can you please have a look at the PR since v1.5.2 will be released. |
this doesn't have any tests |
I wasn't sure if they are covered by default. Will add them. Thanks for your response. |
To check in advance, is the appropriate location to add the test in the script below?:
|
Yes - that's correct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
For the overall the behavior here, I didn't realize commenting in the issue that this approach would be something of a hidden feature. That is, numeric_only isn't in the arguments of the function, but we're essentially implementing it as if it were.
I think this would be (at least more) okay if we were only exercising numeric_only
when it came to string arguments. E.g. ['sum', lambda x, numeric_only: x.sum(), 'mean']` would do the computation on the numeric portion of the DataFrame for 'sum' and 'mean', but would attempt to pass every Series that makes up the frame to the lambda. This is admittedly not great - I think the only proper resolution is to use DataFrame methods in agg for list/dict like arguments, but that can't happen until 2.0.
cc @mroeschke for any thoughts.
If it were better documented that passing the string alias of the DataFrame methods accepts the associated keyword arguments then I supposed it makes sense to support this, especially if the user passes multiple strings aliases into agg. |
Thanks @mroeschke - @thlautenschlaeger can you add a description of whats happening here to the notes section of agg and apply. |
e35fa47
to
749a9eb
Compare
@rhshadrach Added the description to that section. Also added an example to the doc for agg. Hope everything is alright now. |
pandas/core/frame.py
Outdated
... columns=['A', 'B', 'C']) | ||
|
||
Works equivalently as above. Add argument `numeric_only=True` to avoid | ||
exceptions or warnings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: The warning will go away soon (in 2.0) so I am not sure it's worth including here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you're just talking about the or warnings
bit, is that right @mroeschke? Agreed - I think the more general to aggregate only numeric columns.
would be sufficient - just like the apply docstring.
doc/source/whatsnew/v1.5.2.rst
Outdated
@@ -27,7 +27,7 @@ Fixed regressions | |||
Bug fixes | |||
~~~~~~~~~ | |||
- Bug in the Copy-on-Write implementation losing track of views in certain chained indexing cases (:issue:`48996`) | |||
- | |||
- Bug in inconsistent DataFrame.agg behavior when passing as kwargs ``numeric_only=True`` (:issue:`49352` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Bug in inconsistent DataFrame.agg behavior when passing as kwargs ``numeric_only=True`` (:issue:`49352` | |
- Bug in inconsistent DataFrame.agg behavior when passing as kwargs ``numeric_only=True`` with :class:`DataFrame` methods or their associated string alias as functions (:issue:`49352`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't we should be mentioning DataFrame methods here (#49352 (comment) and #49352 (comment)). +1 on the "string alias" bit though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhshadrach I'm assuming the other kwargs like skipna
and min_periods
have the same issue here in agg
, but those don't need to be addressed in 1.5.x since those don't affect the FutureWarning
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good - a few requests below. Also, when mentioning the numeric_only
in the docs, can you state it applies only to string arguments. That it works for e.g. DataFrame.mean
is part of the standard behavior of agg
(to pass any keywords through to the provided function), so I don't think that needs to be particularly called out.
@@ -284,6 +284,14 @@ def transform_str_or_callable(self, func) -> DataFrame | Series: | |||
except Exception: | |||
return func(obj, *args, **kwargs) | |||
|
|||
def _filter_numeric_only(self) -> list[Any]: | |||
if "numeric_only" in self.kwargs and self.kwargs["numeric_only"] is True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is True
here? I haven't checked all, but I believe DataFrame methods use "truthy" (i.e. and self.kwargs["numeric_only"]
):
df = pd.DataFrame({'a': [1, 1, 2], 'b': list('xyz')})
print(df.sum(numeric_only="string"))
# a 4
# dtype: int64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe there is any need for the is True
here. This would give surprising results if the user passes e.g. np.bool_(True)
.
Also, can you use kwargs.get("numeric_only", False)
pandas/core/frame.py
Outdated
... columns=['A', 'B', 'C']) | ||
|
||
Works equivalently as above. Add argument `numeric_only=True` to avoid | ||
exceptions or warnings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you're just talking about the or warnings
bit, is that right @mroeschke? Agreed - I think the more general to aggregate only numeric columns.
would be sufficient - just like the apply docstring.
doc/source/whatsnew/v1.5.2.rst
Outdated
@@ -27,7 +27,7 @@ Fixed regressions | |||
Bug fixes | |||
~~~~~~~~~ | |||
- Bug in the Copy-on-Write implementation losing track of views in certain chained indexing cases (:issue:`48996`) | |||
- | |||
- Bug in inconsistent DataFrame.agg behavior when passing as kwargs ``numeric_only=True`` (:issue:`49352` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't we should be mentioning DataFrame methods here (#49352 (comment) and #49352 (comment)). +1 on the "string alias" bit though.
Correct. While I do think things like skipna should work, they are not part of the regression due to the numeric_only changes. |
Thanks for your reviews @mroeschke @rhshadrach |
32d50db
to
749a9eb
Compare
10ffdde
to
bf86a8d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to only filter the DataFrame due to numeric_only
on string arguments in the list-like. Otherwise, we will be creating an inconsistency between list and callable arguments. For example, with f
a callable user-defined function
df.agg(f, numeric_only=True)
will pass numeric_only
to the function f
whereas
df.agg([f], numeric_only=True)
will filter the DataFrame prior to calling f
.
@@ -284,6 +284,14 @@ def transform_str_or_callable(self, func) -> DataFrame | Series: | |||
except Exception: | |||
return func(obj, *args, **kwargs) | |||
|
|||
def _filter_numeric_only(self) -> list[Any]: | |||
if "numeric_only" in self.kwargs and self.kwargs["numeric_only"] is True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe there is any need for the is True
here. This would give surprising results if the user passes e.g. np.bool_(True)
.
Also, can you use kwargs.get("numeric_only", False)
... ['c', 3, 6]], | ||
... columns=['A', 'B', 'C']) | ||
|
||
Works equivalently as above. Add argument `numeric_only=True` to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "Works equivalently as above." mean here?
@@ -294,7 +299,7 @@ def agg_list_like(self) -> DataFrame | Series: | |||
""" | |||
from pandas.core.reshape.concat import concat | |||
|
|||
obj = self.obj | |||
obj = self._maybe_filter_numeric_only() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will cause different behaviors for list and non-list arguments that contain a callable. E.g.
df.agg(lambda x: x.sum(), numeric_only=True))
df.agg([lambda x: x.sum()], numeric_only=True)
Currently in this PR, I believe the first line will pass numeric_only
to the supplied UDF (and this fails as the UDF doesn't take a numeric_only argument). The second line will subset the DataFrame to only numeric columns and succeed. These should have the same behavior with regards to how the numeric_only argument is treated.
I suggest only using the DataFrame filtered to numeric_only when a given argument in the list is a string; otherwise to use the full DataFrame and pass the numeric_only kwarg to the supplied UDF.
@@ -858,6 +858,22 @@ def test_with_dictlike_columns_with_infer(): | |||
tm.assert_frame_equal(result, expected) | |||
|
|||
|
|||
def test_with_dictlike_functions(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're no longer changing dictlike behavior, I think this can be removed. It should fail, but that is another (unrelated) issue.
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
@thlautenschlaeger - friendly ping. Are you interested in continuing? |
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
DataFrame.agg
behavoir when passing as kwargsnumeric_only=True
#49352doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.