-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG/FIX: groupby should raise on multi-valued filter #7871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
val = res.ravel()[0] | ||
if np.isscalar(res) or len(res) == 1: | ||
try: | ||
val = res.item() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
insert item
abuse here!
Do we have to support this case? The docs do say it must return a bool. IMO it's pretty weird anyway... better the user thinks about whether there may be cases which are not single-valued rather than assume there isn't (i.e. raise and ask user to manually call item/all/any/whatever).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can set still get a Series
of length 1 and it's a valid filter, so yes we have to support it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this is totally hidden from the user, they wouldn't really be able to call item
anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My argument against item is the same as for the general case, you could have some groupby filter written as above chugging away fine and then one day you get dupes in a group (return value > 1) and an exception. The docstring says returns True or False, raising immediately and basically asks the user to make a reduction decision all/any/item/whatever.
I guess I don't know how widespread this (mis?)use is, so we probably do have to support it...
Also, this is totally hidden from the user, they wouldn't really be able to call item anyway
Or maybe I'm missing something, I thought this part is basically rewriting (in the case f(x) is a boolean series):
g.filter(f)
# as
g.filter(lambda x: f(x).item())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hayd sorry andy you're right! let me see what breaks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback what do you think about this? it breaks functionality like:
df.groupby('c').filter(lambda x: x.mean() < 10)
instead you'd have to be more explicit and do
df.groupby('c').filter(lambda x: x.mean().item() < 10)
or
df.groupby('c').filter(lambda x: (x.mean() < 10).item())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could use a try: except: with series.bool()
(or something like this). I think to fix this, which deals with a 1 element boolean array (and raises otherwise); and would fail on ndarrays (or use code like that, e.g. accepting a 1-element ndarray/series of boolean dtype is ok (or a scalar).
I don't think allowing a non-reduction op is ok though (e.g. the user should be using any/all)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback @hayd The main question is "Should we allow single element sequences in filter results?"
non reduction ops do raise, so there's no issue there.
Because DataFramGroupBy._selected_obj
is a DataFrame
(even when you group on a single column), when you do things like df.groupby('c').filter(lambda x: x.mean() < 10)
, x.mean()
will be the mean over a single column DataFrame
which is of course a Series
.
there's nothing to "fix" re single element behavior per se, just deciding whether or not to break it to force users to be more explicit about things.
frankly, i think that the fact that a reduction on _selected_obj
is a single element Series
is an implementation detail that users shouldn't have to think about, even if it's less explicit on their side. this has the added benefit of being backwards compatible
therefore i vote 👎 on disallowing single element sequences
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cpcloud ok, so you are +1 on allowing single element sequences, then. ok, I would agree with that. That's the bug here, right? I think if .item()
its a len-1 sequence (and you just need to apply .item()
) then its ok to take that as a result.
unless I am misunderstanding you. I think thats a nice convience
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm using squeeze
because then i don't have to catch the ValueError
, but yep that's what i'm talking about
then that all looks ok by me |
@jreback for whatever reason this isn't going to pull requests on pandas travis, but passing on my build |
yep...this ok |
BUG/FIX: groupby should raise on multi-valued filter
closes #7870