BUG/FIX: groupby should raise on multi-valued filter #7871

cpcloud · 2014-07-29T17:43:20Z

hayd · 2014-07-29T19:42:38Z

pandas/core/groupby.py

-                    val = res.ravel()[0]
+                if np.isscalar(res) or len(res) == 1:
+                    try:
+                        val = res.item()


insert item abuse here!

Do we have to support this case? The docs do say it must return a bool. IMO it's pretty weird anyway... better the user thinks about whether there may be cases which are not single-valued rather than assume there isn't (i.e. raise and ask user to manually call item/all/any/whatever).

You can set still get a Series of length 1 and it's a valid filter, so yes we have to support it

Also, this is totally hidden from the user, they wouldn't really be able to call item anyway

My argument against item is the same as for the general case, you could have some groupby filter written as above chugging away fine and then one day you get dupes in a group (return value > 1) and an exception. The docstring says returns True or False, raising immediately and basically asks the user to make a reduction decision all/any/item/whatever.

I guess I don't know how widespread this (mis?)use is, so we probably do have to support it...

Also, this is totally hidden from the user, they wouldn't really be able to call item anyway

Or maybe I'm missing something, I thought this part is basically rewriting (in the case f(x) is a boolean series):

g.filter(f) # as g.filter(lambda x: f(x).item())

@hayd sorry andy you're right! let me see what breaks

@jreback what do you think about this? it breaks functionality like:

df.groupby('c').filter(lambda x: x.mean() < 10)

instead you'd have to be more explicit and do

df.groupby('c').filter(lambda x: x.mean().item() < 10)

or

df.groupby('c').filter(lambda x: (x.mean() < 10).item())

you could use a try: except: with series.bool() (or something like this). I think to fix this, which deals with a 1 element boolean array (and raises otherwise); and would fail on ndarrays (or use code like that, e.g. accepting a 1-element ndarray/series of boolean dtype is ok (or a scalar).

I don't think allowing a non-reduction op is ok though (e.g. the user should be using any/all)

@jreback @hayd The main question is "Should we allow single element sequences in filter results?"

non reduction ops do raise, so there's no issue there.

Because DataFramGroupBy._selected_obj is a DataFrame (even when you group on a single column), when you do things like df.groupby('c').filter(lambda x: x.mean() < 10), x.mean() will be the mean over a single column DataFrame which is of course a Series.

there's nothing to "fix" re single element behavior per se, just deciding whether or not to break it to force users to be more explicit about things.

frankly, i think that the fact that a reduction on _selected_obj is a single element Series is an implementation detail that users shouldn't have to think about, even if it's less explicit on their side. this has the added benefit of being backwards compatible

therefore i vote 👎 on disallowing single element sequences

@cpcloud ok, so you are +1 on allowing single element sequences, then. ok, I would agree with that. That's the bug here, right? I think if .item() its a len-1 sequence (and you just need to apply .item()) then its ok to take that as a result.

unless I am misunderstanding you. I think thats a nice convience

i'm using squeeze because then i don't have to catch the ValueError, but yep that's what i'm talking about

jreback · 2014-07-30T15:09:41Z

then that all looks ok by me

cpcloud · 2014-07-30T22:55:02Z

@jreback for whatever reason this isn't going to pull requests on pandas travis, but passing on my build

jreback · 2014-07-30T22:57:30Z

yep...this ok

BUG/FIX: groupby should raise on multi-valued filter

cpcloud added Bug labels Jul 29, 2014

cpcloud modified the milestones: 0.15.1, 0.15.0 Jul 29, 2014

cpcloud self-assigned this Jul 29, 2014

hayd reviewed Jul 29, 2014
View reviewed changes

BUG/FIX: gut groupby.filter code, raise on multi-valued filter

f0fc4b5

cpcloud added a commit that referenced this pull request Jul 30, 2014

Merge pull request #7871 from cpcloud/groupby-filter-fix

c3d31c7

BUG/FIX: groupby should raise on multi-valued filter

cpcloud merged commit c3d31c7 into pandas-dev:master Jul 30, 2014

cpcloud deleted the groupby-filter-fix branch July 30, 2014 23:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG/FIX: groupby should raise on multi-valued filter #7871

BUG/FIX: groupby should raise on multi-valued filter #7871

cpcloud commented Jul 29, 2014

hayd Jul 29, 2014

cpcloud Jul 29, 2014

cpcloud Jul 29, 2014

hayd Jul 29, 2014

cpcloud Jul 29, 2014

cpcloud Jul 30, 2014

jreback Jul 30, 2014

cpcloud Jul 30, 2014

jreback Jul 30, 2014

cpcloud Jul 30, 2014

jreback commented Jul 30, 2014

cpcloud commented Jul 30, 2014

jreback commented Jul 30, 2014

BUG/FIX: groupby should raise on multi-valued filter #7871

BUG/FIX: groupby should raise on multi-valued filter #7871

Conversation

cpcloud commented Jul 29, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jul 30, 2014

cpcloud commented Jul 30, 2014

jreback commented Jul 30, 2014