Skip to content

BUG/FIX: groupby should raise on multi-valued filter #7871

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 30, 2014
Merged

BUG/FIX: groupby should raise on multi-valued filter #7871

merged 1 commit into from
Jul 30, 2014

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Jul 29, 2014

closes #7870

@cpcloud cpcloud modified the milestones: 0.15.1, 0.15.0 Jul 29, 2014
@cpcloud cpcloud self-assigned this Jul 29, 2014
val = res.ravel()[0]
if np.isscalar(res) or len(res) == 1:
try:
val = res.item()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insert item abuse here!

Do we have to support this case? The docs do say it must return a bool. IMO it's pretty weird anyway... better the user thinks about whether there may be cases which are not single-valued rather than assume there isn't (i.e. raise and ask user to manually call item/all/any/whatever).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can set still get a Series of length 1 and it's a valid filter, so yes we have to support it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this is totally hidden from the user, they wouldn't really be able to call item anyway

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My argument against item is the same as for the general case, you could have some groupby filter written as above chugging away fine and then one day you get dupes in a group (return value > 1) and an exception. The docstring says returns True or False, raising immediately and basically asks the user to make a reduction decision all/any/item/whatever.

I guess I don't know how widespread this (mis?)use is, so we probably do have to support it...

Also, this is totally hidden from the user, they wouldn't really be able to call item anyway

Or maybe I'm missing something, I thought this part is basically rewriting (in the case f(x) is a boolean series):

g.filter(f)
# as
g.filter(lambda x: f(x).item())

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hayd sorry andy you're right! let me see what breaks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback what do you think about this? it breaks functionality like:

df.groupby('c').filter(lambda x: x.mean() < 10)

instead you'd have to be more explicit and do

df.groupby('c').filter(lambda x: x.mean().item() < 10)

or

df.groupby('c').filter(lambda x: (x.mean() < 10).item())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use a try: except: with series.bool() (or something like this). I think to fix this, which deals with a 1 element boolean array (and raises otherwise); and would fail on ndarrays (or use code like that, e.g. accepting a 1-element ndarray/series of boolean dtype is ok (or a scalar).

I don't think allowing a non-reduction op is ok though (e.g. the user should be using any/all)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback @hayd The main question is "Should we allow single element sequences in filter results?"

non reduction ops do raise, so there's no issue there.

Because DataFramGroupBy._selected_obj is a DataFrame (even when you group on a single column), when you do things like df.groupby('c').filter(lambda x: x.mean() < 10), x.mean() will be the mean over a single column DataFrame which is of course a Series.

there's nothing to "fix" re single element behavior per se, just deciding whether or not to break it to force users to be more explicit about things.

frankly, i think that the fact that a reduction on _selected_obj is a single element Series is an implementation detail that users shouldn't have to think about, even if it's less explicit on their side. this has the added benefit of being backwards compatible

therefore i vote 👎 on disallowing single element sequences

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud ok, so you are +1 on allowing single element sequences, then. ok, I would agree with that. That's the bug here, right? I think if .item() its a len-1 sequence (and you just need to apply .item()) then its ok to take that as a result.

unless I am misunderstanding you. I think thats a nice convience

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm using squeeze because then i don't have to catch the ValueError, but yep that's what i'm talking about

@jreback
Copy link
Contributor

jreback commented Jul 30, 2014

then that all looks ok by me

@cpcloud
Copy link
Member Author

cpcloud commented Jul 30, 2014

@jreback for whatever reason this isn't going to pull requests on pandas travis, but passing on my build

@jreback
Copy link
Contributor

jreback commented Jul 30, 2014

yep...this ok

cpcloud added a commit that referenced this pull request Jul 30, 2014
BUG/FIX: groupby should raise on multi-valued filter
@cpcloud cpcloud merged commit c3d31c7 into pandas-dev:master Jul 30, 2014
@cpcloud cpcloud deleted the groupby-filter-fix branch July 30, 2014 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

groupby filtering is missing some groups
3 participants