Skip to content

groupby filtering is missing some groups #7870

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
phobson opened this issue Jul 29, 2014 · 13 comments · Fixed by #7871
Closed

groupby filtering is missing some groups #7870

phobson opened this issue Jul 29, 2014 · 13 comments · Fixed by #7871
Milestone

Comments

@phobson
Copy link

phobson commented Jul 29, 2014

I brought this up on the mailing list. @cpcloud modified my example into very concise sample showing expected and resulting output:

In [20]: df  = pd.DataFrame([
    ['best', 'a', 'x'],
    ['worst', 'b', 'y'],
    ['best', 'c', 'x'],
    ['best','d', 'y'],
    ['worst','d', 'y'],
    ['worst','d', 'y'],
    ['best','d', 'z'],
], columns=['a', 'b', 'c'])

In [21]: pd.concat(v[v.a == 'best'] for _, v in df.groupby('c'))
Out[21]:
      a  b  c
0  best  a  x
2  best  c  x
3  best  d  y # <--- missing from the next statement
6  best  d  z

In [22]: df.groupby('c').filter(lambda g: g.a == 'best')
Out[22]:
      a  b  c
0  best  a  x
2  best  c  x
6  best  d  z
@cpcloud cpcloud added this to the 0.15.0 milestone Jul 29, 2014
@dsm054
Copy link
Contributor

dsm054 commented Jul 29, 2014

I'm not sure this is what filter is meant for. I think it acts on the groups, not on elements of the groups.

If I'm right, then the bug is that this isn't tripping raise TypeError("the filter must return a boolean result"), not that the result differs from the concat result. If not, then the docs are confusing. (Maybe they are in any case.)

@cpcloud
Copy link
Member

cpcloud commented Jul 29, 2014

I'll repost how I thought it might've worked

In [20]: df
Out[20]:
       a  b  c
0   best  a  x
1  worst  b  y
2   best  c  x
3   best  d  y
4  worst  d  y
5  worst  d  y
6   best  d  z

In [21]: pd.concat(v[v.a == 'best'] for _, v in df.groupby('c'))
Out[21]:
      a  b  c
0  best  a  x
2  best  c  x
3  best  d  y
6  best  d  z

In [22]: df.groupby('c').filter(lambda g: g.a == 'best')
Out[22]:
      a  b  c
0  best  a  x
2  best  c  x
6  best  d  z

@cpcloud
Copy link
Member

cpcloud commented Jul 29, 2014

@dsm054 by groups do you mean ['x', 'y', 'z'] or the (name, frame) pairs?

@jreback
Copy link
Contributor

jreback commented Jul 29, 2014

I agree with @dsm054 here. This should return the SAME df, (as 'best' is in EACH group). The bug is that the complete groups are not being returned (because the filter function is tricking the detector).

@dsm054
Copy link
Contributor

dsm054 commented Jul 29, 2014

I mean the subframes that filter is passed. I think you're given a group and you're supposed to return a yea or nay bool based on the group, not an iterable of bools about your opinions on each row in the group.

The example in the function itself is (lambda x: x['A'].sum() + x['B'].sum() > 0), which is a reduction-to-bool operation.

@cpcloud
Copy link
Member

cpcloud commented Jul 29, 2014

well then this is very weird :)

@cpcloud
Copy link
Member

cpcloud commented Jul 29, 2014

pr coming

@dsm054
Copy link
Contributor

dsm054 commented Jul 29, 2014

Just to be clear, I'm not sure I agree with @jreback that the right thing to do is to return the same df because "best" is in each group. That's equivalent to saying that we're going to use the convention bool(array) == any(array), which is only going to work here.

@hayd
Copy link
Contributor

hayd commented Jul 29, 2014

Posted on ML before this. I think raise is the correct course of action here. Filter is for removing/including entire groups. Picking all or any (or first!) of the boolean Series would be strange IMO...

Function to apply to each subframe. Should return True or False.

@phobson
Copy link
Author

phobson commented Jul 29, 2014

Very insightful comments here. Much appreciated. I'm 👍 on raising an error, which I believe version 0.12 did.

@jreback
Copy link
Contributor

jreback commented Jul 29, 2014

I concur

@cpcloud
Copy link
Member

cpcloud commented Jul 29, 2014

Okay, this mainly had to do with the slow_path/fast_path business, where the slow path calls apply on the group (usually a frame) and the fast path just calls the function directly, which of course in the case of this:

In [8]: df
Out[8]:
   a       b
0  1  0.9712
1  1  0.7578
2  1  0.6300
3  2  1.7625
4  2 -0.0895

In [9]: df[df.a == 1].apply(lambda x: len(x) > 1)
Out[9]:
a    True
b    True
dtype: bool

In [10]: len(df[df.a == 1]) > 1
Out[10]: True

gives two different unrelated results and there were some hacks around trying to figure out which one was "correct".

i gutted all that code and now just call the fast path, we'll see what happens on travis but this doesn't break anything locally.

@cpcloud
Copy link
Member

cpcloud commented Jul 29, 2014

I went with raise for the same reason that np.array([False, True]).item() raises

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants