groupby filtering is missing some groups #7870

phobson · 2014-07-29T16:14:26Z

I brought this up on the mailing list. @cpcloud modified my example into very concise sample showing expected and resulting output:

In [20]: df  = pd.DataFrame([
    ['best', 'a', 'x'],
    ['worst', 'b', 'y'],
    ['best', 'c', 'x'],
    ['best','d', 'y'],
    ['worst','d', 'y'],
    ['worst','d', 'y'],
    ['best','d', 'z'],
], columns=['a', 'b', 'c'])

In [21]: pd.concat(v[v.a == 'best'] for _, v in df.groupby('c'))
Out[21]:
      a  b  c
0  best  a  x
2  best  c  x
3  best  d  y # <--- missing from the next statement
6  best  d  z

In [22]: df.groupby('c').filter(lambda g: g.a == 'best')
Out[22]:
      a  b  c
0  best  a  x
2  best  c  x
6  best  d  z

The text was updated successfully, but these errors were encountered:

dsm054 · 2014-07-29T16:34:16Z

I'm not sure this is what filter is meant for. I think it acts on the groups, not on elements of the groups.

If I'm right, then the bug is that this isn't tripping raise TypeError("the filter must return a boolean result"), not that the result differs from the concat result. If not, then the docs are confusing. (Maybe they are in any case.)

cpcloud · 2014-07-29T16:35:28Z

I'll repost how I thought it might've worked

In [20]: df
Out[20]:
       a  b  c
0   best  a  x
1  worst  b  y
2   best  c  x
3   best  d  y
4  worst  d  y
5  worst  d  y
6   best  d  z

In [21]: pd.concat(v[v.a == 'best'] for _, v in df.groupby('c'))
Out[21]:
      a  b  c
0  best  a  x
2  best  c  x
3  best  d  y
6  best  d  z

In [22]: df.groupby('c').filter(lambda g: g.a == 'best')
Out[22]:
      a  b  c
0  best  a  x
2  best  c  x
6  best  d  z

cpcloud · 2014-07-29T16:36:04Z

@dsm054 by groups do you mean ['x', 'y', 'z'] or the (name, frame) pairs?

jreback · 2014-07-29T16:39:58Z

I agree with @dsm054 here. This should return the SAME df, (as 'best' is in EACH group). The bug is that the complete groups are not being returned (because the filter function is tricking the detector).

dsm054 · 2014-07-29T16:41:49Z

I mean the subframes that filter is passed. I think you're given a group and you're supposed to return a yea or nay bool based on the group, not an iterable of bools about your opinions on each row in the group.

The example in the function itself is (lambda x: x['A'].sum() + x['B'].sum() > 0), which is a reduction-to-bool operation.

cpcloud · 2014-07-29T16:43:13Z

well then this is very weird :)

cpcloud · 2014-07-29T16:45:33Z

pr coming

dsm054 · 2014-07-29T16:54:28Z

Just to be clear, I'm not sure I agree with @jreback that the right thing to do is to return the same df because "best" is in each group. That's equivalent to saying that we're going to use the convention bool(array) == any(array), which is only going to work here.

hayd · 2014-07-29T17:00:51Z

Posted on ML before this. I think raise is the correct course of action here. Filter is for removing/including entire groups. Picking all or any (or first!) of the boolean Series would be strange IMO...

Function to apply to each subframe. Should return True or False.

phobson · 2014-07-29T17:03:56Z

Very insightful comments here. Much appreciated. I'm 👍 on raising an error, which I believe version 0.12 did.

jreback · 2014-07-29T17:28:57Z

I concur

cpcloud · 2014-07-29T17:37:26Z

Okay, this mainly had to do with the slow_path/fast_path business, where the slow path calls apply on the group (usually a frame) and the fast path just calls the function directly, which of course in the case of this:

In [8]: df
Out[8]:
   a       b
0  1  0.9712
1  1  0.7578
2  1  0.6300
3  2  1.7625
4  2 -0.0895

In [9]: df[df.a == 1].apply(lambda x: len(x) > 1)
Out[9]:
a    True
b    True
dtype: bool

In [10]: len(df[df.a == 1]) > 1
Out[10]: True

gives two different unrelated results and there were some hacks around trying to figure out which one was "correct".

i gutted all that code and now just call the fast path, we'll see what happens on travis but this doesn't break anything locally.

cpcloud · 2014-07-29T17:50:26Z

I went with raise for the same reason that np.array([False, True]).item() raises

cpcloud added Bug labels Jul 29, 2014

cpcloud added this to the 0.15.0 milestone Jul 29, 2014

cpcloud mentioned this issue Jul 29, 2014

BUG/FIX: groupby should raise on multi-valued filter #7871

Merged

cpcloud self-assigned this Jul 29, 2014

cpcloud closed this as completed in #7871 Jul 30, 2014

wesm unassigned cpcloud Oct 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby filtering is missing some groups #7870

groupby filtering is missing some groups #7870

phobson commented Jul 29, 2014

dsm054 commented Jul 29, 2014

cpcloud commented Jul 29, 2014

cpcloud commented Jul 29, 2014

jreback commented Jul 29, 2014

dsm054 commented Jul 29, 2014

cpcloud commented Jul 29, 2014

cpcloud commented Jul 29, 2014

dsm054 commented Jul 29, 2014

hayd commented Jul 29, 2014

phobson commented Jul 29, 2014

jreback commented Jul 29, 2014

cpcloud commented Jul 29, 2014

cpcloud commented Jul 29, 2014

groupby filtering is missing some groups #7870

groupby filtering is missing some groups #7870

Comments

phobson commented Jul 29, 2014

dsm054 commented Jul 29, 2014

cpcloud commented Jul 29, 2014

cpcloud commented Jul 29, 2014

jreback commented Jul 29, 2014

dsm054 commented Jul 29, 2014

cpcloud commented Jul 29, 2014

cpcloud commented Jul 29, 2014

dsm054 commented Jul 29, 2014

hayd commented Jul 29, 2014

phobson commented Jul 29, 2014

jreback commented Jul 29, 2014

cpcloud commented Jul 29, 2014

cpcloud commented Jul 29, 2014