Lots of unexpected behavior using resample after groupby #12923

BreitA · 2016-04-19T08:13:19Z

Code Sample, a copy-pastable example if possible

PANDAS 0.18 code :

df=pd.DataFrame(np.ones((150,4)),columns=['A','B','C','D'],
index=pd.date_range('2014-01-01',freq='D',periods=150))
df2=pd.DataFrame(np.zeros((150,4)),columns=['A','B','C','D'],
index=pd.date_range('2014-01-01',freq='D',periods=150))

df=pd.concat([df,df2])

print df.groupby('B').mean()
print df.groupby('B').resample('MS').mean().head()
print 'shape : ',df.groupby('B').resample('MS').mean().shape
print df.groupby('B').apply(lambda x:x.resample('MS').mean()).head()
print 'shape : ',df.groupby('B').apply(lambda x:x.resample('MS').mean()).shape
print df.groupby('B').mean()
print df.groupby('B').resample('H').mean().head()
print 'shape : ',df.groupby('B').resample('H').mean().shape
print df.groupby('B').apply(lambda x:x.resample('H').mean()).head()
print 'shape : ',df.groupby('B').apply(lambda x:x.resample('H').mean()).shape
print 'pd version', pd.__version__

PANDAS 0.17 equivalent code:

df=pd.DataFrame(np.ones((150,4)),columns=['A','B','C','D'],index=pd.date_range('2014-01-01',freq='D',periods=150))
df2=pd.DataFrame(np.zeros((150,4)),columns=['A','B','C','D'],index=pd.date_range('2014-01-01',freq='D',periods=150))

df=pd.concat([df,df2])

print df.groupby('B').mean()
print df.groupby('B').resample('MS').head()
print 'shape : ',df.groupby('B').resample('MS').shape
print df.groupby('B').apply(lambda x:x.resample('MS')).head()
print 'shape : ',df.groupby('B').apply(lambda x:x.resample('MS')).shape
print df.groupby('B').mean()
print df.groupby('B').resample('H').head()
print 'shape : ',df.groupby('B').resample('H').shape
print df.groupby('B').apply(lambda x:x.resample('H')).head()
print 'shape : ',df.groupby('B').apply(lambda x:x.resample('H')).shape
print 'pd version', pd.__version__

Expected Output

Pandas 0.18 code Output :

   A    C    D

B
0.0 0.0 0.0 0.0
1.0 1.0 1.0 1.0
A B C D
B
0.0 2014-01-01 0.0 0.0 0.0 0.0
2014-02-01 0.0 0.0 0.0 0.0
2014-03-01 0.0 0.0 0.0 0.0
2014-04-01 0.0 0.0 0.0 0.0
2014-05-01 0.0 0.0 0.0 0.0
shape : (10, 4)
A B C D
B
0.0 2014-01-01 0.0 0.0 0.0 0.0
2014-02-01 0.0 0.0 0.0 0.0
2014-03-01 0.0 0.0 0.0 0.0
2014-04-01 0.0 0.0 0.0 0.0
2014-05-01 0.0 0.0 0.0 0.0
shape : (10, 4)
A C D
B
0.0 0.0 0.0 0.0
1.0 1.0 1.0 1.0
A B C D
B
0.0 2014-01-01 0.0 0.0 0.0 0.0
2014-01-02 0.0 0.0 0.0 0.0
2014-01-03 0.0 0.0 0.0 0.0
2014-01-04 0.0 0.0 0.0 0.0
2014-01-05 0.0 0.0 0.0 0.0
shape : (300, 4)
A B C D
B
0.0 2014-01-01 00:00:00 0.0 0.0 0.0 0.0
2014-01-01 01:00:00 NaN NaN NaN NaN
2014-01-01 02:00:00 NaN NaN NaN NaN
2014-01-01 03:00:00 NaN NaN NaN NaN
2014-01-01 04:00:00 NaN NaN NaN NaN
shape : (7154, 4)
pd version 0.18.0

Pandas 0.17 equivalent code Output :

A C D
B
0 0 0 0
1 1 1 1
A C D
B
0 2014-01-01 0 0 0
2014-02-01 0 0 0
2014-03-01 0 0 0
2014-04-01 0 0 0
2014-05-01 0 0 0
shape : (10, 3)
A B C D
B
0 2014-01-01 0 0 0 0
2014-02-01 0 0 0 0
2014-03-01 0 0 0 0
2014-04-01 0 0 0 0
2014-05-01 0 0 0 0
shape : (10, 4)
A C D
B
0 0 0 0
1 1 1 1
A C D
B
0 2014-01-01 00:00:00 0 0 0
2014-01-01 01:00:00 NaN NaN NaN
2014-01-01 02:00:00 NaN NaN NaN
2014-01-01 03:00:00 NaN NaN NaN
2014-01-01 04:00:00 NaN NaN NaN
shape : (7154, 3)
A B C D
B
0 2014-01-01 00:00:00 0 0 0 0
2014-01-01 01:00:00 NaN NaN NaN NaN
2014-01-01 02:00:00 NaN NaN NaN NaN
2014-01-01 03:00:00 NaN NaN NaN NaN
2014-01-01 04:00:00 NaN NaN NaN NaN
shape : (7154, 4)
pd version 0.17.1

ISSUES :

in pandas 0.18.0 the column B is not dropped when applying resample afterwards (it should be dropped and put in index like with the simple example using .mean() after groupby).
in pandas 0.18.0 the behavior is correct when downsampling (example with 'MS') but is wrong when upsampling (example with 'H') The dataframe is not upsampled in that case and stays at freq='D'

A workaround is to use df.groupby('B').apply(lambda x: x.resample.mean()) but it's inelegant to say the least and does not solve the issue of B being not dropped in columns.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2016-04-19T12:08:27Z

Can you checkout #12743, which is closing a bunch of these issues, and ensure that it gives you the expected answers? I'm having a bit of trouble understanding your output since the formatting is off, but it looks correct on that branch.

Let's move the discussion there if there are any issues.

jreback · 2016-04-19T12:32:07Z

In [1]: df=pd.DataFrame(np.ones((150,4)),columns=['A','B','C','D'],
   ...: index=pd.date_range('2014-01-01',freq='D',periods=150))

In [2]: df2=pd.DataFrame(np.zeros((150,4)),columns=['A','B','C','D'],
   ...: index=pd.date_range('2014-01-01',freq='D',periods=150))

In [3]: df=pd.concat([df,df2])

In [4]: print df.groupby('B').mean()
       A    C    D
B                 
0.0  0.0  0.0  0.0
1.0  1.0  1.0  1.0

In [5]: print df.groupby('B').resample('MS').mean().head()
                  A    B    C    D
B                                 
0.0 2014-01-01  0.0  0.0  0.0  0.0
    2014-02-01  0.0  0.0  0.0  0.0
    2014-03-01  0.0  0.0  0.0  0.0
    2014-04-01  0.0  0.0  0.0  0.0
    2014-05-01  0.0  0.0  0.0  0.0

In [6]: print 'shape : ',df.groupby('B').resample('MS').mean().shape
shape :  (10, 4)

In [7]: print df.groupby('B').apply(lambda x:x.resample('MS').mean()).head()
                  A    B    C    D
B                                 
0.0 2014-01-01  0.0  0.0  0.0  0.0
    2014-02-01  0.0  0.0  0.0  0.0
    2014-03-01  0.0  0.0  0.0  0.0
    2014-04-01  0.0  0.0  0.0  0.0
    2014-05-01  0.0  0.0  0.0  0.0

In [8]: print 'shape : ',df.groupby('B').apply(lambda x:x.resample('MS').mean()).shape
shape :  (10, 4)
In [9]: print df.groupby('B').mean()
       A    C    D
B                 
0.0  0.0  0.0  0.0
1.0  1.0  1.0  1.0

In [10]: print df.groupby('B').resample('H').mean().head()
                           A    B    C    D
B                                          
0.0 2014-01-01 00:00:00  0.0  0.0  0.0  0.0
    2014-01-01 01:00:00  NaN  NaN  NaN  NaN
    2014-01-01 02:00:00  NaN  NaN  NaN  NaN
    2014-01-01 03:00:00  NaN  NaN  NaN  NaN
    2014-01-01 04:00:00  NaN  NaN  NaN  NaN

In [11]: print 'shape : ',df.groupby('B').resample('H').mean().shape
shape :  (7154, 4)

In [12]: print df.groupby('B').apply(lambda x:x.resample('H').mean()).head()
                           A    B    C    D
B                                          
0.0 2014-01-01 00:00:00  0.0  0.0  0.0  0.0
    2014-01-01 01:00:00  NaN  NaN  NaN  NaN
    2014-01-01 02:00:00  NaN  NaN  NaN  NaN
    2014-01-01 03:00:00  NaN  NaN  NaN  NaN
    2014-01-01 04:00:00  NaN  NaN  NaN  NaN

In [13]: print 'shape : ',df.groupby('B').apply(lambda x:x.resample('H').mean()).shape
shape :  (7154, 4)

In [14]: print 'pd version', pd.__version__
pd version 0.18.0+129.g928a8b4

So these all look correct to me, as @TomAugspurger says, #12743 will resolve any remaining issues here. In esscense df.groupby(...).resample(...) is doing df.groupby(...).apply(lambda x: x.resample(...)) under the hood

BreitA · 2016-04-19T13:08:27Z

So the different behavior we have here :

YOU

In [10]: print df.groupby('B').resample('H').mean().head()
                           A    B    C    D
B                                          
0.0 2014-01-01 00:00:00  0.0  0.0  0.0  0.0
    2014-01-01 01:00:00  NaN  NaN  NaN  NaN
    2014-01-01 02:00:00  NaN  NaN  NaN  NaN
    2014-01-01 03:00:00  NaN  NaN  NaN  NaN
    2014-01-01 04:00:00  NaN  NaN  NaN  NaN

ME

In [10]: print df.groupby('B').resample('H').mean().head()
                  A    B    C    D
B                                 
0.0 2014-01-01  0.0  0.0  0.0  0.0
    2014-01-02  0.0  0.0  0.0  0.0
    2014-01-03  0.0  0.0  0.0  0.0
    2014-01-04  0.0  0.0  0.0  0.0
    2014-01-05  0.0  0.0  0.0  0.0
shape :  (225, 4)

This will be fixed in next build?

Also is it normal B isn't dropped anymore? It seems weird it is dropped for simple functions such as .mean() but not for resampling.

jreback · 2016-04-19T13:14:56Z

@BreitA you are probably looking to do this:

In [6]: df.groupby('B').resample('H').ffill()
Out[6]: 
                           A    B    C    D
B                                          
0.0 2014-01-01 00:00:00  0.0  0.0  0.0  0.0
    2014-01-01 01:00:00  0.0  0.0  0.0  0.0
    2014-01-01 02:00:00  0.0  0.0  0.0  0.0
    2014-01-01 03:00:00  0.0  0.0  0.0  0.0
    2014-01-01 04:00:00  0.0  0.0  0.0  0.0
    2014-01-01 05:00:00  0.0  0.0  0.0  0.0
    2014-01-01 06:00:00  0.0  0.0  0.0  0.0
    2014-01-01 07:00:00  0.0  0.0  0.0  0.0
    2014-01-01 08:00:00  0.0  0.0  0.0  0.0

.mean() is a downsamping operation and doesn't make any sense here (it works, but is probably not what you want)

The implemenation is exactly this. Yes you are doing an operation on the entire frame, so it makes sense to keep all columns.

In [9]: df.groupby('B').apply(lambda x: x.resample('H').ffill())
Out[9]: 
                           A    B    C    D
B                                          
0.0 2014-01-01 00:00:00  0.0  0.0  0.0  0.0
    2014-01-01 01:00:00  0.0  0.0  0.0  0.0
    2014-01-01 02:00:00  0.0  0.0  0.0  0.0
    2014-01-01 03:00:00  0.0  0.0  0.0  0.0
    2014-01-01 04:00:00  0.0  0.0  0.0  0.0
    2014-01-01 05:00:00  0.0  0.0  0.0  0.0
    2014-01-01 06:00:00  0.0  0.0  0.0  0.0
    2014-01-01 07:00:00  0.0  0.0  0.0  0.0

BreitA · 2016-04-19T13:53:28Z

yeah I know the example is kind of silly (using .mean() for upsampling). The point was that the behavior was not the same by using apply(lambda x:x.resample.mean()) instead of using .resample.mean()

TomAugspurger closed this as completed Apr 19, 2016

TomAugspurger added Groupby Duplicate Report Duplicate issue or pull request Resample resample method labels Apr 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lots of unexpected behavior using resample after groupby #12923

Lots of unexpected behavior using resample after groupby #12923

BreitA commented Apr 19, 2016 •

edited

Loading

TomAugspurger commented Apr 19, 2016

jreback commented Apr 19, 2016

BreitA commented Apr 19, 2016 •

edited by jorisvandenbossche

Loading

jreback commented Apr 19, 2016 •

edited

Loading

BreitA commented Apr 19, 2016

Lots of unexpected behavior using resample after groupby #12923

Lots of unexpected behavior using resample after groupby #12923

Comments

BreitA commented Apr 19, 2016 • edited Loading

Code Sample, a copy-pastable example if possible

PANDAS 0.18 code :

PANDAS 0.17 equivalent code:

Expected Output

Pandas 0.18 code Output :

Pandas 0.17 equivalent code Output :

ISSUES :

TomAugspurger commented Apr 19, 2016

jreback commented Apr 19, 2016

BreitA commented Apr 19, 2016 • edited by jorisvandenbossche Loading

YOU

ME

jreback commented Apr 19, 2016 • edited Loading

BreitA commented Apr 19, 2016

BreitA commented Apr 19, 2016 •

edited

Loading

BreitA commented Apr 19, 2016 •

edited by jorisvandenbossche

Loading

jreback commented Apr 19, 2016 •

edited

Loading