Sort()ing then selecting columns in a function apply()d to a grouped DataFrame #10671

TDaltonC · 2015-07-25T05:23:51Z

This code:

import pandas as pd
from numpy.random import randn


df = pd.DataFrame(randn(8, 4), columns=['A', 'B', 'C', 'D'])

def function(group):
    group.sort(columns = 'C', inplace=True)
    group = group[['A', 'B', 'C', 'D']]
    return group    

df2 = df.groupby(['A']).apply(function)

produces this error:

ValueError: cannot reindex from a duplicate axis

It's the combination of the sort and the column selecting inside of a grouped apply that causes the problem. I'm happy to give more detail on why I want to do this, but this is the simplest most striped down code that produces the error.

jreback · 2015-07-25T13:52:23Z

is their a reason you are not just doing

In [10]: df.sort('C')
Out[10]: 
          A         B         C         D
7 -0.065432  0.476895 -1.933456 -0.225273
5  0.364656  1.510392 -0.552039  0.927939
0  0.144173 -1.230840 -0.551998 -0.103711
6  1.046028  0.906485 -0.449859 -0.185228
1 -0.467742  0.965226  0.546713 -1.300566
3 -0.687709  1.468811  1.031457 -0.760951
2  0.221976 -1.374526  1.753068  0.026533
4 -0.997729 -0.996212  2.454797 -1.431332

what are you expecting this to do?

TDaltonC · 2015-07-25T14:54:57Z

Yes. In trying to make the smallest piece of code that would still produce the error, I made a script that doesn't actually do anything. What I'm actually trying to accomplish looks more like:

import pandas as pd
import numpy as np


df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'bar'],
                   'B' : [0,0,3,3,2,2,1,1]})

def function(group):
    group.sort(columns = 'B', inplace=True)
    group['C'] = group['B'].diff()
    trimedGroup = group[['A', 'C']]
    return trimedGroup

df2 = df.groupby(['A']).apply(function)

And I expect to get back:

In[28]:df2
Out[28]: 
     A   E
0  foo NaN
1  bar NaN
2  foo   1
3  bar   1
4  foo   1
5  bar   1
6  foo   1
7  bar   1

I want to take the diff() of sort()ed groups.

jreback · 2015-07-25T15:04:00Z

I think .diff needs a slightly different defintiion

as df.sort('B').groupby('A',as_index=False).B.diff() should work but raises a TypeError

In [42]: pd.concat([df,df.sort('B').groupby('A').B.diff()],axis=1)
Out[42]: 
     A  B   0
0  foo  0 NaN
1  bar  0 NaN
2  foo  3   1
3  bar  3   1
4  foo  2   1
5  bar  2   1
6  foo  1   1
7  bar  1   1

jreback · 2015-07-25T15:04:41Z

You NEVER want to sort in a group, simply sort the entire frame beforehand. In fact you always want to do as much work as possible in a vectorized fashion.

TomAugspurger · 2018-01-31T15:19:42Z

Duplicate of #19437

jreback added Bug Groupby labels Jul 25, 2015

jreback added this to the Next Major Release milestone Jul 25, 2015

TomAugspurger added the Duplicate Report Duplicate issue or pull request label Jan 31, 2018

TomAugspurger modified the milestones: Next Major Release, No action Jan 31, 2018

TomAugspurger marked this as a duplicate of #19437 Jan 31, 2018

TomAugspurger closed this as completed Jan 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sort()ing then selecting columns in a function apply()d to a grouped DataFrame #10671

Sort()ing then selecting columns in a function apply()d to a grouped DataFrame #10671

TDaltonC commented Jul 25, 2015

jreback commented Jul 25, 2015

TDaltonC commented Jul 25, 2015

jreback commented Jul 25, 2015

jreback commented Jul 25, 2015

TomAugspurger commented Jan 31, 2018

Sort()ing then selecting columns in a function apply()d to a grouped DataFrame #10671

Sort()ing then selecting columns in a function apply()d to a grouped DataFrame #10671

Comments

TDaltonC commented Jul 25, 2015

jreback commented Jul 25, 2015

TDaltonC commented Jul 25, 2015

jreback commented Jul 25, 2015

jreback commented Jul 25, 2015

TomAugspurger commented Jan 31, 2018