Skip to content

Sort()ing then selecting columns in a function apply()d to a grouped DataFrame #10671

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TDaltonC opened this issue Jul 25, 2015 · 5 comments
Closed
Labels
Bug Duplicate Report Duplicate issue or pull request Groupby

Comments

@TDaltonC
Copy link

This code:

import pandas as pd
from numpy.random import randn


df = pd.DataFrame(randn(8, 4), columns=['A', 'B', 'C', 'D'])

def function(group):
    group.sort(columns = 'C', inplace=True)
    group = group[['A', 'B', 'C', 'D']]
    return group    

df2 = df.groupby(['A']).apply(function)

produces this error:

ValueError: cannot reindex from a duplicate axis

It's the combination of the sort and the column selecting inside of a grouped apply that causes the problem. I'm happy to give more detail on why I want to do this, but this is the simplest most striped down code that produces the error.

@jreback
Copy link
Contributor

jreback commented Jul 25, 2015

is their a reason you are not just doing

In [10]: df.sort('C')
Out[10]: 
          A         B         C         D
7 -0.065432  0.476895 -1.933456 -0.225273
5  0.364656  1.510392 -0.552039  0.927939
0  0.144173 -1.230840 -0.551998 -0.103711
6  1.046028  0.906485 -0.449859 -0.185228
1 -0.467742  0.965226  0.546713 -1.300566
3 -0.687709  1.468811  1.031457 -0.760951
2  0.221976 -1.374526  1.753068  0.026533
4 -0.997729 -0.996212  2.454797 -1.431332

what are you expecting this to do?

@TDaltonC
Copy link
Author

Yes. In trying to make the smallest piece of code that would still produce the error, I made a script that doesn't actually do anything. What I'm actually trying to accomplish looks more like:

import pandas as pd
import numpy as np


df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'bar'],
                   'B' : [0,0,3,3,2,2,1,1]})

def function(group):
    group.sort(columns = 'B', inplace=True)
    group['C'] = group['B'].diff()
    trimedGroup = group[['A', 'C']]
    return trimedGroup

df2 = df.groupby(['A']).apply(function)

And I expect to get back:

In[28]:df2
Out[28]: 
     A   E
0  foo NaN
1  bar NaN
2  foo   1
3  bar   1
4  foo   1
5  bar   1
6  foo   1
7  bar   1

I want to take the diff() of sort()ed groups.

@jreback
Copy link
Contributor

jreback commented Jul 25, 2015

I think .diff needs a slightly different defintiion

as df.sort('B').groupby('A',as_index=False).B.diff() should work but raises a TypeError

In [42]: pd.concat([df,df.sort('B').groupby('A').B.diff()],axis=1)
Out[42]: 
     A  B   0
0  foo  0 NaN
1  bar  0 NaN
2  foo  3   1
3  bar  3   1
4  foo  2   1
5  bar  2   1
6  foo  1   1
7  bar  1   1

@jreback
Copy link
Contributor

jreback commented Jul 25, 2015

You NEVER want to sort in a group, simply sort the entire frame beforehand. In fact you always want to do as much work as possible in a vectorized fashion.

@jreback jreback added this to the Next Major Release milestone Jul 25, 2015
@TomAugspurger TomAugspurger added the Duplicate Report Duplicate issue or pull request label Jan 31, 2018
@TomAugspurger TomAugspurger modified the milestones: Next Major Release, No action Jan 31, 2018
@TomAugspurger
Copy link
Contributor

Duplicate of #19437

@TomAugspurger TomAugspurger marked this as a duplicate of #19437 Jan 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Duplicate Report Duplicate issue or pull request Groupby
Projects
None yet
Development

No branches or pull requests

3 participants