Skip to content

BUG: groupby.transform passing Series to transformation #13543

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bashtage opened this issue Jul 1, 2016 · 11 comments
Closed

BUG: groupby.transform passing Series to transformation #13543

bashtage opened this issue Jul 1, 2016 · 11 comments

Comments

@bashtage
Copy link
Contributor

bashtage commented Jul 1, 2016

Code Sample, a copy-pastable example if possible

np.random.seed(12345)
panel = pd.Panel(np.random.randn(125,200,10))
panel.iloc[:,:,0] = np.round(panel.iloc[:,:,0])
panel.iloc[:,:,1] = np.round(panel.iloc[:,:,1])
x = panel
cols = [0,1]
_x = x.swapaxes(0, 2).to_frame()
numeric_cols = []
for df_col in _x:
    if df_col not in cols and pd.core.common.is_numeric_dtype(_x[df_col].dtype):
        numeric_cols.append(df_col)

calls = 0


def _print_type(df):
    print(type(df))
    return df

index = _x.index
_x.index = pd.RangeIndex(0, _x.shape[0])
groups = _x.groupby(cols)
out = groups.transform(_print_type)

Comment

The slow_path operated series by series rather than on a group DataFrame. Once the slow path is accepted, it operated on the group DataFrames. I have 59 groups in my example with 8 columns, and so it runs 8 times with Series from the first group DataFrame and then, once happy, runs 58 more times on the DataFrames.

The description says that it onlly operated on the group DataFrames (which is the correct behavior IMO)

Expected Output

Many lines of

<class 'pandas.core.frame.DataFrame'>

Actual Output

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
...
<class 'pandas.core.frame.DataFrame'>

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.0
scipy: 0.17.1
statsmodels: 0.8.0.dev0+f060948
xarray: 0.7.2
IPython: 4.2.0
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: 0.2.1
@bashtage bashtage changed the title BUG: groupby.transform doesn't operate as described BUG: groupby.transform passing Series to transformation Jul 1, 2016
@jreback
Copy link
Contributor

jreback commented Jul 1, 2016

this is correct, it tries a series reduction first. not sure why this implementation detail actually matters.

@bashtage
Copy link
Contributor Author

bashtage commented Jul 1, 2016

Because when the transformation makes use of DataFrame only attributes
groupby.transform produces an error.

On Fri, Jul 1, 2016, 2:04 PM Jeff Reback [email protected] wrote:

Closed #13543 #13543.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#13543 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AFU5RcV9LjNXtj31MzXCRyxpDf-scxJdks5qRRA2gaJpZM4JDH31
.

@jreback
Copy link
Contributor

jreback commented Jul 1, 2016

this doesn't matter as the exception is caught and next things are tried. you would have to show a compelling example.

@bashtage
Copy link
Contributor Author

bashtage commented Jul 1, 2016

I will when I get back to the office. IMO there is a really simple fix -
only run slow path if fast path fails, rather than requiring slow path to
work.

On Fri, Jul 1, 2016, 2:22 PM Jeff Reback [email protected] wrote:

this doesn't matter as the exception is caught and next things are tried.
you would have to show a compelling example.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#13543 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFU5RWmVFVjGR1OlNLjgI3G20y08S2Kyks5qRRSWgaJpZM4JDH31
.

@jreback
Copy link
Contributor

jreback commented Jul 1, 2016

could be

@bashtage
Copy link
Contributor Author

bashtage commented Jul 1, 2016

Here is my real example -- I am trying to do a selective groupby-demeaning where onlly a subset of columns are demeaned, but the retruend DataFrame has all columns.

# Setup
np.random.seed(12345)
panel = pd.Panel(np.random.randn(125,200,10))
panel.iloc[:,:,0] = np.round(panel.iloc[:,:,0])
panel.iloc[:,:,1] = np.round(panel.iloc[:,:,1])
x = panel
cols = [0,1]

demean_cols = []
for df_col in _x:
    if df_col not in cols and pd.core.common.is_numeric_dtype(_x[df_col].dtype):
        demean_cols.append(df_col)

# Function start
_x = x.swapaxes(0, 2).to_frame()
def _safe_demean(df):
    print(type(df))
    df[demean_cols] -= df[demean_cols].mean(0)
    return df

index = _x.index
_x.index = pd.RangeIndex(0, _x.shape[0])
groups = _x.groupby(cols)
out = groups.transform(_safe_demean)
out.index = index

The function fulfils the requrements in the docstring. It fails when it gets a Series since demean_cols is not in the series index. I can, of course, workaroudn with using an isinstance and handling paths, but this requires knowledge of the implementation. So at least this is a doc-bug, but I think it is an actual bug in the sense that the described (and reasonably correct) behavior of applying a group-wise transform to a DataFrame does not work when the callable returns a DataFrame. Another "real world" bug can be produced by using a group-wise grand demeaning, which would be a function like

def grand_demean(df):
    return df - df.mean().mean()

in this case, I don't think it is possible to ever get the correct answer in the current implementation since it isn't possible to compute the grand mean without the entire group DF. Even a simpler method would produce incorrect numbers:

def grand_demean_numpy(df):
    return df - np.mean(df) #  np-mean is all elements

The errors are

C:\anaconda\lib\site-packages\pandas\core\series.py in _set_labels(self, key, value)
    806         if mask.any():
--> 807             raise ValueError('%s not contained in the index' % str(key[mask]))
    808         self._set_values(indexer, value)

ValueError: ('[2 3 4 5 6 7 8 9] not contained in the index', 'occurred at index 2')

During handling of the above exception, another exception occurred:

ValueErrorTraceback (most recent call last)
<ipython-input-143-07044111bf11> in <module>()
     22 _x.index = pd.RangeIndex(0, _x.shape[0])
     23 groups = _x.groupby(cols)
---> 24 out = groups.transform(_safe_demean)
     25 out.index = index
     26 

C:\anaconda\lib\site-packages\pandas\core\groupby.py in transform(self, func, *args, **kwargs)
   3455                 result = getattr(self, func)(*args, **kwargs)
   3456         else:
-> 3457             return self._transform_general(func, *args, **kwargs)
   3458 
   3459         # a reduction transform

C:\anaconda\lib\site-packages\pandas\core\groupby.py in _transform_general(self, func, *args, **kwargs)
   3403                 except ValueError:
   3404                     msg = 'transform must return a scalar value for each group'
-> 3405                     raise ValueError(msg)
   3406             else:
   3407                 res = path(group)

ValueError: transform must return a scalar value for each group

@bashtage
Copy link
Contributor Author

bashtage commented Jul 1, 2016

FWIW the worksournd function to implement the selective group-wise demeaning looks like

def _safe_demean(df):
    if isinstance(df, pd.Series):
        if df.name in demean_cols:
            return df - df.mean()
        else:
            return df
    df = df.copy()
    df[demean_cols] -= df[demean_cols].mean(0)
    return df

@jreback
Copy link
Contributor

jreback commented Jul 1, 2016

this is VERY inefficient and not idiomatic. It might technically fulfull the doc-string, but that should simply be fixed.

you are MUCH better off doing something like this:

df - df.groupby(....).transform('mean')

@jreback
Copy link
Contributor

jreback commented Jul 1, 2016

its realated to this: #13281

groupby/transform is an immutable operation though its not technically marked as such. modification in the function should be banned (at least in the doc-string). If not actually banned (which is quite tricky to detect).

@bashtage
Copy link
Contributor Author

bashtage commented Jul 1, 2016

I agree that the df should be considered immutable - my first example is poor (and in fact has terrible performance, a .copy() fixes that though).

I suppose a better docstring would highlight that

  • transform is only safe for within-variable transformations. In particular the transform function must produce the same results if applied to the target df using .apply or directly. This will make it clear that operations like a groupby-grand-demeaning aren't possible.
  • the function will be executed column-by-column.

@jreback
Copy link
Contributor

jreback commented Jul 1, 2016

ok @bashtage if you want to do a better do-string (and maybe just turn off allowing mutation for transform), changing for apply would be too much ATM. Then I think that would be great.

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.19.0, 0.20.0 Jul 21, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.19.0 Sep 1, 2016
bashtage added a commit to bashtage/pandas that referenced this issue Oct 10, 2016
Add requirements for user function in groupby transform

closes pandas-dev#13543
[skip ci]
bashtage added a commit to bashtage/pandas that referenced this issue Oct 24, 2016
Add requirements for user function in groupby transform

closes pandas-dev#13543
[skip ci]
bashtage added a commit to bashtage/pandas that referenced this issue Dec 31, 2016
Add requirements for user function in groupby transform

closes pandas-dev#13543
[skip ci]
bashtage added a commit to bashtage/pandas that referenced this issue Dec 31, 2016
Add requirements for user function in groupby transform

closes pandas-dev#13543
[skip ci]
bashtage added a commit to bashtage/pandas that referenced this issue Feb 17, 2017
Add requirements for user function in groupby transform

closes pandas-dev#13543
[skip ci]
@jreback jreback modified the milestones: 0.20.0, Next Major Release Feb 23, 2017
@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017
mattip pushed a commit to mattip/pandas that referenced this issue Apr 3, 2017
closes pandas-dev#13543

Author: Kevin Sheppard <[email protected]>

Closes pandas-dev#14388 from bashtage/groupby-transform-doc-string and squashes the following commits:

ef1ff13 [Kevin Sheppard] DOC: Add details to DataFrame groupby transform
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants