-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: groupby.transform passing Series
to transformation
#13543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Series
to transformation
this is correct, it tries a series reduction first. not sure why this implementation detail actually matters. |
Because when the transformation makes use of DataFrame only attributes On Fri, Jul 1, 2016, 2:04 PM Jeff Reback [email protected] wrote:
|
this doesn't matter as the exception is caught and next things are tried. you would have to show a compelling example. |
I will when I get back to the office. IMO there is a really simple fix - On Fri, Jul 1, 2016, 2:22 PM Jeff Reback [email protected] wrote:
|
could be |
Here is my real example -- I am trying to do a selective groupby-demeaning where onlly a subset of columns are demeaned, but the retruend DataFrame has all columns. # Setup
np.random.seed(12345)
panel = pd.Panel(np.random.randn(125,200,10))
panel.iloc[:,:,0] = np.round(panel.iloc[:,:,0])
panel.iloc[:,:,1] = np.round(panel.iloc[:,:,1])
x = panel
cols = [0,1]
demean_cols = []
for df_col in _x:
if df_col not in cols and pd.core.common.is_numeric_dtype(_x[df_col].dtype):
demean_cols.append(df_col)
# Function start
_x = x.swapaxes(0, 2).to_frame()
def _safe_demean(df):
print(type(df))
df[demean_cols] -= df[demean_cols].mean(0)
return df
index = _x.index
_x.index = pd.RangeIndex(0, _x.shape[0])
groups = _x.groupby(cols)
out = groups.transform(_safe_demean)
out.index = index The function fulfils the requrements in the docstring. It fails when it gets a def grand_demean(df):
return df - df.mean().mean() in this case, I don't think it is possible to ever get the correct answer in the current implementation since it isn't possible to compute the grand mean without the entire group DF. Even a simpler method would produce incorrect numbers: def grand_demean_numpy(df):
return df - np.mean(df) # np-mean is all elements The errors are C:\anaconda\lib\site-packages\pandas\core\series.py in _set_labels(self, key, value)
806 if mask.any():
--> 807 raise ValueError('%s not contained in the index' % str(key[mask]))
808 self._set_values(indexer, value)
ValueError: ('[2 3 4 5 6 7 8 9] not contained in the index', 'occurred at index 2')
During handling of the above exception, another exception occurred:
ValueErrorTraceback (most recent call last)
<ipython-input-143-07044111bf11> in <module>()
22 _x.index = pd.RangeIndex(0, _x.shape[0])
23 groups = _x.groupby(cols)
---> 24 out = groups.transform(_safe_demean)
25 out.index = index
26
C:\anaconda\lib\site-packages\pandas\core\groupby.py in transform(self, func, *args, **kwargs)
3455 result = getattr(self, func)(*args, **kwargs)
3456 else:
-> 3457 return self._transform_general(func, *args, **kwargs)
3458
3459 # a reduction transform
C:\anaconda\lib\site-packages\pandas\core\groupby.py in _transform_general(self, func, *args, **kwargs)
3403 except ValueError:
3404 msg = 'transform must return a scalar value for each group'
-> 3405 raise ValueError(msg)
3406 else:
3407 res = path(group)
ValueError: transform must return a scalar value for each group |
FWIW the worksournd function to implement the selective group-wise demeaning looks like def _safe_demean(df):
if isinstance(df, pd.Series):
if df.name in demean_cols:
return df - df.mean()
else:
return df
df = df.copy()
df[demean_cols] -= df[demean_cols].mean(0)
return df |
this is VERY inefficient and not idiomatic. It might technically fulfull the doc-string, but that should simply be fixed. you are MUCH better off doing something like this:
|
its realated to this: #13281 groupby/transform is an immutable operation though its not technically marked as such. modification in the function should be banned (at least in the doc-string). If not actually banned (which is quite tricky to detect). |
I agree that the df should be considered immutable - my first example is poor (and in fact has terrible performance, a I suppose a better docstring would highlight that
|
ok @bashtage if you want to do a better do-string (and maybe just turn off allowing mutation for transform), changing for apply would be too much ATM. Then I think that would be great. |
Add requirements for user function in groupby transform closes pandas-dev#13543 [skip ci]
Add requirements for user function in groupby transform closes pandas-dev#13543 [skip ci]
Add requirements for user function in groupby transform closes pandas-dev#13543 [skip ci]
Add requirements for user function in groupby transform closes pandas-dev#13543 [skip ci]
Add requirements for user function in groupby transform closes pandas-dev#13543 [skip ci]
closes pandas-dev#13543 Author: Kevin Sheppard <[email protected]> Closes pandas-dev#14388 from bashtage/groupby-transform-doc-string and squashes the following commits: ef1ff13 [Kevin Sheppard] DOC: Add details to DataFrame groupby transform
Code Sample, a copy-pastable example if possible
Comment
The
slow_path
operated series by series rather than on a group DataFrame. Once the slow path is accepted, it operated on the group DataFrames. I have 59 groups in my example with 8 columns, and so it runs 8 times with Series from the first group DataFrame and then, once happy, runs 58 more times on the DataFrames.The description says that it onlly operated on the group DataFrames (which is the correct behavior IMO)
Expected Output
Many lines of
Actual Output
output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: