-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
DOC: Strangest behavoir of groubpy aggregation ever, most likely a bug #17326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@FlorianWilhelm : Thanks for reporting this! Let's unpack your questions:
|
Those two are related. As @gfyoung says, the function is first tried inside a try except block, and as you note will fail, but the error is catched. Therefore you do not see the error, but you also do not see the next line (as this is never executed due to the error) |
But why it then falls back to executing the function on a Series with a different name, I don't directly know. |
|
@gfyoung and @jorisvandenbossche , thanks for your fast replies! I bet this is somehow related to some C++ pointer bug deep down in the Pandas codebase. When the first test execution (for performance measurement) of the first group fails (and the exception is swallowed) then the pointer to the feature column is somehow shifted to the groupby column... but this is just a wild guess ;-) |
@FlorianWilhelm : Feel free to investigate the cause. There could potentially be it. |
This is user error
Get the first group
Execute the function
You are raising inside the function (and that's why commenting it out works). You are doing way too many things inside the |
@jreback Well, this is totally not a clear user error (or very hard to debug). And certainly passing a Series with a different name as a second try, how can that ever be correct? |
I totally agree with @jorisvandenbossche here. The user error surely is the wrong comparison but the behavior how this error is handled is buggy. Why is the error just caught silently and why is the On a side note @jreback , I find it rather rude to just comment and close an issue if none of the other participants of the discussion had a chance to react on your comment. |
That's my thought as well. @FlorianWilhelm if you're willing, could you track down where the name changes, and see if changing that breaks any tests?
Sorry about that. I know in some communities closing issues is like locking them and ending discussion, but not here. We're quick to close issues (and yet we still have over 2,000 open), but always happy to continue discussion and re-open if necessary. I think that for this example: In [13]: def f(x):
...: if x.name == 'feature':
...: raise TypeError
...: else:
...: return 'False'
...:
In [14]: grouped['feature'].agg(f)
Out[14]:
uid
1 False
2 False
Name: feature, dtype: object A |
TyoeErrors are not re raised rather they signal to ignore that column. E.g. doing mean on a string column this has long been the case : I don't think it's possible to remove in any kind of clean way thus the OP is violating the guarantee of agg there are several issues about this IIRC (which are closed ; might be hard to find) I suppose the guarantee could be better documented |
Thanks @jreback that totally explains the behavior. If I got you correctly, the fact that I still think that is not a nice behavior since it violates the principle of least surprise and explicit is better than implicit. On the other hand, changing this would break a lot of code and seems therefore impractical for pandas 1.0, but could be an option for pandas 2.0. I think it is always better to apply an aggregation explicitly over a certain set of columns that you expect to correctly aggregated and in this case it is safe to raise an exception that reaches the actual caller. I also agree that the current behavior should be better documented. Thanks for the explanation again. |
I tried to change this once, but ran into a rabbit-hole and discarded it. Adding an optional to If you would offer up a short warning in groupby that covers 2 items:
would be fantastic |
@jreback Thanks, I'll look into that! Sounds like a nice little task for me. |
@jreback, I did some more investigations in order to find a proper warning for the docs but some questions come up:
|
@FlorianWilhelm the ideas is that these should be the same ( |
@jreback So import pandas as pd
df = pd.DataFrame({'uid': [1, 1, 2], 'feature': [1, 2, 3]})
def raise_type_error(series):
print("Type is: {}".format(type(series)))
print("Name of series is: {}".format(series.name))
if series.name == 'feature':
print("Raise due to feature name")
raise TypeError
return series.sum()
def raise_value_error(series):
print("Type is: {}".format(type(series)))
print("Name of series is: {}".format(series.name))
if series.name == 'feature':
print("Raise due to feature name")
raise ValueError
return series.sum()
grouped = df.groupby('uid')
print(grouped['feature'].agg(raise_type_error))
print(grouped['feature'].agg(raise_value_error))
print("#"*80)
print(grouped['feature'].apply(raise_type_error))
print(grouped['feature'].apply(raise_value_error)) Anyway, I think we can close this issue since it really seems to be intended behaviour. |
Code Sample
Problem description
The code above displays:
Several things are weird here:
feature
as expected but the following ones are named after the the values in theuid
column. Intended behavior?where
call to the series never shown?x >= '1'
should fail in Python3 but doesn't?series.where(...
leads to the expected resultMy guess is that this is somehow related to the fact that pandas benchmarks different codepaths by calling the first group several times. Herein something seems to go wrong and messes up the data structure leading to this strange behavior. If this is intended then please let me know.
Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: Non
The text was updated successfully, but these errors were encountered: