Skip to content

DOC: Strangest behavoir of groubpy aggregation ever, most likely a bug #17326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
FlorianWilhelm opened this issue Aug 24, 2017 · 17 comments
Open
Labels
Apply Apply, Aggregate, Transform, Map Docs Groupby

Comments

@FlorianWilhelm
Copy link
Contributor

FlorianWilhelm commented Aug 24, 2017

Code Sample

import sys
import logging
import pandas as pd

_logger = logging.getLogger(__name__)


def weird_func(series):
    series = pd.to_numeric(series)
    _logger.info("Series:\n{}".format(series))
    if series.name == 'feature':
        _logger.info("Calling a nonsense function with no effect on x")
        # Commenting the following line solves the problem but WHY???
        series.where(lambda x: x >= '1', inplace=False, try_cast=False, raise_on_error=True)
        _logger.error("This is NEVER SHOWN if line above is not commented!")
        result = 'Right'
    else:
        result = 'False'
    return result

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)

df = pd.DataFrame({'uid': [1, 1, 2],
                   'feature': ['10', '20', '30']})

grouped = df.groupby('uid')
print("Final result:\n{}".format(grouped['feature'].agg(weird_func)))

Problem description

The code above displays:

INFO:__main__:Series:
0    10
1    20
Name: feature, dtype: int64
INFO:__main__:Calling a nonsense function with no effect on x
INFO:__main__:Series:
0    10
1    20
Name: feature, dtype: int64
INFO:__main__:Calling a nonsense function with no effect on x
INFO:__main__:Series:
0    10
1    20
Name: 1, dtype: int64
INFO:__main__:Series:
2    30
Name: 2, dtype: int64
Final result:
uid
1    False
2    False
Name: feature, dtype: object

Several things are weird here:

  1. The name of the first two series inside the aggregation function is feature as expected but the following ones are named after the the values in the uid column. Intended behavior?
  2. Why is the error log message after the meaningless where call to the series never shown?
  3. The dtype of the series is int64 and the comparison x >= '1' should fail in Python3 but doesn't?
  4. Commenting the line series.where(... leads to the expected result

My guess is that this is somehow related to the fact that pandas benchmarks different codepaths by calling the first group several times. Herein something seems to go wrong and messes up the data structure leading to this strange behavior. If this is intended then please let me know.

Expected Output

INFO:__main__:Series:
0    10
1    20
Name: feature, dtype: int64
INFO:__main__:Calling a nonsense function with no effect on x
ERROR:__main__:This is NEVER SHOWN if line above is not commented!
INFO:__main__:Series:
2    30
Name: feature, dtype: int64
INFO:__main__:Calling a nonsense function with no effect on x
ERROR:__main__:This is NEVER SHOWN if line above is not commented!
Final result:
uid
1    Right
2    Right
Name: feature, dtype: object

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: Non

@gfyoung
Copy link
Member

gfyoung commented Aug 24, 2017

@FlorianWilhelm : Thanks for reporting this! Let's unpack your questions:

  1. Try setting the output of your .agg call to a variable. Then print it. You'll that the uid column is the output of your aggregation.

  2. I believe there's a try-except going on somewhere that causes the code to exit the function when you do the comparison. @jreback ? @jorisvandenbossche ?

  3. Try passing in a Series directly into weird_func (i.e. call the function yourself with df["feature"], and it errors as expected.

  4. I'm not surprised given my comment to 3.

@jorisvandenbossche
Copy link
Member

Why is the error log message after the meaningless where call to the series never shown?
The dtype of the series is int64 and the comparison x >= '1' should fail in Python3 but doesn't?

Those two are related. As @gfyoung says, the function is first tried inside a try except block, and as you note will fail, but the error is catched. Therefore you do not see the error, but you also do not see the next line (as this is never executed due to the error)

@jorisvandenbossche
Copy link
Member

But why it then falls back to executing the function on a Series with a different name, I don't directly know.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 24, 2017

Apparently I am wrong, the comparison seems to work fine: (sorry, forgot that the function first converts to numerical)

@FlorianWilhelm
Copy link
Contributor Author

@gfyoung and @jorisvandenbossche , thanks for your fast replies!
This small code snippet was actually boiled down from a code I was working on. Adding the series.where part all of a sudden caused the aggregation to return multi-index dataframes for no reason since I was not fiddling with the index at all. Digging into this I noticed that the Name: ... of the series inside the function passed to aggregate changed all of a sudden. Then, I started to remove all parts until I came to that version above. Still I think that we should not see these Name: 1, dtype: int64 and Name: 2, dtype: int64 series since these are values of the column we grouped by, not of the column feature we are operating on.

I bet this is somehow related to some C++ pointer bug deep down in the Pandas codebase. When the first test execution (for performance measurement) of the first group fails (and the exception is swallowed) then the pointer to the feature column is somehow shifted to the groupby column... but this is just a wild guess ;-)

@gfyoung
Copy link
Member

gfyoung commented Aug 24, 2017

@FlorianWilhelm : Feel free to investigate the cause. There could potentially be it.

@jreback
Copy link
Contributor

jreback commented Aug 25, 2017

This is user error

In [1]: df = pd.DataFrame({'uid': [1, 1, 2],
   ...:                    'feature': ['10', '20', '30']})
   ...: 
   ...: grouped = df.groupby('uid')
   ...: 

In [2]: grouped.groups
Out[2]: {1: Int64Index([0, 1], dtype='int64'), 2: Int64Index([2], dtype='int64')}

Get the first group

In [4]: s = grouped.get_group(1)

In [5]: s
Out[5]: 
  feature  uid
0      10    1
1      20    1

Execute the function

In [11]: s = pd.to_numeric(s.feature)

In [13]: s
Out[13]: 
2    30
Name: feature, dtype: int64

In [14]: s.where(lambda x: x >= '1', inplace=False, try_cast=False, raise_on_error=True)
TypeError: invalid type comparison

You are raising inside the function (and that's why commenting it out works). You are doing way too many things inside the .agg so of course very hard to find root causes.

@jreback jreback closed this as completed Aug 25, 2017
@jorisvandenbossche
Copy link
Member

@jreback Well, this is totally not a clear user error (or very hard to debug).
If the function errors, you would expect this error to bubble up (after first catching it and trying something else).

And certainly passing a Series with a different name as a second try, how can that ever be correct?

@FlorianWilhelm
Copy link
Contributor Author

I totally agree with @jorisvandenbossche here. The user error surely is the wrong comparison but the behavior how this error is handled is buggy. Why is the error just caught silently and why is the name attribute of the series changed?

On a side note @jreback , I find it rather rude to just comment and close an issue if none of the other participants of the discussion had a chance to react on your comment.

@TomAugspurger
Copy link
Contributor

And certainly passing a Series with a different name as a second try, how can that ever be correct?

That's my thought as well. @FlorianWilhelm if you're willing, could you track down where the name changes, and see if changing that breaks any tests?

On a side note @jreback , I find it rather rude to just comment and close an issue if none of the other participants of the discussion had a chance to react on your comment.

Sorry about that. I know in some communities closing issues is like locking them and ending discussion, but not here. We're quick to close issues (and yet we still have over 2,000 open), but always happy to continue discussion and re-open if necessary.

I think that for this example:

In [13]: def f(x):
    ...:     if x.name == 'feature':
    ...:         raise TypeError
    ...:     else:
    ...:         return 'False'
    ...:

In [14]: grouped['feature'].agg(f)
Out[14]:
uid
1    False
2    False
Name: feature, dtype: object

A TypeError should be raised back up to the user, similar to if f raised a BaseException, which isn't caught by pandas.

@TomAugspurger TomAugspurger reopened this Aug 25, 2017
@jreback
Copy link
Contributor

jreback commented Aug 25, 2017

TyoeErrors are not re raised rather they signal to ignore that column. E.g. doing mean on a string column

this has long been the case : I don't think it's possible to remove in any kind of clean way

thus the OP is violating the guarantee of agg

there are several issues about this IIRC (which are closed ; might be hard to find)

I suppose the guarantee could be better documented

@FlorianWilhelm
Copy link
Contributor Author

FlorianWilhelm commented Aug 28, 2017

Thanks @jreback that totally explains the behavior. If I got you correctly, the fact that weird_func raises an exception lets agg ignore the column feature that is used for naming the series and defaults therefore to the index column which will instead be used for naming the series.

I still think that is not a nice behavior since it violates the principle of least surprise and explicit is better than implicit. On the other hand, changing this would break a lot of code and seems therefore impractical for pandas 1.0, but could be an option for pandas 2.0. I think it is always better to apply an aggregation explicitly over a certain set of columns that you expect to correctly aggregated and in this case it is safe to raise an exception that reaches the actual caller.

I also agree that the current behavior should be better documented. Thanks for the explanation again.

@jreback
Copy link
Contributor

jreback commented Aug 29, 2017

@FlorianWilhelm

I tried to change this once, but ran into a rabbit-hole and discarded it. Adding an optional to .groupby() also is not IMHO a nice soln. Basically udf's that are not 'well-behaved' are somewhat problematic, thought these are itself hard to detect.

If you would offer up a short warning in groupby that covers 2 items:

would be fantastic

@jreback jreback added this to the 0.21.0 milestone Aug 29, 2017
@jreback jreback changed the title Strangest behavoir of groubpy aggregation ever, most likely a bug DOC: Strangest behavoir of groubpy aggregation ever, most likely a bug Aug 29, 2017
@FlorianWilhelm
Copy link
Contributor Author

@jreback Thanks, I'll look into that! Sounds like a nice little task for me.

@FlorianWilhelm
Copy link
Contributor Author

@jreback, I did some more investigations in order to find a proper warning for the docs but some questions come up:

  1. The behavior or caught exceptions seems to happen only in aggregate, not in apply, right?
  2. TypeError and ValueError are caught and the corresponding columns dropped in the result but is there really a difference? I only found one more check for TypeError in the code. You wrote TypeError vs. ValueError

@jreback
Copy link
Contributor

jreback commented Sep 6, 2017

@FlorianWilhelm the ideas is that these should be the same (.aggregate vs .apply), but there may be a lingering difference. If you can show a specific example (simplified from your top post), that show a difference would take that as a bug fix.

@FlorianWilhelm
Copy link
Contributor Author

@jreback So aggregate and apply seem to differ in the sense that aggregate sets the current column name as name attribute for the series while apply always uses the group name as name attribute for the series.
In case of aggregate if an ValueError or TypeError is raised it will fall back to naming the series like in the case of apply. If then another of those exception is raised the whole aggregation fails.
In case of apply raising either ValueError or TypeError will always fail.
Still I find that behaviour really strange, my take away is that I should be careful with assumptions on how a series in a groupby is named and about raising exceptions.
The following code shows that:

import pandas as pd

df = pd.DataFrame({'uid': [1, 1, 2], 'feature': [1, 2, 3]})

def raise_type_error(series):
    print("Type is: {}".format(type(series)))
    print("Name of series is: {}".format(series.name))
    if series.name == 'feature':
        print("Raise due to feature name")
        raise TypeError
    return series.sum()

def raise_value_error(series):
    print("Type is: {}".format(type(series)))
    print("Name of series is: {}".format(series.name))
    if series.name == 'feature':
        print("Raise due to feature name")
        raise ValueError
    return series.sum()


grouped = df.groupby('uid')
print(grouped['feature'].agg(raise_type_error))
print(grouped['feature'].agg(raise_value_error))
print("#"*80)
print(grouped['feature'].apply(raise_type_error))
print(grouped['feature'].apply(raise_value_error))

Anyway, I think we can close this issue since it really seems to be intended behaviour.

@rhshadrach rhshadrach added the Apply Apply, Aggregate, Transform, Map label Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Docs Groupby
Projects
None yet
Development

No branches or pull requests

7 participants