DOC: Strangest behavoir of groubpy aggregation ever, most likely a bug #17326

FlorianWilhelm · 2017-08-24T15:40:09Z

Code Sample

import sys
import logging
import pandas as pd

_logger = logging.getLogger(__name__)


def weird_func(series):
    series = pd.to_numeric(series)
    _logger.info("Series:\n{}".format(series))
    if series.name == 'feature':
        _logger.info("Calling a nonsense function with no effect on x")
        # Commenting the following line solves the problem but WHY???
        series.where(lambda x: x >= '1', inplace=False, try_cast=False, raise_on_error=True)
        _logger.error("This is NEVER SHOWN if line above is not commented!")
        result = 'Right'
    else:
        result = 'False'
    return result

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)

df = pd.DataFrame({'uid': [1, 1, 2],
                   'feature': ['10', '20', '30']})

grouped = df.groupby('uid')
print("Final result:\n{}".format(grouped['feature'].agg(weird_func)))

Problem description

The code above displays:

INFO:__main__:Series:
0    10
1    20
Name: feature, dtype: int64
INFO:__main__:Calling a nonsense function with no effect on x
INFO:__main__:Series:
0    10
1    20
Name: feature, dtype: int64
INFO:__main__:Calling a nonsense function with no effect on x
INFO:__main__:Series:
0    10
1    20
Name: 1, dtype: int64
INFO:__main__:Series:
2    30
Name: 2, dtype: int64
Final result:
uid
1    False
2    False
Name: feature, dtype: object

Several things are weird here:

The name of the first two series inside the aggregation function is feature as expected but the following ones are named after the the values in the uid column. Intended behavior?
Why is the error log message after the meaningless where call to the series never shown?
The dtype of the series is int64 and the comparison x >= '1' should fail in Python3 but doesn't?
Commenting the line series.where(... leads to the expected result

My guess is that this is somehow related to the fact that pandas benchmarks different codepaths by calling the first group several times. Herein something seems to go wrong and messes up the data structure leading to this strange behavior. If this is intended then please let me know.

Expected Output

INFO:__main__:Series:
0    10
1    20
Name: feature, dtype: int64
INFO:__main__:Calling a nonsense function with no effect on x
ERROR:__main__:This is NEVER SHOWN if line above is not commented!
INFO:__main__:Series:
2    30
Name: feature, dtype: int64
INFO:__main__:Calling a nonsense function with no effect on x
ERROR:__main__:This is NEVER SHOWN if line above is not commented!
Final result:
uid
1    Right
2    Right
Name: feature, dtype: object

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: Non

The text was updated successfully, but these errors were encountered:

gfyoung · 2017-08-24T16:11:49Z

@FlorianWilhelm : Thanks for reporting this! Let's unpack your questions:

Try setting the output of your .agg call to a variable. Then print it. You'll that the uid column is the output of your aggregation.
I believe there's a try-except going on somewhere that causes the code to exit the function when you do the comparison. @jreback ? @jorisvandenbossche ?
Try passing in a Series directly into weird_func (i.e. call the function yourself with df["feature"], and it errors as expected.
I'm not surprised given my comment to 3.

jorisvandenbossche · 2017-08-24T16:29:01Z

Why is the error log message after the meaningless where call to the series never shown?
The dtype of the series is int64 and the comparison x >= '1' should fail in Python3 but doesn't?

Those two are related. As @gfyoung says, the function is first tried inside a try except block, and as you note will fail, but the error is catched. Therefore you do not see the error, but you also do not see the next line (as this is never executed due to the error)

jorisvandenbossche · 2017-08-24T16:30:23Z

But why it then falls back to executing the function on a Series with a different name, I don't directly know.

jorisvandenbossche · 2017-08-24T16:33:06Z

~~Apparently I am wrong, the comparison seems to work fine:~~ (sorry, forgot that the function first converts to numerical)

FlorianWilhelm · 2017-08-24T17:26:42Z

@gfyoung and @jorisvandenbossche , thanks for your fast replies!
This small code snippet was actually boiled down from a code I was working on. Adding the series.where part all of a sudden caused the aggregation to return multi-index dataframes for no reason since I was not fiddling with the index at all. Digging into this I noticed that the Name: ... of the series inside the function passed to aggregate changed all of a sudden. Then, I started to remove all parts until I came to that version above. Still I think that we should not see these Name: 1, dtype: int64 and Name: 2, dtype: int64 series since these are values of the column we grouped by, not of the column feature we are operating on.

I bet this is somehow related to some C++ pointer bug deep down in the Pandas codebase. When the first test execution (for performance measurement) of the first group fails (and the exception is swallowed) then the pointer to the feature column is somehow shifted to the groupby column... but this is just a wild guess ;-)

gfyoung · 2017-08-24T17:32:00Z

@FlorianWilhelm : Feel free to investigate the cause. There could potentially be it.

jreback · 2017-08-25T11:07:42Z

This is user error

In [1]: df = pd.DataFrame({'uid': [1, 1, 2],
   ...:                    'feature': ['10', '20', '30']})
   ...: 
   ...: grouped = df.groupby('uid')
   ...: 

In [2]: grouped.groups
Out[2]: {1: Int64Index([0, 1], dtype='int64'), 2: Int64Index([2], dtype='int64')}

Get the first group

In [4]: s = grouped.get_group(1)

In [5]: s
Out[5]: 
  feature  uid
0      10    1
1      20    1

Execute the function

In [11]: s = pd.to_numeric(s.feature)

In [13]: s
Out[13]: 
2    30
Name: feature, dtype: int64

In [14]: s.where(lambda x: x >= '1', inplace=False, try_cast=False, raise_on_error=True)
TypeError: invalid type comparison

You are raising inside the function (and that's why commenting it out works). You are doing way too many things inside the .agg so of course very hard to find root causes.

jorisvandenbossche · 2017-08-25T11:56:40Z

@jreback Well, this is totally not a clear user error (or very hard to debug).
If the function errors, you would expect this error to bubble up (after first catching it and trying something else).

And certainly passing a Series with a different name as a second try, how can that ever be correct?

FlorianWilhelm · 2017-08-25T12:59:58Z

I totally agree with @jorisvandenbossche here. The user error surely is the wrong comparison but the behavior how this error is handled is buggy. Why is the error just caught silently and why is the name attribute of the series changed?

On a side note @jreback , I find it rather rude to just comment and close an issue if none of the other participants of the discussion had a chance to react on your comment.

TomAugspurger · 2017-08-25T14:50:15Z

And certainly passing a Series with a different name as a second try, how can that ever be correct?

That's my thought as well. @FlorianWilhelm if you're willing, could you track down where the name changes, and see if changing that breaks any tests?

On a side note @jreback , I find it rather rude to just comment and close an issue if none of the other participants of the discussion had a chance to react on your comment.

Sorry about that. I know in some communities closing issues is like locking them and ending discussion, but not here. We're quick to close issues (and yet we still have over 2,000 open), but always happy to continue discussion and re-open if necessary.

I think that for this example:

In [13]: def f(x):
    ...:     if x.name == 'feature':
    ...:         raise TypeError
    ...:     else:
    ...:         return 'False'
    ...:

In [14]: grouped['feature'].agg(f)
Out[14]:
uid
1    False
2    False
Name: feature, dtype: object

A TypeError should be raised back up to the user, similar to if f raised a BaseException, which isn't caught by pandas.

jreback · 2017-08-25T23:30:53Z

TyoeErrors are not re raised rather they signal to ignore that column. E.g. doing mean on a string column

this has long been the case : I don't think it's possible to remove in any kind of clean way

thus the OP is violating the guarantee of agg

there are several issues about this IIRC (which are closed ; might be hard to find)

I suppose the guarantee could be better documented

FlorianWilhelm · 2017-08-28T08:44:50Z

Thanks @jreback that totally explains the behavior. If I got you correctly, the fact that weird_func raises an exception lets agg ignore the column feature that is used for naming the series and defaults therefore to the index column which will instead be used for naming the series.

I still think that is not a nice behavior since it violates the principle of least surprise and explicit is better than implicit. On the other hand, changing this would break a lot of code and seems therefore impractical for pandas 1.0, but could be an option for pandas 2.0. I think it is always better to apply an aggregation explicitly over a certain set of columns that you expect to correctly aggregated and in this case it is safe to raise an exception that reaches the actual caller.

I also agree that the current behavior should be better documented. Thanks for the explanation again.

jreback · 2017-08-29T12:48:19Z

@FlorianWilhelm

I tried to change this once, but ran into a rabbit-hole and discarded it. Adding an optional to .groupby() also is not IMHO a nice soln. Basically udf's that are not 'well-behaved' are somewhat problematic, thought these are itself hard to detect.

If you would offer up a short warning in groupby that covers 2 items:

raising an exception (TypeError vs. ValueError) in a udf
mutating the input (DOC: clarify dangers of fast apply in GroupBy apply docs #14180)

would be fantastic

FlorianWilhelm · 2017-08-30T12:00:15Z

@jreback Thanks, I'll look into that! Sounds like a nice little task for me.

FlorianWilhelm · 2017-09-04T06:04:21Z

@jreback, I did some more investigations in order to find a proper warning for the docs but some questions come up:

The behavior or caught exceptions seems to happen only in aggregate, not in apply, right?
TypeError and ValueError are caught and the corresponding columns dropped in the result but is there really a difference? I only found one more check for TypeError in the code. You wrote TypeError vs. ValueError

jreback · 2017-09-06T10:33:24Z

@FlorianWilhelm the ideas is that these should be the same (.aggregate vs .apply), but there may be a lingering difference. If you can show a specific example (simplified from your top post), that show a difference would take that as a bug fix.

FlorianWilhelm · 2017-09-14T13:55:05Z

@jreback So aggregate and apply seem to differ in the sense that aggregate sets the current column name as name attribute for the series while apply always uses the group name as name attribute for the series.
In case of aggregate if an ValueError or TypeError is raised it will fall back to naming the series like in the case of apply. If then another of those exception is raised the whole aggregation fails.
In case of apply raising either ValueError or TypeError will always fail.
Still I find that behaviour really strange, my take away is that I should be careful with assumptions on how a series in a groupby is named and about raising exceptions.
The following code shows that:

import pandas as pd

df = pd.DataFrame({'uid': [1, 1, 2], 'feature': [1, 2, 3]})

def raise_type_error(series):
    print("Type is: {}".format(type(series)))
    print("Name of series is: {}".format(series.name))
    if series.name == 'feature':
        print("Raise due to feature name")
        raise TypeError
    return series.sum()

def raise_value_error(series):
    print("Type is: {}".format(type(series)))
    print("Name of series is: {}".format(series.name))
    if series.name == 'feature':
        print("Raise due to feature name")
        raise ValueError
    return series.sum()


grouped = df.groupby('uid')
print(grouped['feature'].agg(raise_type_error))
print(grouped['feature'].agg(raise_value_error))
print("#"*80)
print(grouped['feature'].apply(raise_type_error))
print(grouped['feature'].apply(raise_value_error))

Anyway, I think we can close this issue since it really seems to be intended behaviour.

gfyoung added the Groupby label Aug 24, 2017

jreback closed this as completed Aug 25, 2017

TomAugspurger reopened this Aug 25, 2017

jreback added Difficulty Novice Docs labels Aug 29, 2017

jreback added this to the 0.21.0 milestone Aug 29, 2017

jreback changed the title ~~Strangest behavoir of groubpy aggregation ever, most likely a bug~~ DOC: Strangest behavoir of groubpy aggregation ever, most likely a bug Aug 29, 2017

FlorianWilhelm closed this as completed Sep 14, 2017

jorisvandenbossche reopened this Jul 8, 2018

jorisvandenbossche removed this from the 0.21.0 milestone Jul 8, 2018

jbrockmendel removed the Effort Low label Oct 21, 2019

Kodiologist mentioned this issue Feb 26, 2021

BUG: GroupBy.apply reruns the full loop if an exception is raised #40093

Closed

2 tasks

rhshadrach added the Apply Apply, Aggregate, Transform, Map label Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: Strangest behavoir of groubpy aggregation ever, most likely a bug #17326

DOC: Strangest behavoir of groubpy aggregation ever, most likely a bug #17326

FlorianWilhelm commented Aug 24, 2017 •

edited

Loading

INSTALLED VERSIONS

gfyoung commented Aug 24, 2017

jorisvandenbossche commented Aug 24, 2017

jorisvandenbossche commented Aug 24, 2017

jorisvandenbossche commented Aug 24, 2017 •

edited

Loading

FlorianWilhelm commented Aug 24, 2017

gfyoung commented Aug 24, 2017

jreback commented Aug 25, 2017

jorisvandenbossche commented Aug 25, 2017

FlorianWilhelm commented Aug 25, 2017

TomAugspurger commented Aug 25, 2017

jreback commented Aug 25, 2017

FlorianWilhelm commented Aug 28, 2017 •

edited

Loading

jreback commented Aug 29, 2017

FlorianWilhelm commented Aug 30, 2017

FlorianWilhelm commented Sep 4, 2017

jreback commented Sep 6, 2017

FlorianWilhelm commented Sep 14, 2017

DOC: Strangest behavoir of groubpy aggregation ever, most likely a bug #17326

DOC: Strangest behavoir of groubpy aggregation ever, most likely a bug #17326

Comments

FlorianWilhelm commented Aug 24, 2017 • edited Loading

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

gfyoung commented Aug 24, 2017

jorisvandenbossche commented Aug 24, 2017

jorisvandenbossche commented Aug 24, 2017

jorisvandenbossche commented Aug 24, 2017 • edited Loading

FlorianWilhelm commented Aug 24, 2017

gfyoung commented Aug 24, 2017

jreback commented Aug 25, 2017

jorisvandenbossche commented Aug 25, 2017

FlorianWilhelm commented Aug 25, 2017

TomAugspurger commented Aug 25, 2017

jreback commented Aug 25, 2017

FlorianWilhelm commented Aug 28, 2017 • edited Loading

jreback commented Aug 29, 2017

FlorianWilhelm commented Aug 30, 2017

FlorianWilhelm commented Sep 4, 2017

jreback commented Sep 6, 2017

FlorianWilhelm commented Sep 14, 2017

FlorianWilhelm commented Aug 24, 2017 •

edited

Loading

Output of `pd.show_versions()`

jorisvandenbossche commented Aug 24, 2017 •

edited

Loading

FlorianWilhelm commented Aug 28, 2017 •

edited

Loading