Skip to content

BUG: GroupBy.count() and GroupBy.sum() incorreclty return NaN instead of 0 for missing categories (Version 2) #35241

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

smithto1
Copy link
Member

Behavioural Changes
Fixing two related bugs: when grouping on multiple categoricals, .sum() and .count() would return NaN for the missing categories, but they are expected to return 0 for the missing categories. Both these bugs are fixed.

Tests
Tests were added in PR #35022 when these bugs were discovered and the tests were marked with an xfail. For this PR the xfails are removed and the tests are passing normally. As well, a few other existing tests were expecting sum() to return NaN; these have been updated so that the tests now expect to get 0 (which is the desired behaviour).

One new test is added to ensure that the exception handling of the new try-except-finally block behaves as expected.

df.pivot_table
The changes to .sum() & .count() also impacts the df.pivot_table() if it is called with aggfunc=sum/count and is pivoted on a Categorical column with observed=False. This is not explicitly mentioned in either of the bugs, but it does make the behaviour consistent (i.e. the sum of a missing category is zero, not NaN). Two tests on test_pivot.py was updated to reflect this change.

@jreback
Copy link
Contributor

jreback commented Jul 11, 2020

it’s better to push to the original PR i guess it’s ok this time

@smithto1
Copy link
Member Author

it’s better to push to the original PR i guess it’s ok this time

Ok. Is pushing to the existing PR the best practice even if all the new changes are on a new branch?

@smithto1
Copy link
Member Author

smithto1 commented Jul 11, 2020

GroupBy.apply(sum)
One issue to note is that GroupBy.agg(sum) and GroupBy.apply(sum) now produce different outputs. .agg(sum) produces the desired behaviour of returning 0 for missing categories, while .apply(sum) still has the old behaviour of returning NaN.

import pandas as pd

df = pd.DataFrame(
    {
        "cat_1": pd.Categorical(list("AABB"), categories=list("ABC")),
        "cat_2": pd.Categorical(list("1111"), categories=list("12")),
        "value": [0.1, 0.1, 0.1, 0.1],
    }
)
df_grp = df.groupby(["cat_1", "cat_2"], observed=False)

# Using branch from this PR:
# .sum() returns zeros 
print( df_grp.sum() )

# agg returns zeros  
print( df_grp.agg(sum) )

# apply still returns NaN 
print( df_grp.apply(sum) )

Calling .apply(sum) never actually calls the GroupBy.sum() method, it just applies the passed in function. .agg(sum) does call the GroupBy.sum() method (happening here https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/generic.py#L253). Thes means that .agg(sum) does get the special behaviour to return zero for missing categories, while .apply(sum) misses out on the special behaviour and still returns NaN.

This change necessitated an update to groupby\test_categorical.py::test_seriesgroupby_observed_false_or_none.

@jreback , what are your thoughts on this? Is it fine to leave .apply(sum) as it is (returning NaN for missing categories).

I am not inclined to address it in this PR, but if we do want it fixed, I'll raise an issue for it.

@smithto1 smithto1 closed this Jul 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants