BUG: GroupBy.count() and GroupBy.sum() incorreclty return NaN instead of 0 for missing categories (Version 2) #35241

smithto1 · 2020-07-11T22:42:16Z

closes Inconsistent behavior when groupby pandas Categorical variables #31422
closes BUG: df.groupby().count() returns NaN instead of Zero #35028
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Behavioural Changes
Fixing two related bugs: when grouping on multiple categoricals, .sum() and .count() would return NaN for the missing categories, but they are expected to return 0 for the missing categories. Both these bugs are fixed.

Tests
Tests were added in PR #35022 when these bugs were discovered and the tests were marked with an xfail. For this PR the xfails are removed and the tests are passing normally. As well, a few other existing tests were expecting sum() to return NaN; these have been updated so that the tests now expect to get 0 (which is the desired behaviour).

One new test is added to ensure that the exception handling of the new try-except-finally block behaves as expected.

df.pivot_table
The changes to .sum() & .count() also impacts the df.pivot_table() if it is called with aggfunc=sum/count and is pivoted on a Categorical column with observed=False. This is not explicitly mentioned in either of the bugs, but it does make the behaviour consistent (i.e. the sum of a missing category is zero, not NaN). Two tests on test_pivot.py was updated to reflect this change.

… called with 0

jreback · 2020-07-11T22:44:31Z

it’s better to push to the original PR i guess it’s ok this time

smithto1 · 2020-07-11T22:58:29Z

it’s better to push to the original PR i guess it’s ok this time

Ok. Is pushing to the existing PR the best practice even if all the new changes are on a new branch?

smithto1 · 2020-07-11T23:12:28Z

GroupBy.apply(sum)
One issue to note is that GroupBy.agg(sum) and GroupBy.apply(sum) now produce different outputs. .agg(sum) produces the desired behaviour of returning 0 for missing categories, while .apply(sum) still has the old behaviour of returning NaN.

import pandas as pd

df = pd.DataFrame(
    {
        "cat_1": pd.Categorical(list("AABB"), categories=list("ABC")),
        "cat_2": pd.Categorical(list("1111"), categories=list("12")),
        "value": [0.1, 0.1, 0.1, 0.1],
    }
)
df_grp = df.groupby(["cat_1", "cat_2"], observed=False)

# Using branch from this PR:
# .sum() returns zeros 
print( df_grp.sum() )

# agg returns zeros  
print( df_grp.agg(sum) )

# apply still returns NaN 
print( df_grp.apply(sum) )

Calling .apply(sum) never actually calls the GroupBy.sum() method, it just applies the passed in function. .agg(sum) does call the GroupBy.sum() method (happening here https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/generic.py#L253). Thes means that .agg(sum) does get the special behaviour to return zero for missing categories, while .apply(sum) misses out on the special behaviour and still returns NaN.

This change necessitated an update to groupby\test_categorical.py::test_seriesgroupby_observed_false_or_none.

@jreback , what are your thoughts on this? Is it fine to leave .apply(sum) as it is (returning NaN for missing categories).

I am not inclined to address it in this PR, but if we do want it fixed, I'll raise an issue for it.

smithto1 added 7 commits July 11, 2020 23:08

set self.observed=True for .sum() and .count() so that reindex can be…

ec577d7

… called with 0

wrote test for .sum() and .count() exception handling

586b1bd

black

ec4bd28

updated pivot tests

eed48cb

whatsnew

88d8a14

comments

940b32e

black

3e39f1a

smithto1 mentioned this pull request Jul 11, 2020

BUG: GroupBy.count() and GroupBy.sum() incorreclty return NaN instead of 0 for missing categories #35201

Closed

6 tasks

smithto1 closed this Jul 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: GroupBy.count() and GroupBy.sum() incorreclty return NaN instead of 0 for missing categories (Version 2) #35241

BUG: GroupBy.count() and GroupBy.sum() incorreclty return NaN instead of 0 for missing categories (Version 2) #35241

smithto1 commented Jul 11, 2020

jreback commented Jul 11, 2020

smithto1 commented Jul 11, 2020

smithto1 commented Jul 11, 2020 •

edited

Loading

BUG: GroupBy.count() and GroupBy.sum() incorreclty return NaN instead of 0 for missing categories (Version 2) #35241

BUG: GroupBy.count() and GroupBy.sum() incorreclty return NaN instead of 0 for missing categories (Version 2) #35241

Conversation

smithto1 commented Jul 11, 2020

jreback commented Jul 11, 2020

smithto1 commented Jul 11, 2020

smithto1 commented Jul 11, 2020 • edited Loading

smithto1 commented Jul 11, 2020 •

edited

Loading