BUG: GroupBy.count() and GroupBy.sum() incorreclty return NaN instead of 0 for missing categories (Version 2) #35241
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Behavioural Changes
Fixing two related bugs: when grouping on multiple categoricals, .sum() and .count() would return NaN for the missing categories, but they are expected to return 0 for the missing categories. Both these bugs are fixed.
Tests
Tests were added in PR #35022 when these bugs were discovered and the tests were marked with an xfail. For this PR the xfails are removed and the tests are passing normally. As well, a few other existing tests were expecting
sum()
to returnNaN
; these have been updated so that the tests now expect to get0
(which is the desired behaviour).One new test is added to ensure that the exception handling of the new
try-except-finally
block behaves as expected.df.pivot_table
The changes to
.sum()
&.count()
also impacts thedf.pivot_table()
if it is called withaggfunc=sum/count
and is pivoted on a Categorical column with observed=False. This is not explicitly mentioned in either of the bugs, but it does make the behaviour consistent (i.e. the sum of a missing category is zero, not NaN). Two tests on test_pivot.py was updated to reflect this change.