-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Crosstab inconsistent with using aggfunc=sum when MultiIndex has category #47147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, Can I help in contributing to this issue? |
Thanks @BraisLP for the report
from the docs https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html
given the above, I'm not so sure if this isn't the expected behavior and the other cases are perhaps the incorrect ones. for instance if the cat series included an unused for the withsum case in the OP, you get
and for the withmin case
The withmin case does not include the unused category as so is IMO incorrect and the withsum case repeats the unused category for all values in the first level which makes sense and consitent with including the extra rows in the withsum case without unused categories. |
#16367 related |
I thought I'd add the example with the inverted row order that I mention in a comment: it includes all of the "categories" even if the inner row is not categorical. withsum_inverted = pd.crosstab([rowcat, rowstr],
columns=dummy,
values=values,
aggfunc=sum, # same for np.sum
) >>> withsum_inverted
dummy 0
cat str
a a 2
b 0
b a 0
b 8 |
again, what would be the expected output if the category Series includes an used category. The first level of the MultiIndex should include the unused category (according to the docs), but what would be the second level? maybe a missing value indicator (such as np.nan) or all of the unique values in the second level? (i.e. the MultiIndex should be the product of the categories with the unique values in the non categorical series) creating the MutiIndex from the product would be consistent with the current sum behavior for the reversed level order. |
In this last example, the column that gets "expanded" to include unused values is not categorical (the column titled "str" was defined in my first post as Time permitting, I'm going to have a look at this in more detail... |
I discovered this in pandas
1.2.5
and I tested it in1.4.2
and1.3.1
.I haven't found a similar issue reported here.
I'm using python 3.9.10
Here's a "minimal" example: if I provide e.g. two rows to
pd.crosstab
, one of which is of categorical type, there's odd behaviour exclusively whenaggfunc=sum
ornp.sum
:This produces the odd-looking matrix:
Compare to the following three examples:
The text was updated successfully, but these errors were encountered: