-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG #15150 normalization of crosstable with multiindex and margins #16599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #16599 +/- ##
==========================================
- Coverage 90.93% 90.93% -0.01%
==========================================
Files 161 161
Lines 49267 49253 -14
==========================================
- Hits 44800 44787 -13
+ Misses 4467 4466 -1
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #16599 +/- ##
==========================================
+ Coverage 91.02% 92.6% +1.57%
==========================================
Files 161 161
Lines 49393 63210 +13817
==========================================
+ Hits 44962 58536 +13574
- Misses 4431 4674 +243
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. Could you add a release note in whatsnew/0.21.0
?
pandas/core/reshape/pivot.py
Outdated
table_index_names = table.index.names | ||
table_columns_names = table.columns.names | ||
else: | ||
raise ValueError("Not a valid margins argument") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe include the invalid argument in the error
ValueError("Not a valid margins argument: {!r}".format(margins))`
@@ -1266,21 +1266,18 @@ def test_crosstab_normalize(self): | |||
[0.25, 0.75], | |||
[0.4, 0.6]], | |||
index=pd.Index([1, 2, 'All'], | |||
name='a', | |||
dtype='object'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What the changes here? They look fine, as object
is inferred, just wondering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see down below where it does change from object. I think that's ok though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, a bit confusing..see above comment where I tried to explain it
pandas/core/reshape/pivot.py
Outdated
try: | ||
f = normalizers[normalize] | ||
except KeyError: | ||
raise ValueError("Not a valid normalize argument") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, if you don't mind.
pandas/core/reshape/pivot.py
Outdated
# reset index to ensure default index dtype | ||
if normalize == 'index': | ||
colnames = table.columns.names | ||
table.columns = Index(table.columns.tolist()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the tolist
necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added these lines to make a new index from scratch, to ensure that the default index type is used. If margins is True, the index dtype always changes to 'object', because margins index value is always a string. If normalize=='index' or 'column', the margin value is removed again. As a result, an index containing only integer values has dtype object, but by default should be Int64index. I changed test_crosstab_normalize slightly and removed the dtype='object', to ensure that the validation data contains default index dtypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use .tolist()
, this is completely inefficient. You should instead conditionally add the margin value to the index then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll try to do it that way. Requires some changes in the crosstab function, because margins can not be calculated within pivot_table(), then.
Ok. Thank you for checking. I'll modify the error messages and add the whatsnew. Could someone comment on my point above: I think that margins are not calculated correctly when normalize is set to 'columns' or 'index'. See my example above. Should I post a bug report for that? Or is the calculation correct? |
pandas/core/reshape/pivot.py
Outdated
# reset index to ensure default index dtype | ||
if normalize == 'index': | ||
colnames = table.columns.names | ||
table.columns = Index(table.columns.tolist()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use .tolist()
, this is completely inefficient. You should instead conditionally add the margin value to the index then.
268ce49
to
bbb979c
Compare
I tried to implement the changes as requested. I also modified some values in validation data of test_margin_dropna, because the expected margin values were not correct. |
def test_crosstab_norm_margins_with_multiindex(self): | ||
# GH 15150 | ||
a = np.array(['foo', 'bar', 'foo', 'bar', 'bar', 'foo']) | ||
b = np.array(['one', 'one', 'two', 'one', 'two', 'two']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test is very hard to read, make it in several sections
input data
expected
result
assert_frame_equal
then repeat for cases ....
rather than naming things expected_col_colnorm
, just name them expected, and insted put a comment for that case
@jreback |
pandas/core/reshape/pivot.py
Outdated
for level in table.index.names: | ||
if margins_name in table.index.get_level_values(level): | ||
raise ValueError(exception_msg) | ||
# could be passed a Series object with no 'columns' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line here
pandas/core/reshape/pivot.py
Outdated
|
||
if normalize != 'columns': | ||
# add margin row | ||
if type(table.index) is MultiIndex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use isinstance
pandas/core/reshape/pivot.py
Outdated
# add margin row | ||
if type(table.index) is MultiIndex: | ||
table = table.transpose() | ||
table[margins_name] = table.sum(axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will mangle the dtypes
table.loc[margins_name] = table.sum(axis=0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This solution does not work: it flattens the MultiIndex.
Strangely, it works for columns. Therefore I did this workaround using transpose().
Do you have another idea how to deal with that?
import pandas as pd
import numpy as np
a = np.array(['foo', 'bar', 'foo', 'bar', 'bar', 'foo'])
b = np.array(['one', 'one', 'two', 'one', 'two', 'two'])
c = np.array(['dull', 'shiny', 'dull', 'dull', 'dull', 'shiny'])
d = np.array(['a', 'a', 'b', 'a', 'b', 'b'])
#dataframe with mutliindex columns and multiindex index
df = pd.crosstab([a, b], [c, d], normalize='columns',
margins=False)
#this works
df['all'] = df.sum(axis=1)
#this destroys the multiindex
df.loc['all'] = df.sum(axis=0)
print(df)
col_0 dull shiny all
col_1 a b a b
(bar, one) 0.5 0.0 1.0 0.0 1.5
(bar, two) 0.0 0.5 0.0 0.0 0.5
(foo, one) 0.5 0.0 0.0 0.0 0.5
(foo, two) 0.0 0.5 0.0 1.0 1.5
all 1.0 1.0 1.0 1.0 4.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a solution now:
df.loc[margins_name, :] = df.sum(axis=1)
I'll make a new commit soon.
pandas/core/reshape/pivot.py
Outdated
table.index.names = table_index_names | ||
table.columns.names = table_columns_names | ||
try: | ||
f = normalizers[normalize] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an internal error yes? (IOW is NOT exposed to the user). is there a test?
I would have just raise KeyError if it fails (IOW its not found)
@cmohl2013 Sorry for coming in the discussion now only. This said: from #15150 :
I wonder whether we couldn't directly fix (it's not just a hack - one of the two changes just works around #17024 I think - it will also fail on >2 levels... but it seems pretty simple to fix) |
@@ -263,6 +252,21 @@ def _add_margins(table, data, values, rows, cols, aggfunc, | |||
return result | |||
|
|||
|
|||
def _check_margins_name(margins_name, table): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring here would be good (for developers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I can do that
…based on normalization type, corrected expected margin values in test_margin_dropna
f8e7e72
to
93cb736
Compare
Couldn't we try to solve these problems? I'd be happy to help. On the other hand, the current PR introduces a lot of code duplication which I personally would prefer to avoid. Moreover, my understanding (admittedly after just a quick glance at the changes) is that you are fixing something in (Disclaimer: those are just suggestions, I'm not a maintainer) |
@@ -311,6 +311,9 @@ Reshaping | |||
- Bug in merging with categorical dtypes with datetimelikes incorrectly raised a ``TypeError`` (:issue:`16900`) | |||
- Bug when using :func:`isin` on a large object series and large comparison array (:issue:`16012`) | |||
- Fixes regression from 0.20, :func:`Series.aggregate` and :func:`DataFrame.aggregate` allow dictionaries as return values again (:issue:`16741`) | |||
- Bug in ``pd.crosstab(normalize=True, margins=True)`` when at least one axis has a multi-index (:issue:`15150`) | |||
|
|||
>>>>>>> added whatsnew and reformatted tests to be more readable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i missed that..
In fact, it reduces the code. We get rid of the whole margins dropping and fixing that goes on in lines 568 to 606 (in the master branch). Here we have also a separate calculation of margins, independent from the calculation in pivot_table. So the PR did not introduce this duplication of margins calculation, it was there before. However, I see your point that it is not good to use different code to either calculate margins in pivot_table or crosstable. I find the calculation of margins in pivot_table not very readable and cryptic. Would it be an option to transfer the code for calculating margins how I did it in crosstable to pivot_table and rewrite _add_margins? Could well be that I miss something and that it would be not as easy. What do you think? |
not sure what to do with this PR. @toobaz want to take a look and see how we can reconcile this (and other's indicated above) |
@jreback @cmohl2013 sorry for disappearing
Cool, but this is not a guarantee that there is no (also future) duplication involved... and the fact that you call
Probably, yes. I admit I'm a bit confused (and rebasing could help): the bug you are fixing is raised in the call to Oh, by the way:
Admittedly the result looks strange and the design choice can be discussed, but it is, strictly speaking, correct. You are asking to normalize columns, which means that each column should add up to 1... including the |
closing as stale. if you'd like to continue working, pls ping. |
git diff upstream/master --name-only -- '*.py' | flake8 --diff
When debugging this issue I came across some unexpected results for margins
when normalization 'index' or 'column' is performed. Here a cross table with 'column' normalization (example from line 1271 in test_pivot.py):
I would expect, that margin values should always be the sums of rows/cols, regardless if values were normalized or not, so I would expect the following:
This is not the case for 'index' and 'column' normalization. In fact, margin values are calculated as sums of raw values and then normalized. This is fine for normalization 'all'. But for normalization 'columns' and 'index', this leads to -- at least for me -- unexpected results.
I left the calculation as it is, because this kind of behavior is validated in test_crosstab_normalize
(test_pivot.py), so maybe the calculation is wanted like that. Is it?