Skip to content

BUG #15150 normalization of crosstable with multiindex and margins #16599

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

cmohl2013
Copy link
Contributor

@cmohl2013 cmohl2013 commented Jun 5, 2017

When debugging this issue I came across some unexpected results for margins
when normalization 'index' or 'column' is performed. Here a cross table with 'column' normalization (example from line 1271 in test_pivot.py):

b    3    4  All
a               
1  0.5  0.0  0.2
2  0.5  1.0  0.8

I would expect, that margin values should always be the sums of rows/cols, regardless if values were normalized or not, so I would expect the following:

b    3    4  All
a               
1  0.5  0.0  0.5
2  0.5  1.0  1.5

This is not the case for 'index' and 'column' normalization. In fact, margin values are calculated as sums of raw values and then normalized. This is fine for normalization 'all'. But for normalization 'columns' and 'index', this leads to -- at least for me -- unexpected results.

I left the calculation as it is, because this kind of behavior is validated in test_crosstab_normalize
(test_pivot.py), so maybe the calculation is wanted like that. Is it?

@codecov
Copy link

codecov bot commented Jun 5, 2017

Codecov Report

Merging #16599 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16599      +/-   ##
==========================================
- Coverage   90.93%   90.93%   -0.01%     
==========================================
  Files         161      161              
  Lines       49267    49253      -14     
==========================================
- Hits        44800    44787      -13     
+ Misses       4467     4466       -1
Flag Coverage Δ
#multiple 88.69% <100%> (-0.01%) ⬇️
#single 40.23% <0%> (+0.01%) ⬆️
Impacted Files Coverage Δ
pandas/core/reshape/pivot.py 95.2% <100%> (+0.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bf99975...268ce49. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 5, 2017

Codecov Report

Merging #16599 into master will increase coverage by 1.57%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16599      +/-   ##
==========================================
+ Coverage   91.02%    92.6%   +1.57%     
==========================================
  Files         161      161              
  Lines       49393    63210   +13817     
==========================================
+ Hits        44962    58536   +13574     
- Misses       4431     4674     +243
Flag Coverage Δ
#multiple 90.42% <100%> (+1.64%) ⬆️
#single 44.48% <4.65%> (+4.14%) ⬆️
Impacted Files Coverage Δ
pandas/core/reshape/pivot.py 97.51% <100%> (+2.32%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/util/_print_versions.py 14.28% <0%> (-1.43%) ⬇️
pandas/core/config_init.py 95.33% <0%> (-0.57%) ⬇️
pandas/core/computation/eval.py 96.47% <0%> (-0.46%) ⬇️
pandas/core/common.py 91.26% <0%> (-0.18%) ⬇️
pandas/core/indexes/category.py 98.53% <0%> (+0.02%) ⬆️
pandas/core/panel.py 97.03% <0%> (+0.11%) ⬆️
pandas/tseries/frequencies.py 96.87% <0%> (+0.2%) ⬆️
pandas/core/categorical.py 95.82% <0%> (+0.36%) ⬆️
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7930202...93cb736. Read the comment docs.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Could you add a release note in whatsnew/0.21.0?

table_index_names = table.index.names
table_columns_names = table.columns.names
else:
raise ValueError("Not a valid margins argument")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe include the invalid argument in the error

ValueError("Not a valid margins argument: {!r}".format(margins))`

@@ -1266,21 +1266,18 @@ def test_crosstab_normalize(self):
[0.25, 0.75],
[0.4, 0.6]],
index=pd.Index([1, 2, 'All'],
name='a',
dtype='object'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the changes here? They look fine, as object is inferred, just wondering.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see down below where it does change from object. I think that's ok though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, a bit confusing..see above comment where I tried to explain it

try:
f = normalizers[normalize]
except KeyError:
raise ValueError("Not a valid normalize argument")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, if you don't mind.

# reset index to ensure default index dtype
if normalize == 'index':
colnames = table.columns.names
table.columns = Index(table.columns.tolist())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the tolist necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added these lines to make a new index from scratch, to ensure that the default index type is used. If margins is True, the index dtype always changes to 'object', because margins index value is always a string. If normalize=='index' or 'column', the margin value is removed again. As a result, an index containing only integer values has dtype object, but by default should be Int64index. I changed test_crosstab_normalize slightly and removed the dtype='object', to ensure that the validation data contains default index dtypes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use .tolist(), this is completely inefficient. You should instead conditionally add the margin value to the index then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll try to do it that way. Requires some changes in the crosstab function, because margins can not be calculated within pivot_table(), then.

@TomAugspurger TomAugspurger added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jun 8, 2017
@TomAugspurger TomAugspurger added this to the 0.21.0 milestone Jun 8, 2017
@cmohl2013
Copy link
Contributor Author

Ok. Thank you for checking. I'll modify the error messages and add the whatsnew.

Could someone comment on my point above: I think that margins are not calculated correctly when normalize is set to 'columns' or 'index'. See my example above. Should I post a bug report for that? Or is the calculation correct?

# reset index to ensure default index dtype
if normalize == 'index':
colnames = table.columns.names
table.columns = Index(table.columns.tolist())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use .tolist(), this is completely inefficient. You should instead conditionally add the margin value to the index then.

@cmohl2013
Copy link
Contributor Author

I tried to implement the changes as requested. I also modified some values in validation data of test_margin_dropna, because the expected margin values were not correct.

def test_crosstab_norm_margins_with_multiindex(self):
# GH 15150
a = np.array(['foo', 'bar', 'foo', 'bar', 'bar', 'foo'])
b = np.array(['one', 'one', 'two', 'one', 'two', 'two'])
Copy link
Contributor

@jreback jreback Jun 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is very hard to read, make it in several sections

input data
expected
result
assert_frame_equal

then repeat for cases ....

rather than naming things expected_col_colnorm, just name them expected, and insted put a comment for that case

@cmohl2013
Copy link
Contributor Author

@jreback
I reformatted the test to be more readable, as you requested. is it ok now?

for level in table.index.names:
if margins_name in table.index.get_level_values(level):
raise ValueError(exception_msg)
# could be passed a Series object with no 'columns'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line here


if normalize != 'columns':
# add margin row
if type(table.index) is MultiIndex:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use isinstance

# add margin row
if type(table.index) is MultiIndex:
table = table.transpose()
table[margins_name] = table.sum(axis=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will mangle the dtypes

table.loc[margins_name] = table.sum(axis=0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solution does not work: it flattens the MultiIndex.
Strangely, it works for columns. Therefore I did this workaround using transpose().
Do you have another idea how to deal with that?

import pandas as pd
import numpy as np
a = np.array(['foo', 'bar', 'foo', 'bar', 'bar', 'foo'])
b = np.array(['one', 'one', 'two', 'one', 'two', 'two'])
c = np.array(['dull', 'shiny', 'dull', 'dull', 'dull', 'shiny'])
d = np.array(['a', 'a', 'b', 'a', 'b', 'b'])

#dataframe with mutliindex columns and multiindex index 
df = pd.crosstab([a, b], [c, d], normalize='columns',
                         margins=False)

#this works
df['all'] = df.sum(axis=1)

#this destroys the multiindex
df.loc['all'] = df.sum(axis=0)


print(df)
col_0      dull      shiny       all
col_1         a    b     a    b     
(bar, one)  0.5  0.0   1.0  0.0  1.5
(bar, two)  0.0  0.5   0.0  0.0  0.5
(foo, one)  0.5  0.0   0.0  0.0  0.5
(foo, two)  0.0  0.5   0.0  1.0  1.5
all         1.0  1.0   1.0  1.0  4.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a solution now:

df.loc[margins_name, :] = df.sum(axis=1)

I'll make a new commit soon.

see here
https://stackoverflow.com/questions/44949953/how-to-add-a-row-to-a-pandas-dataframe-without-flattening-the-multiindex

table.index.names = table_index_names
table.columns.names = table_columns_names
try:
f = normalizers[normalize]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an internal error yes? (IOW is NOT exposed to the user). is there a test?
I would have just raise KeyError if it fails (IOW its not found)

@jreback
Copy link
Contributor

jreback commented Jul 19, 2017

@gfyoung can you have a look
@toobaz can you have a look

@toobaz
Copy link
Member

toobaz commented Jul 19, 2017

@cmohl2013 Sorry for coming in the discussion now only.

This said: from #15150 :

Possible solution: calling pivot_table in crosstab always with margins=False, then
do normalization and finally call _add_margins, if margins=True.

I wonder whether we couldn't directly fix pivot_table... as an ugly hack, the following seems to work: https://github.com/pandas-dev/pandas/compare/master...toobaz:crosstab_hack?expand=1

(it's not just a hack - one of the two changes just works around #17024 I think - it will also fail on >2 levels... but it seems pretty simple to fix)

@@ -263,6 +252,21 @@ def _add_margins(table, data, values, rows, cols, aggfunc,
return result


def _check_margins_name(margins_name, table):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring here would be good (for developers)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I can do that

@cmohl2013
Copy link
Contributor Author

@toobaz Yes, fixing pivot_table was my original plan as commented in #15150, but I did not manage to come up with a good solution, so I implemented adding margins in crosstable directly.
Your solution looks good to me (and can be adjusted as soon as #17024 is fixed).

@cmohl2013
Copy link
Contributor Author

@toobaz
I tried your solution, but then ran into problems with concat and MultiIndex. So I ended up with my previous solution and added the function _add_margins_to_multiindex as workaround for #17024. I left a comment in the code that the workaround should be removed when #17024 is solved.

@toobaz
Copy link
Member

toobaz commented Jul 25, 2017

I tried your solution, but then ran into problems with concat and MultiIndex

Couldn't we try to solve these problems? I'd be happy to help.

On the other hand, the current PR introduces a lot of code duplication which I personally would prefer to avoid. Moreover, my understanding (admittedly after just a quick glance at the changes) is that you are fixing something in crosstable but the same exact problem will remain in pivot_table, while if you fix it in pivot_table then crosstable will automatically benefit.

(Disclaimer: those are just suggestions, I'm not a maintainer)

@@ -311,6 +311,9 @@ Reshaping
- Bug in merging with categorical dtypes with datetimelikes incorrectly raised a ``TypeError`` (:issue:`16900`)
- Bug when using :func:`isin` on a large object series and large comparison array (:issue:`16012`)
- Fixes regression from 0.20, :func:`Series.aggregate` and :func:`DataFrame.aggregate` allow dictionaries as return values again (:issue:`16741`)
- Bug in ``pd.crosstab(normalize=True, margins=True)`` when at least one axis has a multi-index (:issue:`15150`)

>>>>>>> added whatsnew and reformatted tests to be more readable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i missed that..

@jreback jreback removed this from the 0.21.0 milestone Jul 26, 2017
@cmohl2013
Copy link
Contributor Author

@toobaz

On the other hand, the current PR introduces a lot of code duplication which I personally would >prefer to avoid.

In fact, it reduces the code. We get rid of the whole margins dropping and fixing that goes on in lines 568 to 606 (in the master branch). Here we have also a separate calculation of margins, independent from the calculation in pivot_table. So the PR did not introduce this duplication of margins calculation, it was there before.

However, I see your point that it is not good to use different code to either calculate margins in pivot_table or crosstable. I find the calculation of margins in pivot_table not very readable and cryptic. Would it be an option to transfer the code for calculating margins how I did it in crosstable to pivot_table and rewrite _add_margins? Could well be that I miss something and that it would be not as easy. What do you think?

@jreback
Copy link
Contributor

jreback commented Sep 23, 2017

not sure what to do with this PR. @toobaz want to take a look and see how we can reconcile this (and other's indicated above)

@toobaz
Copy link
Member

toobaz commented Sep 24, 2017

@jreback @cmohl2013 sorry for disappearing

In fact, it reduces the code

Cool, but this is not a guarantee that there is no (also future) duplication involved... and the fact that you call pivot_table with margins=False looks to me conceptually wrong (the if margins: part really mostly replicates stuff from pivot_table).

Would it be an option to transfer the code for calculating margins how I did it in crosstable to pivot_table and rewrite _add_margins?

Probably, yes. I admit I'm a bit confused (and rebasing could help): the bug you are fixing is raised in the call to _normalize inside crosstab. So either 1) _normalize must be fixed, or 2) the input that crosstab passes to _normalize is in some way wrong (or both). Now my approach just fixed _normalize, and apparently worked. So I tend to exclude that 2) is a problem. However, most of your changes take effect before the call to _normalize. Maybe some of them are good, but I fail to understand their purpose. Are they unrelated to the bug?

Oh, by the way:

When debugging this issue I came across some unexpected results for margins

Admittedly the result looks strange and the design choice can be discussed, but it is, strictly speaking, correct. You are asking to normalize columns, which means that each column should add up to 1... including the All column. But if you think this is worth reconsidering (i.e. "first normalize, then calculate margins"), I suggest to do so in a separate issue.

@jreback
Copy link
Contributor

jreback commented Nov 10, 2017

closing as stale. if you'd like to continue working, pls ping.

@jreback jreback closed this Nov 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: crosstab cannot normalize multiple columns for the index
5 participants