BUG #15150 normalization of crosstable with multiindex and margins #16599

cmohl2013 · 2017-06-05T11:00:18Z

closes BUG: crosstab cannot normalize multiple columns for the index #15150
tests added / passed
passes git diff upstream/master --name-only -- '*.py' | flake8 --diff
whatsnew entry

When debugging this issue I came across some unexpected results for margins
when normalization 'index' or 'column' is performed. Here a cross table with 'column' normalization (example from line 1271 in test_pivot.py):

b    3    4  All
a               
1  0.5  0.0  0.2
2  0.5  1.0  0.8

I would expect, that margin values should always be the sums of rows/cols, regardless if values were normalized or not, so I would expect the following:

b    3    4  All
a               
1  0.5  0.0  0.5
2  0.5  1.0  1.5

This is not the case for 'index' and 'column' normalization. In fact, margin values are calculated as sums of raw values and then normalized. This is fine for normalization 'all'. But for normalization 'columns' and 'index', this leads to -- at least for me -- unexpected results.

I left the calculation as it is, because this kind of behavior is validated in test_crosstab_normalize
(test_pivot.py), so maybe the calculation is wanted like that. Is it?

codecov · 2017-06-05T11:51:27Z

Codecov Report

Merging #16599 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #16599      +/-   ##
==========================================
- Coverage   90.93%   90.93%   -0.01%     
==========================================
  Files         161      161              
  Lines       49267    49253      -14     
==========================================
- Hits        44800    44787      -13     
+ Misses       4467     4466       -1

Flag	Coverage Δ
#multiple	`88.69% <100%> (-0.01%)`	⬇️
#single	`40.23% <0%> (+0.01%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/reshape/pivot.py	`95.2% <100%> (+0.11%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bf99975...268ce49. Read the comment docs.

codecov · 2017-06-05T11:51:30Z

Codecov Report

Merging #16599 into master will increase coverage by 1.57%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #16599      +/-   ##
==========================================
+ Coverage   91.02%    92.6%   +1.57%     
==========================================
  Files         161      161              
  Lines       49393    63210   +13817     
==========================================
+ Hits        44962    58536   +13574     
- Misses       4431     4674     +243

Flag	Coverage Δ
#multiple	`90.42% <100%> (+1.64%)`	⬆️
#single	`44.48% <4.65%> (+4.14%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/reshape/pivot.py	`97.51% <100%> (+2.32%)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/util/_print_versions.py	`14.28% <0%> (-1.43%)`	⬇️
pandas/core/config_init.py	`95.33% <0%> (-0.57%)`	⬇️
pandas/core/computation/eval.py	`96.47% <0%> (-0.46%)`	⬇️
pandas/core/common.py	`91.26% <0%> (-0.18%)`	⬇️
pandas/core/indexes/category.py	`98.53% <0%> (+0.02%)`	⬆️
pandas/core/panel.py	`97.03% <0%> (+0.11%)`	⬆️
pandas/tseries/frequencies.py	`96.87% <0%> (+0.2%)`	⬆️
pandas/core/categorical.py	`95.82% <0%> (+0.36%)`	⬆️
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7930202...93cb736. Read the comment docs.

TomAugspurger

Looks good overall. Could you add a release note in whatsnew/0.21.0?

TomAugspurger · 2017-06-08T21:02:12Z

pandas/core/reshape/pivot.py

-        table_index_names = table.index.names
-        table_columns_names = table.columns.names
+    else:
+        raise ValueError("Not a valid margins argument")


Maybe include the invalid argument in the error

ValueError("Not a valid margins argument: {!r}".format(margins))`

TomAugspurger · 2017-06-08T21:05:11Z

pandas/tests/reshape/test_pivot.py

@@ -1266,21 +1266,18 @@ def test_crosstab_normalize(self):
                                           [0.25, 0.75],
                                           [0.4, 0.6]],
                                          index=pd.Index([1, 2, 'All'],
-                                                         name='a',
-                                                         dtype='object'),


What the changes here? They look fine, as object is inferred, just wondering.

Ah, I see down below where it does change from object. I think that's ok though.

yes, a bit confusing..see above comment where I tried to explain it

TomAugspurger · 2017-06-08T21:11:27Z

pandas/core/reshape/pivot.py

+    try:
+        f = normalizers[normalize]
+    except KeyError:
+        raise ValueError("Not a valid normalize argument")


Same here, if you don't mind.

TomAugspurger · 2017-06-08T21:15:51Z

pandas/core/reshape/pivot.py

+        # reset index to ensure default index dtype
+        if normalize == 'index':
+            colnames = table.columns.names
+            table.columns = Index(table.columns.tolist())


Is the tolist necessary?

I added these lines to make a new index from scratch, to ensure that the default index type is used. If margins is True, the index dtype always changes to 'object', because margins index value is always a string. If normalize=='index' or 'column', the margin value is removed again. As a result, an index containing only integer values has dtype object, but by default should be Int64index. I changed test_crosstab_normalize slightly and removed the dtype='object', to ensure that the validation data contains default index dtypes.

don't use .tolist(), this is completely inefficient. You should instead conditionally add the margin value to the index then.

ok, I'll try to do it that way. Requires some changes in the crosstab function, because margins can not be calculated within pivot_table(), then.

cmohl2013 · 2017-06-09T08:49:02Z

Ok. Thank you for checking. I'll modify the error messages and add the whatsnew.

Could someone comment on my point above: I think that margins are not calculated correctly when normalize is set to 'columns' or 'index'. See my example above. Should I post a bug report for that? Or is the calculation correct?

jreback · 2017-06-09T10:25:32Z

pandas/core/reshape/pivot.py

+        # reset index to ensure default index dtype
+        if normalize == 'index':
+            colnames = table.columns.names
+            table.columns = Index(table.columns.tolist())


don't use .tolist(), this is completely inefficient. You should instead conditionally add the margin value to the index then.

cmohl2013 · 2017-06-14T10:11:14Z

I tried to implement the changes as requested. I also modified some values in validation data of test_margin_dropna, because the expected margin values were not correct.

jreback · 2017-06-14T10:21:39Z

pandas/tests/reshape/test_pivot.py

+    def test_crosstab_norm_margins_with_multiindex(self):
+        # GH 15150
+        a = np.array(['foo', 'bar', 'foo', 'bar', 'bar', 'foo'])
+        b = np.array(['one', 'one', 'two', 'one', 'two', 'two'])


this test is very hard to read, make it in several sections

input data
expected
result
assert_frame_equal

then repeat for cases ....

rather than naming things expected_col_colnorm, just name them expected, and insted put a comment for that case

cmohl2013 · 2017-06-29T14:48:32Z

@jreback
I reformatted the test to be more readable, as you requested. is it ok now?

jreback · 2017-07-01T08:19:09Z

pandas/core/reshape/pivot.py

+    for level in table.index.names:
+        if margins_name in table.index.get_level_values(level):
+            raise ValueError(exception_msg)
+    # could be passed a Series object with no 'columns'


blank line here

jreback · 2017-07-01T08:19:35Z

pandas/core/reshape/pivot.py

+
+        if normalize != 'columns':
+            # add margin row
+            if type(table.index) is MultiIndex:


use isinstance

jreback · 2017-07-01T08:20:35Z

pandas/core/reshape/pivot.py

+            # add margin row
+            if type(table.index) is MultiIndex:
+                table = table.transpose()
+                table[margins_name] = table.sum(axis=1)


this will mangle the dtypes

table.loc[margins_name] = table.sum(axis=0)

This solution does not work: it flattens the MultiIndex.
Strangely, it works for columns. Therefore I did this workaround using transpose().
Do you have another idea how to deal with that?

import pandas as pd import numpy as np a = np.array(['foo', 'bar', 'foo', 'bar', 'bar', 'foo']) b = np.array(['one', 'one', 'two', 'one', 'two', 'two']) c = np.array(['dull', 'shiny', 'dull', 'dull', 'dull', 'shiny']) d = np.array(['a', 'a', 'b', 'a', 'b', 'b']) #dataframe with mutliindex columns and multiindex index df = pd.crosstab([a, b], [c, d], normalize='columns', margins=False) #this works df['all'] = df.sum(axis=1) #this destroys the multiindex df.loc['all'] = df.sum(axis=0) print(df) col_0 dull shiny all col_1 a b a b (bar, one) 0.5 0.0 1.0 0.0 1.5 (bar, two) 0.0 0.5 0.0 0.0 0.5 (foo, one) 0.5 0.0 0.0 0.0 0.5 (foo, two) 0.0 0.5 0.0 1.0 1.5 all 1.0 1.0 1.0 1.0 4.0

I have a solution now:

df.loc[margins_name, :] = df.sum(axis=1)

I'll make a new commit soon.

see here
https://stackoverflow.com/questions/44949953/how-to-add-a-row-to-a-pandas-dataframe-without-flattening-the-multiindex

jreback · 2017-07-01T08:22:01Z

pandas/core/reshape/pivot.py

-        table.index.names = table_index_names
-        table.columns.names = table_columns_names
+    try:
+        f = normalizers[normalize]


this is an internal error yes? (IOW is NOT exposed to the user). is there a test?
I would have just raise KeyError if it fails (IOW its not found)

jreback · 2017-07-19T10:34:19Z

@gfyoung can you have a look
@toobaz can you have a look

toobaz · 2017-07-19T13:58:46Z

@cmohl2013 Sorry for coming in the discussion now only.

This said: from #15150 :

Possible solution: calling pivot_table in crosstab always with margins=False, then
do normalization and finally call _add_margins, if margins=True.

I wonder whether we couldn't directly fix pivot_table... as an ugly hack, the following seems to work: https://github.com/pandas-dev/pandas/compare/master...toobaz:crosstab_hack?expand=1

(it's not just a hack - one of the two changes just works around #17024 I think - it will also fail on >2 levels... but it seems pretty simple to fix)

gfyoung · 2017-07-19T15:23:55Z

pandas/core/reshape/pivot.py

@@ -263,6 +252,21 @@ def _add_margins(table, data, values, rows, cols, aggfunc,
    return result


+def _check_margins_name(margins_name, table):


Docstring here would be good (for developers)

yes, I can do that

cmohl2013 · 2017-07-20T08:07:12Z

@toobaz Yes, fixing pivot_table was my original plan as commented in #15150, but I did not manage to come up with a good solution, so I implemented adding margins in crosstable directly.
Your solution looks good to me (and can be adjusted as soon as #17024 is fixed).

…based on normalization type, corrected expected margin values in test_margin_dropna

cmohl2013 · 2017-07-25T18:04:58Z

@toobaz
I tried your solution, but then ran into problems with concat and MultiIndex. So I ended up with my previous solution and added the function _add_margins_to_multiindex as workaround for #17024. I left a comment in the code that the workaround should be removed when #17024 is solved.

toobaz · 2017-07-25T22:46:32Z

I tried your solution, but then ran into problems with concat and MultiIndex

Couldn't we try to solve these problems? I'd be happy to help.

On the other hand, the current PR introduces a lot of code duplication which I personally would prefer to avoid. Moreover, my understanding (admittedly after just a quick glance at the changes) is that you are fixing something in crosstable but the same exact problem will remain in pivot_table, while if you fix it in pivot_table then crosstable will automatically benefit.

(Disclaimer: those are just suggestions, I'm not a maintainer)

gfyoung · 2017-07-25T22:49:01Z

doc/source/whatsnew/v0.21.0.txt

@@ -311,6 +311,9 @@ Reshaping
 - Bug in merging with categorical dtypes with datetimelikes incorrectly raised a ``TypeError`` (:issue:`16900`)
 - Bug when using :func:`isin` on a large object series and large comparison array (:issue:`16012`)
 - Fixes regression from 0.20, :func:`Series.aggregate` and :func:`DataFrame.aggregate` allow dictionaries as return values again (:issue:`16741`)
+- Bug in ``pd.crosstab(normalize=True, margins=True)`` when at least one axis has a multi-index (:issue:`15150`)
+
+>>>>>>> added whatsnew and reformatted tests to be more readable


Remove this.

oh i missed that..

cmohl2013 · 2017-07-26T11:28:08Z

@toobaz

On the other hand, the current PR introduces a lot of code duplication which I personally would >prefer to avoid.

In fact, it reduces the code. We get rid of the whole margins dropping and fixing that goes on in lines 568 to 606 (in the master branch). Here we have also a separate calculation of margins, independent from the calculation in pivot_table. So the PR did not introduce this duplication of margins calculation, it was there before.

However, I see your point that it is not good to use different code to either calculate margins in pivot_table or crosstable. I find the calculation of margins in pivot_table not very readable and cryptic. Would it be an option to transfer the code for calculating margins how I did it in crosstable to pivot_table and rewrite _add_margins? Could well be that I miss something and that it would be not as easy. What do you think?

jreback · 2017-09-23T16:56:06Z

not sure what to do with this PR. @toobaz want to take a look and see how we can reconcile this (and other's indicated above)

toobaz · 2017-09-24T10:46:31Z

@jreback @cmohl2013 sorry for disappearing

In fact, it reduces the code

Cool, but this is not a guarantee that there is no (also future) duplication involved... and the fact that you call pivot_table with margins=False looks to me conceptually wrong (the if margins: part really mostly replicates stuff from pivot_table).

Would it be an option to transfer the code for calculating margins how I did it in crosstable to pivot_table and rewrite _add_margins?

Probably, yes. I admit I'm a bit confused (and rebasing could help): the bug you are fixing is raised in the call to _normalize inside crosstab. So either 1) _normalize must be fixed, or 2) the input that crosstab passes to _normalize is in some way wrong (or both). Now my approach just fixed _normalize, and apparently worked. So I tend to exclude that 2) is a problem. However, most of your changes take effect before the call to _normalize. Maybe some of them are good, but I fail to understand their purpose. Are they unrelated to the bug?

Oh, by the way:

When debugging this issue I came across some unexpected results for margins

Admittedly the result looks strange and the design choice can be discussed, but it is, strictly speaking, correct. You are asking to normalize columns, which means that each column should add up to 1... including the All column. But if you think this is worth reconsidering (i.e. "first normalize, then calculate margins"), I suggest to do so in a separate issue.

jreback · 2017-11-10T20:19:16Z

closing as stale. if you'd like to continue working, pls ping.

TomAugspurger reviewed Jun 8, 2017

View reviewed changes

TomAugspurger added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jun 8, 2017

TomAugspurger added this to the 0.21.0 milestone Jun 8, 2017

jreback requested changes Jun 9, 2017

View reviewed changes

cmohl2013 force-pushed the crosstable_norm branch from 268ce49 to bbb979c Compare June 14, 2017 09:37

jreback reviewed Jun 14, 2017

View reviewed changes

jreback requested changes Jul 1, 2017

View reviewed changes

toobaz mentioned this pull request Jul 19, 2017

.loc with partial key flattens MultiIndex on index #17024

Open

gfyoung reviewed Jul 19, 2017

View reviewed changes

Christoph Möhl added 7 commits July 25, 2017 15:49

worked on _normalize function

46d711e

modified crosstable normalization for multi index data

d746d0d

added test for crosstab norm with multiindex

2e1f5d7

BUG GH15150 crosstable normalize with multiindex

eeb7416

pandas-dev#15150 added conditional calculation of crosstable margins …

66ef8df

…based on normalization type, corrected expected margin values in test_margin_dropna

added whatsnew and reformatted tests to be more readable

9c55b4d

workaround for adding margins row to multiindex

93cb736

cmohl2013 force-pushed the crosstable_norm branch from f8e7e72 to 93cb736 Compare July 25, 2017 15:16

gfyoung reviewed Jul 25, 2017

View reviewed changes

jreback removed this from the 0.21.0 milestone Jul 26, 2017

jreback closed this Nov 10, 2017

		@@ -263,6 +252,21 @@ def _add_margins(table, data, values, rows, cols, aggfunc,
		return result


		def _check_margins_name(margins_name, table):

BUG #15150 normalization of crosstable with multiindex and margins #16599

BUG #15150 normalization of crosstable with multiindex and margins #16599

Conversation

cmohl2013 commented Jun 5, 2017 • edited Loading

codecov bot commented Jun 5, 2017

Codecov Report

codecov bot commented Jun 5, 2017 • edited Loading

Codecov Report

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmohl2013 commented Jun 9, 2017

Choose a reason for hiding this comment

cmohl2013 commented Jun 14, 2017

jreback Jun 14, 2017 • edited Loading

Choose a reason for hiding this comment

cmohl2013 commented Jun 29, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jul 19, 2017

toobaz commented Jul 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmohl2013 commented Jul 20, 2017

cmohl2013 commented Jul 25, 2017

toobaz commented Jul 25, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmohl2013 commented Jul 26, 2017

jreback commented Sep 23, 2017

toobaz commented Sep 24, 2017

jreback commented Nov 10, 2017

cmohl2013 commented Jun 5, 2017 •

edited

Loading

codecov bot commented Jun 5, 2017 •

edited

Loading

jreback Jun 14, 2017 •

edited

Loading

toobaz commented Jul 19, 2017 •

edited

Loading