na_position doesn't work for sort_index() with MultiIndex #14784

brandonmburroughs · 2016-12-01T23:28:41Z

Code Sample, a copy-pastable example if possible

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: pd.__version__
Out[3]: u'0.19.0'

In [4]: mi = pd.MultiIndex.from_tuples([[1, 1, 3], [1, 1, np.nan]], names=list('ABC'))

In [5]: df = pd.DataFrame([[1, 2], [3, 4]], mi)

In [6]: df.sort_index(na_position="first")
Out[6]: 
         0  1
A B C        
1 1 NaN  3  4
    3    1  2

In [7]: df.sort_index(na_position="last")
Out[7]: 
         0  1
A B C        
1 1 NaN  3  4
    3    1  2

Problem description

The na_position argument isn't used in DataFrame.sort_index() or Series.sort_index() due to the way we sort the MultiIndex. Whenever we create a MultiIndex, we store the labels as relative values. For instance, if we have the following MultiIndex:

MultiIndex.from_tuples([[1, 1, 3], [1, 1, np.nan]], names=list('ABC'))

the values get stored as

MultiIndex(levels=[[1], [1], [3]],
           labels=[[0, 0], [0, 0], [0, -1]],
           names=[u'A', u'B', u'C'])

with a NaN placeholder of -1.

These label values are what get passed to the sorting algorithm for both DataFrames and Series. Since the sorting only happens on the labels, it has no notion of the NaN.

This has been discussed in #14015 and #14672 .

My original naive solution was to change these lines from:

indexer = _lexsort_indexer(labels.labels, orders=ascending,
                                            na_position=na_position)

to

index_values_list = np.dstack(labels.get_values())[0].tolist()
indexer = _lexsort_indexer(index_values_list, orders=ascending,
                                           na_position=na_position)

This didn't break any tests, but it isn't necessarily the best approach.

Expected Output

In [7]: df.sort_index(na_position="last")
Out[7]: 
         0  1
A B C        
1 1 3    1  2
    NaN  3  4

Output of `pd.show_versions()`

In [3]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-77-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.0
nose: 1.3.4
pip: 9.0.0
setuptools: 27.2.0
Cython: 0.21
numpy: 1.11.2
scipy: 0.16.1
statsmodels: 0.6.1
xarray: None
IPython: 4.0.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.5.0
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.5
sqlalchemy: 0.9.7
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.7.3
boto: 2.32.1
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2017-02-27T16:52:26Z

So basically sorting a multi-index always puts NaNs first, which can be quite annoying.

If someone wants to try to tackle this, it should be rather easy I think. na_position is already passed to lexsort_indexer, there is only a bug in its implementation where mask = c.codes == -1 does not work because the -1 values in the labels were turned into a a category '-1', not into a code value of '-1'.

linebp · 2017-03-17T13:43:56Z

I'd like to give this a try!

I have been looking at the code and when sorting the multi-index, the labels are passed to the lexsort_indexer function. This does not go well, because they are not the original values, but have been renamed, specifically the NaN values have been renamed to -1, so it no longer has special meaning in Categories.

After a quick look I see several ways of fixing the mask = c.codes == -1

Figure out which code has been assigned to the categorical value -1 and check for that instead.
Replace the -1 in the labels with NaN so that the code assigned will be -1 and the mask statement will work as intended
After creating the Categories object from the labels, remove the -1 category if it exists, so that the mask statement will work as intended

Is the check best left as is and should I fix data to make it work or should I fix the check to match the data? That bit where the key is checked if it is already Categorical worries me.

jreback · 2017-03-17T14:05:15Z

you can do something like this right about here

There are 2 cases when this is called.

from MultiIndex, when you get the labels passed in, these are already factorized (IOW they are almost but not quite a categorical).
from sorting (e.g. DataFrame.sort_values() where these are NOT factorized and must be turned into a proper categorical (this is already handled correctly).

This will correctly handle case 1), case 2) is handled by the existing code.

diff --git a/pandas/core/sorting.py b/pandas/core/sorting.py
index 205d0d9..c62b4e2 100644
--- a/pandas/core/sorting.py
+++ b/pandas/core/sorting.py
@@ -174,7 +174,8 @@ def lexsort_indexer(keys, orders=None, na_position='last'):
 
         # create the Categorical
         else:
-            c = Categorical(key, ordered=True)
+            cats = algorithms.unique(key)
+            c = Categorical.from_codes(key, cats[cats != -1], ordered=True)
 
         if na_position not in ['last', 'first']:
             raise ValueError('invalid na_position: {!r}'.format(na_position))

Simply do the above in the MultiIndex code before passing to _lexsort_indexer; if its a categorical it will just work. (don't actually do this in _lexsort_indexer as I did (that was just a test).

linebp · 2017-03-29T09:03:19Z

Like so:

--- i/pandas/core/frame.py
+++ w/pandas/core/frame.py
@@ -3392,7 +3392,12 @@ it is assumed to be aliases for the column names.')
             if not labels.is_lexsorted():
                 labels = MultiIndex.from_tuples(labels.values)
 
-            indexer = lexsort_indexer(labels.labels, orders=ascending,
+            keys = []
+            for label in labels.labels:
+                cats = algorithms.unique(labels)
+                keys.append(Categorical.from_codes(label, cats[cats != -1], ordered=True))
+
+            indexer = lexsort_indexer(keys, orders=ascending,
                                       na_position=na_position)
         else:
             from pandas.core.sorting import nargsort

I figured the for loop was better than using either map or a list comprehension, since that is what was suggested and for loops are what is used elsewhere.

I also did a few tests:

--- i/pandas/tests/frame/test_sorting.py
+++ w/pandas/tests/frame/test_sorting.py
@@ -58,6 +58,20 @@ class TestDataFrameSorting(tm.TestCase, TestData):
         expected = df.sort_index()
         assert_frame_equal(result, expected)
 
+        # Setting up data for NaN sorting
+        mi = MultiIndex.from_tuples([[1, 2], [np.nan, np.nan], [np.nan, 3], [12, 13]])
+        frame = DataFrame(np.arange(16).reshape(4, 4), index=mi, columns=list('ABCD'))
+
+        # MI sort with NaN's first
+        result = frame.sort_index(na_position='first')
+        expected = frame.iloc[[1, 2, 0, 3], :]
+        assert_frame_equal(result, expected)
+
+        # MI sort with NaN's last
+        result = frame.sort_index(na_position='last')
+        expected = frame.iloc[[0, 3, 2, 1], :]
+        assert_frame_equal(result, expected)
+
     def test_sort(self):
         frame = DataFrame(np.arange(16).reshape(4, 4), index=[1, 2, 3, 4],
                           columns=['A', 'B', 'C', 'D'])

When sorting on levels

When sorting the index with the level option set, the na_position option is ignored. Is this working as intended or should the option be passed along?

jorisvandenbossche · 2017-03-29T09:18:50Z

@linebp It is probably easier if you open a PR with the above changes (even if you are not sure of the approach, or if it is not yet finished, just indicate so in the PR), that will make discussing it easier.

When sorting the index with the level option set, the na_position option is ignored. Is this working as intended or should the option be passed along?

I think ideally this should also work (so pass na_option along)

jreback · 2017-04-19T22:30:42Z

thanks @linebp

linebp · 2017-04-20T08:30:48Z

thanks, @jreback , for your patience with all the overthinking

closes pandas-dev#14784 Author: Line Pedersen <[email protected]> Closes pandas-dev#15845 from linebp/json_normalize_seperator and squashes the following commits: 66f809e [Line Pedersen] BUG GH14784 na_position doesn't work for sort_index() with MultiIndex

xref pandas-dev#14784

* TST: separate out groupby/test_nth * BUG: bug in groupby on empty frame with multi groupers xref #14784 closes #16064

closes pandas-dev#14784 Author: Line Pedersen <[email protected]> Closes pandas-dev#15845 from linebp/json_normalize_seperator and squashes the following commits: 66f809e [Line Pedersen] BUG GH14784 na_position doesn't work for sort_index() with MultiIndex

) * TST: separate out groupby/test_nth * BUG: bug in groupby on empty frame with multi groupers xref pandas-dev#14784 closes pandas-dev#16064

brandonmburroughs mentioned this issue Dec 1, 2016

ENH: support kind and na_position kwargs in Series.sort_index #14445

Merged

jorisvandenbossche changed the title ~~na_position argument doesn't work for DataFrame.sort_index()~~ na_position doesn't work for sort_index() with MultiIndex Dec 2, 2016

jorisvandenbossche added Enhancement MultiIndex labels Dec 2, 2016

jorisvandenbossche added this to the Someday milestone Dec 2, 2016

jorisvandenbossche modified the milestones: Next Major Release, Someday Feb 27, 2017

jorisvandenbossche added Bug Difficulty Novice and removed Enhancement labels Feb 27, 2017

linebp mentioned this issue Mar 30, 2017

na_position doesn't work for sort_index() with MultiIndex #15845

Closed

4 tasks

jreback modified the milestones: 0.20.0, Next Major Release Apr 19, 2017

jreback closed this as completed in dd5cef5 Apr 19, 2017

jreback added a commit to jreback/pandas that referenced this issue Apr 21, 2017

BUG: bug in groupby on empty frame with multi groupers

dc7d022

xref pandas-dev#14784

jreback mentioned this issue Apr 21, 2017

BUG: bug in groupby on empty frame with multi groupers #16090

Merged

jreback added a commit that referenced this issue Apr 22, 2017

BUG: bug in groupby on empty frame with multi groupers (#16090)

f562308

* TST: separate out groupby/test_nth * BUG: bug in groupby on empty frame with multi groupers xref #14784 closes #16064

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

na_position doesn't work for sort_index() with MultiIndex #14784

na_position doesn't work for sort_index() with MultiIndex #14784

brandonmburroughs commented Dec 1, 2016

INSTALLED VERSIONS

jorisvandenbossche commented Feb 27, 2017

linebp commented Mar 17, 2017

jreback commented Mar 17, 2017

linebp commented Mar 29, 2017

jorisvandenbossche commented Mar 29, 2017

jreback commented Apr 19, 2017

linebp commented Apr 20, 2017

na_position doesn't work for sort_index() with MultiIndex #14784

na_position doesn't work for sort_index() with MultiIndex #14784

Comments

brandonmburroughs commented Dec 1, 2016

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jorisvandenbossche commented Feb 27, 2017

linebp commented Mar 17, 2017

jreback commented Mar 17, 2017

linebp commented Mar 29, 2017

When sorting on levels

jorisvandenbossche commented Mar 29, 2017

jreback commented Apr 19, 2017

linebp commented Apr 20, 2017

Output of `pd.show_versions()`