Skip to content

na_position doesn't work for sort_index() with MultiIndex #14784

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
brandonmburroughs opened this issue Dec 1, 2016 · 7 comments
Closed

na_position doesn't work for sort_index() with MultiIndex #14784

brandonmburroughs opened this issue Dec 1, 2016 · 7 comments

Comments

@brandonmburroughs
Copy link
Contributor

Code Sample, a copy-pastable example if possible

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: pd.__version__
Out[3]: u'0.19.0'

In [4]: mi = pd.MultiIndex.from_tuples([[1, 1, 3], [1, 1, np.nan]], names=list('ABC'))

In [5]: df = pd.DataFrame([[1, 2], [3, 4]], mi)

In [6]: df.sort_index(na_position="first")
Out[6]: 
         0  1
A B C        
1 1 NaN  3  4
    3    1  2

In [7]: df.sort_index(na_position="last")
Out[7]: 
         0  1
A B C        
1 1 NaN  3  4
    3    1  2

Problem description

The na_position argument isn't used in DataFrame.sort_index() or Series.sort_index() due to the way we sort the MultiIndex. Whenever we create a MultiIndex, we store the labels as relative values. For instance, if we have the following MultiIndex:

MultiIndex.from_tuples([[1, 1, 3], [1, 1, np.nan]], names=list('ABC'))

the values get stored as

MultiIndex(levels=[[1], [1], [3]],
           labels=[[0, 0], [0, 0], [0, -1]],
           names=[u'A', u'B', u'C'])

with a NaN placeholder of -1.

These label values are what get passed to the sorting algorithm for both DataFrames and Series. Since the sorting only happens on the labels, it has no notion of the NaN.

This has been discussed in #14015 and #14672 .

My original naive solution was to change these lines from:

indexer = _lexsort_indexer(labels.labels, orders=ascending,
                                            na_position=na_position)

to

index_values_list = np.dstack(labels.get_values())[0].tolist()
indexer = _lexsort_indexer(index_values_list, orders=ascending,
                                           na_position=na_position)

This didn't break any tests, but it isn't necessarily the best approach.

Expected Output

In [7]: df.sort_index(na_position="last")
Out[7]: 
         0  1
A B C        
1 1 3    1  2
    NaN  3  4

Output of pd.show_versions()

In [3]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-77-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.0
nose: 1.3.4
pip: 9.0.0
setuptools: 27.2.0
Cython: 0.21
numpy: 1.11.2
scipy: 0.16.1
statsmodels: 0.6.1
xarray: None
IPython: 4.0.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.5.0
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.5
sqlalchemy: 0.9.7
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.7.3
boto: 2.32.1
pandas_datareader: None

@jorisvandenbossche jorisvandenbossche changed the title na_position argument doesn't work for DataFrame.sort_index() na_position doesn't work for sort_index() with MultiIndex Dec 2, 2016
@jorisvandenbossche jorisvandenbossche added this to the Someday milestone Dec 2, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, Someday Feb 27, 2017
@jorisvandenbossche
Copy link
Member

So basically sorting a multi-index always puts NaNs first, which can be quite annoying.

If someone wants to try to tackle this, it should be rather easy I think. na_position is already passed to lexsort_indexer, there is only a bug in its implementation where mask = c.codes == -1 does not work because the -1 values in the labels were turned into a a category '-1', not into a code value of '-1'.

@linebp
Copy link
Contributor

linebp commented Mar 17, 2017

I'd like to give this a try!

I have been looking at the code and when sorting the multi-index, the labels are passed to the lexsort_indexer function. This does not go well, because they are not the original values, but have been renamed, specifically the NaN values have been renamed to -1, so it no longer has special meaning in Categories.

After a quick look I see several ways of fixing the mask = c.codes == -1

  • Figure out which code has been assigned to the categorical value -1 and check for that instead.
  • Replace the -1 in the labels with NaN so that the code assigned will be -1 and the mask statement will work as intended
  • After creating the Categories object from the labels, remove the -1 category if it exists, so that the mask statement will work as intended

Is the check best left as is and should I fix data to make it work or should I fix the check to match the data? That bit where the key is checked if it is already Categorical worries me.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2017

you can do something like this right about here

There are 2 cases when this is called.

  1. from MultiIndex, when you get the labels passed in, these are already factorized (IOW they are almost but not quite a categorical).

  2. from sorting (e.g. DataFrame.sort_values() where these are NOT factorized and must be turned into a proper categorical (this is already handled correctly).

This will correctly handle case 1), case 2) is handled by the existing code.

diff --git a/pandas/core/sorting.py b/pandas/core/sorting.py
index 205d0d9..c62b4e2 100644
--- a/pandas/core/sorting.py
+++ b/pandas/core/sorting.py
@@ -174,7 +174,8 @@ def lexsort_indexer(keys, orders=None, na_position='last'):
 
         # create the Categorical
         else:
-            c = Categorical(key, ordered=True)
+            cats = algorithms.unique(key)
+            c = Categorical.from_codes(key, cats[cats != -1], ordered=True)
 
         if na_position not in ['last', 'first']:
             raise ValueError('invalid na_position: {!r}'.format(na_position))

Simply do the above in the MultiIndex code before passing to _lexsort_indexer; if its a categorical it will just work. (don't actually do this in _lexsort_indexer as I did (that was just a test).

@linebp
Copy link
Contributor

linebp commented Mar 29, 2017

Like so:

--- i/pandas/core/frame.py
+++ w/pandas/core/frame.py
@@ -3392,7 +3392,12 @@ it is assumed to be aliases for the column names.')
             if not labels.is_lexsorted():
                 labels = MultiIndex.from_tuples(labels.values)
 
-            indexer = lexsort_indexer(labels.labels, orders=ascending,
+            keys = []
+            for label in labels.labels:
+                cats = algorithms.unique(labels)
+                keys.append(Categorical.from_codes(label, cats[cats != -1], ordered=True))
+
+            indexer = lexsort_indexer(keys, orders=ascending,
                                       na_position=na_position)
         else:
             from pandas.core.sorting import nargsort

I figured the for loop was better than using either map or a list comprehension, since that is what was suggested and for loops are what is used elsewhere.

I also did a few tests:

--- i/pandas/tests/frame/test_sorting.py
+++ w/pandas/tests/frame/test_sorting.py
@@ -58,6 +58,20 @@ class TestDataFrameSorting(tm.TestCase, TestData):
         expected = df.sort_index()
         assert_frame_equal(result, expected)
 
+        # Setting up data for NaN sorting
+        mi = MultiIndex.from_tuples([[1, 2], [np.nan, np.nan], [np.nan, 3], [12, 13]])
+        frame = DataFrame(np.arange(16).reshape(4, 4), index=mi, columns=list('ABCD'))
+
+        # MI sort with NaN's first
+        result = frame.sort_index(na_position='first')
+        expected = frame.iloc[[1, 2, 0, 3], :]
+        assert_frame_equal(result, expected)
+
+        # MI sort with NaN's last
+        result = frame.sort_index(na_position='last')
+        expected = frame.iloc[[0, 3, 2, 1], :]
+        assert_frame_equal(result, expected)
+
     def test_sort(self):
         frame = DataFrame(np.arange(16).reshape(4, 4), index=[1, 2, 3, 4],
                           columns=['A', 'B', 'C', 'D'])

When sorting on levels

When sorting the index with the level option set, the na_position option is ignored. Is this working as intended or should the option be passed along?

@jorisvandenbossche
Copy link
Member

@linebp It is probably easier if you open a PR with the above changes (even if you are not sure of the approach, or if it is not yet finished, just indicate so in the PR), that will make discussing it easier.

When sorting the index with the level option set, the na_position option is ignored. Is this working as intended or should the option be passed along?

I think ideally this should also work (so pass na_option along)

@jreback jreback modified the milestones: 0.20.0, Next Major Release Apr 19, 2017
@jreback
Copy link
Contributor

jreback commented Apr 19, 2017

thanks @linebp

@linebp
Copy link
Contributor

linebp commented Apr 20, 2017

thanks, @jreback , for your patience with all the overthinking

analyticalmonk pushed a commit to analyticalmonk/pandas that referenced this issue Apr 20, 2017
closes pandas-dev#14784

Author: Line Pedersen <[email protected]>

Closes pandas-dev#15845 from linebp/json_normalize_seperator and squashes the following commits:

66f809e [Line Pedersen] BUG GH14784 na_position doesn't work for sort_index() with MultiIndex
jreback added a commit to jreback/pandas that referenced this issue Apr 21, 2017
jreback added a commit that referenced this issue Apr 22, 2017
* TST: separate out groupby/test_nth

* BUG: bug in groupby on empty frame with multi groupers

xref #14784
closes #16064
pcluo pushed a commit to pcluo/pandas that referenced this issue May 22, 2017
closes pandas-dev#14784

Author: Line Pedersen <[email protected]>

Closes pandas-dev#15845 from linebp/json_normalize_seperator and squashes the following commits:

66f809e [Line Pedersen] BUG GH14784 na_position doesn't work for sort_index() with MultiIndex
pcluo pushed a commit to pcluo/pandas that referenced this issue May 22, 2017
)

* TST: separate out groupby/test_nth

* BUG: bug in groupby on empty frame with multi groupers

xref pandas-dev#14784
closes pandas-dev#16064
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants