index into multi-index past the lex-sort depth #8526

behzadnouri · 2014-10-10T02:56:48Z

closes #7724
closes #2646

on current master:

>>> df
                           jolia
jim joe jolie      joline       
1   z   2014-10-14 a          30
    y   2014-10-13 b           3
    x   2014-10-12 c          15
0   z   2014-10-11 d          35
    y   2014-10-10 e          43
    x   2014-10-09 f          36

False negative if the key length exceeds lexsort depth:

>>> (0,) in df.index
False
>>> (0, 'z') in df.index
False
>>> (0, 'z', '2014-10-11') in df.index
False
>>> (0, 'z', Timestamp('2014-10-11')) in df.index
False
>>> (0, 'z', '2014-10-11', 'd') in df.index
False

only ones which work:

>>> 0 in df.index
True
>>> (0, 'z', Timestamp('2014-10-11'), 'd') in df.index
True

which take a different code paths. The last one only works if the index is unique:

>>> (0, 'z', Timestamp('2014-10-11'), 'd') in pd.concat([df, df]).index
False

for all of the false negative cases above, obviously df.loc[key] fails:

>>> df.loc[(0, 'z', Timestamp('2014-10-11'))]
KeyError: 'Key length (3) was greater than MultiIndex lexsort depth (0)'

Some of these issues persist even if the index is lexically sorted:

>>> df.sort_index(inplace=True)
>>> df  # lexically sorted
                           jolia
jim joe jolie      joline       
0   x   2014-10-09 f          36
    y   2014-10-10 e          43
    z   2014-10-11 d          35
1   x   2014-10-12 c          15
    y   2014-10-13 b           3
    z   2014-10-14 a          30

date-time indexing with a full-key fails if index is unique:

>>> (0, 'x', '2014-10-09') in df.index  # partial key, works!
True
>>> (0, 'x', '2014-10-09', 'f') in df.index  # full key, unique index, breaks!
False

also, non-unique lexically sorted index always returns false positive with any full key:

>>> xdf = pd.concat([df, df]).sort_index()
>>> xdf
                           jolia
jim joe jolie      joline       
0   x   2014-10-09 f          36
                   f          36
    y   2014-10-10 e          43
                   e          43
    z   2014-10-11 d          35
                   d          35
1   x   2014-10-12 c          15
                   c          15
    y   2014-10-13 b           3
                   b           3
    z   2014-10-14 a          30
                   a          30
>>> (0, '$', '2014-10-09') in xdf.index  # partial key works
False
>>> (0, '$', '2014-10-09', '#') in xdf.index  # full key always `True`
True
>>> xdf.loc[(0, '$', '2014-10-09', '#')]
KeyError: 'the label [$] is not in the [columns]'

jreback · 2014-10-10T11:43:08Z

we'll put this on tap for 0.15.1. Even though it is a slight API change.

will need to add an example at some point.

further, pls put a replica of the tests in the issue (as well as your complete test)

behzadnouri · 2014-10-11T00:03:39Z

updated the pr with some examples

jreback · 2014-10-25T00:20:37Z

pandas/core/index.py

+        if keylen == self.nlevels and self.is_unique:
+            def _maybe_str_to_time_stamp(key, lev):
+                if lev.is_all_dates and not isinstance(key, Timestamp):
+                    try: return Timestamp(key, tz=getattr(lev, 'tz', None))


can you put on sep lines (I know its trivial, but more readable)

jreback · 2014-10-25T00:23:08Z

ok minor comments (and I seem to remember _maybe_to_slice like functionaility used somewhere else (maybe in core/indexing.py if you can fine similar enough, pls combine.

need an API example in v0.15.1 (to show off the indexing beyon lex-sort depth).

looks gr8 otherwise

behzadnouri · 2014-10-26T17:00:26Z

could not find a similar functionality to _maybe_to_slice
added the api change example in v0.15.1.txt

jreback · 2014-10-26T23:10:06Z

- Indexing in ``MultiIndex`` beyond lex-sort depth is now supported, though
  a lexically sorted index will have a better performance. (:issue:`2646`)

  .. ipython:: python

    df = pd.DataFrame({'jim':[0, 0, 1, 1],
                       'joe':['x', 'x', 'z', 'y'],
                       'jolie':np.random.rand(4)}).set_index(['jim', 'joe'])
    df
    df.index.lexsort_depth

    # in prior versions this would raise a KeyError
    # will now show a RuntimeWarning
    df.loc[(1, 'z')]

    # lexically sorting
    df2 = df.sortlevel()
    df2
    df2.index.lexsort_depth
    df.loc[(1,'z')]

pls update the docs to something like this (the v0.15.1). I had done this but then realized that the warning is not tested, so

need a test for the RunTimeWarning as well (so if this is ever changed it will be caught).
when you do np.arange, use a dtype of int64 rather than np.int_. This is not 32-bit windows friendly.

behzadnouri · 2014-10-27T02:31:29Z

anything special with tm.assert_produces_warning inside a for loop? travis build fails for 5 out of 7 but not the other two; tests pass on my machine as well.

the code says "not thread safe". anything to do with that?

jreback · 2014-10-27T02:36:38Z

hmm it is kind of finicky but should work

another way of testing this is
to set warnings to raise temporarily (and reset at the end of the test)

and see if it is failing for some reason

behzadnouri · 2014-10-29T20:58:36Z

There seems to be a number of closed bugs on the warnings.simplefilter: 4180, 1548371, 1191104.

I traced the code on python 2.7.3, and it hits the warning line but it fails the test. setting the warnings to error does not help either.

behzadnouri · 2014-11-01T15:07:20Z

@jreback what shall we do regarding the warnings test?
I could not make this work on python 2.7.3, perhaps because of those issues on bugs.python.org.

jreback · 2014-11-01T17:39:07Z

pandas/tests/test_indexing.py

+                mask &= df.iloc[:, i] == k
+
+                if not mask.any():
+                    assert key[:i+1] not in mi.index, \


use self.assertIn (or self.assertTrue); these are more informative that a bare assert

jreback · 2014-11-01T17:58:57Z

not sure that is the problem. It seems some indexing past lexsort_depth of 0 works (or at least works but doesn't show the warning). And the example that you are trying to catch the warning, does seem to work for me.

behzadnouri · 2014-11-08T20:16:50Z

@jreback

changed to PerformanceWarning, self.assertIn, self.assertNotIn
moved the release notes to v0.15.2.txt
gave up on warnings test. i cannot reproduce it on python 3.4, and when i trace it on python 2.7.3 it hits the warnings line, but fails the test. As for your comment that

some indexing past lexsort_depth of 0 works

it may be the case that you are specifying a full key, since if it is not a partial key and the index is unique it takes a different code path. if you have an example which this is not the case, let me know and i will look into it.

jreback · 2014-11-08T20:42:21Z

doc/source/whatsnew/v0.15.2.txt

+    df.index.lexsort_depth
+
+    # in prior versions this would raise a KeyError
+    # will now show a RuntimeWarning


this is now PerformanceWarning, yes?

jreback · 2014-11-08T20:43:08Z

@behzadnouri ok, minor doc change. lmk when you are ready (e.g. you are investigating that last issue)

behzadnouri · 2014-11-09T01:43:24Z

fixed the doc ( RuntimeWarning -> PerformanceWarning )

jreback · 2014-11-09T21:52:44Z

This warning has to be tested. See the example below.
So the len(key) is > lexsort_depth (e.g. 1 > 0).

In [9]:     df = pd.DataFrame({'jim':[0, 0, 1, 1],
                       'joe':['x', 'x', 'z', 'y'],
                       'jolie':np.random.rand(4)}).set_index(['jim', 'joe'])

In [10]: df2 = df.iloc[[2,1,3,0]]

In [11]: df2.index.lexsort_depth
Out[11]: 0

In [12]: df2.loc[(0,)]
pandas/core/index.py:3947: PerformanceWarning: indexing past lexsort depth may impact performance.
  PerformanceWarning)
Out[12]: 
        jolie
joe          
x    0.095603
x    0.702337

But doing it with an even greater depth is ok?

In [14]: df2 = df.iloc[[2,1,3,0]]

In [15]: df2.loc[(0,'x')]
Out[15]: 
            jolie
jim joe          
0   x    0.095603
    x    0.702337

behzadnouri · 2014-11-15T15:57:49Z

@jreback my guess is that you did not set the warnings filter to alwyas, or based on issue 4180 you were not using python 3.4

this is on python 3.4:

In [1]: import warnings

In [2]: warnings.simplefilter('always')

In [3]: df = pd.DataFrame({'jim':[0, 0, 1, 1],
   ...:                    'joe':['x', 'x', 'z', 'y'],
   ...:                    'jolie':np.random.rand(4)}).set_index(['jim', 'joe'])

In [4]: df2 = df.iloc[[2,1,3,0]]

In [5]: df2.index.lexsort_depth
Out[5]: 0

In [6]: df2.loc[(0,)]
/home/acer/dev/pandas/build/lib.linux-x86_64-3.4/pandas/core/index.py:3947: PerformanceWarning: indexing past lexsort depth may impact performance.
  PerformanceWarning)
Out[6]: 
        jolie
joe          
x    0.778548
x    0.145349

In [7]: df2 = df.iloc[[2,1,3,0]]

In [8]: df2.loc[(0,'x')]
/home/acer/dev/pandas/build/lib.linux-x86_64-3.4/pandas/core/index.py:3947: PerformanceWarning: indexing past lexsort depth may impact performance.
  PerformanceWarning)
Out[8]: 
            jolie
jim joe          
0   x    0.778548
    x    0.145349

jreback · 2014-11-15T16:27:35Z

well, this has to work on py2.7 as well (and test that way). (and I directly tried this in a single ipython session). something still not right.

behzadnouri · 2014-11-15T17:11:18Z

i am almost sure this is python bug, which is fixed in newer versions. i already linked to issues on python bug tracker, and have given up on making this work on older versions of python. I mean i understand the necessity for checking for warnings but i could not make it work; esp. my home computer is python 3.4, and i cannot reproduce it at home.

if you add from pdb import set_trace; set_trace() just before the warnings.warn('indexing pas ... line, it will hit but does not show the warning. that is how i traced on a python 2.7.3 .

jreback · 2014-11-15T17:19:33Z

Then make another test which does exactly what your whatsnew example does (which shows the warning properly). You can explicity turn the warnings filter on and off.

The python bug is completely irrelevant; the warning DOES happen. You just need to assert that is happening when you think it should happen. Otherwise when someone else makes a change to this code and the warning disappears it will not be caught (whether on purpose or not).

The key is that travis tests the warning. 2.7 is well in use and will be for the foresable future.

You can easily install 2.7 using conda (or even just create another environment, as they are pathed differently).

behzadnouri · 2014-11-16T21:50:27Z

i added a new test just to test the warning, as in the example above;
it fails on python 2.6 so i added nose.SkipTest for python version less than 2.7

jreback · 2014-11-20T23:33:41Z

doc/source/whatsnew/v0.15.2.txt

+    df2 = df.sortlevel()
+    df2
+    df2.index.lexsort_depth
+    df.loc[(1,'z')]


this should be df2?

behzadnouri · 2014-11-21T01:35:54Z

this should be df2?

yes, fixed that

jreback · 2014-11-21T23:22:07Z

@behzadnouri thanks for this!

I had to add a method of clearing the warnings (added to assert_produce_warning) as was failing
for me. but all set now.

thanks!

jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Oct 10, 2014

jreback added this to the 0.15.1 milestone Oct 10, 2014

behzadnouri force-pushed the mi-key-loc branch 2 times, most recently from caebcc7 to e31cb4f Compare October 10, 2014 23:05

behzadnouri force-pushed the mi-key-loc branch from e31cb4f to 2e48fea Compare October 11, 2014 13:53

jreback reviewed Oct 25, 2014
View reviewed changes

behzadnouri force-pushed the mi-key-loc branch 2 times, most recently from e36a794 to 6e79d54 Compare October 26, 2014 15:29

behzadnouri force-pushed the mi-key-loc branch 2 times, most recently from eb83bd3 to 06611ed Compare October 27, 2014 01:34

jreback reviewed Nov 1, 2014
View reviewed changes

jreback mentioned this pull request Nov 6, 2014

Add method to sort *within* a single MultiIndex level #739

Closed

behzadnouri force-pushed the mi-key-loc branch 3 times, most recently from cd47eef to 05ea639 Compare November 8, 2014 18:56

jreback reviewed Nov 8, 2014
View reviewed changes

behzadnouri force-pushed the mi-key-loc branch 2 times, most recently from 17fbb2a to 1fa414e Compare November 9, 2014 01:14

behzadnouri force-pushed the mi-key-loc branch 4 times, most recently from d75cb01 to e74c4a0 Compare November 16, 2014 21:20

jreback reviewed Nov 20, 2014
View reviewed changes

doc/source/whatsnew/v0.15.2.txt

df2 = df.sortlevel()

df2

df2.index.lexsort_depth

df.loc[(1,'z')]

Copy link

Contributor

jreback Nov 20, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be df2?

index into multi-index past the lexsort depth

d0861e8

behzadnouri force-pushed the mi-key-loc branch from e74c4a0 to d0861e8 Compare November 21, 2014 01:06

jreback merged commit d0861e8 into pandas-dev:master Nov 21, 2014

behzadnouri deleted the mi-key-loc branch November 22, 2014 14:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index into multi-index past the lex-sort depth #8526

index into multi-index past the lex-sort depth #8526

behzadnouri commented Oct 10, 2014

jreback commented Oct 10, 2014

behzadnouri commented Oct 11, 2014

jreback Oct 25, 2014

jreback commented Oct 25, 2014

behzadnouri commented Oct 26, 2014

jreback commented Oct 26, 2014

behzadnouri commented Oct 27, 2014

jreback commented Oct 27, 2014

behzadnouri commented Oct 29, 2014

behzadnouri commented Nov 1, 2014

jreback Nov 1, 2014

jreback commented Nov 1, 2014

behzadnouri commented Nov 8, 2014

jreback Nov 8, 2014

jreback commented Nov 8, 2014

behzadnouri commented Nov 9, 2014

jreback commented Nov 9, 2014

behzadnouri commented Nov 15, 2014

jreback commented Nov 15, 2014

behzadnouri commented Nov 15, 2014

jreback commented Nov 15, 2014

behzadnouri commented Nov 16, 2014

jreback Nov 20, 2014

behzadnouri commented Nov 21, 2014

jreback commented Nov 21, 2014

index into multi-index past the lex-sort depth #8526

index into multi-index past the lex-sort depth #8526

Conversation

behzadnouri commented Oct 10, 2014

jreback commented Oct 10, 2014

behzadnouri commented Oct 11, 2014

jreback Oct 25, 2014

Choose a reason for hiding this comment

jreback commented Oct 25, 2014

behzadnouri commented Oct 26, 2014

jreback commented Oct 26, 2014

behzadnouri commented Oct 27, 2014

jreback commented Oct 27, 2014

behzadnouri commented Oct 29, 2014

behzadnouri commented Nov 1, 2014

jreback Nov 1, 2014

Choose a reason for hiding this comment

jreback commented Nov 1, 2014

behzadnouri commented Nov 8, 2014

jreback Nov 8, 2014

Choose a reason for hiding this comment

jreback commented Nov 8, 2014

behzadnouri commented Nov 9, 2014

jreback commented Nov 9, 2014

behzadnouri commented Nov 15, 2014

jreback commented Nov 15, 2014

behzadnouri commented Nov 15, 2014

jreback commented Nov 15, 2014

behzadnouri commented Nov 16, 2014

jreback Nov 20, 2014

Choose a reason for hiding this comment

behzadnouri commented Nov 21, 2014

jreback commented Nov 21, 2014