Skip to content

index into multi-index past the lex-sort depth #8526

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 21, 2014

Conversation

behzadnouri
Copy link
Contributor

closes #7724
closes #2646

on current master:

>>> df
                           jolia
jim joe jolie      joline       
1   z   2014-10-14 a          30
    y   2014-10-13 b           3
    x   2014-10-12 c          15
0   z   2014-10-11 d          35
    y   2014-10-10 e          43
    x   2014-10-09 f          36

False negative if the key length exceeds lexsort depth:

>>> (0,) in df.index
False
>>> (0, 'z') in df.index
False
>>> (0, 'z', '2014-10-11') in df.index
False
>>> (0, 'z', Timestamp('2014-10-11')) in df.index
False
>>> (0, 'z', '2014-10-11', 'd') in df.index
False

only ones which work:

>>> 0 in df.index
True
>>> (0, 'z', Timestamp('2014-10-11'), 'd') in df.index
True

which take a different code paths. The last one only works if the index is unique:

>>> (0, 'z', Timestamp('2014-10-11'), 'd') in pd.concat([df, df]).index
False

for all of the false negative cases above, obviously df.loc[key] fails:

>>> df.loc[(0, 'z', Timestamp('2014-10-11'))]
KeyError: 'Key length (3) was greater than MultiIndex lexsort depth (0)'

Some of these issues persist even if the index is lexically sorted:

>>> df.sort_index(inplace=True)
>>> df  # lexically sorted
                           jolia
jim joe jolie      joline       
0   x   2014-10-09 f          36
    y   2014-10-10 e          43
    z   2014-10-11 d          35
1   x   2014-10-12 c          15
    y   2014-10-13 b           3
    z   2014-10-14 a          30

date-time indexing with a full-key fails if index is unique:

>>> (0, 'x', '2014-10-09') in df.index  # partial key, works!
True
>>> (0, 'x', '2014-10-09', 'f') in df.index  # full key, unique index, breaks!
False

also, non-unique lexically sorted index always returns false positive with any full key:

>>> xdf = pd.concat([df, df]).sort_index()
>>> xdf
                           jolia
jim joe jolie      joline       
0   x   2014-10-09 f          36
                   f          36
    y   2014-10-10 e          43
                   e          43
    z   2014-10-11 d          35
                   d          35
1   x   2014-10-12 c          15
                   c          15
    y   2014-10-13 b           3
                   b           3
    z   2014-10-14 a          30
                   a          30
>>> (0, '$', '2014-10-09') in xdf.index  # partial key works
False
>>> (0, '$', '2014-10-09', '#') in xdf.index  # full key always `True`
True
>>> xdf.loc[(0, '$', '2014-10-09', '#')]
KeyError: 'the label [$] is not in the [columns]'

@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Oct 10, 2014
@jreback jreback added this to the 0.15.1 milestone Oct 10, 2014
@jreback
Copy link
Contributor

jreback commented Oct 10, 2014

we'll put this on tap for 0.15.1. Even though it is a slight API change.

will need to add an example at some point.

further, pls put a replica of the tests in the issue (as well as your complete test)

@behzadnouri behzadnouri force-pushed the mi-key-loc branch 2 times, most recently from caebcc7 to e31cb4f Compare October 10, 2014 23:05
@behzadnouri
Copy link
Contributor Author

updated the pr with some examples

if keylen == self.nlevels and self.is_unique:
def _maybe_str_to_time_stamp(key, lev):
if lev.is_all_dates and not isinstance(key, Timestamp):
try: return Timestamp(key, tz=getattr(lev, 'tz', None))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put on sep lines (I know its trivial, but more readable)

@jreback
Copy link
Contributor

jreback commented Oct 25, 2014

ok minor comments (and I seem to remember _maybe_to_slice like functionaility used somewhere else (maybe in core/indexing.py if you can fine similar enough, pls combine.

need an API example in v0.15.1 (to show off the indexing beyon lex-sort depth).

looks gr8 otherwise

@behzadnouri behzadnouri force-pushed the mi-key-loc branch 2 times, most recently from e36a794 to 6e79d54 Compare October 26, 2014 15:29
@behzadnouri
Copy link
Contributor Author

could not find a similar functionality to _maybe_to_slice
added the api change example in v0.15.1.txt

@jreback
Copy link
Contributor

jreback commented Oct 26, 2014

- Indexing in ``MultiIndex`` beyond lex-sort depth is now supported, though
  a lexically sorted index will have a better performance. (:issue:`2646`)

  .. ipython:: python

    df = pd.DataFrame({'jim':[0, 0, 1, 1],
                       'joe':['x', 'x', 'z', 'y'],
                       'jolie':np.random.rand(4)}).set_index(['jim', 'joe'])
    df
    df.index.lexsort_depth

    # in prior versions this would raise a KeyError
    # will now show a RuntimeWarning
    df.loc[(1, 'z')]

    # lexically sorting
    df2 = df.sortlevel()
    df2
    df2.index.lexsort_depth
    df.loc[(1,'z')]

pls update the docs to something like this (the v0.15.1). I had done this but then realized that the warning is not tested, so

  • need a test for the RunTimeWarning as well (so if this is ever changed it will be caught).
  • when you do np.arange, use a dtype of int64 rather than np.int_. This is not 32-bit windows friendly.

@behzadnouri behzadnouri force-pushed the mi-key-loc branch 2 times, most recently from eb83bd3 to 06611ed Compare October 27, 2014 01:34
@behzadnouri
Copy link
Contributor Author

anything special with tm.assert_produces_warning inside a for loop? travis build fails for 5 out of 7 but not the other two; tests pass on my machine as well.

the code says "not thread safe". anything to do with that?

@jreback
Copy link
Contributor

jreback commented Oct 27, 2014

hmm it is kind of finicky but should work

another way of testing this is
to set warnings to raise temporarily (and reset at the end of the test)

and see if it is failing for some reason

@behzadnouri
Copy link
Contributor Author

There seems to be a number of closed bugs on the warnings.simplefilter: 4180, 1548371, 1191104.

I traced the code on python 2.7.3, and it hits the warning line but it fails the test. setting the warnings to error does not help either.

@behzadnouri
Copy link
Contributor Author

@jreback what shall we do regarding the warnings test?
I could not make this work on python 2.7.3, perhaps because of those issues on bugs.python.org.

mask &= df.iloc[:, i] == k

if not mask.any():
assert key[:i+1] not in mi.index, \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use self.assertIn (or self.assertTrue); these are more informative that a bare assert

@jreback
Copy link
Contributor

jreback commented Nov 1, 2014

not sure that is the problem. It seems some indexing past lexsort_depth of 0 works (or at least works but doesn't show the warning). And the example that you are trying to catch the warning, does seem to work for me.

@behzadnouri
Copy link
Contributor Author

@jreback

  • changed to PerformanceWarning, self.assertIn, self.assertNotIn
  • moved the release notes to v0.15.2.txt
  • gave up on warnings test. i cannot reproduce it on python 3.4, and when i trace it on python 2.7.3 it hits the warnings line, but fails the test. As for your comment that

some indexing past lexsort_depth of 0 works

it may be the case that you are specifying a full key, since if it is not a partial key and the index is unique it takes a different code path. if you have an example which this is not the case, let me know and i will look into it.

df.index.lexsort_depth

# in prior versions this would raise a KeyError
# will now show a RuntimeWarning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now PerformanceWarning, yes?

@jreback
Copy link
Contributor

jreback commented Nov 8, 2014

@behzadnouri ok, minor doc change. lmk when you are ready (e.g. you are investigating that last issue)

@behzadnouri behzadnouri force-pushed the mi-key-loc branch 2 times, most recently from 17fbb2a to 1fa414e Compare November 9, 2014 01:14
@behzadnouri
Copy link
Contributor Author

fixed the doc ( RuntimeWarning -> PerformanceWarning )

@jreback
Copy link
Contributor

jreback commented Nov 9, 2014

This warning has to be tested. See the example below.
So the len(key) is > lexsort_depth (e.g. 1 > 0).

In [9]:     df = pd.DataFrame({'jim':[0, 0, 1, 1],
                       'joe':['x', 'x', 'z', 'y'],
                       'jolie':np.random.rand(4)}).set_index(['jim', 'joe'])

In [10]: df2 = df.iloc[[2,1,3,0]]

In [11]: df2.index.lexsort_depth
Out[11]: 0

In [12]: df2.loc[(0,)]
pandas/core/index.py:3947: PerformanceWarning: indexing past lexsort depth may impact performance.
  PerformanceWarning)
Out[12]: 
        jolie
joe          
x    0.095603
x    0.702337

But doing it with an even greater depth is ok?

In [14]: df2 = df.iloc[[2,1,3,0]]

In [15]: df2.loc[(0,'x')]
Out[15]: 
            jolie
jim joe          
0   x    0.095603
    x    0.702337

@behzadnouri
Copy link
Contributor Author

@jreback my guess is that you did not set the warnings filter to alwyas, or based on issue 4180 you were not using python 3.4

this is on python 3.4:

In [1]: import warnings

In [2]: warnings.simplefilter('always')

In [3]: df = pd.DataFrame({'jim':[0, 0, 1, 1],
   ...:                    'joe':['x', 'x', 'z', 'y'],
   ...:                    'jolie':np.random.rand(4)}).set_index(['jim', 'joe'])

In [4]: df2 = df.iloc[[2,1,3,0]]

In [5]: df2.index.lexsort_depth
Out[5]: 0

In [6]: df2.loc[(0,)]
/home/acer/dev/pandas/build/lib.linux-x86_64-3.4/pandas/core/index.py:3947: PerformanceWarning: indexing past lexsort depth may impact performance.
  PerformanceWarning)
Out[6]: 
        jolie
joe          
x    0.778548
x    0.145349

In [7]: df2 = df.iloc[[2,1,3,0]]

In [8]: df2.loc[(0,'x')]
/home/acer/dev/pandas/build/lib.linux-x86_64-3.4/pandas/core/index.py:3947: PerformanceWarning: indexing past lexsort depth may impact performance.
  PerformanceWarning)
Out[8]: 
            jolie
jim joe          
0   x    0.778548
    x    0.145349

@jreback
Copy link
Contributor

jreback commented Nov 15, 2014

well, this has to work on py2.7 as well (and test that way). (and I directly tried this in a single ipython session). something still not right.

@behzadnouri
Copy link
Contributor Author

i am almost sure this is python bug, which is fixed in newer versions. i already linked to issues on python bug tracker, and have given up on making this work on older versions of python. I mean i understand the necessity for checking for warnings but i could not make it work; esp. my home computer is python 3.4, and i cannot reproduce it at home.

if you add from pdb import set_trace; set_trace() just before the warnings.warn('indexing pas ... line, it will hit but does not show the warning. that is how i traced on a python 2.7.3 .

@jreback
Copy link
Contributor

jreback commented Nov 15, 2014

Then make another test which does exactly what your whatsnew example does (which shows the warning properly). You can explicity turn the warnings filter on and off.

The python bug is completely irrelevant; the warning DOES happen. You just need to assert that is happening when you think it should happen. Otherwise when someone else makes a change to this code and the warning disappears it will not be caught (whether on purpose or not).

The key is that travis tests the warning. 2.7 is well in use and will be for the foresable future.

You can easily install 2.7 using conda (or even just create another environment, as they are pathed differently).

@behzadnouri behzadnouri force-pushed the mi-key-loc branch 4 times, most recently from d75cb01 to e74c4a0 Compare November 16, 2014 21:20
@behzadnouri
Copy link
Contributor Author

i added a new test just to test the warning, as in the example above;
it fails on python 2.6 so i added nose.SkipTest for python version less than 2.7

df2 = df.sortlevel()
df2
df2.index.lexsort_depth
df.loc[(1,'z')]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be df2?

@behzadnouri
Copy link
Contributor Author

this should be df2?

yes, fixed that

@jreback jreback merged commit d0861e8 into pandas-dev:master Nov 21, 2014
@jreback
Copy link
Contributor

jreback commented Nov 21, 2014

@behzadnouri thanks for this!

I had to add a method of clearing the warnings (added to assert_produce_warning) as was failing
for me. but all set now.

thanks!

@behzadnouri behzadnouri deleted the mi-key-loc branch November 22, 2014 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
2 participants