Skip to content

API: support multiple indexers for .iloc with a MultiIndex #7490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shoyer opened this issue Jun 18, 2014 · 14 comments
Closed

API: support multiple indexers for .iloc with a MultiIndex #7490

shoyer opened this issue Jun 18, 2014 · 14 comments

Comments

@shoyer
Copy link
Member

shoyer commented Jun 18, 2014

MultIndexing with multiple indexers (#6301) via .loc is great.

It would be nice to mirror this functionality with .iloc.

To my understanding, until this change, loc and iloc had a mirror syntax, where if you replaced all of your index labels with arrays of 0-indexed integers, they were equivalent, e.g., for the following series:

import pandas as pd
midx = pd.MultiIndex.from_product([range(3), range(5)])
s = pd.Series(range(15), midx)

Now they lack this symmetry, because indexing like s.iloc[0, 0] doesn't work like s.loc[0, 0]. I found this surprising. Thoughts?

@jreback
Copy link
Contributor

jreback commented Jun 18, 2014

you need to use a tuple (0,0)
in fact NOT using a tuple though it works it a bit wonky to support

@shoyer
Copy link
Member Author

shoyer commented Jun 18, 2014

As I'm sure you know, Python's __getitem__ syntax makes no distinction between x[(0, 0)] and x[0, 0], though I agree that the former makes the intent here clearer.

In any case, s.iloc[(0, 0)] should work.

@jreback
Copy link
Contributor

jreback commented Jun 18, 2014

indexing interpretation is amazing complex

think about what

df.loc[0,0] could possibly be

so u guess this is technically a bug in that iloc for Series should also try to index into a multi index when passed multiple indexers

why don't u give it a shot to fix

@jorisvandenbossche
Copy link
Member

It doesn't matter if you use a tuple or not (for a series!). With loc this works s.loc[0,0] just as expected. (but of course, for a dataframe this is something else, there you would need df.loc[(0,0),:]

But I think more important, what should it mean? You could say, if you are thinking of integer locations based on the whole dataframe, there is only one first row?
What would s.iloc[(0, 1)] mean? First location in level 0, but second location in level 1? That does contradicts a bit no? I think you are actually thinking "the first location of the values where the label in level 0 is equal to the first occurring label in that level" (something like s.groupby(level=0).get_group(0).iloc[0]) But what if the index labels are not sorted? What should it then mean?

@jorisvandenbossche
Copy link
Member

By the way, if you do this with a dataframe with iloc, it interpretes the tuple as a list of two integer locations:

In [109]: s.to_frame().iloc[(0,0),:]
Out[109]: 
     0
0 0  0
  0  0

So for a dataframe df.iloc[(0,0),:] and df.iloc[[0,0],:] is equivalent, so you could argue that s.iloc[(0,0)] (and thus s.iloc[0,0]) should also do the same as s.iloc[[0,0]] and return two rows.

@jreback
Copy link
Contributor

jreback commented Jun 18, 2014

so I wonder then should iloc
with a multi index and a tuple be an error
as u probably mean a list (and if so then it should be specified as a list and not interpreted that way)

as the ordering of a MultiIndex is only guaranteed when sorted

@jorisvandenbossche
Copy link
Member

But with a dataframe, a tuple is interpreted as a list (as in df.iloc[(0,1),:]), shouldn't this be the same with a series?

However, that is a minor point, the main thing is that I think multi-indexing with iloc does not make sense (seeing the location as 'flat' even if you have multiple levels of labels). Or does it?

@jreback
Copy link
Contributor

jreback commented Jun 18, 2014

Hmm, this seems like it reports a correct error, @shoyer are you seeing something different?

In [1]: midx = pd.MultiIndex.from_product([range(3), range(5)])

In [2]: s = pd.Series(range(15), midx)

In [3]: s
Out[3]: 
0  0     0
   1     1
   2     2
   3     3
   4     4
1  0     5
   1     6
   2     7
   3     8
   4     9
2  0    10
   1    11
   2    12
   3    13
   4    14
dtype: int64

In [6]: s.loc[0,1]
Out[6]: 1

In [7]: s.loc[(0,1)]
Out[7]: 1

In [8]: s.iloc[(0,0)]
IndexingError: Too many indexers

In [9]: s.iloc[0,0]
IndexingError: Too many indexers

@shoyer
Copy link
Member Author

shoyer commented Jun 18, 2014

@jreback I definitely do see the same error, and IndexingError: Too many indexers seems like the right error (if this is actually prohibited).

@jorisvandenbossche This is an interesting point about nested tuple indexing on a DataFrame invoking fancy indexing. That is indeed consistent with how numpy does things.

So it looks like we could not add this without breaking some user code, although I do think it is rather unusual to use tuples (instead of lists or arrays) for indexers along a dimension, given how it doesn't work for 1D. I would be OK breaking the current nested tuple indexing, but that is definitely a design trade-off. (Note that .loc is already different from numpy indexing in some cases, for example if you do fancy indexing in multiple dimensions at once.)

Let me try to reproduce your pathological case (in a series, for simplicity):

>>> idx = pd.MultiIndex([['a', 'b'], [2, 1]], [[0, 0, 1, 1], [0, 1, 1, 0]])
>>> idx
MultiIndex(levels=[[u'a', u'b'], [2, 1]],
           labels=[[0, 0, 1, 1], [0, 1, 1, 0]])
>>> s = pd.Series(np.arange(4), idx)
>>> s
a  2    0
   1    1
b  1    2
   2    3
dtype: int64

.iloc should use the MultiIndex labels, which would mean s.iloc[(1, 0)] == 3 (not 2). I do agree this is somewhat counter-intuitive if the levels aren't sorted, but this is an unsupported corner case: .iloc already contains a warning about sorted labels:
Warning You will need to make sure that the selection axes are fully lexsorted!

@jorisvandenbossche
Copy link
Member

@shoyer I don't follow your example I think. Can you explain why you think s.iloc[(1,0)] should be 3 and not 2?
And when do you get that warning when using iloc?

@jreback
Copy link
Contributor

jreback commented Jun 18, 2014

related is #5420

@shoyer
Copy link
Member Author

shoyer commented Jun 19, 2014

OK, here's a prototype of my proposed functionality:

import numpy as np
import pandas as pd

def get_iloc(index, indexer):
    int_levels = [np.arange(len(level)) for level in index.levels]
    return pd.MultiIndex(int_levels, index.labels).get_loc(indexer)

def iloc(series, indexer):
    return series.iloc[get_iloc(series.index, indexer)]

And some code to delve into these issues;

idx = pd.MultiIndex([[0, 1], [0, 1]], [[0, 0, 1, 1], [0, 1, 1, 0]])
s = pd.Series(np.arange(4), idx, name='s')

idx2 = pd.MultiIndex([[1, 0], [1, 0]], [[1, 1, 0, 0], [1, 0, 0, 1]])
s2 = pd.Series(np.arange(4), idx2, name='s2')

data = [(i, j, s.loc[(i, j)], s2.loc[(i, j)],
         iloc(s, (i, j)), iloc(s2, (i, j)))
        for i in range(2) for j in range(2)]
results = pd.DataFrame.from_records(
    data, columns=['i', 'j', 'loc', 'loc2', 'iloc', 'iloc2']
    ).set_index(['i', 'j'])
>>> print s
0  0    0
   1    1
1  1    2
   0    3
Name: s, dtype: int64
>>> print s2
0  0    0
   1    1
1  1    2
   0    3
Name: s2, dtype: int64
>>> print results
     loc  loc2  iloc  iloc2
i j                        
0 0    0     0     0      2
  1    1     1     1      3
1 0    3     3     3      1
  1    2     2     2      0

So yes, as you can see, this proposal for iloc gives inconsistent results if the multi-index is not lexsorted -- but otherwise gives results that are fully consistent with loc for integer multi-indexes.

I'm not sure it's possible to define this sort of indexing unambiguously without lexsorting, but again, that is a mostly standard constraint of MultiIndex.

@jreback
Copy link
Contributor

jreback commented Jun 19, 2014

@shoyer how is this useful? we already have many types of indexing, and it is a struggle to keep everything consistent now.

@shoyer
Copy link
Member Author

shoyer commented Jun 23, 2014

Now we've thought through the full implications of how this could work, I'm no longer convinced this is a good idea. Reasoning for non-lexsorted indexes is pretty convoluted, and I support .loc being as ndarray-like as possible.

@jreback jreback closed this as completed Jun 23, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants