BUG: dataframe loading with duplicated columns and usecols #11823 #11882

sxwang · 2015-12-22T04:08:07Z

changed usecols from set to list in the C parser to handle duplicated columns and usecols. Update python parser to be consistent with C parser.

jreback · 2015-12-23T17:49:34Z

pandas/io/parsers.py

@@ -1770,9 +1771,22 @@ def _handle_usecols(self, columns, usecols_key):
                    raise ValueError("If using multiple headers, usecols must "
                                     "be integers.")
                col_indices = []
+                # if duplicates exist, save the original usecols_key. GH11823


use Index(...) and then use the proper index set operations and .is_unique here.

jreback · 2016-01-07T13:05:29Z

can you update

sxwang · 2016-01-08T05:37:59Z

Sorry haven't had a chance, will do soon ...

sxwang · 2016-01-12T06:40:03Z

Rewrote the logic using Index and .get_loc. It's still a bit tricky though, having to convert possible outputs of get_loc to unique indices for each duplicated column name.

jreback · 2016-01-12T19:15:59Z

pandas/io/parsers.py

+                        else: # found duplicates in usecols_key.
+                            # get indices of all instances of u in usecols_key. GH11823
+                            u_index = Index(usecols_key).get_loc(u)
+                            # convert the slice or  mask array to index


this is way to complicated. Pls show the inputs in the cases (e.g. duplicate and non-duplicate).

I agree. Well, here are some examples of various usecols with and without duplicates for when names contain duplicates.

pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None, names=['a', 'b', 'a'], usecols=['a']) Out: a 0 1 pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None, names=['a', 'b', 'a'], usecols=['a','a']) Out: a a 0 3 3 pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None, names=['a', 'b', 'a'], usecols=['a','b']) Out: a b 0 1 2 pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None, names=['a', 'b', 'a'], usecols=['a','b','a']) Out: a b a 0 3 2 3

In retrospect I'm not sure if this complexity is worth it here, and in any case one can still argue that the above behavior is not ideal ...

What about throwing an error when duplicated columns exist with usecols? Or leave the python engine alone and edit only the c engine parser so that the "(b) looks like a bug" part of the original issue #11823 is addressed, but not the "(a) different" part?

what I mean is you don't need a loop and it should be much much simpler, something like

(Pdb) p usecols_key ['a', 'b', 'a'] (Pdb) !res = Index(self.usecols).get_indexer(usecols_key) (Pdb) p res array([ 0, -1, 0]) (Pdb) p pd.unique(res) array([ 0, -1])

you can simply ignore the -1 (no column found)

This only works if self.usecols is unique though, right?

usecols=['a','b','a'] pd.Index(usecols).get_indexer(['a']) InvalidIndexError: Reindexing only valid with uniquely valued Index objects

I could do np.unique(usecols) before turning it into Index, but that would lose the duplicate column information which was the whole point.

you can use .get_indexer_for which will return ordered indexers for non-uniques

jorisvandenbossche · 2016-01-14T09:05:38Z

Before trying to solve the bug (if it is a bug), what is actually the desired behaviour?

Current behaviour on the example you gave:

In [1]: pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None,
   ...:             names=['a', 'b', 'a'], usecols=['a'])
Out[1]:
   a
0  1

In [2]: pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None,
   ...:             names=['a', 'b', 'a'], usecols=['a','a'])
Out[2]:
   a
0  1

In the example output you gave here #11882 (comment), you changed Out[2] to include two columns:

pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None, 
            names=['a', 'b', 'a'], usecols=['a','a'])
Out: 
   a  a
0  3  3

But I don't think this is the desired behaviour? Shouldn't it rather be something like:

pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None, 
            names=['a', 'b', 'a'], usecols=['a','a'])
Out: 
   a  a
0  1  3

And I would argue that Out[1] (so using usecols=['a'] should yield the same)

sxwang · 2016-01-15T04:16:07Z

The desired behavior is simply to match the engine='c' parser, which is arguably not ideal either. For the example case with ['a', 'a']:

pd.read_csv(StringIO("""1,2,3"""), engine='c', header=None, 
            names=['a', 'b', 'a'], usecols=['a','a'])
Out[2]: 
   a  a
0  3  3

I agree that your suggested output of [1, 3] would be better. Then need to fix the c parser as well ...

jorisvandenbossche · 2016-01-15T07:46:23Z

Well, I would rather say that the behaviour of the c parser is a bug.
In any case, both are wrong IMHO, and I find the behaviour of the c parser not any better.

And BTW, I get even another behaviour on master:

In [1]: pd.read_csv(StringIO("""1,2,3"""), engine='c', header=None,
   ...:             names=['a', 'b', 'a'], usecols=['a','a'])
Out[1]:
   a  a
0  1  1

jorisvandenbossche · 2016-01-15T07:48:03Z

@sxwang

What about throwing an error when duplicated columns exist with usecols? Or leave the python engine alone and edit only the c engine parser so that the "(b) looks like a bug" part of the original issue #11823 is addressed, but not the "(a) different" part?

I think this is indeed better

jreback · 2016-01-15T14:48:24Z

yeh I think the ideal is is to make c-parser / python parser the same for now (part b)

sxwang · 2016-01-16T05:07:33Z

@jorisvandenbossche which one is indeed better?
If I leave the python parser alone for now and just edit the C parser to fix the NaN bug in the original issue, then we get this:

In [5]: pd.read_csv(StringIO("""1,2,3"""), engine='c', header=None, 
names=['a', 'a', 'b'], usecols=['a','a','b'])
Out[5]: 
   a  a  b
0  2  2  3
In [6]: pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None, 
names=['a', 'a', 'b'], usecols=['a','a','b'])
Out[6]: 
   a  b
0  1  3

If there is consensus, I'll continue to try to fix the python parser to match the still-non-ideal c parser behavior with hopefully less complicated code.

jorisvandenbossche · 2016-01-16T10:41:59Z

@sxwang sorry, that was not really clear. I meant that I more like to try fixing the bug, than just make the behaviour for c and python parser the same, but still buggy. (or if that's too difficult, I think raising an error is even better as the current situation).

yeh I think the ideal is is to make c-parser / python parser the same for now (part b)

@jreback that was actually part (a) :-) But kidding aside, I rather think we should fix the bug than just making their behaviour the same.
But the question still is what we look like this to do.

@sxwang Personally, I think I would like to see this behaviour:

In [6]: pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None, 
                    names=['a', 'a', 'b'], usecols=['a','a','b'])
Out[6]: 
   a  a  b
0  1  2  3

In [6]: pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None, 
                    names=['a', 'a', 'b'], usecols=['a','b'])
Out[6]: 
   a  a  b
0  1  2  3

So once the column 'a' is specified in the usecols keyword, get all columns with this name, regardless of how many times it is specified in usecols, as the behaviour of this keyword should not depend on the order of columns specified or how many times it is specified.
But of course this can be discussed, but to me that seems the most sensible (and predictable) thing to do.

Of course the current bug in the C parser is even worse as the behaviour in the python parser, but ideally we should fix them both at once.

jreback · 2016-03-12T17:49:06Z

can you rebase/update

sxwang · 2016-03-15T03:09:36Z

Rebased.

On Sat, Mar 12, 2016 at 9:49 AM Jeff Reback [email protected]
wrote:

can you rebase/update

—
Reply to this email directly or view it on GitHub
#11882 (comment).

jreback · 2016-03-15T13:44:01Z

So I don't think test coverage is sufficient here. We need these 2 work (you actually just add a single tests w/o specifying engine and use self.read_csv in the test_parsers.py). Trick is to hard code the expected frame.

In [1]: pd.read_csv(StringIO("""1,2,3"""), engine='c', header=None, 
   ...: names=['a', 'a', 'b'], usecols=['a','a','b'])
Out[1]: 
   a  a    b
0  2  2  NaN

In [2]: pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None, 
   ...: names=['a', 'a', 'b'], usecols=['a','a','b'])
Out[2]: 
   a  a  b
0  2  2  3

jreback · 2016-03-15T13:47:16Z

pandas/io/parsers.py

-                        col_indices.append(usecols_key.index(u))
-                    else:
-                        col_indices.append(u)
+                col_indices = Index(usecols_key).get_indexer_for(self.usecols)


In [4]: Index(list('aab')).get_indexer_for(list('ab')) Out[4]: Int64Index([0, 1, 2], dtype='int64')

Index(usecols_key).get_indexer_for(pd.unique(self.usecols)) might do the trick

sxwang · 2016-03-19T21:25:43Z

I messed something up ... sorry, still trying to learn my way around git. Let me close and submit a new PR ...

jorisvandenbossche · 2016-03-19T22:14:11Z

@sxwang you don't need to open a new PR to clean it up here. If you can clean up your branch locally, just push force it your origin/dup_column_usecols branch and it will be updated here

…#11823

sxwang · 2016-03-20T05:35:56Z

Ok, I managed to fix the branch.

I think there are 2 distinct things to test, which are reflected in the new test I added and the comments:
(1) matching behavior with and without usecols

In [4]: pd.read_csv(StringIO("""1,2,3"""), names=['a', 'a', 'b'], header=None, usecols=['a', 'a', 'b'])
Out[4]: 
   a  a  b
0  2  2  3

In [5]: pd.read_csv(StringIO("""1,2,3"""), names=['a', 'a', 'b'], header=None)
Out[5]: 
   a  a  b
0  2  2  3

(2) matching behavior between c and python engines

In [6]: pd.read_csv(StringIO("""1,2,3"""), engine='c', names=['a', 'a', 'b'], header=None, usecols=['a', 'a', 'b'])
Out[6]: 
   a  a  b
0  2  2  3

In [7]: pd.read_csv(StringIO("""1,2,3"""), engine='python', names=['a', 'a', 'b'], header=None, usecols=['a', 'a', 'b'])
Out[7]: 
   a  a  b
0  2  2  3

jreback · 2016-03-20T14:53:30Z

pandas/io/tests/test_parsers.py

+        result = self.read_csv(StringIO(data), names=['a', 'a', 'b'],
+                               header=None, usecols=['a', 'a', 'b'])
+        expected = self.read_csv(StringIO(data), names=['a', 'a', 'b'],
+                                 header=None)


I'd like to use a fixed result, here IOW, actually construct the expected DataFrame, rather than reading it in with a different set of options.

jreback · 2016-03-20T14:54:58Z

I think this is the result we'd like to achieve (in all the above cases).

In [7]: df = pd.read_csv(StringIO("""1,2,3"""), engine='c', header=None)

In [8]: df.columns=list('aab')

In [9]: df
Out[9]: 
   a  a  b
0  1  2  3

jorisvandenbossche · 2016-03-20T14:57:08Z

@sxwang quoting your expected behaviour of above:

In [5]: pd.read_csv(StringIO("""1,2,3"""), names=['a', 'a', 'b'], header=None)
Out[5]: 
   a  a  b
0  2  2  3

IMO this should not be the expected behaviour (unless it was a typo?). As you get here not the values that were in your csv file, while you just provided names for the columns (no selection of the columns).

jreback · 2016-03-20T15:11:25Z

@sxwang note that I just merged #12551; though this shouldn't affect this PR (even though it involves usecols)

jreback · 2016-03-20T21:50:23Z

@sxwang also have a look at some discussion on #12678

jreback · 2016-04-18T17:29:05Z

can you rebase/update

sxwang · 2016-04-19T05:39:53Z

I think making this have the "ideal" behavior is quite involved ... and I'm afraid I don't really have to bandwidth to work on this anymore. Maybe somebody else can take a stab at it. Should I close the PR?

gfyoung · 2016-05-02T23:27:34Z

@sxwang : I too have been working on fixing duplicate columns issues. If you don't mind, I'd be happy to pick this one up / resolve this issue once my first PR for duplicate columns gets merged.

@jreback : From what I see in this discussion, I should point out that I am not the only person who has been stating that this "ideal" (which you lay out here here) would take a lot of work / refactoring to do. Two different contributors (@sxwang and myself) independently working on this issue have expressed this opinion. If you really do believe your idea of lists is that simple to implement, then IMHO it should not take very long for you to do so yourself because right now, neither of us seem to see an easy path ATM.

sxwang · 2016-05-03T05:08:11Z

@gfyoung thanks, I'll close this one.

jreback reviewed Dec 23, 2015
View reviewed changes

jreback added Bug IO CSV read_csv, to_csv labels Dec 23, 2015

jreback reviewed Jan 12, 2016
View reviewed changes

sxwang force-pushed the dup_column_usecols branch 3 times, most recently from 19e9a4e to 5474511 Compare March 15, 2016 02:09

jreback reviewed Mar 15, 2016
View reviewed changes

sxwang closed this Mar 19, 2016

jorisvandenbossche reopened this Mar 19, 2016

sxwang force-pushed the dup_column_usecols branch from d2f4c51 to 7b4186f Compare March 20, 2016 00:11

BUG: dataframe loading with duplicated columns and usecols pandas-dev…

27349b9

…#11823

sxwang force-pushed the dup_column_usecols branch from 7b4186f to 27349b9 Compare March 20, 2016 03:20

jreback reviewed Mar 20, 2016
View reviewed changes

jreback added this to the 0.18.1 milestone Mar 20, 2016

jorisvandenbossche mentioned this pull request Mar 20, 2016

ERR: dont' allow ambiguous usecols #12678

Closed

jreback modified the milestones: 0.18.2, 0.18.1 Apr 18, 2016

jreback removed this from the 0.18.2 milestone Apr 18, 2016

sxwang closed this May 3, 2016

gfyoung mentioned this pull request May 23, 2016

Dataframe loading with duplicated columns and usecols #11823

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: dataframe loading with duplicated columns and usecols #11823 #11882

BUG: dataframe loading with duplicated columns and usecols #11823 #11882

sxwang commented Dec 22, 2015

jreback Dec 23, 2015

jreback commented Jan 7, 2016

sxwang commented Jan 8, 2016

sxwang commented Jan 12, 2016

jreback Jan 12, 2016

sxwang Jan 13, 2016

jreback Jan 13, 2016

sxwang Jan 14, 2016

jreback Jan 14, 2016

jorisvandenbossche commented Jan 14, 2016

sxwang commented Jan 15, 2016

jorisvandenbossche commented Jan 15, 2016

jorisvandenbossche commented Jan 15, 2016

jreback commented Jan 15, 2016

sxwang commented Jan 16, 2016

jorisvandenbossche commented Jan 16, 2016

jreback commented Mar 12, 2016

sxwang commented Mar 15, 2016

jreback commented Mar 15, 2016

jreback Mar 15, 2016

sxwang commented Mar 19, 2016

jorisvandenbossche commented Mar 19, 2016

sxwang commented Mar 20, 2016

jreback Mar 20, 2016

jreback commented Mar 20, 2016

jorisvandenbossche commented Mar 20, 2016

jreback commented Mar 20, 2016

jreback commented Mar 20, 2016

jreback commented Apr 18, 2016

sxwang commented Apr 19, 2016

gfyoung commented May 2, 2016

sxwang commented May 3, 2016

BUG: dataframe loading with duplicated columns and usecols #11823 #11882

BUG: dataframe loading with duplicated columns and usecols #11823 #11882

Conversation

sxwang commented Dec 22, 2015

Choose a reason for hiding this comment

jreback commented Jan 7, 2016

sxwang commented Jan 8, 2016

sxwang commented Jan 12, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 14, 2016

sxwang commented Jan 15, 2016

jorisvandenbossche commented Jan 15, 2016

jorisvandenbossche commented Jan 15, 2016

jreback commented Jan 15, 2016

sxwang commented Jan 16, 2016

jorisvandenbossche commented Jan 16, 2016

jreback commented Mar 12, 2016

sxwang commented Mar 15, 2016

jreback commented Mar 15, 2016

Choose a reason for hiding this comment

sxwang commented Mar 19, 2016

jorisvandenbossche commented Mar 19, 2016

sxwang commented Mar 20, 2016

Choose a reason for hiding this comment

jreback commented Mar 20, 2016

jorisvandenbossche commented Mar 20, 2016

jreback commented Mar 20, 2016

jreback commented Mar 20, 2016

jreback commented Apr 18, 2016

sxwang commented Apr 19, 2016

gfyoung commented May 2, 2016

sxwang commented May 3, 2016