Skip to content

BUG: dataframe loading with duplicated columns and usecols #11823 #11882

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

sxwang
Copy link
Contributor

@sxwang sxwang commented Dec 22, 2015

Fixes #11823 .

changed usecols from set to list in the C parser to handle duplicated columns and usecols. Update python parser to be consistent with C parser.

@@ -1770,9 +1771,22 @@ def _handle_usecols(self, columns, usecols_key):
raise ValueError("If using multiple headers, usecols must "
"be integers.")
col_indices = []
# if duplicates exist, save the original usecols_key. GH11823
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use Index(...) and then use the proper index set operations and .is_unique here.

@jreback jreback added Bug IO CSV read_csv, to_csv labels Dec 23, 2015
@jreback
Copy link
Contributor

jreback commented Jan 7, 2016

can you update

@sxwang
Copy link
Contributor Author

sxwang commented Jan 8, 2016

Sorry haven't had a chance, will do soon ...

@sxwang
Copy link
Contributor Author

sxwang commented Jan 12, 2016

Rewrote the logic using Index and .get_loc. It's still a bit tricky though, having to convert possible outputs of get_loc to unique indices for each duplicated column name.

else: # found duplicates in usecols_key.
# get indices of all instances of u in usecols_key. GH11823
u_index = Index(usecols_key).get_loc(u)
# convert the slice or mask array to index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is way to complicated. Pls show the inputs in the cases (e.g. duplicate and non-duplicate).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Well, here are some examples of various usecols with and without duplicates for when names contain duplicates.

pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None, 
            names=['a', 'b', 'a'], usecols=['a'])
Out: 
   a
0  1
pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None, 
            names=['a', 'b', 'a'], usecols=['a','a'])
Out: 
   a  a
0  3  3
pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None, 
            names=['a', 'b', 'a'], usecols=['a','b'])
Out: 
   a  b
0  1  2
pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None, 
            names=['a', 'b', 'a'], usecols=['a','b','a'])
Out: 
   a  b  a
0  3  2  3

In retrospect I'm not sure if this complexity is worth it here, and in any case one can still argue that the above behavior is not ideal ...

What about throwing an error when duplicated columns exist with usecols? Or leave the python engine alone and edit only the c engine parser so that the "(b) looks like a bug" part of the original issue #11823 is addressed, but not the "(a) different" part?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I mean is you don't need a loop and it should be much much simpler, something like

(Pdb) p usecols_key 
['a', 'b', 'a']
(Pdb) !res = Index(self.usecols).get_indexer(usecols_key)
(Pdb) p res
array([ 0, -1,  0])
(Pdb) p pd.unique(res)
array([ 0, -1])

you can simply ignore the -1 (no column found)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only works if self.usecols is unique though, right?

usecols=['a','b','a']
pd.Index(usecols).get_indexer(['a'])

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

I could do np.unique(usecols) before turning it into Index, but that would lose the duplicate column information which was the whole point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use .get_indexer_for which will return ordered indexers for non-uniques

@jorisvandenbossche
Copy link
Member

Before trying to solve the bug (if it is a bug), what is actually the desired behaviour?

Current behaviour on the example you gave:

In [1]: pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None,
   ...:             names=['a', 'b', 'a'], usecols=['a'])
Out[1]:
   a
0  1

In [2]: pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None,
   ...:             names=['a', 'b', 'a'], usecols=['a','a'])
Out[2]:
   a
0  1

In the example output you gave here #11882 (comment), you changed Out[2] to include two columns:

pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None, 
            names=['a', 'b', 'a'], usecols=['a','a'])
Out: 
   a  a
0  3  3

But I don't think this is the desired behaviour? Shouldn't it rather be something like:

pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None, 
            names=['a', 'b', 'a'], usecols=['a','a'])
Out: 
   a  a
0  1  3

And I would argue that Out[1] (so using usecols=['a'] should yield the same)

@sxwang
Copy link
Contributor Author

sxwang commented Jan 15, 2016

The desired behavior is simply to match the engine='c' parser, which is arguably not ideal either. For the example case with ['a', 'a']:

pd.read_csv(StringIO("""1,2,3"""), engine='c', header=None, 
            names=['a', 'b', 'a'], usecols=['a','a'])
Out[2]: 
   a  a
0  3  3

I agree that your suggested output of [1, 3] would be better. Then need to fix the c parser as well ...

@jorisvandenbossche
Copy link
Member

Well, I would rather say that the behaviour of the c parser is a bug.
In any case, both are wrong IMHO, and I find the behaviour of the c parser not any better.

And BTW, I get even another behaviour on master:

In [1]: pd.read_csv(StringIO("""1,2,3"""), engine='c', header=None,
   ...:             names=['a', 'b', 'a'], usecols=['a','a'])
Out[1]:
   a  a
0  1  1

@jorisvandenbossche
Copy link
Member

@sxwang

What about throwing an error when duplicated columns exist with usecols? Or leave the python engine alone and edit only the c engine parser so that the "(b) looks like a bug" part of the original issue #11823 is addressed, but not the "(a) different" part?

I think this is indeed better

@jreback
Copy link
Contributor

jreback commented Jan 15, 2016

yeh I think the ideal is is to make c-parser / python parser the same for now (part b)

@sxwang
Copy link
Contributor Author

sxwang commented Jan 16, 2016

@jorisvandenbossche which one is indeed better?
If I leave the python parser alone for now and just edit the C parser to fix the NaN bug in the original issue, then we get this:

In [5]: pd.read_csv(StringIO("""1,2,3"""), engine='c', header=None, 
names=['a', 'a', 'b'], usecols=['a','a','b'])
Out[5]: 
   a  a  b
0  2  2  3
In [6]: pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None, 
names=['a', 'a', 'b'], usecols=['a','a','b'])
Out[6]: 
   a  b
0  1  3

If there is consensus, I'll continue to try to fix the python parser to match the still-non-ideal c parser behavior with hopefully less complicated code.

@jorisvandenbossche
Copy link
Member

@sxwang sorry, that was not really clear. I meant that I more like to try fixing the bug, than just make the behaviour for c and python parser the same, but still buggy. (or if that's too difficult, I think raising an error is even better as the current situation).

yeh I think the ideal is is to make c-parser / python parser the same for now (part b)

@jreback that was actually part (a) :-) But kidding aside, I rather think we should fix the bug than just making their behaviour the same.
But the question still is what we look like this to do.

@sxwang Personally, I think I would like to see this behaviour:

In [6]: pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None, 
                    names=['a', 'a', 'b'], usecols=['a','a','b'])
Out[6]: 
   a  a  b
0  1  2  3

In [6]: pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None, 
                    names=['a', 'a', 'b'], usecols=['a','b'])
Out[6]: 
   a  a  b
0  1  2  3

So once the column 'a' is specified in the usecols keyword, get all columns with this name, regardless of how many times it is specified in usecols, as the behaviour of this keyword should not depend on the order of columns specified or how many times it is specified.
But of course this can be discussed, but to me that seems the most sensible (and predictable) thing to do.

Of course the current bug in the C parser is even worse as the behaviour in the python parser, but ideally we should fix them both at once.

@jreback
Copy link
Contributor

jreback commented Mar 12, 2016

can you rebase/update

@sxwang sxwang force-pushed the dup_column_usecols branch 3 times, most recently from 19e9a4e to 5474511 Compare March 15, 2016 02:09
@sxwang
Copy link
Contributor Author

sxwang commented Mar 15, 2016

Rebased.

On Sat, Mar 12, 2016 at 9:49 AM Jeff Reback [email protected]
wrote:

can you rebase/update


Reply to this email directly or view it on GitHub
#11882 (comment).

@jreback
Copy link
Contributor

jreback commented Mar 15, 2016

So I don't think test coverage is sufficient here. We need these 2 work (you actually just add a single tests w/o specifying engine and use self.read_csv in the test_parsers.py). Trick is to hard code the expected frame.

In [1]: pd.read_csv(StringIO("""1,2,3"""), engine='c', header=None, 
   ...: names=['a', 'a', 'b'], usecols=['a','a','b'])
Out[1]: 
   a  a    b
0  2  2  NaN

In [2]: pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None, 
   ...: names=['a', 'a', 'b'], usecols=['a','a','b'])
Out[2]: 
   a  a  b
0  2  2  3

col_indices.append(usecols_key.index(u))
else:
col_indices.append(u)
col_indices = Index(usecols_key).get_indexer_for(self.usecols)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [4]: Index(list('aab')).get_indexer_for(list('ab'))
Out[4]: Int64Index([0, 1, 2], dtype='int64')

Index(usecols_key).get_indexer_for(pd.unique(self.usecols)) might do the trick

@sxwang
Copy link
Contributor Author

sxwang commented Mar 19, 2016

I messed something up ... sorry, still trying to learn my way around git. Let me close and submit a new PR ...

@jorisvandenbossche
Copy link
Member

@sxwang you don't need to open a new PR to clean it up here. If you can clean up your branch locally, just push force it your origin/dup_column_usecols branch and it will be updated here

@sxwang sxwang force-pushed the dup_column_usecols branch from d2f4c51 to 7b4186f Compare March 20, 2016 00:11
@sxwang sxwang force-pushed the dup_column_usecols branch from 7b4186f to 27349b9 Compare March 20, 2016 03:20
@sxwang
Copy link
Contributor Author

sxwang commented Mar 20, 2016

Ok, I managed to fix the branch.

I think there are 2 distinct things to test, which are reflected in the new test I added and the comments:
(1) matching behavior with and without usecols

In [4]: pd.read_csv(StringIO("""1,2,3"""), names=['a', 'a', 'b'], header=None, usecols=['a', 'a', 'b'])
Out[4]: 
   a  a  b
0  2  2  3

In [5]: pd.read_csv(StringIO("""1,2,3"""), names=['a', 'a', 'b'], header=None)
Out[5]: 
   a  a  b
0  2  2  3

(2) matching behavior between c and python engines

In [6]: pd.read_csv(StringIO("""1,2,3"""), engine='c', names=['a', 'a', 'b'], header=None, usecols=['a', 'a', 'b'])
Out[6]: 
   a  a  b
0  2  2  3

In [7]: pd.read_csv(StringIO("""1,2,3"""), engine='python', names=['a', 'a', 'b'], header=None, usecols=['a', 'a', 'b'])
Out[7]: 
   a  a  b
0  2  2  3

result = self.read_csv(StringIO(data), names=['a', 'a', 'b'],
header=None, usecols=['a', 'a', 'b'])
expected = self.read_csv(StringIO(data), names=['a', 'a', 'b'],
header=None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to use a fixed result, here IOW, actually construct the expected DataFrame, rather than reading it in with a different set of options.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2016

I think this is the result we'd like to achieve (in all the above cases).

In [7]: df = pd.read_csv(StringIO("""1,2,3"""), engine='c', header=None)

In [8]: df.columns=list('aab')

In [9]: df
Out[9]: 
   a  a  b
0  1  2  3

@jreback jreback added this to the 0.18.1 milestone Mar 20, 2016
@jorisvandenbossche
Copy link
Member

@sxwang quoting your expected behaviour of above:

In [5]: pd.read_csv(StringIO("""1,2,3"""), names=['a', 'a', 'b'], header=None)
Out[5]: 
   a  a  b
0  2  2  3

IMO this should not be the expected behaviour (unless it was a typo?). As you get here not the values that were in your csv file, while you just provided names for the columns (no selection of the columns).

@jreback
Copy link
Contributor

jreback commented Mar 20, 2016

@sxwang note that I just merged #12551; though this shouldn't affect this PR (even though it involves usecols)

@jreback
Copy link
Contributor

jreback commented Mar 20, 2016

@sxwang also have a look at some discussion on #12678

@jreback jreback modified the milestones: 0.18.2, 0.18.1 Apr 18, 2016
@jreback
Copy link
Contributor

jreback commented Apr 18, 2016

can you rebase/update

@jreback jreback removed this from the 0.18.2 milestone Apr 18, 2016
@sxwang
Copy link
Contributor Author

sxwang commented Apr 19, 2016

I think making this have the "ideal" behavior is quite involved ... and I'm afraid I don't really have to bandwidth to work on this anymore. Maybe somebody else can take a stab at it. Should I close the PR?

@gfyoung
Copy link
Member

gfyoung commented May 2, 2016

@sxwang : I too have been working on fixing duplicate columns issues. If you don't mind, I'd be happy to pick this one up / resolve this issue once my first PR for duplicate columns gets merged.

@jreback : From what I see in this discussion, I should point out that I am not the only person who has been stating that this "ideal" (which you lay out here here) would take a lot of work / refactoring to do. Two different contributors (@sxwang and myself) independently working on this issue have expressed this opinion. If you really do believe your idea of lists is that simple to implement, then IMHO it should not take very long for you to do so yourself because right now, neither of us seem to see an easy path ATM.

@sxwang
Copy link
Contributor Author

sxwang commented May 3, 2016

@gfyoung thanks, I'll close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants