-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: dataframe loading with duplicated columns and usecols #11823 #11882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -1770,9 +1771,22 @@ def _handle_usecols(self, columns, usecols_key): | |||
raise ValueError("If using multiple headers, usecols must " | |||
"be integers.") | |||
col_indices = [] | |||
# if duplicates exist, save the original usecols_key. GH11823 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use Index(...)
and then use the proper index set operations and .is_unique
here.
can you update |
Sorry haven't had a chance, will do soon ... |
Rewrote the logic using |
else: # found duplicates in usecols_key. | ||
# get indices of all instances of u in usecols_key. GH11823 | ||
u_index = Index(usecols_key).get_loc(u) | ||
# convert the slice or mask array to index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is way to complicated. Pls show the inputs in the cases (e.g. duplicate and non-duplicate).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Well, here are some examples of various usecols
with and without duplicates for when names
contain duplicates.
pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None,
names=['a', 'b', 'a'], usecols=['a'])
Out:
a
0 1
pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None,
names=['a', 'b', 'a'], usecols=['a','a'])
Out:
a a
0 3 3
pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None,
names=['a', 'b', 'a'], usecols=['a','b'])
Out:
a b
0 1 2
pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None,
names=['a', 'b', 'a'], usecols=['a','b','a'])
Out:
a b a
0 3 2 3
In retrospect I'm not sure if this complexity is worth it here, and in any case one can still argue that the above behavior is not ideal ...
What about throwing an error when duplicated columns exist with usecols
? Or leave the python engine alone and edit only the c engine parser so that the "(b) looks like a bug" part of the original issue #11823 is addressed, but not the "(a) different" part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what I mean is you don't need a loop and it should be much much simpler, something like
(Pdb) p usecols_key
['a', 'b', 'a']
(Pdb) !res = Index(self.usecols).get_indexer(usecols_key)
(Pdb) p res
array([ 0, -1, 0])
(Pdb) p pd.unique(res)
array([ 0, -1])
you can simply ignore the -1
(no column found)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only works if self.usecols
is unique though, right?
usecols=['a','b','a']
pd.Index(usecols).get_indexer(['a'])
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
I could do np.unique(usecols)
before turning it into Index
, but that would lose the duplicate column information which was the whole point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use .get_indexer_for
which will return ordered indexers for non-uniques
Before trying to solve the bug (if it is a bug), what is actually the desired behaviour? Current behaviour on the example you gave:
In the example output you gave here #11882 (comment), you changed
But I don't think this is the desired behaviour? Shouldn't it rather be something like:
And I would argue that |
The desired behavior is simply to match the
I agree that your suggested output of |
Well, I would rather say that the behaviour of the c parser is a bug. And BTW, I get even another behaviour on master:
|
yeh I think the ideal is is to make c-parser / python parser the same for now (part b) |
@jorisvandenbossche which one is indeed better?
If there is consensus, I'll continue to try to fix the python parser to match the still-non-ideal c parser behavior with hopefully less complicated code. |
@sxwang sorry, that was not really clear. I meant that I more like to try fixing the bug, than just make the behaviour for c and python parser the same, but still buggy. (or if that's too difficult, I think raising an error is even better as the current situation).
@jreback that was actually part (a) :-) But kidding aside, I rather think we should fix the bug than just making their behaviour the same. @sxwang Personally, I think I would like to see this behaviour:
So once the column 'a' is specified in the Of course the current bug in the C parser is even worse as the behaviour in the python parser, but ideally we should fix them both at once. |
can you rebase/update |
19e9a4e
to
5474511
Compare
Rebased. On Sat, Mar 12, 2016 at 9:49 AM Jeff Reback [email protected]
|
So I don't think test coverage is sufficient here. We need these 2 work (you actually just add a single tests w/o specifying
|
col_indices.append(usecols_key.index(u)) | ||
else: | ||
col_indices.append(u) | ||
col_indices = Index(usecols_key).get_indexer_for(self.usecols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [4]: Index(list('aab')).get_indexer_for(list('ab'))
Out[4]: Int64Index([0, 1, 2], dtype='int64')
Index(usecols_key).get_indexer_for(pd.unique(self.usecols))
might do the trick
I messed something up ... sorry, still trying to learn my way around git. Let me close and submit a new PR ... |
@sxwang you don't need to open a new PR to clean it up here. If you can clean up your branch locally, just push force it your origin/dup_column_usecols branch and it will be updated here |
d2f4c51
to
7b4186f
Compare
7b4186f
to
27349b9
Compare
Ok, I managed to fix the branch. I think there are 2 distinct things to test, which are reflected in the new test I added and the comments:
(2) matching behavior between c and python engines
|
result = self.read_csv(StringIO(data), names=['a', 'a', 'b'], | ||
header=None, usecols=['a', 'a', 'b']) | ||
expected = self.read_csv(StringIO(data), names=['a', 'a', 'b'], | ||
header=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to use a fixed result, here IOW, actually construct the expected DataFrame
, rather than reading it in with a different set of options.
I think this is the result we'd like to achieve (in all the above cases).
|
@sxwang quoting your expected behaviour of above:
IMO this should not be the expected behaviour (unless it was a typo?). As you get here not the values that were in your csv file, while you just provided names for the columns (no selection of the columns). |
can you rebase/update |
I think making this have the "ideal" behavior is quite involved ... and I'm afraid I don't really have to bandwidth to work on this anymore. Maybe somebody else can take a stab at it. Should I close the PR? |
@sxwang : I too have been working on fixing duplicate columns issues. If you don't mind, I'd be happy to pick this one up / resolve this issue once my first PR for duplicate columns gets merged. @jreback : From what I see in this discussion, I should point out that I am not the only person who has been stating that this "ideal" (which you lay out here here) would take a lot of work / refactoring to do. Two different contributors (@sxwang and myself) independently working on this issue have expressed this opinion. If you really do believe your idea of lists is that simple to implement, then IMHO it should not take very long for you to do so yourself because right now, neither of us seem to see an easy path ATM. |
@gfyoung thanks, I'll close this one. |
Fixes #11823 .
changed
usecols
from set to list in the C parser to handle duplicated columns andusecols
. Update python parser to be consistent with C parser.