-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Fix parse_dates processing with usecols and C engine #12512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -75,8 +75,12 @@ class ParserWarning(Warning): | |
of each line, you might consider index_col=False to force pandas to _not_ | ||
use the first column as the index (row names) | ||
usecols : array-like, default None | ||
Return a subset of the columns. | ||
Results in much faster parsing time and lower memory usage. | ||
Return a subset of the columns. All elements in this array must either | ||
be positional (i.e. integer indices into the document columns) or strings | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe add in a mini example: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
that correspond to column names provided either by the user in `names` or | ||
inferred from the document header row(s). For example, a valid `usecols` | ||
parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Using this parameter | ||
results in much faster parsing time and lower memory usage. | ||
squeeze : boolean, default False | ||
If the parsed data only contains one column then return a Series | ||
prefix : str, default None | ||
|
@@ -801,6 +805,26 @@ def _is_index_col(col): | |
return col is not None and col is not False | ||
|
||
|
||
def _validate_usecols_arg(usecols): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. then return the inferred type (which you can use later on). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, I see. Done. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a note what this is doing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add |
||
""" | ||
Check whether or not the 'usecols' parameter | ||
contains all integers (column selection by index) | ||
or strings (column by name). Raises a ValueError | ||
if that is not the case. | ||
""" | ||
# gh-12678 | ||
if usecols is not None: | ||
usecols_dtype = lib.infer_dtype(usecols) | ||
if usecols_dtype not in ('integer', 'string'): | ||
raise ValueError(("The elements of 'usecols' " | ||
"must either be all strings " | ||
"or all integers")) | ||
|
||
# validation has succeeded, so | ||
# return the argument for assignment | ||
return usecols | ||
|
||
|
||
class ParserBase(object): | ||
|
||
def __init__(self, kwds): | ||
|
@@ -1132,7 +1156,7 @@ def __init__(self, src, **kwds): | |
self._reader = _parser.TextReader(src, **kwds) | ||
|
||
# XXX | ||
self.usecols = self._reader.usecols | ||
self.usecols = _validate_usecols_arg(self._reader.usecols) | ||
|
||
passed_names = self.names is None | ||
|
||
|
@@ -1157,18 +1181,21 @@ def __init__(self, src, **kwds): | |
else: | ||
self.names = lrange(self._reader.table_width) | ||
|
||
# If the names were inferred (not passed by user) and usedcols is | ||
# defined, then ensure names refers to the used columns, not the | ||
# document's columns. | ||
if self.usecols and passed_names: | ||
col_indices = [] | ||
for u in self.usecols: | ||
if isinstance(u, string_types): | ||
col_indices.append(self.names.index(u)) | ||
else: | ||
col_indices.append(u) | ||
self.names = [n for i, n in enumerate(self.names) | ||
if i in col_indices] | ||
# gh-9755 | ||
# | ||
# need to set orig_names here first | ||
# so that proper indexing can be done | ||
# with _set_noconvert_columns | ||
# | ||
# once names has been filtered, we will | ||
# then set orig_names again to names | ||
self.orig_names = self.names[:] | ||
|
||
if self.usecols: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also at this point you know if you have positional usecols or named uscols. so use that information. I don't think you need to whole so this is worth refactoing a bit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The reason why you have to set on a per-name basis is because the elements of |
||
if len(self.names) > len(self.usecols): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this check is really valid. You might have len(names) < len(usecols) (which is kind of silly, but possible). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure I follow your comment. If you have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if that is the case then the check is not necessary? can you confirm and add a test (or maybe move these test next to the other ones). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The check is necessary because if you have a situation where the user deliberately passes in only the column names for the filtered table but |
||
self.names = [n for i, n in enumerate(self.names) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this doesn't handle the mixed usecols issue. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From an organization perspective, I was thinking of fixing the mixed usecols issue as a follow-up because it's a problem for both the Python and C engines, probably by putting a verification step in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well we have lots of things floating ATM. However I think its simple to add the non-mixed usecols (w/o addressing the set/list issue). So you can do that in another commit if it works. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sound good. I'll just add it as another commit and then close the issue in the process. |
||
if (i in self.usecols or n in self.usecols)] | ||
|
||
if len(self.names) < len(self.usecols): | ||
raise ValueError("Usecols do not match names.") | ||
|
||
|
@@ -1194,13 +1221,17 @@ def __init__(self, src, **kwds): | |
self._implicit_index = self._reader.leading_cols > 0 | ||
|
||
def _set_noconvert_columns(self): | ||
names = self.names | ||
names = self.orig_names | ||
usecols = self.usecols | ||
|
||
def _set(x): | ||
if com.is_integer(x): | ||
self._reader.set_noconvert(x) | ||
else: | ||
self._reader.set_noconvert(names.index(x)) | ||
if usecols and com.is_integer(x): | ||
x = list(usecols)[x] | ||
|
||
if not com.is_integer(x): | ||
x = names.index(x) | ||
|
||
self._reader.set_noconvert(x) | ||
|
||
if isinstance(self.parse_dates, list): | ||
for val in self.parse_dates: | ||
|
@@ -1472,7 +1503,7 @@ def __init__(self, f, **kwds): | |
self.lineterminator = kwds['lineterminator'] | ||
self.quoting = kwds['quoting'] | ||
self.mangle_dupe_cols = kwds.get('mangle_dupe_cols', True) | ||
self.usecols = kwds['usecols'] | ||
self.usecols = _validate_usecols_arg(kwds['usecols']) | ||
self.skip_blank_lines = kwds['skip_blank_lines'] | ||
|
||
self.names_passed = kwds['names'] or None | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2682,12 +2682,118 @@ def test_uneven_lines_with_usecols(self): | |
df = self.read_csv(StringIO(csv), usecols=usecols) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
usecols = ['a', 1] | ||
usecols = ['a', 'b'] | ||
df = self.read_csv(StringIO(csv), usecols=usecols) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
usecols = ['a', 'b'] | ||
df = self.read_csv(StringIO(csv), usecols=usecols) | ||
def test_usecols_with_parse_dates(self): | ||
# See gh-9755 | ||
s = """a,b,c,d,e | ||
0,1,20140101,0900,4 | ||
0,1,20140102,1000,4""" | ||
parse_dates = [[1, 2]] | ||
|
||
cols = { | ||
'a' : [0, 0], | ||
'c_d': [ | ||
Timestamp('2014-01-01 09:00:00'), | ||
Timestamp('2014-01-02 10:00:00') | ||
] | ||
} | ||
expected = DataFrame(cols, columns=['c_d', 'a']) | ||
|
||
df = self.read_csv(StringIO(s), usecols=[0, 2, 3], | ||
parse_dates=parse_dates) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
df = self.read_csv(StringIO(s), usecols=[3, 0, 2], | ||
parse_dates=parse_dates) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
def test_usecols_with_parse_dates_and_full_names(self): | ||
# See gh-9755 | ||
s = """0,1,20140101,0900,4 | ||
0,1,20140102,1000,4""" | ||
parse_dates = [[1, 2]] | ||
names = list('abcde') | ||
|
||
cols = { | ||
'a' : [0, 0], | ||
'c_d': [ | ||
Timestamp('2014-01-01 09:00:00'), | ||
Timestamp('2014-01-02 10:00:00') | ||
] | ||
} | ||
expected = DataFrame(cols, columns=['c_d', 'a']) | ||
|
||
df = self.read_csv(StringIO(s), names=names, | ||
usecols=[0, 2, 3], | ||
parse_dates=parse_dates) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
df = self.read_csv(StringIO(s), names=names, | ||
usecols=[3, 0, 2], | ||
parse_dates=parse_dates) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
def test_usecols_with_parse_dates_and_usecol_names(self): | ||
# See gh-9755 | ||
s = """0,1,20140101,0900,4 | ||
0,1,20140102,1000,4""" | ||
parse_dates = [[1, 2]] | ||
names = list('acd') | ||
|
||
cols = { | ||
'a' : [0, 0], | ||
'c_d': [ | ||
Timestamp('2014-01-01 09:00:00'), | ||
Timestamp('2014-01-02 10:00:00') | ||
] | ||
} | ||
expected = DataFrame(cols, columns=['c_d', 'a']) | ||
|
||
df = self.read_csv(StringIO(s), names=names, | ||
usecols=[0, 2, 3], | ||
parse_dates=parse_dates) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
df = self.read_csv(StringIO(s), names=names, | ||
usecols=[3, 0, 2], | ||
parse_dates=parse_dates) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
def test_mixed_dtype_usecols(self): | ||
# See gh-12678 | ||
data = """a,b,c | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think its worth testing with a header that is say There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
>>> data = """a,b,c
1000,2000,3000
4000,5000,6000
"""
>>> usecols = [0, 2] # column selection by index
>>> read_csv(StringIO(data), usecols=usecols)
>>>
>>> usecols = ['0', '2'] # column selection by name
>>> read_csv(StringIO(data), usecols=usecols)
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is clearly doing positional indexing, even though the label exists. not sure what to do here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As is stated in the new documentation, integers are interpreted as indices. Strings are interpreted as column names. Personally, I consider that behaviour to be correct. In fact, if you change There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm, ok There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll go ahead and add a separate test in there just in case this confusion comes up later on. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, just document it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jreback : Double checked on the documentation to make sure that the wording clearly states that integers = indices and strings = column names. Also added a test where column names were integers. Travis gives the green light. |
||
1000,2000,3000 | ||
4000,5000,6000 | ||
""" | ||
msg = ("The elements of \'usecols\' " | ||
"must either be all strings " | ||
"or all integers") | ||
usecols = [0, 'b', 2] | ||
|
||
with tm.assertRaisesRegexp(ValueError, msg): | ||
df = self.read_csv(StringIO(data), usecols=usecols) | ||
|
||
def test_usecols_with_integer_like_header(self): | ||
data = """2,0,1 | ||
1000,2000,3000 | ||
4000,5000,6000 | ||
""" | ||
|
||
usecols = [0, 1] # column selection by index | ||
expected = DataFrame(data=[[1000, 2000], | ||
[4000, 5000]], | ||
columns=['2', '0']) | ||
df = self.read_csv(StringIO(data), usecols=usecols) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
usecols = ['0', '1'] # column selection by name | ||
expected = DataFrame(data=[[2000, 3000], | ||
[5000, 6000]], | ||
columns=['0', '1']) | ||
df = self.read_csv(StringIO(data), usecols=usecols) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure you update
io.rst
with the sameThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good point. Done.