Skip to content

BUG: Fix parse_dates processing with usecols and C engine #12512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -120,8 +120,12 @@ index_col : int or sequence or ``False``, default ``None``
each line, you might consider ``index_col=False`` to force pandas to *not* use
the first column as the index (row names).
usecols : array-like, default ``None``
Return a subset of the columns. Results in much faster parsing time and lower
memory usage
Return a subset of the columns. All elements in this array must either
be positional (i.e. integer indices into the document columns) or strings
that correspond to column names provided either by the user in `names` or
inferred from the document header row(s). For example, a valid `usecols`
parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Using this parameter
results in much faster parsing time and lower memory usage.
squeeze : boolean, default ``False``
If the parsed data only contains one column then return a Series.
prefix : str, default ``None``
Expand Down
3 changes: 2 additions & 1 deletion doc/source/whatsnew/v0.18.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ API changes


- ``CParserError`` is now a ``ValueError`` instead of just an ``Exception`` (:issue:`12551`)

- ``read_csv`` no longer allows a combination of strings and integers for the ``usecols`` parameter (:issue:`12678`)
- ``pd.show_versions()`` now includes ``pandas_datareader`` version (:issue:`12740`)
- Provide a proper ``__name__`` and ``__qualname__`` attributes for generic functions (:issue:`12021`)

Expand Down Expand Up @@ -211,6 +211,7 @@ Bug Fixes

- Bug in ``value_counts`` when ``normalize=True`` and ``dropna=True`` where nulls still contributed to the normalized count (:issue:`12558`)
- Bug in ``Panel.fillna()`` ignoring ``inplace=True`` (:issue:`12633`)
- Bug in ``read_csv`` when specifying ``names``, ```usecols``, and ``parse_dates`` simultaneously with the C engine (:issue:`9755`)
- Bug in ``Series.rename``, ``DataFrame.rename`` and ``DataFrame.rename_axis`` not treating ``Series`` as mappings to relabel (:issue:`12623`).
- Clean in ``.rolling.min`` and ``.rolling.max`` to enhance dtype handling (:issue:`12373`)

Expand Down
73 changes: 52 additions & 21 deletions pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,8 +75,12 @@ class ParserWarning(Warning):
of each line, you might consider index_col=False to force pandas to _not_
use the first column as the index (row names)
usecols : array-like, default None
Return a subset of the columns.
Results in much faster parsing time and lower memory usage.
Return a subset of the columns. All elements in this array must either
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure you update io.rst with the same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point. Done.

be positional (i.e. integer indices into the document columns) or strings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add in a mini example: [0, 1, 2], ['foo', 'bar', 'baz'].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

that correspond to column names provided either by the user in `names` or
inferred from the document header row(s). For example, a valid `usecols`
parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Using this parameter
results in much faster parsing time and lower memory usage.
squeeze : boolean, default False
If the parsed data only contains one column then return a Series
prefix : str, default None
Expand Down Expand Up @@ -801,6 +805,26 @@ def _is_index_col(col):
return col is not None and col is not False


def _validate_usecols_arg(usecols):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then return the inferred type (which you can use later on).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see. Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a note what this is doing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add Parameters / Returns

"""
Check whether or not the 'usecols' parameter
contains all integers (column selection by index)
or strings (column by name). Raises a ValueError
if that is not the case.
"""
# gh-12678
if usecols is not None:
usecols_dtype = lib.infer_dtype(usecols)
if usecols_dtype not in ('integer', 'string'):
raise ValueError(("The elements of 'usecols' "
"must either be all strings "
"or all integers"))

# validation has succeeded, so
# return the argument for assignment
return usecols


class ParserBase(object):

def __init__(self, kwds):
Expand Down Expand Up @@ -1132,7 +1156,7 @@ def __init__(self, src, **kwds):
self._reader = _parser.TextReader(src, **kwds)

# XXX
self.usecols = self._reader.usecols
self.usecols = _validate_usecols_arg(self._reader.usecols)

passed_names = self.names is None

Expand All @@ -1157,18 +1181,21 @@ def __init__(self, src, **kwds):
else:
self.names = lrange(self._reader.table_width)

# If the names were inferred (not passed by user) and usedcols is
# defined, then ensure names refers to the used columns, not the
# document's columns.
if self.usecols and passed_names:
col_indices = []
for u in self.usecols:
if isinstance(u, string_types):
col_indices.append(self.names.index(u))
else:
col_indices.append(u)
self.names = [n for i, n in enumerate(self.names)
if i in col_indices]
# gh-9755
#
# need to set orig_names here first
# so that proper indexing can be done
# with _set_noconvert_columns
#
# once names has been filtered, we will
# then set orig_names again to names
self.orig_names = self.names[:]

if self.usecols:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also at this point you know if you have positional usecols or named uscols. so use that information. I don't think you need to whole _set thing (on a per-name basis), you already know what to do.

so this is worth refactoing a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why you have to set on a per-name basis is because the elements of parse_dates can be a mixture of strings and positional integers. Knowing whether usecols is string, integer, or None (i.e. when usecols is not specified) does not help much here.

if len(self.names) > len(self.usecols):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this check is really valid. You might have len(names) < len(usecols) (which is kind of silly, but possible).
I think if .usecols is specified you always need to hit this path. Maybe try to find a test which breaks this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow your comment. If you have len(names) < len(usecols), you would be automatically erroring just as we do already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that is the case then the check is not necessary? can you confirm and add a test (or maybe move these test next to the other ones).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check is necessary because if you have a situation where the user deliberately passes in only the column names for the filtered table but usecols corresponding to the indices of the original table, you are going to get an error. Some of the examples provided in the original issue will break if that check is not there.

self.names = [n for i, n in enumerate(self.names)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't handle the mixed usecols issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an organization perspective, I was thinking of fixing the mixed usecols issue as a follow-up because it's a problem for both the Python and C engines, probably by putting a verification step in ParserBase. Or should I just kill two birds with one stone and fix it here (+ tests)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we have lots of things floating ATM. However I think its simple to add the non-mixed usecols (w/o addressing the set/list issue). So you can do that in another commit if it works.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sound good. I'll just add it as another commit and then close the issue in the process.

if (i in self.usecols or n in self.usecols)]

if len(self.names) < len(self.usecols):
raise ValueError("Usecols do not match names.")

Expand All @@ -1194,13 +1221,17 @@ def __init__(self, src, **kwds):
self._implicit_index = self._reader.leading_cols > 0

def _set_noconvert_columns(self):
names = self.names
names = self.orig_names
usecols = self.usecols

def _set(x):
if com.is_integer(x):
self._reader.set_noconvert(x)
else:
self._reader.set_noconvert(names.index(x))
if usecols and com.is_integer(x):
x = list(usecols)[x]

if not com.is_integer(x):
x = names.index(x)

self._reader.set_noconvert(x)

if isinstance(self.parse_dates, list):
for val in self.parse_dates:
Expand Down Expand Up @@ -1472,7 +1503,7 @@ def __init__(self, f, **kwds):
self.lineterminator = kwds['lineterminator']
self.quoting = kwds['quoting']
self.mangle_dupe_cols = kwds.get('mangle_dupe_cols', True)
self.usecols = kwds['usecols']
self.usecols = _validate_usecols_arg(kwds['usecols'])
self.skip_blank_lines = kwds['skip_blank_lines']

self.names_passed = kwds['names'] or None
Expand Down
112 changes: 109 additions & 3 deletions pandas/io/tests/test_parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -2682,12 +2682,118 @@ def test_uneven_lines_with_usecols(self):
df = self.read_csv(StringIO(csv), usecols=usecols)
tm.assert_frame_equal(df, expected)

usecols = ['a', 1]
usecols = ['a', 'b']
df = self.read_csv(StringIO(csv), usecols=usecols)
tm.assert_frame_equal(df, expected)

usecols = ['a', 'b']
df = self.read_csv(StringIO(csv), usecols=usecols)
def test_usecols_with_parse_dates(self):
# See gh-9755
s = """a,b,c,d,e
0,1,20140101,0900,4
0,1,20140102,1000,4"""
parse_dates = [[1, 2]]

cols = {
'a' : [0, 0],
'c_d': [
Timestamp('2014-01-01 09:00:00'),
Timestamp('2014-01-02 10:00:00')
]
}
expected = DataFrame(cols, columns=['c_d', 'a'])

df = self.read_csv(StringIO(s), usecols=[0, 2, 3],
parse_dates=parse_dates)
tm.assert_frame_equal(df, expected)

df = self.read_csv(StringIO(s), usecols=[3, 0, 2],
parse_dates=parse_dates)
tm.assert_frame_equal(df, expected)

def test_usecols_with_parse_dates_and_full_names(self):
# See gh-9755
s = """0,1,20140101,0900,4
0,1,20140102,1000,4"""
parse_dates = [[1, 2]]
names = list('abcde')

cols = {
'a' : [0, 0],
'c_d': [
Timestamp('2014-01-01 09:00:00'),
Timestamp('2014-01-02 10:00:00')
]
}
expected = DataFrame(cols, columns=['c_d', 'a'])

df = self.read_csv(StringIO(s), names=names,
usecols=[0, 2, 3],
parse_dates=parse_dates)
tm.assert_frame_equal(df, expected)

df = self.read_csv(StringIO(s), names=names,
usecols=[3, 0, 2],
parse_dates=parse_dates)
tm.assert_frame_equal(df, expected)

def test_usecols_with_parse_dates_and_usecol_names(self):
# See gh-9755
s = """0,1,20140101,0900,4
0,1,20140102,1000,4"""
parse_dates = [[1, 2]]
names = list('acd')

cols = {
'a' : [0, 0],
'c_d': [
Timestamp('2014-01-01 09:00:00'),
Timestamp('2014-01-02 10:00:00')
]
}
expected = DataFrame(cols, columns=['c_d', 'a'])

df = self.read_csv(StringIO(s), names=names,
usecols=[0, 2, 3],
parse_dates=parse_dates)
tm.assert_frame_equal(df, expected)

df = self.read_csv(StringIO(s), names=names,
usecols=[3, 0, 2],
parse_dates=parse_dates)
tm.assert_frame_equal(df, expected)

def test_mixed_dtype_usecols(self):
# See gh-12678
data = """a,b,c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its worth testing with a header that is say [2, 0, 1], and usecols is specified as [0, 2]. IOW, the header are actually labels AND we are specifying labels, but these are actually numbers, so it looks positional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'm not quite sure I understand what you're saying, is it something like this?
>>> data = """a,b,c
1000,2000,3000
4000,5000,6000
"""
>>> usecols = [0, 2] # column selection by index
>>> read_csv(StringIO(data), usecols=usecols)
>>>
>>> usecols = ['0', '2'] # column selection by name
>>> read_csv(StringIO(data), usecols=usecols)
  1. Where or how does this relate to the mixed-dtypes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [3]: read_csv(StringIO(data))
Out[3]: 
   2  3  4
0  a  b  c
1  d  e  f

In [5]: read_csv(StringIO(data),usecols=[2])
Out[5]: 
   4
0  c
1  f

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is clearly doing positional indexing, even though the label exists. not sure what to do here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As is stated in the new documentation, integers are interpreted as indices. Strings are interpreted as column names. Personally, I consider that behaviour to be correct. In fact, if you change usecols=[2] to usecols=['2'], you'll get the column with a "2" above it as it should be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, ok

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go ahead and add a separate test in there just in case this confusion comes up later on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, just document it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : Double checked on the documentation to make sure that the wording clearly states that integers = indices and strings = column names. Also added a test where column names were integers. Travis gives the green light.

1000,2000,3000
4000,5000,6000
"""
msg = ("The elements of \'usecols\' "
"must either be all strings "
"or all integers")
usecols = [0, 'b', 2]

with tm.assertRaisesRegexp(ValueError, msg):
df = self.read_csv(StringIO(data), usecols=usecols)

def test_usecols_with_integer_like_header(self):
data = """2,0,1
1000,2000,3000
4000,5000,6000
"""

usecols = [0, 1] # column selection by index
expected = DataFrame(data=[[1000, 2000],
[4000, 5000]],
columns=['2', '0'])
df = self.read_csv(StringIO(data), usecols=usecols)
tm.assert_frame_equal(df, expected)

usecols = ['0', '1'] # column selection by name
expected = DataFrame(data=[[2000, 3000],
[5000, 6000]],
columns=['0', '1'])
df = self.read_csv(StringIO(data), usecols=usecols)
tm.assert_frame_equal(df, expected)


Expand Down