Skip to content

BUG: Fix parse_dates processing with usecols and C engine #12512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Mar 2, 2016

closes #9755
closes #12678

Continuing on my conquest of read_csv bugs, this PR fixes a bug brought up in #9755 in processing parse_dates with the C engine in which the wrong indices (those of the filtered column names) were being used to determine the date columns to not be dtype-parsed by the C engine. The correct indices are those of the original column names, as they are used later on in the actual data processing.

@gfyoung gfyoung force-pushed the parse_dates_usecols branch 2 times, most recently from bfb010b to a78ba29 Compare March 3, 2016 13:36
@jreback jreback added Bug IO CSV read_csv, to_csv labels Mar 3, 2016
@gfyoung gfyoung force-pushed the parse_dates_usecols branch 2 times, most recently from 58ca399 to 0f99251 Compare March 6, 2016 02:32

def _set(x):
if com.is_integer(x):
self._reader.set_noconvert(x)
if not identical:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems pretty duplicative here. pls try to make minimal code changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@gfyoung gfyoung force-pushed the parse_dates_usecols branch 3 times, most recently from 0fd57af to b0eef59 Compare March 8, 2016 06:27
@gfyoung
Copy link
Member Author

gfyoung commented Mar 8, 2016

@jreback : Tests are passing once more, so this should be good to merge AFAICT.

@jreback jreback added this to the 0.18.1 milestone Mar 8, 2016
@gfyoung gfyoung force-pushed the parse_dates_usecols branch 15 times, most recently from 8965565 to f96340e Compare March 14, 2016 21:05
@gfyoung gfyoung force-pushed the parse_dates_usecols branch 2 times, most recently from 9dd9566 to 90d6777 Compare March 16, 2016 15:27
@gfyoung gfyoung force-pushed the parse_dates_usecols branch 2 times, most recently from bcf7dbb to deb22fe Compare April 5, 2016 18:32
@gfyoung
Copy link
Member Author

gfyoung commented Apr 5, 2016

@jreback : Any updates with respect to my responses to your refactoring and check comments?




- Bug in ``read_csv`` when specifying ``usecols`` and ``parse_dates`` simultaneously with the C engine (:issue:`9755`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't add onto the end, put in-between other bug fixes (that way you won't have, nor cause conflicts)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@gfyoung gfyoung force-pushed the parse_dates_usecols branch 3 times, most recently from 6bba242 to f74f81f Compare April 6, 2016 00:48
}
expected = DataFrame(cols, columns=['c_d', 'a'])

df = read_csv(StringIO(s), usecols=[0, 2, 3],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are not testing this fully here. you MUST use self.read_csv. In fact we should REMOVE read_csv from the import as its just confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_csv will default to the c-engine ONLY. while self.read_csv cycles thru all of them (as separate tests)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, good catch. Let me go fix that.

@gfyoung gfyoung force-pushed the parse_dates_usecols branch from f74f81f to f05d27a Compare April 6, 2016 02:15
@@ -1133,6 +1151,7 @@ def __init__(self, src, **kwds):

# XXX
self.usecols = self._reader.usecols
self._usecols_dtype = _validate_usecols_arg(self.usecols)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you are actually using this variable? _usecols_dtype, so no need to save it (I think we discussed using it, but not necessary i guess)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah...doesn't look like it. Removed.

Fixes bug in processing 'parse_dates' with the C engine
in which the wrong indices (those of the filtered column
names) were being used to determine the date columns to
not be dtype-parsed by the C engine. The correct indices
are those of the original (unfiltered) column names, as
they are used later on in the actual data processing.

Closes pandas-devgh-9755.
@gfyoung gfyoung force-pushed the parse_dates_usecols branch from f05d27a to 551c9f1 Compare April 6, 2016 13:35
@@ -211,6 +215,12 @@ Bug Fixes

- Bug in ``value_counts`` when ``normalize=True`` and ``dropna=True`` where nulls still contributed to the normalized count (:issue:`12558`)
- Bug in ``Panel.fillna()`` ignoring ``inplace=True`` (:issue:`12633`)

- Bug in ``read_csv`` when specifying ``names``, ```usecols``, and ``parse_dates`` simultaneously with the C engine (:issue:`9755`)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine for now, but pls don't add extra lines

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@gfyoung gfyoung force-pushed the parse_dates_usecols branch from 551c9f1 to 8e0489f Compare April 6, 2016 19:01
Enforces the fact that 'usecols' must either
be all integers (indexing) or strings (column
names), as mixtures of the two are ambiguous.

Closes pandas-devgh-12678.
@gfyoung gfyoung force-pushed the parse_dates_usecols branch from 8e0489f to f0543a4 Compare April 6, 2016 19:10
@jreback jreback closed this in c6c201e Apr 6, 2016
@jreback
Copy link
Contributor

jreback commented Apr 6, 2016

thanks @gfyoung

@gfyoung gfyoung deleted the parse_dates_usecols branch April 6, 2016 19:19
@gfyoung
Copy link
Member Author

gfyoung commented Apr 6, 2016

@jreback : FYI, you can also cancel both of my builds for this PR on Travis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ERR: dont' allow ambiguous usecols read_csv with names, usecols and parse_dates
2 participants