Skip to content

Regex C Engine Warning #10208

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jseabold opened this issue May 26, 2015 · 9 comments
Closed

Regex C Engine Warning #10208

jseabold opened this issue May 26, 2015 · 9 comments
Labels
Docs Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Milestone

Comments

@jseabold
Copy link
Contributor

Using pd.read_csv(..., sep=", ", ...) I'm now getting a warning about falling back to the C engine because regex parsing isn't supported in the C engine. That's fine, but this isn't actually using regex.

I don't have an idea for a good transition strategy, and maybe the ship has sailed, but perhaps there should be a separate read_regex or a regex keyword instead of emitting this warning for any string greater than length 1.

Pandas 0.16.0.

@jreback
Copy link
Contributor

jreback commented May 26, 2015

The warning has to do with the fact that your separator is > 1 character, which is not supported by the parser (Prob isn't that hard to implement, but would need someone to do it). Here is a nice way to do this (and use the c-parser).

In [38]: data = """a, b, c\n1, 2, 3"""

In [39]: read_csv(StringIO(data),sep=",",engine='c',skipinitialspace=True)
Out[39]: 
   a  b  c
0  1  2  3

In [40]: read_csv(StringIO(data),sep=", ",engine='python')
Out[40]: 
   a  b  c
0  1  2  3

@jreback jreback added the IO CSV read_csv, to_csv label May 26, 2015
@dukebody
Copy link
Contributor

From the documentation it is not clear when a separator is considered a regex and when it isn't. I was trying to use '::' as separator (MovieLens dataset) when reading a file and pandas was interpreting it as a regex, when it really isn't.

I think a separate sep_regex keyword would be cleaner. For the time being, we can also raise an exception "non-regex separators of more than 1 character are not supported". If it's the C engine that doesn't support >1 char separators, we can warn "C engine doesn't support separators longer than 1 character, falling back to Python engine".

@TomAugspurger
Copy link
Contributor

I don't think there's any need to adjust the API, just a clearer warning message.

@dukebody
Copy link
Contributor

I think documentation should also be amended.

sep: Delimiter to use. If sep is None, will try to automatically determine this. Regular expressions are accepted and will force use of the python parsing engine and will ignore quotes in the data.

When I first read this I wondered how pandas knows when am I using a regexp as delimiter and when am I using a normal string. I would change this by:

sep: Delimiter to use. If sep is None, will try to automatically determine this. If it is longer than 1 character, it will be interpreted as a regular expression, will force use of the python parsing engine and will ignore quotes in the data.

Anyhow I still believe that accepting string separators larger than 1 character is a good feature, but might need a separate ticket/issue.

@jreback
Copy link
Contributor

jreback commented Mar 29, 2016

IIRC if its > 1 length, then it by defintion defers to the python engine.

@jreback
Copy link
Contributor

jreback commented Mar 29, 2016

no need to add any more options to the parsers. But as @TomAugspurger points out a clearer error message would be fine.

@jreback jreback added Docs Difficulty Novice Error Reporting Incorrect or improved errors from pandas labels Mar 29, 2016
@jreback jreback added this to the 0.18.1 milestone Mar 29, 2016
@jreback
Copy link
Contributor

jreback commented Mar 29, 2016

@dukebody pull-requests welcome.

@nhhas
Copy link

nhhas commented Sep 21, 2019

ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
"""Entry point for launching an IPython kernel.

I can't view the data at all. any helps?
thanx

@unojoe2
Copy link

unojoe2 commented Oct 14, 2019

from book: Python for Data Analysis pg 23. There is deprecated
In [XX]: users = pd.read_table... TO pd.read_csv...
FILENAME: The referenced file name in the book is supposed to be changed...
from: yourFilePath/users.dat (is repeated accidentally in the book)
yourFilePath/ratings.dat
yourFilePath/movies.dat

And finally you should add (as stated above) engine='python'
users = pd.read_csv('yourFilePath/users.dat', engine='python'
...: , sep='::', header=None, names=unames)
rnames = pd.read_csv('yourFilePath/ratings.dat', engine='python'
...: , sep='::', header=None, names=unames)
mnames = pd.read_csv('yourFilePath/movies.dat', engine='python'
...: , sep='::', header=None, names=unames)

I hope that is helpful info, sorry if I totally missed the point, but I was stuck on this and typing in circles for longer than desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants