Skip to content

read_csv dtype argument not working when there is a footer #5232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
socheon opened this issue Oct 15, 2013 · 12 comments
Closed

read_csv dtype argument not working when there is a footer #5232

socheon opened this issue Oct 15, 2013 · 12 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv

Comments

@socheon
Copy link

socheon commented Oct 15, 2013

xref #7141

Datafile test.csv

col1|col2
a|438087272980
b|399432587827
c|592706116147
d|1584843561523
footer 1

Command

print pd.read_csv('test.csv', sep='|', skipfooter=1, dtype={'col2':'object'}).dtypes

Output

col1 object
col2 int64
dtype: object

Expected Output

col1 object
col2 object
dtype: object

Platform

Windows XP, Python 2.7, Pandas version 0.11.0

@jreback
Copy link
Contributor

jreback commented Oct 15, 2013

is present in master, thanks for the report

work-around is to coerce after

In [11]: df
Out[11]: 
  col1           col2
0    a   438087272980
1    b   399432587827
2    c   592706116147
3    d  1584843561523

In [12]: df.dtypes
Out[12]: 
col1    object
col2     int64
dtype: object

In [13]: df['col2'].astype(object)
Out[13]: 
0     438087272980
1     399432587827
2     592706116147
3    1584843561523
Name: col2, dtype: object

@guyrt
Copy link
Contributor

guyrt commented Nov 15, 2013

dtype is only supported in c parser, but
setfooter is only supported in the python parser, which is what this example is implicitly using.

To fix this bug, we have to either implement dtype in python parser or setfooter in c parser (or both!)

However, there's another problem here. If you explicitly set engine to python, you'll get an error:

ValueError: The 'dtype' option is not supported with the 'python' engine

However, that check happens before we implicitly switch parser to python.

I think it would be best to (a) Issue a warning when we switch converters automatically, and (b) move the engine switch before we validate options against the eventual engine. Probably should be a different ticket though.

@michaelaye
Copy link
Contributor

As I am regularly parsing Gigabytes of text files, I definitely support the idea of having a warning when I'm switched to a slower parsing engine.

@michaelaye
Copy link
Contributor

Just realized that the dtype argument does not work for me in master at all using the 'c engine?

dtypes_dic

{'af': 'int',
 'c': 'int',
 'date': 'int',
 'det': 'int',
 'hour': 'int',
 'minute': 'int',
 'month': 'int',
 'orbit': 'int',
 'year': 'int'}

df = pd.read_csv(fname, delim_whitespace=True, sep='\s*',
                           dtype=dtypes_dic, engine='c')

df.dtypes

date     float64
month    float64
year     float64
...
qual     float64
sppsx    float64
sppsy    float64
Length: 40, dtype: object

@jreback
Copy link
Contributor

jreback commented Dec 7, 2013

FYI, delim_whitespace and sep just uses sep (e.g. delim_whitespace implies what you are doing wtih sep)

@jreback
Copy link
Contributor

jreback commented Dec 7, 2013

@michaelaye dtype works, maybe post an explicit example (this issue is when both skip_footer and dtype is specified with python parser)

@michaelaye
Copy link
Contributor

import StringIO

s = """\ta\tb\tc\td
\t1.0\t4.2\t2\t6
\t6.0\t2.1\t3\t6
"""
s_in = StringIO.StringIO(s)
s_in.seek(0)
df = pd.read_csv(s_in, sep='\s*', dtype={'a':np.int32})
df.dtypes
a    float64
b    float64
c      int64
d      int64
dtype: object

I was using sep='\s*' because delim_whitespace could not cope with initial whitespace as you can try out with this example. Also skipinitialspace did not help. But this sep regex supersedes what delim_whitespace does, so thanks for letting me know that they overlap.

Maybe a down coercion is not allowed due to potential data loss?

@jreback
Copy link
Contributor

jreback commented Feb 14, 2014

@guyrt want to take this? (I guess implement dtype in python parser and footer?)

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 9, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@Twizzledrizzle
Copy link

skip_footer in the c-parser would be nice!

@gfyoung
Copy link
Member

gfyoung commented May 27, 2016

skipfooter would indeed be nice to add to the C engine. However, the implementation is not so straightforward. In abstract, you need to count the number of lines and then figure out the line number cutoff, but that is assuming you have correctly-formatted data. What happens if your data is not correctly-formatted? How do you proceed with the counting? In the Python engine, it's simple: the CSV reader will most likely complain and break, whereas the C engine supports different types of messages for error handling/warning. It isn't hard to implement counting, but that would mean compromising the error_bad_lines and warn_bad_lines functionality.

@gfyoung
Copy link
Member

gfyoung commented Jul 31, 2016

@jreback : I think this issue can be closed because it has boiled down to either requesting dtype in the Python engine or skipfooter in the C engine, both of which are already part of the tracker in #12686.

@jreback
Copy link
Contributor

jreback commented Aug 2, 2016

closed as already marked in other issues specifically as @gfyoung indicates above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

6 participants