Skip to content

read_table with dtype=object and an int converter still returns float64 if NaN present #14558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
radekholy24 opened this issue Nov 2, 2016 · 5 comments
Labels
IO CSV read_csv, to_csv Usage Question

Comments

@radekholy24
Copy link

A small, complete example of the issue

>>> import io, pandas
>>> csvfile = io.StringIO('a\nN/A\n1\n')
>>> converters = {'a': lambda x: float('nan') if x == 'N/A' else int(x)}
>>> pandas.read_table(csvfile, dtype=object, converters=converters)
     a
0  NaN
1  1.0

Expected Output

     a
0  NaN
1    1

Output of pd.show_versions()

pandas: 0.19.0
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-45-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.0
nose: None
pip: 7.1.0
setuptools: 18.0.1
Cython: None
numpy: 1.11.2
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None
@jorisvandenbossche
Copy link
Member

This is because the dtype keyword is ignored when converters is specified. There is a PR to clarify in the docstring: #14295. But probably, we could trigger a warning to warn the user about that.

By the way, I suspect that you want the object type so your integers aren't cast to floats, but note that in many operations you do on such an object dtyped series, you will still get this casting behaviour + having this as object dtype will not be performant.

@radekholy24
Copy link
Author

This is because the dtype keyword is ignored when converters is specified. There is a PR to clarify in the docstring: #14295. But probably, we could trigger a warning to warn the user about that.

Thank you for the response. API Reference does not mention the fact that converters switches the engine to the Python one. I am looking forward to the docstring clarification. Thank you.

By the way, I suspect that you want the object type so your integers aren't cast to floats, but note that in many operations you do on such an object dtyped series, you will still get this casting behaviour + having this as object dtype will not be performant.

Yes, that's why I do that. In my case, DataFrame is just an intermediate format which I use to aggregate my data. AFAIK, I have only two options. Either to do what I do now or to handle my ints as floats. But since the rest of the code differentiates between ints and floats I'd have to convert the data twice (from str to float using read_table and from float to int then). I believe that the first option is better in my case.

@jorisvandenbossche
Copy link
Member

I'd have to convert the data twice (from str to float using read_table and from float to int then). I believe that the first option is better in my case.

Well, the first conversion from str to float is actually done automatically as 'N/A' by default is recognized as a missing value:

In [17]: s = 'a\nN/A\n1\n'

In [18]: pd.read_csv(StringIO(s))
Out[18]: 
     a
0  NaN
1  1.0

So I think this is the much easier path instead of passing a custom converter.

@radekholy24
Copy link
Author

OK, that's a good point. But still... From the point of view of my interfaces... Now, I have a function that converts strings to integers which exactly matches the description of the data I use and the function interface remains library-independent. If I change the signature to float -> int, it just feels wrong. I promise that I'll keep in mind those possible performance problems but until I hit them, the code simplicity and clarity is preferred over performance optimizations.

@jorisvandenbossche jorisvandenbossche added this to the No action milestone Nov 2, 2016
@jorisvandenbossche
Copy link
Member

OK! Closing this then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Usage Question
Projects
None yet
Development

No branches or pull requests

2 participants