read_table with dtype=object and an int converter still returns float64 if NaN present #14558

radekholy24 · 2016-11-02T08:08:14Z

A small, complete example of the issue

>>> import io, pandas
>>> csvfile = io.StringIO('a\nN/A\n1\n')
>>> converters = {'a': lambda x: float('nan') if x == 'N/A' else int(x)}
>>> pandas.read_table(csvfile, dtype=object, converters=converters)
     a
0  NaN
1  1.0

Expected Output

     a
0  NaN
1    1

Output of `pd.show_versions()`

pandas: 0.19.0

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-45-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.0
nose: None
pip: 7.1.0
setuptools: 18.0.1
Cython: None
numpy: 1.11.2
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2016-11-02T09:03:46Z

This is because the dtype keyword is ignored when converters is specified. There is a PR to clarify in the docstring: #14295. But probably, we could trigger a warning to warn the user about that.

By the way, I suspect that you want the object type so your integers aren't cast to floats, but note that in many operations you do on such an object dtyped series, you will still get this casting behaviour + having this as object dtype will not be performant.

radekholy24 · 2016-11-02T09:24:05Z

This is because the dtype keyword is ignored when converters is specified. There is a PR to clarify in the docstring: #14295. But probably, we could trigger a warning to warn the user about that.

Thank you for the response. API Reference does not mention the fact that converters switches the engine to the Python one. I am looking forward to the docstring clarification. Thank you.

By the way, I suspect that you want the object type so your integers aren't cast to floats, but note that in many operations you do on such an object dtyped series, you will still get this casting behaviour + having this as object dtype will not be performant.

Yes, that's why I do that. In my case, DataFrame is just an intermediate format which I use to aggregate my data. AFAIK, I have only two options. Either to do what I do now or to handle my ints as floats. But since the rest of the code differentiates between ints and floats I'd have to convert the data twice (from str to float using read_table and from float to int then). I believe that the first option is better in my case.

jorisvandenbossche · 2016-11-02T09:31:06Z

I'd have to convert the data twice (from str to float using read_table and from float to int then). I believe that the first option is better in my case.

Well, the first conversion from str to float is actually done automatically as 'N/A' by default is recognized as a missing value:

In [17]: s = 'a\nN/A\n1\n'

In [18]: pd.read_csv(StringIO(s))
Out[18]: 
     a
0  NaN
1  1.0

So I think this is the much easier path instead of passing a custom converter.

radekholy24 · 2016-11-02T09:46:38Z

OK, that's a good point. But still... From the point of view of my interfaces... Now, I have a function that converts strings to integers which exactly matches the description of the data I use and the function interface remains library-independent. If I change the signature to float -> int, it just feels wrong. I promise that I'll keep in mind those possible performance problems but until I hit them, the code simplicity and clarity is preferred over performance optimizations.

jorisvandenbossche · 2016-11-02T09:49:21Z

OK! Closing this then.

jorisvandenbossche added Usage Question IO CSV read_csv, to_csv labels Nov 2, 2016

jorisvandenbossche mentioned this issue Nov 2, 2016

API: add dtype= option to python parser #14295

Merged

4 tasks

jorisvandenbossche added this to the No action milestone Nov 2, 2016

jorisvandenbossche closed this as completed Nov 2, 2016

radekholy24 mentioned this issue Nov 2, 2016

series.apply(pandas.to_datetime, convert_dtype=False) still converts dtype #14559

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_table with dtype=object and an int converter still returns float64 if NaN present #14558

read_table with dtype=object and an int converter still returns float64 if NaN present #14558

radekholy24 commented Nov 2, 2016

jorisvandenbossche commented Nov 2, 2016

radekholy24 commented Nov 2, 2016

jorisvandenbossche commented Nov 2, 2016

radekholy24 commented Nov 2, 2016

jorisvandenbossche commented Nov 2, 2016

read_table with dtype=object and an int converter still returns float64 if NaN present #14558

read_table with dtype=object and an int converter still returns float64 if NaN present #14558

Comments

radekholy24 commented Nov 2, 2016

A small, complete example of the issue

Expected Output

Output of pd.show_versions()

jorisvandenbossche commented Nov 2, 2016

radekholy24 commented Nov 2, 2016

jorisvandenbossche commented Nov 2, 2016

radekholy24 commented Nov 2, 2016

jorisvandenbossche commented Nov 2, 2016

Output of `pd.show_versions()`