read_excel with dtype=str converts empty cells to the string 'nan' #20377

arnau126 · 2018-03-16T10:46:02Z

Code Sample, a copy-pastable example if possible

In [10]: df = pd.DataFrame({'a': ['x', 'y', '', 'z']})

In [11]: df
Out[11]: 
   a
0  x
1  y
2   
3  z

In [12]: df.to_excel('temp.xlsx')

In [13]: df = pd.read_excel('temp.xlsx', dtype=str)

In [14]: df
Out[14]: 
     a
0    x
1    y
2  nan
3    z

In [15]: df.loc[2, 'a']
Out[15]: 'nan'

In [16]: type(df.loc[2, 'a'])
Out[16]: str

Problem description

The empty string of the original dataframe becomes the string 'nan', instead of numpy.nan.

Expected Output

In [14]: df
Out[14]: 
     a
0    x
1    y
2  NaN
3    z

In [15]: df.loc[2, 'a']
Out[15]: nan

In [16]: type(df.loc[2, 'a'])
Out[16]: float

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-36-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: ca_ES.UTF-8
LOCALE: ca_ES.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.5.2
Cython: 0.27.3
numpy: 1.14.1
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.0
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: None
xlsxwriter: 0.7.3
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-03-16T11:27:16Z

Thanks for the report. Mind taking a look to see what's going on?

arnau126 · 2018-03-16T15:42:42Z

I've found the function that converts the np.nan to 'nan': pandas.core.dtypes.astype_nansafe.

read_excel uses PythonParser. This parser has a function called self._cast_types, which uses astype_nansafe.

https://github.com/pandas-dev/pandas/blob/master/pandas/io/parsers.py#L1570

An example of astype_nansafe:

In [4]: from pandas.core.dtypes.cast import astype_nansafe

In [5]: arr = np.array([np.nan, 'a', 'b'], object)

In [6]: arr
Out[6]: array([nan, 'a', 'b'], dtype=object)

In [7]: astype_nansafe(arr, str)
Out[7]: array(['nan', 'a', 'b'], dtype=object)

I don't know if it's the expected behavior or not.
Should we modify astype_nansafe? or _cast_types?.

I'm not sure how to fix the issue.

cbertinato · 2018-03-17T16:57:12Z

Is that what one should expect when using dtype=str? Not specifying dtype produces the expected result. The default value is None and dtype conversion is done by row.

In [16]: df = pd.read_excel('temp.xlsx')

In [17]: df
Out[17]:
     a
0    x
1    y
2  NaN
3    z

In [18]: df.loc[2, 'a']
Out[18]: nan

In [19]: type(df.loc[2, 'a'])
Out[19]: float

In [20]: type(df.loc[1, 'a'])
Out[20]: str

I think that the output for dtype=str is expected behavior.

nikoskaragiannakis · 2018-03-17T22:40:14Z

read_csv returns what you would expect

from io import StringIO
import pandas as pd

arr = pd.read_csv(StringIO('col1, col2, col3\n,a,b'), dtype=str)

it will give you

  col1  col2  col3
0  NaN     a     b

so, I guess, what you expect from read_excel seems to be consistent with this.

I guess, if this needs to be changed, the best would be to change astype_unicode in pandas/_libs/lib.pyx.
What do you think?

nikoskaragiannakis · 2018-03-17T22:58:40Z

@cbertinato just saw your comment! Good point.

arnau126 · 2018-03-19T09:15:22Z

@cbertinato, if I have a column with numbers, but I want to read them as strings I need to specify dtype=str, otherwise they will be read as float.

In [1]: df = pd.DataFrame({'a': ['1', '2', '', '3']})

In [2]: df.to_excel('temp.xlsx')

In [3]: pd.read_excel('temp.xlsx', dtype=str)
Out[3]: 
     a
0   1
1   2
2  nan
3   3

In [5]: pd.read_excel('temp.xlsx')
Out[5]: 
     a
0  1.0
1  2.0
2  NaN
3  3.0

Moreover, as @nikoskaragiannakis says, it should be consistent with read_csv.

cbertinato · 2018-03-19T12:45:47Z

I missed the use case for numbers, and @nikoskaragiannakis's comment. It should probably be consistent with read_csv.

nikoskaragiannakis · 2018-03-19T13:19:58Z

@cbertinato if the case is as @arnau126 says, then I can make a fix for this.

nikoskaragiannakis · 2018-03-20T02:43:30Z

Question: When we use dtype=str do we want np.nan to turn to a string "nan" or to stay as it is?
I made a fix for this issue, but the above functionality is affected.

arnau126 · 2018-03-20T08:24:46Z

The purpose of this issue is just change this functionality: currently np.nan turns to a string "nan", and I think that it should stay as it.

cbertinato · 2018-03-20T16:17:38Z

As @arnau126 points out, the result from pd.read_excel with dtype=str is inconsistent with that from pd.read_csv. The value of placing a np.nan instead of the string representation is that you can use pd.isna, which does not work for 'nan'. So the thought is to make read_excel consistent with read_csv.

nikoskaragiannakis · 2018-03-20T17:49:15Z

@cbertinato @arnau126 maybe my question wasn't very clear.
I suggested changing astype_unicode and astype_str (https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx#L460-L483), which would mean that 'nan' would always become np.nan.

If we only want to change the read_excel function, then we could just substitute all the 'nan' to np.nan before we return the result to the user. Not very elegant, but it would only affect read_excel.

Which one do we want here?

Sorry if I misunderstood what you guys are saying.

cbertinato · 2018-03-20T17:56:34Z

I think that the latter would be safer and less likely to break something else. This is clearly only an issue for read_excel and not read_csv.

nikoskaragiannakis · 2018-03-20T18:33:07Z

It is clear now, thanks. I'll do the changes.

…dev#20377)

… np.nan to empty string (pandas-dev#20377)

…andas-dev#20377)

… tests for np.nan (pandas-dev#20377)

… tests for np.nan (pandas-dev#20377) TST: pep8 (pandas-dev#20377) TST: Correction in a test (pandas-dev#20377)

Closes pandas-devgh-20377.

Closes gh-20377.

Closes pandas-devgh-20377.

yoonghm · 2018-12-11T14:30:11Z

Treating an empty value in Excel as nan has another side effect: integer will be converted to float. Subsequent operation on that columns will have other effects again.

Also read_excel() does not honor empty value handling offered by function in converters:

I have an Excel file temp.xlsx with the following data:

Key1,Key2,Key3,Key4
0,11.0,  Apple  ,1.12
1,12.0,,1.02
2,13.0,  Orange,
3,  Banana  ,0.01

This is the code:

import numpy as np
import pandas as pd

def handle_string(value):
    return value.replace(' ', '')

def handle_integer(value):
    if value == '':
        return 0
    else:
        int(value)

def handle_float(value):
    if value == '':
        return 0.0
    else:
        float(value)

df = pd.read_excel(
        'temp.xlsx',
)
print(df)
print(f"type(df.loc[3,'Key2']) = {type(df.loc[3,'Key2'])}")
print(f"type(df.loc[1,'Key3']) = {type(df.loc[1,'Key3'])}")
print(f"type(df.loc[2,'Key4']) = {type(df.loc[2,'Key4'])}")

print('')

df = pd.read_excel(
        'temp.xlsx',
        converters={\
            'Key1' : handle_integer,
            'Key2' : handle_integer,
            'Key3' : handle_string,
            'Key4' : handle_float,
        }
)
print(df)
print(f"type(df.loc[3,'Key2']) = {type(df.loc[3,'Key2'])}")
print(f"type(df.loc[1,'Key3']) = {type(df.loc[1,'Key3'])}")
print(f"type(df.loc[2,'Key4']) = {type(df.loc[2,'Key4'])}")

The output:

Key1 Key2 Key3 Key4
0 0 11.0 Apple 1.12
1 1 12.0 NaN 1.02
2 2 13.0 Orange NaN
3 3 NaN Banana 0.01
type(df.loc[3,'Key2']) = <class 'numpy.float64'>
type(df.loc[1,'Key3']) = <class 'float'>
type(df.loc[2,'Key4']) = <class 'numpy.float64'>

Key1 Key2 Key3 Key4
0 None NaN Apple NaN
1 None NaN NaN NaN
2 None NaN Orange 0.0
3 None 0.0 Banana NaN
type(df.loc[3,'Key2']) = <class 'numpy.float64'>
type(df.loc[1,'Key3']) = <class 'float'>
type(df.loc[2,'Key4']) = <class 'numpy.float64'>

daveknave · 2019-08-19T07:41:43Z

Hello everyone, I think I have encountered that same issue while using read_fwf. NaNs are being converted to 'nan' rather than np.nan() when dtype=str is set.

It would be great, if the behavior were consistent with read_csv for this one as well. Since it seems to be exactly this issue, I didn't want to open a new one.
I am using Python 3.6 panda 25.0.0

Best David

ptth222 · 2022-04-20T15:31:00Z

This change ended up breaking some code for me. It was a simple enough fix, and the implementation should have been better in the first place, but it still took me a whole day to track it down.

I am commenting to say that I feel like read_csv should have been changed to be like read_excel instead of the other way around. I find it rather confusing that when you tell the function to give you str data type it actually gives you a mix of str and float (for NaN). My expectation was that everything would be of str type since that is what I asked for. I understand some of the history and reasoning for why it does what it does, but I still think that if you tell a function to return a certain type that's what it should return.

There seem to be some recent developments on this described here: https://pandas.pydata.org/docs/user_guide/text.html. Once this is more stable I think it would make sense for dtype=str and dtype="string" to have the same functionality.

TomAugspurger added IO Data IO issues that don't fit into a more specific label Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate IO Excel read_excel, to_excel Difficulty Intermediate labels Mar 16, 2018

nikoskaragiannakis mentioned this issue Mar 20, 2018

read_excel with dtype=str converts empty cells to np.nan #20429

Closed

4 tasks

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Mar 25, 2018

BUG: astype_unicode astype_str turn a np.nan to empty string (pandas-…

694849d

…dev#20377)

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Mar 25, 2018

TST: added unitest for read_excel and modified series/test_dtypes for…

5ba95a1

… np.nan to empty string (pandas-dev#20377)

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Mar 25, 2018

TST: added unitest for read_csv (pandas-dev#20377)

d3ceec3

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Mar 25, 2018

BUG: patched TextReader to turn np.nan to empty string if dtype=str (p…

ea1d73a

…andas-dev#20377)

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Mar 25, 2018

DOC: updated IO section (pandas-dev#20377)

c1376a5

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Mar 25, 2018

DOC: updated IO section (pandas-dev#20377)

3103811

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 2, 2018

DOC: updated IO section (pandas-dev#20377)

478d08d

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 2, 2018

BUG: np.nan stays as np.nan (pandas-dev#20377)

edb26d7

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 2, 2018

TXT: Moved test from series.test_io to io.parser.na_values. Corrected…

c3ab9cb

… tests for np.nan (pandas-dev#20377)

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 2, 2018

DOC: updated IO section (pandas-dev#20377)

69f6c95

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 2, 2018

TST: pep8 (pandas-dev#20377)

97a345a

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 2, 2018

DOC: updated IO section (pandas-dev#20377)

c9f5120

jreback added this to the 0.23.0 milestone Apr 5, 2018

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 8, 2018

DOC: Better explanation (pandas-dev#20377)

47bc105

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 8, 2018

BUG: use checknull (pandas-dev#20377)

3740dfe

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 8, 2018

TST: update tests (pandas-dev#20377)

7d453bb

nikoskaragiannakis added a commit to nikoskaragiannakis/pandas that referenced this issue Apr 8, 2018

BUG: string nans to np.nan in Series for list data (pandas-dev#20377)

bcd739d

jreback modified the milestones: 0.23.0, Next Major Release Apr 14, 2018

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 15, 2018

BUG: Don't parse NaN as 'nan' in Data IO

dafaf5c

Closes pandas-devgh-20377.

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 15, 2018

BUG: Don't parse NaN as 'nan' in Data IO

6637262

Closes pandas-devgh-20377.

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 15, 2018

BUG: Don't parse NaN as 'nan' in Data IO

a55e8d0

Closes pandas-devgh-20377.

gfyoung mentioned this issue Oct 15, 2018

BUG: Don't parse NaN as 'nan' in Data IO #23162

Merged

jreback modified the milestones: Contributions Welcome, 0.24.0 Oct 15, 2018

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 16, 2018

BUG: Don't parse NaN as 'nan' in Data IO

5285f5b

Closes pandas-devgh-20377.

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 16, 2018

BUG: Don't parse NaN as 'nan' in Data IO

dea9b7c

Closes pandas-devgh-20377.

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 17, 2018

BUG: Don't parse NaN as 'nan' in Data IO

524dbb3

Closes pandas-devgh-20377.

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 18, 2018

BUG: Don't parse NaN as 'nan' in Data IO

ddd1be3

Closes pandas-devgh-20377.

jreback closed this as completed in #23162 Oct 18, 2018

jreback pushed a commit that referenced this issue Oct 18, 2018

BUG: Don't parse NaN as 'nan' in Data IO (#23162)

0c51d4d

Closes gh-20377.

tm9k1 pushed a commit to tm9k1/pandas that referenced this issue Nov 19, 2018

BUG: Don't parse NaN as 'nan' in Data IO (pandas-dev#23162)

06139bc

Closes pandas-devgh-20377.

joha1 mentioned this issue Jun 9, 2020

BUG: read_excel with dtype=np.int32 can't handle empty or nan cells #34662

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_excel with dtype=str converts empty cells to the string 'nan' #20377

read_excel with dtype=str converts empty cells to the string 'nan' #20377

arnau126 commented Mar 16, 2018

INSTALLED VERSIONS

TomAugspurger commented Mar 16, 2018

arnau126 commented Mar 16, 2018 •

edited

Loading

cbertinato commented Mar 17, 2018 •

edited

Loading

nikoskaragiannakis commented Mar 17, 2018

nikoskaragiannakis commented Mar 17, 2018

arnau126 commented Mar 19, 2018 •

edited

Loading

cbertinato commented Mar 19, 2018

nikoskaragiannakis commented Mar 19, 2018 •

edited

Loading

nikoskaragiannakis commented Mar 20, 2018 •

edited

Loading

arnau126 commented Mar 20, 2018

cbertinato commented Mar 20, 2018

nikoskaragiannakis commented Mar 20, 2018

cbertinato commented Mar 20, 2018 •

edited

Loading

nikoskaragiannakis commented Mar 20, 2018

yoonghm commented Dec 11, 2018 •

edited

Loading

daveknave commented Aug 19, 2019

ptth222 commented Apr 20, 2022

read_excel with dtype=str converts empty cells to the string 'nan' #20377

read_excel with dtype=str converts empty cells to the string 'nan' #20377

Comments

arnau126 commented Mar 16, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Mar 16, 2018

arnau126 commented Mar 16, 2018 • edited Loading

cbertinato commented Mar 17, 2018 • edited Loading

nikoskaragiannakis commented Mar 17, 2018

nikoskaragiannakis commented Mar 17, 2018

arnau126 commented Mar 19, 2018 • edited Loading

cbertinato commented Mar 19, 2018

nikoskaragiannakis commented Mar 19, 2018 • edited Loading

nikoskaragiannakis commented Mar 20, 2018 • edited Loading

arnau126 commented Mar 20, 2018

cbertinato commented Mar 20, 2018

nikoskaragiannakis commented Mar 20, 2018

cbertinato commented Mar 20, 2018 • edited Loading

nikoskaragiannakis commented Mar 20, 2018

yoonghm commented Dec 11, 2018 • edited Loading

daveknave commented Aug 19, 2019

ptth222 commented Apr 20, 2022

Output of `pd.show_versions()`

arnau126 commented Mar 16, 2018 •

edited

Loading

cbertinato commented Mar 17, 2018 •

edited

Loading

arnau126 commented Mar 19, 2018 •

edited

Loading

nikoskaragiannakis commented Mar 19, 2018 •

edited

Loading

nikoskaragiannakis commented Mar 20, 2018 •

edited

Loading

cbertinato commented Mar 20, 2018 •

edited

Loading

yoonghm commented Dec 11, 2018 •

edited

Loading