-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
read_excel with dtype=str converts empty cells to the string 'nan' #20377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report. Mind taking a look to see what's going on? |
I've found the function that converts the
https://github.com/pandas-dev/pandas/blob/master/pandas/io/parsers.py#L1570 An example of In [4]: from pandas.core.dtypes.cast import astype_nansafe
In [5]: arr = np.array([np.nan, 'a', 'b'], object)
In [6]: arr
Out[6]: array([nan, 'a', 'b'], dtype=object)
In [7]: astype_nansafe(arr, str)
Out[7]: array(['nan', 'a', 'b'], dtype=object) I don't know if it's the expected behavior or not. I'm not sure how to fix the issue. |
Is that what one should expect when using
I think that the output for |
from io import StringIO
import pandas as pd
arr = pd.read_csv(StringIO('col1, col2, col3\n,a,b'), dtype=str) it will give you
so, I guess, what you expect from I guess, if this needs to be changed, the best would be to change |
@cbertinato just saw your comment! Good point. |
@cbertinato, if I have a column with numbers, but I want to read them as strings I need to specify In [1]: df = pd.DataFrame({'a': ['1', '2', '', '3']})
In [2]: df.to_excel('temp.xlsx')
In [3]: pd.read_excel('temp.xlsx', dtype=str)
Out[3]:
a
0 1
1 2
2 nan
3 3
In [5]: pd.read_excel('temp.xlsx')
Out[5]:
a
0 1.0
1 2.0
2 NaN
3 3.0 Moreover, as @nikoskaragiannakis says, it should be consistent with |
I missed the use case for numbers, and @nikoskaragiannakis's comment. It should probably be consistent with |
@cbertinato if the case is as @arnau126 says, then I can make a fix for this. |
Question: When we use |
The purpose of this issue is just change this functionality: currently |
As @arnau126 points out, the result from |
@cbertinato @arnau126 maybe my question wasn't very clear. If we only want to change the Which one do we want here? Sorry if I misunderstood what you guys are saying. |
I think that the latter would be safer and less likely to break something else. This is clearly only an issue for |
It is clear now, thanks. I'll do the changes. |
… np.nan to empty string (pandas-dev#20377)
… tests for np.nan (pandas-dev#20377) TST: pep8 (pandas-dev#20377) TST: Correction in a test (pandas-dev#20377)
Treating an empty value in Excel as Also I have an Excel file
This is the code: import numpy as np
import pandas as pd
def handle_string(value):
return value.replace(' ', '')
def handle_integer(value):
if value == '':
return 0
else:
int(value)
def handle_float(value):
if value == '':
return 0.0
else:
float(value)
df = pd.read_excel(
'temp.xlsx',
)
print(df)
print(f"type(df.loc[3,'Key2']) = {type(df.loc[3,'Key2'])}")
print(f"type(df.loc[1,'Key3']) = {type(df.loc[1,'Key3'])}")
print(f"type(df.loc[2,'Key4']) = {type(df.loc[2,'Key4'])}")
print('')
df = pd.read_excel(
'temp.xlsx',
converters={\
'Key1' : handle_integer,
'Key2' : handle_integer,
'Key3' : handle_string,
'Key4' : handle_float,
}
)
print(df)
print(f"type(df.loc[3,'Key2']) = {type(df.loc[3,'Key2'])}")
print(f"type(df.loc[1,'Key3']) = {type(df.loc[1,'Key3'])}")
print(f"type(df.loc[2,'Key4']) = {type(df.loc[2,'Key4'])}")
The output: Key1 Key2 Key3 Key4 Key1 Key2 Key3 Key4
|
Hello everyone, I think I have encountered that same issue while using It would be great, if the behavior were consistent with Best David |
This change ended up breaking some code for me. It was a simple enough fix, and the implementation should have been better in the first place, but it still took me a whole day to track it down. I am commenting to say that I feel like read_csv should have been changed to be like read_excel instead of the other way around. I find it rather confusing that when you tell the function to give you str data type it actually gives you a mix of str and float (for NaN). My expectation was that everything would be of str type since that is what I asked for. I understand some of the history and reasoning for why it does what it does, but I still think that if you tell a function to return a certain type that's what it should return. There seem to be some recent developments on this described here: https://pandas.pydata.org/docs/user_guide/text.html. Once this is more stable I think it would make sense for dtype=str and dtype="string" to have the same functionality. |
Code Sample, a copy-pastable example if possible
Problem description
The empty string of the original dataframe becomes the string 'nan', instead of numpy.nan.
Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-36-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: ca_ES.UTF-8
LOCALE: ca_ES.UTF-8
pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.5.2
Cython: 0.27.3
numpy: 1.14.1
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.0
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: None
xlsxwriter: 0.7.3
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: