pandas.DataFrame.sum() skipna behavior does not match documentation #15542

jglicoes · 2017-03-01T20:52:36Z

Code Sample, a copy-pastable example if possible

`
import pandas as pd

df = pd.DataFrame([[0,0],[1,1],[np.NaN,0],[np.NaN,1],[np.NaN,np.NaN]], columns = ['a','b'])

df
Out[3]:
a b
0 0.0 0.0
1 1.0 1.0
2 NaN 0.0
3 NaN 1.0
4 NaN NaN

df.sum(axis=1, skipna=True)
Out[4]:
0 0.0
1 2.0
2 0.0
3 1.0
4 0.0
dtype: float64
`

Problem description

The documentation for pandas.DataFrame.sum indicates that:

skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA

However, it appears that the actual behavior under the skipna option within this particular environment is for the summation routine to evaluate full rows of na values to 0 rather than the documented value of NA.

Currently this behavior only occurs on within my local windows environment, and NOT within the linux environment my organization's grid runs on. I have provided output for pd.versions for both environments below to help localize this issue.

Expected Output

df.sum(axis=1, skipna=True)
Out[6]:
0 0.0
1 2.0
2 0.0
3 1.0
4 NaN
dtype: float64

Output of `pd.show_versions()`

Windows version in which issue is present

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 45 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US
LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 8.1.1
setuptools: 20.3
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None

Linux version in which issue is NOT present

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-642.15.1.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.8
patsy: 0.4.1
dateutil: 2.4.1
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.2
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: 0.2.1

The text was updated successfully, but these errors were encountered:

jreback · 2017-03-01T21:09:08Z

this a duplicate of #9422

you have bottleneck installed on 1 env and not the other. This is unfortunate the pandas behavior is correct and numpy/bottleneck are wrong (but they matched in bottleneck < 1).

jreback added Duplicate Report Duplicate issue or pull request Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations labels Mar 1, 2017

jreback added this to the No action milestone Mar 1, 2017

jreback closed this as completed Mar 1, 2017

This was referenced Mar 1, 2017

BUG: Fix nansum overflow on Windows with bottleneck #15507

Closed

API: sum of Series of all NaN should return 0 or NaN ? #9422

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas.DataFrame.sum() skipna behavior does not match documentation #15542

pandas.DataFrame.sum() skipna behavior does not match documentation #15542

jglicoes commented Mar 1, 2017

INSTALLED VERSIONS

INSTALLED VERSIONS

jreback commented Mar 1, 2017

pandas.DataFrame.sum() skipna behavior does not match documentation #15542

pandas.DataFrame.sum() skipna behavior does not match documentation #15542

Comments

jglicoes commented Mar 1, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

INSTALLED VERSIONS

jreback commented Mar 1, 2017

Output of `pd.show_versions()`